Easy Steps To Install PySpark on Ubuntu 22.04 Terminal
In this guide, we want to teach you to Install PySpark on Ubuntu 22.04 Terminal. Apache Spark is one the most popular and open-source big data distributed processing frameworks. It provides development APIs in Java, Scala, Python, and R. PySpark is the Python API for Apache Spark. With PySpark you can perform real-time, large-scale data processing in a distributed environment using Python. Also, it provides a PySpark shell for interactively analyzing your data.
Now follow the steps below on the Orcacore website to Install Apache Spark and Run PySpark on Ubuntu 22.04.
To Install PySpark on Ubuntu 22.04 Terminal, you must have access to your server as a non-root user with sudo privileges. To do this, you can check this guide on Initial Server Setup with Ubuntu 22.04.
Step 1 – Install Java on Ubuntu 22.04
To Install PySpark on Ubuntu 22.04 Terminal, you must have Java installed on your server. First, update your system by using the command below:
sudo apt update
Then, use the following command to install Java:
sudo apt install default-jdk -y
Verify your Java installation by checking its version:
java --version
**<mark>Output</mark>**
openjdk 11.0.19 2023-04-18
OpenJDK Runtime Environment (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)
Step 2 – Download Apache Spark on Ubuntu 22.04
To Install PySpark on Ubuntu 22.04 Terminal, you need to install some required packages by using the command below:
sudo apt install mlocate git scala -y
Then, visit the Apache Spark Downloads page and get the latest release of Apache Hadoop by using the following wget command:
Note: Hadoop is the foundation of your big data architecture. It’s responsible for storing and processing your data.
sudo wget https://dlcdn.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
When your Spark download is completed on Ubuntu 22.04, extract your downloaded file with the following command:
sudo tar xvf spark-3.4.0-bin-hadoop3.tgz
Move your extracted file to a new directory with the command below:
mv spark-3.4.0-bin-hadoop3 /opt/spark
Step 3 – How To Configure Spark Environment?
At this point, you need to add the environment variables to your bashrc file. Open the file with your desired text editor, we use the vi editor:
sudo vi ~/.bashrc
At the end of the file, add the following content to the file:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Note: Remember to set your Spark installation directory next to the export SPARK_HOME=.
When you are done, save and close the file.
Next, source your bashrc file:
sudo source ~/.bashrc
Step 4 – How To Run Spark Shell on Ubuntu 22.04?
At this point, you can verify your Spark installation by running the Spark shell command:
spark-shell
If everything is ok, you should get the following output:
**<mark>Output</mark>**
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 3.4.0
/_/
Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.19)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Step 5 – How To Run PySpark on Ubuntu 22.04?
If you want to use Python instead of Scala, you can easily run PySpark on your Ubuntu server with the command below:
pyspark
In your output, you should see:

From your PySpark shell, you can easily write the code and execute it.
Conclusion
At this point, you have learned to Install Apache Spark and Run Spark-shell and Install PySpark on Ubuntu 22.04 Terminal. By using PySpark you can interactively analyze your data. Hope you enjoy it.
Now that we’ve covered the standard method to Install PySpark on Ubuntu 22.04 Terminal, let’s explore alternative approaches.
Alternative Solutions to Installing PySpark on Ubuntu 22.04
While the manual installation process outlined above is effective, it involves several steps. Here are two alternative approaches that can simplify the process: using Anaconda and using Docker.
Alternative 1: Using Anaconda Environment
Anaconda is a popular Python distribution that simplifies package management and deployment. It provides an isolated environment where you can install PySpark and its dependencies without affecting other Python projects on your system. This is particularly useful for managing different versions of PySpark and its dependencies.
Steps:
-
Install Anaconda: If you don’t have Anaconda installed, download the appropriate installer for Linux from the Anaconda website and follow the installation instructions.
-
Create a Conda Environment: Create a new conda environment specifically for PySpark. This isolates your PySpark installation from other Python projects.
conda create -n pyspark_env python=3.9 # or any Python version you prefer conda activate pyspark_env
-
Install PySpark: Use pip, the Python package installer, to install PySpark within the activated conda environment. Make sure you also install
findspark
.pip install pyspark findspark
-
Configure Environment (using findspark):
findspark
simplifies the process of making Spark available to your Python environment. Add the following to your python script before you initialize Spark:import findspark findspark.init() import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.appName("Example").getOrCreate() # Your PySpark code here spark.stop()
Explanation:
- Anaconda manages dependencies and creates isolated environments, preventing conflicts.
findspark
automatically configures the necessary environment variables for PySpark, simplifying setup.
Alternative 2: Using Docker
Docker provides a containerization platform, allowing you to package PySpark and its dependencies into a portable container. This approach ensures consistent execution across different environments and simplifies deployment.
Steps:
-
Install Docker: If you don’t have Docker installed, follow the instructions on the Docker website to install Docker Engine and Docker Compose on your Ubuntu system.
-
Create a Dockerfile: Create a
Dockerfile
that defines the environment for your PySpark application. Here’s an example:FROM ubuntu:22.04 # Update and install dependencies RUN apt-get update && apt-get install -y openjdk-11-jdk python3 python3-pip wget && apt-get clean # Set environment variables ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 ENV SPARK_VERSION=3.4.0 ENV HADOOP_VERSION=hadoop3 # Download and extract Spark RUN wget https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-${HADOOP_VERSION}.tgz && tar -xzf spark-${SPARK_VERSION}-bin-${HADOOP_VERSION}.tgz && mv spark-${SPARK_VERSION}-bin-${HADOOP_VERSION} /opt/spark # Set Spark home ENV SPARK_HOME=/opt/spark ENV PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin # Install PySpark RUN pip3 install pyspark # Set working directory WORKDIR /app # Copy application files (optional) COPY . /app # Command to run (example) CMD ["pyspark"]
-
Build the Docker Image: Navigate to the directory containing the
Dockerfile
and build the Docker image.docker build -t pyspark-image .
-
Run the Docker Container: Run the Docker container based on the image you built.
docker run -it pyspark-image
Explanation:
- The
Dockerfile
defines the base image (Ubuntu 22.04), installs Java, downloads and extracts Spark, sets environment variables, and installs PySpark. - The
docker build
command creates a Docker image from theDockerfile
. - The
docker run
command starts a container based on the image, providing an isolated environment for PySpark.
These alternative methods offer different trade-offs in terms of complexity and control. Anaconda simplifies dependency management, while Docker provides a fully isolated and reproducible environment. Choose the method that best suits your needs and experience level. The ability to Install PySpark on Ubuntu 22.04 Terminal is crucial for modern data processing workflows. These approaches help ensure a smooth and efficient installation process.
This article has now covered the original method to Install PySpark on Ubuntu 22.04 Terminal, as well as providing two alternative ways of solving this problem.