Easy Steps To Install PySpark on Ubuntu 22.04 Terminal

In this guide, we want to teach you to Install PySpark on Ubuntu 22.04 Terminal. Apache Spark is one the most popular and open-source big data distributed processing frameworks. It provides development APIs in Java, Scala, Python, and R. PySpark is the Python API for Apache Spark. With PySpark you can perform real-time, large-scale data processing in a distributed environment using Python. Also, it provides a PySpark shell for interactively analyzing your data.

Now follow the steps below on the Orcacore website to Install Apache Spark and Run PySpark on Ubuntu 22.04.

To Install PySpark on Ubuntu 22.04 Terminal, you must have access to your server as a non-root user with sudo privileges. To do this, you can check this guide on Initial Server Setup with Ubuntu 22.04.

Step 1 – Install Java on Ubuntu 22.04

To Install PySpark on Ubuntu 22.04 Terminal, you must have Java installed on your server. First, update your system by using the command below:

sudo apt update

Then, use the following command to install Java:

sudo apt install default-jdk  -y

Verify your Java installation by checking its version:

java --version

**<mark>Output</mark>**
openjdk 11.0.19 2023-04-18
OpenJDK Runtime Environment (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)

Step 2 – Download Apache Spark on Ubuntu 22.04

To Install PySpark on Ubuntu 22.04 Terminal, you need to install some required packages by using the command below:

sudo apt install mlocate git scala -y

Then, visit the Apache Spark Downloads page and get the latest release of Apache Hadoop by using the following wget command:

Note: Hadoop is the foundation of your big data architecture. It’s responsible for storing and processing your data.

sudo wget https://dlcdn.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz

When your Spark download is completed on Ubuntu 22.04, extract your downloaded file with the following command:

sudo tar xvf spark-3.4.0-bin-hadoop3.tgz

Move your extracted file to a new directory with the command below:

mv spark-3.4.0-bin-hadoop3 /opt/spark

Step 3 – How To Configure Spark Environment?

At this point, you need to add the environment variables to your bashrc file. Open the file with your desired text editor, we use the vi editor:

sudo vi ~/.bashrc

At the end of the file, add the following content to the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Note: Remember to set your Spark installation directory next to the export SPARK_HOME=.

When you are done, save and close the file.

Next, source your bashrc file:

sudo source ~/.bashrc

Step 4 – How To Run Spark Shell on Ubuntu 22.04?

At this point, you can verify your Spark installation by running the Spark shell command:

spark-shell

If everything is ok, you should get the following output:

**<mark>Output</mark>**
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 3.4.0
      /_/

Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.19)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Step 5 – How To Run PySpark on Ubuntu 22.04?

If you want to use Python instead of Scala, you can easily run PySpark on your Ubuntu server with the command below:

pyspark

In your output, you should see:

From your PySpark shell, you can easily write the code and execute it.

Conclusion

At this point, you have learned to Install Apache Spark and Run Spark-shell and Install PySpark on Ubuntu 22.04 Terminal. By using PySpark you can interactively analyze your data. Hope you enjoy it.

Now that we’ve covered the standard method to Install PySpark on Ubuntu 22.04 Terminal, let’s explore alternative approaches.

Alternative Solutions to Installing PySpark on Ubuntu 22.04

While the manual installation process outlined above is effective, it involves several steps. Here are two alternative approaches that can simplify the process: using Anaconda and using Docker.

Alternative 1: Using Anaconda Environment

Anaconda is a popular Python distribution that simplifies package management and deployment. It provides an isolated environment where you can install PySpark and its dependencies without affecting other Python projects on your system. This is particularly useful for managing different versions of PySpark and its dependencies.

Steps:

Install Anaconda: If you don’t have Anaconda installed, download the appropriate installer for Linux from the Anaconda website and follow the installation instructions.
Create a Conda Environment: Create a new conda environment specifically for PySpark. This isolates your PySpark installation from other Python projects.
```
conda create -n pyspark_env python=3.9 # or any Python version you prefer
conda activate pyspark_env
```
Install PySpark: Use pip, the Python package installer, to install PySpark within the activated conda environment. Make sure you also install findspark.
```
pip install pyspark findspark
```
Configure Environment (using findspark): findspark simplifies the process of making Spark available to your Python environment. Add the following to your python script before you initialize Spark:
```
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Example").getOrCreate()

# Your PySpark code here
spark.stop()
```

Explanation:

Anaconda manages dependencies and creates isolated environments, preventing conflicts.
findspark automatically configures the necessary environment variables for PySpark, simplifying setup.

Alternative 2: Using Docker

Docker provides a containerization platform, allowing you to package PySpark and its dependencies into a portable container. This approach ensures consistent execution across different environments and simplifies deployment.

Steps:

Install Docker: If you don’t have Docker installed, follow the instructions on the Docker website to install Docker Engine and Docker Compose on your Ubuntu system.

Create a Dockerfile: Create a Dockerfile that defines the environment for your PySpark application. Here’s an example:

FROM ubuntu:22.04

# Update and install dependencies
RUN apt-get update && 
    apt-get install -y openjdk-11-jdk python3 python3-pip wget && 
    apt-get clean

# Set environment variables
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
ENV SPARK_VERSION=3.4.0
ENV HADOOP_VERSION=hadoop3

# Download and extract Spark
RUN wget https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-${HADOOP_VERSION}.tgz && 
    tar -xzf spark-${SPARK_VERSION}-bin-${HADOOP_VERSION}.tgz && 
    mv spark-${SPARK_VERSION}-bin-${HADOOP_VERSION} /opt/spark

# Set Spark home
ENV SPARK_HOME=/opt/spark
ENV PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

# Install PySpark
RUN pip3 install pyspark

# Set working directory
WORKDIR /app

# Copy application files (optional)
COPY . /app

# Command to run (example)
CMD ["pyspark"]

Build the Docker Image: Navigate to the directory containing the Dockerfile and build the Docker image.
```
docker build -t pyspark-image .
```
Run the Docker Container: Run the Docker container based on the image you built.
```
docker run -it pyspark-image
```

Explanation:

The Dockerfile defines the base image (Ubuntu 22.04), installs Java, downloads and extracts Spark, sets environment variables, and installs PySpark.
The docker build command creates a Docker image from the Dockerfile.
The docker run command starts a container based on the image, providing an isolated environment for PySpark.

These alternative methods offer different trade-offs in terms of complexity and control. Anaconda simplifies dependency management, while Docker provides a fully isolated and reproducible environment. Choose the method that best suits your needs and experience level. The ability to Install PySpark on Ubuntu 22.04 Terminal is crucial for modern data processing workflows. These approaches help ensure a smooth and efficient installation process.

This article has now covered the original method to Install PySpark on Ubuntu 22.04 Terminal, as well as providing two alternative ways of solving this problem.