Install Apache Spark and Run PySpark on AlmaLinux 9: Easy Guide

This tutorial aims to guide you through the process of installing Apache Spark and running PySpark on AlmaLinux 9 directly from the command line. Before diving into the installation steps, let’s briefly understand what Apache Spark and PySpark are. This Install Apache Spark and Run PySpark on AlmaLinux 9 tutorial will guide you through each step.

What Is Apache Spark?

Apache Spark is a powerful, open-source, and widely used big data distributed processing framework. It provides high-level APIs for development in various languages, including Java, Scala, Python, and R. Spark’s in-memory processing capabilities make it significantly faster than traditional disk-based data processing frameworks like Hadoop MapReduce.

What Is PySpark?

PySpark is the Python API for Apache Spark. It allows you to leverage the power of Spark for real-time, large-scale data processing in a distributed environment using the familiar Python language. This makes Spark accessible to a broader range of developers and data scientists. This Install Apache Spark and Run PySpark on AlmaLinux 9 tutorial will provide easy to follow steps.

Now, let’s proceed with the steps to Install Apache Spark and Run PySpark on AlmaLinux 9.

Before you begin, ensure you have the following prerequisites:

Access to an AlmaLinux 9 server: You’ll need access to a server running AlmaLinux 9. Ideally, you should have a non-root user with sudo privileges. You can refer to a guide like Initial Server Setup with AlmaLinux 9 for setting up such a user.
Java Development Kit (JDK) installed: Apache Spark requires Java to run. Make sure you have a JDK installed on your server. A helpful resource is How To Install Java with DNF on AlmaLinux 9.

With these prerequisites in place, you’re ready to begin the installation process.

Step 1 – Download Apache Spark Hadoop on AlmaLinux 9

First, install necessary packages using the following command:

sudo dnf install mlocate git -y

Next, visit the Apache Spark Downloads page and obtain the link for the latest release of Apache Spark with the appropriate Hadoop version. Use the wget command to download the file.

Note: Hadoop is a crucial component of the big data ecosystem, providing distributed storage and processing capabilities. Spark can leverage Hadoop’s distributed file system (HDFS) for data access.

In this example, we’ll download Spark 3.4.1 with Hadoop 3:

sudo wget https://dlcdn.apache.org/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz

Once the download is complete, extract the downloaded archive using the tar command:

sudo tar xvf spark-3.4.1-bin-hadoop3.tgz

Then, move the extracted directory to a more convenient location, such as /opt/spark:

sudo mv spark-3.4.1-bin-hadoop3 /opt/spark

Step 2 – Configure Spark Environment Variables on AlmaLinux 9

Now, you need to configure the environment variables that Spark requires. Add these variables to your ~/.bashrc file. Open the file using a text editor like vi:

sudo vi ~/.bashrc

Add the following lines to the end of the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Note: Ensure that the SPARK_HOME variable points to the correct Spark installation directory.

Save and close the file.

Next, apply the changes by sourcing the ~/.bashrc file:

sudo source ~/.bashrc

Step 3 – How To Run Spark Shell on AlmaLinux 9?

To verify your Spark installation, run the Spark shell:

spark-shell

If the installation was successful, you should see output similar to the following:

**Output**
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 3.4.1
      /_/

Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 11.0.19)
Type in expressions to have them evaluated.
Type :help for more information.

**scala>**

This indicates that Spark is running correctly, and you can now interact with the Spark shell using Scala.

Step 4 – How To Run PySpark on AlmaLinux 9?

To use Python with Spark, you can run PySpark using the following command:

pyspark

The output should look similar to this:

**Output**
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /__ / .__/_,_/_/ /_/_   version 3.4.1
      /_/

Using Python version 3.9.16 (main, May 29 2023 00:00:00)
Spark context Web UI available at http://37.120.247.7:4040
Spark context available as 'sc' (master = local[*], app id = local-1689750671688).
SparkSession available as 'spark'.
>>>

You can now write and execute Python code within the PySpark shell, leveraging Spark’s distributed processing capabilities.

Conclusion

You have successfully learned how to Install Apache Spark and Run PySpark on AlmaLinux 9. PySpark enables you to perform real-time, large-scale data processing in a distributed environment using the Python language. This Install Apache Spark and Run PySpark on AlmaLinux 9 guide has provided the basic knowledge you need to get started.

You may also be interested in these related articles:

Alternative Solutions for Installing and Running Spark

While the above steps provide a manual approach to installing Spark, there are alternative methods that can simplify the process and offer more flexibility. Here are two alternative solutions:

1. Using Conda Environment

Conda is a popular package, dependency, and environment management system. It allows you to create isolated environments for your projects, preventing conflicts between different software versions. You can use Conda to install Spark and its dependencies, including Python and Java.

Explanation:

Conda manages dependencies effectively. Creating a separate environment ensures that Spark’s dependencies don’t clash with other Python packages on your system. This is particularly useful if you have multiple projects with different dependency requirements.

Steps:

Install Conda: If you don’t have Conda installed, download and install Miniconda (a minimal Conda distribution) from the official Anaconda website. Follow the instructions for Linux.
Create a Conda environment:
```
conda create -n spark_env python=3.9
conda activate spark_env
```
This creates a new environment named spark_env with Python 3.9 and activates it.
Install PySpark:
```
conda install -c conda-forge pyspark
```
This installs PySpark and its dependencies within the Conda environment. Conda will automatically handle the Java dependency if needed.
Run PySpark:
```
pyspark
```
You can now run PySpark within the activated Conda environment.

Code Example:

There isn’t a specific code example for the installation process itself, as it primarily involves using Conda commands. However, here’s an example of running PySpark within the Conda environment:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("CondaPySparkExample").getOrCreate()

# Sample data
data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)]

# Create a DataFrame
df = spark.createDataFrame(data, ["Name", "Age"])

# Show the DataFrame
df.show()

# Stop the SparkSession
spark.stop()

Save this code as example.py and run it using spark-submit example.py. This will execute the PySpark code using the Spark installation within the Conda environment.

2. Using Docker

Docker is a containerization platform that allows you to package applications and their dependencies into isolated containers. You can use Docker to create a container with Spark pre-installed, making it easy to deploy and run Spark applications consistently across different environments.

Explanation:

Docker provides a consistent and reproducible environment for running Spark. The container encapsulates all the necessary dependencies, eliminating the need to manually install and configure them on the host system. This is particularly useful for deploying Spark applications in production environments.

Steps:

Install Docker: If you don’t have Docker installed, follow the instructions on the official Docker website to install it on AlmaLinux 9.
Pull a Spark Docker image:
```
docker pull apache/spark:3.4.1
```
This pulls the official Apache Spark Docker image with version 3.4.1.
Run the Docker container:
```
docker run -it --name spark-container apache/spark:3.4.1 /bin/bash
```
This creates and starts a Docker container named spark-container based on the pulled image and opens a bash shell within the container.
Run PySpark:

Within the Docker container, you can now run PySpark:
```
./bin/pyspark
```

Code Example:

Here’s an example of running PySpark within the Docker container:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("DockerPySparkExample").getOrCreate()

# Sample data
data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)]

# Create a DataFrame
df = spark.createDataFrame(data, ["Name", "Age"])

# Show the DataFrame
df.show()

# Stop the SparkSession
spark.stop()

Save this code as example.py. To run this within the container, you’d need to either copy the file into the container or mount a volume containing the file. For simplicity, let’s assume you copied the file into the container’s / directory. Then, run:

./bin/spark-submit /example.py

This will execute the PySpark code using the Spark installation within the Docker container.

These alternative solutions offer different advantages over the manual installation method. Conda simplifies dependency management and environment isolation, while Docker provides a consistent and reproducible environment for deploying Spark applications. Choosing the best method depends on your specific needs and preferences. Remember to revisit this Install Apache Spark and Run PySpark on AlmaLinux 9 guide if needed.