Home / How To / How to install Apache Spark on Ubuntu 20.04

How to install Apache Spark on Ubuntu 20.04

[*]

Apache Spark is an open source framework and a general system for cluster computers. Spark provides high-level APIs in Java, Scala, Python, and R that support general flowcharts. It comes with built-in modules used for streaming, SQL, machine learning and graphing. It can analyze a large amount of data and distribute it across the cluster and process data in parallel.

In this tutorial we will explain how to install Apache Spark cluster computer bundle on Ubuntu 20.04.

Conditions

  • A server running Ubuntu 20.04 server.
  • A root password is configured on the server.

Getting Started

First you need to update your system packages to the latest version. You can update them all with the following command:

apt-get update -y

Once all packages have been updated, you can proceed to the next step.

Install Java

Apache Spark is a Java-based application. So Java must be installed in your system. You can install it with the following command:

apt-get install default-jdk -y

Once Java is installed, verify the installed version of Java with the following command:

java --version

You should see the following output:

openjdk 11.0.8 2020-07-14
OpenJDK Runtime Environment (build 11.0.8+10-post-Ubuntu-0ubuntu120.04)
OpenJDK 64-Bit Server VM (build 11.0.8+10-post-Ubuntu-0ubuntu120.04, mixed mode, sharing)

Install Scala

Apache Spark is developed with Scala. So you need to install Scala in your system. You can install it with the following command:

apt-get install scala -y

After the installation of Scala. You can verify the Scala version with the following command:

scala -version

You should see the following output:

Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Now connect to the Scala interface with the following command:

scala

You should get the following output:

Welcome to Scala 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.8).
Type in expressions for evaluation. Or try :help.

Now test Scala with the following command:

scala> println("Hitesh Jethva")

You should get the following output:

Hitesh Jethva

Install Apache Spark

First, you need to download the latest version of Apache Spark from its official website. At the time of writing, the latest version of Apache Spark is 2.4.6. You can download it to the / opt directory with the following command:

cd /opt
wget https://archive.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz

After downloading, extract the downloaded file with the following command:

tar -xvzf spark-2.4.6-bin-hadoop2.7.tgz

Then rename the extracted directory to spark as shown below:

mv spark-2.4.6-bin-hadoop2.7 spark

Next, you need to configure the Spark environment so that you can easily run Spark commands. You can configure it by editing the .bashrc file:

nano ~/.bashrc

Add the following lines to the end of the file:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Save and close the file and then activate the environment with the following command:

source ~/.bashrc

Launch Spark Master Server

At this time, Apache Spark is being installed and configured. Now start the Spark master server with the following command:

start-master.sh

You should see the following output:

starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-ubuntu2004.out

By default, Spark listens on port 8080. You can check this with the following command:

ss -tpln | grep 8080

You should see the following output:

LISTEN   0        1                               *:8080                *:*      users:(("java",pid=4930,fd=249))   

Now open your browser and access the Spark web interface with URL http: // your server-ip: 8080. You should see the following screen:Advertisement

Apache Spark Web UI

Start the Spark Worker Process

As you can see, the Spark master service runs on spark: // your-server-ip: 7077. So you can use this address to start the Spark worker process with the following command:

start-slave.sh spark://your-server-ip:7077

You should see the following output:

starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ubuntu2004.out

Now go to the Spark dashboard and refresh the screen. You should see the Spark worker process on the following screen:

Apache Spark Worker

Work with Spark Shell

You can also connect the Spark server using the command line. You can connect it with the spark-shell command as shown below:

spark-shell

Once connected, you should see the following output:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.6.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
20/08/29 14:35:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://ubuntu2004:4040
Spark context available as 'sc' (master = local[*], app id = local-1598711719335).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 2.4.6
      /_/
         
Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 11.0.8)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

If you want to use Python in Spark. You can use the command line tool pyspark.

First install Python version 2 with the following command:

apt-get install python -y

Once installed, you can connect Spark with the following command:Advertisement

pyspark

Once connected, you should get the following output:

Python 2.7.18rc1 (default, Apr  7 2020, 12:05:55) 
[GCC 9.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.11-2.4.6.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
20/08/29 14:36:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /__ / .__/_,_/_/ /_/_   version 2.4.6
      /_/

Using Python version 2.7.18rc1 (default, Apr  7 2020 12:05:55)
SparkSession available as 'spark'.
>>> 

To stop the Master and Slave servers. You can do this with the following command:

stop-slave.sh
stop-master.sh

Conclusion

Congratulations! you have successfully installed Apache Spark on the Ubuntu 20.04 server. You should now be able to perform basic tests before you start setting up a Spark cluster. Feel free to ask me if you have any questions.


Source link