Home / How To / How to install and configure Apache Hadoop on Ubuntu 20.04

How to install and configure Apache Hadoop on Ubuntu 20.04



Apache Hadoop is an open source framework used to manage, store and process data for various big data applications running under clustered systems. It is written in Java with some embedded code in C and shell scripts. It uses a distributed file system (HDFS) and is scaled up from single servers to thousands of machines.

Apache Hadoop is based on the four main components:

  • Hadoop Common: It is the collection of tools and libraries needed by other Hadoop modules.
  • HDFS: Also known as Hadoop Distributed File System distributed across multiple nodes.
  • MapReduce: It is a framework used to write applications to process huge amounts of data. [19659004] Hadoop YARN: Also known as yet another resource reseller is the resource management layer in Hadoop.

In this tutorial we will explain how to set up a Hadoop cluster with a single node on Ubuntu 20.04.

Prerequisites

  • A server running Ubuntu 20.04 with 4 GB of RAM.
  • A root password is configured on your server.

Update system packages

Before you begin, it is recommended that you update your system packages to the latest version. You can do so with the following command:

  apt-get update -y 
apt-get upgrade -y

Once your system is updated, restart it to make the changes.

Installing Java [19659012] Apache Hadoop is a Java-based application. So you need to install Java in your system. You can install it with the following command:

  apt-get install default-jdk default-jre -y 

Once installed, you can verify the installed version of Java with the following command: [19659013] java version

You should get the following output:

  openjdk version "11.0.7" 2020-04-14
OpenJDK Runtime Environment (build 11.0.7 + 10-post-Ubuntu-3ubuntu1)
OpenJDK 64-bit server VM (build 11.0.7 + 10-post-Ubuntu-3ubuntu1, mixed mode, sharing)

Create Hadoop User and Setup Passwordless SSH

First create a new user named hadoop with the following command:

  adduser hadoop 

Then add the Hadoop user to the sudo group
usermod -aG sudo hadoop

Then log in with hadoop users and generate an SSH key pair with the following command:

  su - hadoop 
ssh-keygen -t rsa

You should get the following output:

  Generate public / private rsa key pair.
Specify file where to save the key (/home/hadoop/.ssh/id_rsa):
Created directory & # 39; /home/hadoop/.ssh&#39 ;.
Enter passphrase (no passphrase blank):
Re-enter the same passphrase:
Your identification has been saved in /home/hadoop/.ssh/id_rsa
Your public key has been saved to /home/hadoop/.ssh/id_rsa.pub
The most important fingerprint is:
SHA256: HG2K6x1aCGuJMqRKJb + GKIDRdKCd8LXnGsB7WSxApno [email protected]
The key's random art image is:
+ --- [RSA 3072] ---- +
| .. = .. |
| O. +. O. |
| oo * .o +. o |
| o .o * o + |
| o + E. = o S |
| =. + o * o |
| * .o. = o o |
| = + o .. +. |
| o .. o. |
+ ---- [SHA256] ----- +

Then add this key to the authorized ssh keys and give the correct state:

  cat ~ / .ssh / id_rsa.pub >> ~ / .ssh / Author_keys 
chmod 0600 ~ / .ssh / Author_keys [19659014] Then, verify the password-free SSH with the following command:

  ssh localhost 

When you are logged in without a password, you can proceed to the next step.

Install Hadoop

First, log in with the hadoop user and download the latest version of Hadoop with the following command:

  su - hadoop 
wget https://downloads.apache.org/hadoop/common/ hadoop-3.2.1 / hadoop-3.2.1.tar .gz

When the download is complete, extract the downloaded file with the following command:

  tar -xvzf hadoop-3.2.1.tar.gz 

Move then extracted directory to / usr / local /:

  sudo mv hadoop-3.2.1 / usr / local / hadoop 

Then create a directory to store log with the following command:

  sudo mkdir / usr / local / hadoop / logs 

Then change owner ship with hadoop directory to hadoop:

  sudo chown-R hadoop: hadoop / usr / local / hadoop 

Next, you need to configure the Hadoop environment variables. You can do this by editing ~ / .bashrc file:

  nano ~ / .bashrc 

Add the following lines:

  export HADOOP_HOME = / usr / local / hadoop
export HADOOP_INSTALL = $ HADOOP_HOME
export HADOOP_MAPRED_HOME = $ HADOOP_HOME
export HADOOP_COMMON_HOME = $ HADOOP_HOME
export HADOOP_HDFS_HOME = $ HADOOP_HOME
export YARN_HOME = $ HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR = $ HADOOP_HOME / lib / native
export PATH = $ PATH: $ HADOOP_HOME / sbin: $ HADOOP_HOME / bin
export HADOOP_OPTS = "- Djava.library.path = $ HADOOP_HOME / lib / native"

Save and close the file when you are done. Then enable the environment variables with the following command:

  source ~ / .bashrc 

Configure Hadoop

In this section we will learn how to set Hadoop on a single node.

Configuring Java Environment Variables

Next, you must define Java environment variables in hadoop-env.sh to configure YARN, HDFS, MapReduce and Hadoop-related project settings. Advertisement

First, find the correct Java path using the following command:

  which javac 

You should see the following output:

  / usr / bin / javac

Then you will find the OpenJDK directory with the following command:

  readlink -f / usr / bin / javac 

You should see the following output:

  / usr / lib / jvm / java-11- openjdk- amd64 / bin / javac

Then edit the Hadoop-env.sh file and define the Java path:

  sudo nano $ HADOOP_HOME / etc / hadoop / hadoop-env.sh 

Add the following lines:

  export JAVA_HOME = / usr / lib / JVM / java-11-openjdk-amd64
export HADOOP_CLASSPATH + = "$ HADOOP_HOME / lib / *. jar"

Then you must also download the Javax activation file. You can download it with the following command:

  cd / usr / local / hadoop / lib 
sudo wget https://jcenter.bintray.com/javax/activation/javax.activation-api/1.2.0/ javax.activation-api-1.2.0.jar

You can now verify the Hadoop version with the following command:

  hadoop version 

You should get the following output:

  Hadoop 3.2.1
Source repository https://gitbox.apache.org/repos/asf/hadoop.git -r b3cbbb467e22ea829b3808f4b7b01d07e0bf3842
Compiled by rohithsharmaks 2019-09-10T15: 56Z
Compiled with protocol 2.5.0
From source with checksum 776eaf9eee9c0ffc370bcbc1888737
This command was run with /usr/local/hadoop/share/hadoop/common/hadoop-common-3.2.1.jar

Configure core-site.xml file

Next, you must enter the URL of your NameNode. You can do this by editing core-site.xml file: Ad

  sudo nano $ HADOOP_HOME / etc / hadoop / core -site .xml 

Add the following lines:

  
   
       fs.default.name 
        hdfs: //0.0.0.0: 9000 
        The default file system URI 
   


Save and close the file when you are finished:

Configure hdfs-site.xml File

Then you must define the location for storing node metadata, fsimage file and editing log file. You can do so by editing the hdfs-site.xml file. First, create a directory for storing node metadata:

  sudo mkdir -p / home / hadoop / hdfs / {namenode, data anode} 
sudo chown -R hadoop: hadoop / home / hadoop / hdfs

Next, edit file hdfs-site.xml and define the location of the directory:

  sudo nano $ HADOOP_HOME / etc / hadoop / hdfs-site.xml 

Add the following lines:

  
   
       dfs.replication [19659082] 1 [19659093] dfs.name.dir [19659082] file: /// home / Hadoop / HDFS / namenode [19659093] dfs.data.dir [19659082] file: /// hem / Hadoop / HDFS / datanode 
   

Save and close the file.

Configure mapred-site.xml File

Then you must define MapReduce values. You can define it by editing mapred-site.xml file:

  sudo nano $ HADOOP_HOME / etc / hadoop / mapred-site.xml 

Add the following lines:

  
   
       mapreduce.framework.name 
        yarn 
   

Save and close the file.

Configure yarn-site.xml File

Then you need to edit the yarn-site.xml file and define YARN-related settings:

  sudo nano $ HADOOP_HOME / etc / hadoop / yarn-website.xml 

Add the following lines:

  
   
       yarn.nodemanager.aux services [19659082] mapreduce_shuffle 
   

Save and close the file when you are done.

Format HDFS NameNode

Next, you must validate the Hadoop configuration and format the HDFS NameNode.

Log in first with Hadoop users and formats HDFS NameNode with the following command:

  su - hadoop 
hdfs namenode format

You should get the following output:

  2020-06-07 11: 35: 57,691 INFO util.GSet: VM type = 64-bit
2020-06-07 11: 35: 57,692 INFO util.GSet: 0.25% max memory 1.9 GB = 5.0 MB
2020-06-07 11: 35: 57,692 INFO util.GSet: capacity = 2 ^ 19 = 524288 records
2020-06-07 11: 35: 57,706 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2020-06-07 11: 35: 57,706 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2020-06-07 11: 35: 57,706 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1.5.25
2020-06-07 11: 35: 57,710 INFO namenode.FSNamesystem: Try cache on namenode is enabled
INFO namenode.FS Namesystem: Try cache will use 0.03 of total heap and try cache record expiry time is 600000 millis
2020-06-07 11: 35: 57,712 INFO util.GSet: Map Calculation Capability NameNodRetryCache
2020-06-07 11: 35: 57,712 INFO util.GSet: VM type = 64-bit
2020-06-07 11: 35: 57,712 INFO util.GSet: 0.029999999329447746% max memory 1.9 GB = 61.9 KB
2020-06-07 11: 35: 57,712 INFO util.GSet: capacity = 2 ^ 16 = 65536 entries
2020-06-07 11: 35: 57,743 INFO namenode.FSImage: Assigned new BlockPoolId: BP-1242120599-69.87.216.36-1591529757733
2020-06-07 11: 35: 57,763 INFO common. Storage: Storage directory / home / hadoop / hdfs / namenode has been formatted successfully.
2020-06-07 11: 35: 57,817 INFO namenode.FSImageFormatProtobuf: Save image file /home/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 with no compression
2020-06-07 11: 35: 57,972 INFO namenode.FSImageFormatProtobuf: Image file /home/hadoop/hdfs/namenode/current/fsimage.ckpt_0000000000000000000 in size 398 bytes saved in 0 seconds.
2020-06-07 11: 35: 57,987 INFO namenode.NNStorageRetentionManager: Will keep 1 images with txid> = 0
2020-06-07 11: 35: 58,000 INFO namenode.FSImage: FSImageSaver pure checkpoint: txid = 0 when meetings are turned off.
2020-06-07 11: 35: 58,003 INFO namenode.NameNode: SHUTDOWN_MSG:
/ ************************************************* ***********
SHUTDOWN_MSG: Turn off NameNode at ubuntu2004 / 69.87.216.36
************************************************** ********** /

Start Hadoop Cluster

First start NameNode and DataNode with the following command:

  start-dfs.sh 

You should get the following output:

  Start name nodes on [0.0.0.0]
Start data anodes
Start Secondary Nodes [ubuntu2004]

Then start the YARN resource and node managers by running the following command:

  start-yarn.sh 

You should get the following output:

  Start resource manager
Start node managers

You can now verify them with the following command:

  jps 

You should get the following output:

  5047 NameNode
5850 Jps
5326 Secondary name
5151 DataNode

Accessing the Hadoop Web Interface

You can now access the Hadoop NameNode at URL http: // your-server-ip: 9870 . You should see the following screen: Ad

Ad

 You can also access the individual data nodes with URL <span class= http: // your-server-ip: 9864 . The following screen appears:

 Hadoop Data Node

To access YARN Resource Manager, use URL http: // your-server-ip: 8088 . You should see the following screen:

 Hadoop Yarn Resource Manager

Conclusion

Congratulations! you have installed Hadoop on a single node. You can now start exploring basic HDFS commands and design a fully distributed Hadoop cluster. Feel free to ask me if you have any questions.


Source link