Configuring Hadoop on Ubuntu 20.04

-Using Ubuntu 20.04 & Hadoop 3.3.0-

*inspiration taken from https://www.digitalocean.com/community/tutorials/how-to-spin-up-a-hadoop-cluster-with-digitalocean-droplets but changes have been made to reflect updates to Ubuntu and Hadoop*

Initial Configuration (on each node)

Open a terminal on each node and run the following commands:

sudo apt-get update && sudo apt-get -y dist-upgrade
sudo adduser hadoop
sudo usermod -aG sudo hadoop

On each node, change the hostname to something unique. For my setup, I used “namenode” on the name node, and “datanode” on the worker node. Use the following command on each respective node to do this, replacing “<nodename>” with the name you choose:

sudo hostnamectl set-hostname <nodename>

Reboot each node using the following command:

sudo reboot

Login to the newly created “hadoop” user. Open a terminal and run the following command to edit the “hosts” file. Comment out the “localhost” entries by adding a preceding “#” and add the IP addresses and respective hostnames of your Hadoop nodes. See the picture below for an example.

sudo nano /etc/hosts

Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following commands:

sudo apt-get -y install openjdk-8-jdk
sudo apt install openssh-server openssh-client -y
mkdir my-hadoop-install && cd my-hadoop-install
wget http://mirror.cc.columbia.edu/pub/software/apache/hadoop/common/hadoop-3.3.0/hadoop-3.3.o.tar.gz
tar xvzf hadoop-3.3.0.tar.gz

Hadoop Environment Configuration (on each node)

Open a terminal on each node and run the following command:

nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/hadoop-env.sh

Add the following lines anywhere in the file, and make sure they aren’t commented out (make sure there is no preceding “#”)

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HDFS_NAMENODE_USER="hadoop"
export HDFS_DATANODE_USER="hadoop"
export HDFS_SECONDARYNAMENODE_USER="hadoop"
export YARN_RESOURCEMANAGER_USER="hadoop"
export YARN_NODEMANAGER_USER="hadoop"

Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following commands:

source ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/hadoop-env.sh
sudo mkdir -p /usr/local/hadoop/hdfs/data
sudo chown -R hadoop:hadoop /usr/local/hadoop/hdfs/data
nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/core-site.xml

Between “” and “”, add the following, replacing “master-server-ip” with your own master/namenode server’s IP address. Do NOT replace it with each respective server’s own IP, as the DigitalOcean guide recommends.

<property>
        <name>fs.defaultFS</name>
        <value>hdfs://master-server-ip:9000</value>
</property>

Setup Passwordless SSH

On the master/namenode server, run the following command, and then press “Enter” three times.

ssh-keygen

Run the following command and copy the entire output onto your clipboard.

cat ~/.ssh/id_rsa.pub

Run the following command on both the master/namenode server and any worker nodes, and paste in the output from the previous command.

nano ~/.ssh/authorized_keys

On the master/namenode server, run the following command:

nano ~/.ssh/config

Edit the file using the following format, replacing the underlined portions with your respective master/namenode and worker node IPs.

Host hadoop-master-server-ip
    HostName hadoop-master-server-ip
    User hadoop
    IdentityFile ~/.ssh/id_rsa

Host hadoop-worker-01-server-ip
    HostName hadoop-worker-01-server-ip
    User hadoop
    IdentityFile ~/.ssh/id_rsa

Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal.

SSH into your worker node(s) from your master/namenode server, replacing “hadoop-worker-01-ip” with your respective IP(s).

ssh hadoop@hadoop-worker-01-server-ip

Reply to the prompt with “yes”, and then logout by typing the following (admittedly self-explanatory) command:

logout

Configure the Master Node

On the master node, run the following command:

nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/hdfs-site.xml

Between “<configuration>” and “</configuration>”, add the following:

<property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///usr/local/hadoop/hdfs/data</value>
</property>

Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following command:

nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/mapred-site.xml

Between “<configuration>” and “</configuration>”, add the following, replacing “hadoop-master-server-ip” with your master/namenode server’s IP address.

<property>
        <name>mapreduce.jobtracker.address</name>
        <value>hadoop-master-server-ip:54311</value>
    </property>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>

Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following command:

nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/yarn-site.xml

Between “<configuration>” and “</configuration>”, add the following, replacing “hadoop-master-server-ip” (underlined) with your master/namenode server’s IP address.

<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop-master-server-ip</value>
</property>

Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following command:

nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/masters

Type in your master/namenode server’s IP address.

hadoop-master-server-ip

Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal. Run the following command:

nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/workers

Type in your worker node(s) IP address(es), one per line, below the “localhost” entry.

localhost
hadoop-worker-01-server-ip
hadoop-worker-02-server-ip
hadoop-worker-03-server-ip

Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal.

Configure the Worker Node(s)

On the worker node(s), run the following command:

nano ~/my-hadoop-install/hadoop-3.3.0/etc/hadoop/hdfs-site.xml

Between “<configuration>” and “</configuration>”, add the following.

<property>
        <name>dfs.replication</name>
        <value>3</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///usr/local/hadoop/hdfs/data</value>
</property>

Press “Ctrl + X”, “Y” and then “Enter” to save your changes and return to the terminal.

Starting up Hadoop

On the master/namenode server, run the following commands:

cd ~/my-hadoop-install/hadoop-3.3.0/
sudo ./bin/hdfs namenode -format
sudo ./sbin/start-dfs.sh
./sbin/start-yarn.sh

Verify Functionality

On each of your nodes, run the following command to ensure Hadoop processes are running:

jps

On the master/namenode, visit the following URL in a web browser, replacing “hadoop-master-server-ip” with your own master/namenode server’s IP address:

http://hadoop-master-server-ip:9870

Click on “Datanodes” on the menu bar, and ensure that all of your worker nodes’ IP addresses are showing up on the web GUI.