post

Setting Up a Single-Node Hadoop Cluster on a Linux Server

· 4 min read · 733 words

Notes and pitfalls encountered while setting up a single-node Hadoop cluster on a cloud server.

Official Hadoop guide: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

Prerequisites

Server specs provisioned for the team:

8 cores; 16 GB RAM; System disk: 100 GB; Data disk: 500 GB
TencentOS Server

For simplicity, no containers are used — the service is deployed and run natively on the host machine (production deployments will likely migrate to Docker later).

Hadoop Installation

Install Java and SSH

Pre-installed; steps omitted.

Official docs recommend installing pdsh:

yum install pdsh

Download the Distribution

Latest version: https://dlcdn.apache.org/hadoop/common/current/hadoop-3.4.0.tar.gz

Download locally, then upload to the server via scp:

scp -P 36000 hadoop-3.4.0.tar.gz <user>@xxx.xxx.xxx.xxx:/data/download

Extract on the server:

tar -zxvf hadoop-3.4.0.tar.gz

Enter the Hadoop directory. Since the system PATH is not configured, all subsequent operations default to this directory.

Configure Java

Find the Java installation path:

java -XshowSettings:properties -version 2>&1 | grep 'java.home'

Edit etc/hadoop/hadoop-env.sh:

 # set to the root of your Java installation
 export JAVA_HOME=/usr/java/latest

Run:

bin/hadoop

The usage documentation should be printed at this point.

HDFS Testing

A Hadoop cluster can be configured in the following modes:

  • Local (Standalone) Mode — single node, single process
  • Pseudo-Distributed Mode — single node, multiple processes
  • Fully-Distributed Mode — multiple nodes, multiple processes

Since only one machine is available, the first two modes are tested here.

Standalone Operation

mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar grep input output 'dfs[a-z.]+'
cat output/*

This example is a grep application that uses a MapReduce job to regex-match specific strings.

Pseudo-Distributed Operation

This is where things get tricky. The cloud server’s default SSH port is not 22 but 36000, so both the host and Hadoop’s SSH-related default ports need to be updated.

Edit etc/hadoop/hadoop-env.sh and add:

export HADOOP_SSH_OPTS="-p 36000"

Edit core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.ssh.port</name>
        <value>36000</value>
    </property>
</configuration>

Official description of fs.defaultFS:

The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.

This property sets the type and address of the default filesystem. In the configuration above, the type is hdfs and the address is localhost:9000. localhost resolves via local DNS to 127.0.0.1 (the IPv4 loopback address). In a production environment, this should be set to the address and exposed port of the HDFS NameNode.

You can check the NameNode address with:

hdfs getconf -namenodes

Edit hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>localhost:50090</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///data/hdfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///data/hdfs/data</value>
    </property>
</configuration>

dfs.namenode.secondary.http-address sets the Secondary NameNode address to the local machine. The replication factor is set to 1. The Secondary NameNode address is set to the default port 50090. Storage directories for the NameNode and DataNode are also configured. Without the storage directories, subsequent steps will fail. Use the following command to check DataNode status:

hdfs dfsadmin -report

Configure passwordless SSH login to localhost:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Stop / Format / Start HDFS:

sbin/stop-dfs.sh
bin/hdfs namenode -format
sbin/start-dfs.sh

At this point, NameNode information is accessible at xxx.xxx.xxx.xxx:9870.

Create directories in HDFS and run the sample job:

bin/hdfs dfs -mkdir -p /user/wjrtest/input
bin/hdfs dfs -ls /user/wjrtest/input
bin/hdfs dfs -put etc/hadoop/*.xml /user/wjrtest/input

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar grep /user/wjrtest/input /user/wjrtest/output 'dfs[a-z.]+'
bin/hdfs dfs -get /user/wjrtest/output output
# or
bin/hdfs dfs -cat /user/wjrtest/output/*

YARN Testing

Edit etc/hadoop/hadoop-env.sh and add:

export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

Start YARN:

sbin/start-yarn.sh

Access the ResourceManager web UI at xxx.xxx.xxx.xxx:8088.

To re-run the MapReduce test job, specify a different output directory or remove the previous one:

bin/hdfs dfs -rm -r /user/wjrtest/output

The output confirms that YARN is being used:

[root@*******-tencentos /data/download/hadoop-3.4.0]# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar grep /user/wjrtest/input /user/wjrtest/output 'dfs[a-z.]+'
2024-07-05 17:40:33,883 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2024-07-05 17:40:34,145 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1720171933597_0003
2024-07-05 17:40:34,418 INFO input.FileInputFormat: Total input files to process : 10
2024-07-05 17:40:34,856 INFO mapreduce.JobSubmitter: number of splits:10
2024-07-05 17:40:35,343 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1720171933597_0003
2024-07-05 17:40:35,344 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-07-05 17:40:35,465 INFO conf.Configuration: resource-types.xml not found
2024-07-05 17:40:35,465 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-07-05 17:40:35,519 INFO impl.YarnClientImpl: Submitted application application_1720171933597_0003
2024-07-05 17:40:35,545 INFO mapreduce.Job: The url to track the job: http://*******-tencentos:8088/proxy/application_1720171933597_0003/
2024-07-05 17:40:35,545 INFO mapreduce.Job: Running job: job_1720171933597_0003
2024-07-05 17:40:40,619 INFO mapreduce.Job: Job job_1720171933597_0003 running in uber mode : false
...

Stop YARN:

sbin/stop-yarn.sh