Setting Up a Single-Node Hadoop Cluster on a Linux Server
Table of Contents
Notes and pitfalls encountered while setting up a single-node Hadoop cluster on a cloud server.
Official Hadoop guide: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
Prerequisites
Server specs provisioned for the team:
8 cores; 16 GB RAM; System disk: 100 GB; Data disk: 500 GB
TencentOS Server
For simplicity, no containers are used — the service is deployed and run natively on the host machine (production deployments will likely migrate to Docker later).
Hadoop Installation
Install Java and SSH
Pre-installed; steps omitted.
Official docs recommend installing pdsh:
yum install pdsh
Download the Distribution
Latest version: https://dlcdn.apache.org/hadoop/common/current/hadoop-3.4.0.tar.gz
Download locally, then upload to the server via scp:
scp -P 36000 hadoop-3.4.0.tar.gz <user>@xxx.xxx.xxx.xxx:/data/download
Extract on the server:
tar -zxvf hadoop-3.4.0.tar.gz
Enter the Hadoop directory. Since the system PATH is not configured, all subsequent operations default to this directory.
Configure Java
Find the Java installation path:
java -XshowSettings:properties -version 2>&1 | grep 'java.home'
Edit etc/hadoop/hadoop-env.sh:
# set to the root of your Java installation
export JAVA_HOME=/usr/java/latest
Run:
bin/hadoop
The usage documentation should be printed at this point.
HDFS Testing
A Hadoop cluster can be configured in the following modes:
- Local (Standalone) Mode — single node, single process
- Pseudo-Distributed Mode — single node, multiple processes
- Fully-Distributed Mode — multiple nodes, multiple processes
Since only one machine is available, the first two modes are tested here.
Standalone Operation
mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar grep input output 'dfs[a-z.]+'
cat output/*
This example is a grep application that uses a MapReduce job to regex-match specific strings.
Pseudo-Distributed Operation
This is where things get tricky. The cloud server’s default SSH port is not 22 but 36000, so both the host and Hadoop’s SSH-related default ports need to be updated.
Edit etc/hadoop/hadoop-env.sh and add:
export HADOOP_SSH_OPTS="-p 36000"
Edit core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.ssh.port</name>
<value>36000</value>
</property>
</configuration>
Official description of fs.defaultFS:
The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
This property sets the type and address of the default filesystem. In the configuration above, the type is hdfs and the address is localhost:9000. localhost resolves via local DNS to 127.0.0.1 (the IPv4 loopback address). In a production environment, this should be set to the address and exposed port of the HDFS NameNode.
You can check the NameNode address with:
hdfs getconf -namenodes
Edit hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>localhost:50090</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/hdfs/data</value>
</property>
</configuration>
dfs.namenode.secondary.http-address sets the Secondary NameNode address to the local machine. The replication factor is set to 1. The Secondary NameNode address is set to the default port 50090. Storage directories for the NameNode and DataNode are also configured. Without the storage directories, subsequent steps will fail. Use the following command to check DataNode status:
hdfs dfsadmin -report
Configure passwordless SSH login to localhost:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Stop / Format / Start HDFS:
sbin/stop-dfs.sh
bin/hdfs namenode -format
sbin/start-dfs.sh
At this point, NameNode information is accessible at xxx.xxx.xxx.xxx:9870.
Create directories in HDFS and run the sample job:
bin/hdfs dfs -mkdir -p /user/wjrtest/input
bin/hdfs dfs -ls /user/wjrtest/input
bin/hdfs dfs -put etc/hadoop/*.xml /user/wjrtest/input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar grep /user/wjrtest/input /user/wjrtest/output 'dfs[a-z.]+'
bin/hdfs dfs -get /user/wjrtest/output output
# or
bin/hdfs dfs -cat /user/wjrtest/output/*
YARN Testing
Edit etc/hadoop/hadoop-env.sh and add:
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
Start YARN:
sbin/start-yarn.sh
Access the ResourceManager web UI at xxx.xxx.xxx.xxx:8088.
To re-run the MapReduce test job, specify a different output directory or remove the previous one:
bin/hdfs dfs -rm -r /user/wjrtest/output
The output confirms that YARN is being used:
[root@*******-tencentos /data/download/hadoop-3.4.0]# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar grep /user/wjrtest/input /user/wjrtest/output 'dfs[a-z.]+'
2024-07-05 17:40:33,883 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2024-07-05 17:40:34,145 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1720171933597_0003
2024-07-05 17:40:34,418 INFO input.FileInputFormat: Total input files to process : 10
2024-07-05 17:40:34,856 INFO mapreduce.JobSubmitter: number of splits:10
2024-07-05 17:40:35,343 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1720171933597_0003
2024-07-05 17:40:35,344 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-07-05 17:40:35,465 INFO conf.Configuration: resource-types.xml not found
2024-07-05 17:40:35,465 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-07-05 17:40:35,519 INFO impl.YarnClientImpl: Submitted application application_1720171933597_0003
2024-07-05 17:40:35,545 INFO mapreduce.Job: The url to track the job: http://*******-tencentos:8088/proxy/application_1720171933597_0003/
2024-07-05 17:40:35,545 INFO mapreduce.Job: Running job: job_1720171933597_0003
2024-07-05 17:40:40,619 INFO mapreduce.Job: Job job_1720171933597_0003 running in uber mode : false
...
Stop YARN:
sbin/stop-yarn.sh