Linux服务器配置单节点Hadoop

JRQZ

九月 14, 2024

云服务器搭建配置单节点Hadoop踩坑记录

Hadoop官方指南：https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

准备

根据团队情况申请云服务器，本文的配置为：

8核；16G；系统盘:100G；数据盘:500G
TencentOS Server

方便起见，这里没有使用容器，直接将服务部署在机器上原生运行（后续实际生产应该会转移到docker上）

Hadoop安装

安装Java，ssh

机器自带，略

官方推荐安装pdsh

yum install pdsh

下载镜像

目前最新版：https://dlcdn.apache.org/hadoop/common/current/hadoop-3.4.0.tar.gz

在主机下载完毕，使用scp命令将镜像上传到服务器：

scp -P 36000 hadoop-3.4.0.tar.gz <user>@xxx.xxx.xxx.xxx:/data/download

服务器上解压缩

tar -zxvf hadoop-3.4.0.tar.gz

进入hadoop目录，由于没有配置系统路径，后续操作默认在该目录下进行

配置Java

找到Java安装位置

java -XshowSettings:properties -version 2>&1 | grep 'java.home'

编辑etc/hadoop/hadoop-env.sh

 # set to the root of your Java installation
 export JAVA_HOME=/usr/java/latest

运行

bin/hadoop

此时会出现使用文档

HDFS测试

Hadoop集群可设置成如下模式：

Local (Standalone) Mode （单节点单进程）
Pseudo-Distributed Mode （伪分布式，单节点多进程）
Fully-Distributed Mode （全分布式，多节点多进程）

由于只有一个机器，对前两种模式进行测试

Standalone Operation

mkdir input
cp etc/hadoop/*.xml input
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar grep input output 'dfs[a-z.]+'
cat output/*

该示例是一个grep应用，使用MapReduce任务正则匹配特定字符串

Pseudo-Distributed Operation

这里开始有坑，主要原因是云服务器的默认ssh端口不是22，而是36000，因此对主机和Hadoop相关的ssh默认端口均需要修改

编辑etc/hadoop/hadoop-env.sh，添加内容：

export HADOOP_SSH_OPTS="-p 36000"

编辑core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>hadoop.ssh.port</name>
        <value>36000</value>
    </property>
</configuration>

官网对fs.defaultFS配置项的描述：

The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.

该项配置了默认文件系统的类型和地址，在上述文件中，类型为hdfs，地址为localhost:9000. localhost经过本地DNS解析成127.0.0.1（ipv4本地回环地址）；如果是线上环境，需要设置成hdfs nameNode节点的地址和开放宽口

可以通过hdfs getconf -namenodes查看namenode的地址

编辑hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
      <name>dfs.namenode.secondary.http-address</name>
      <value>localhost:50090</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///data/hdfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///data/hdfs/data</value>
    </property>
</configuration>

dfs.namenode.secondary.http-address设置了secondarynamenode的地址，这里设置成本机

另外，这里设置了dfs的副本数为1，并将secondary namenode的地址设置为本机50090（也就是默认端口），并设置namenode和datanode的数据储存目录。如果没有设置存储目录，后续步骤会报错，可以使用如下命令查看datanode状态：

hdfs dfsadmin -report

接下来配置ssh免密登录本机：

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

停止/格式化/启动hdfs：

sbin/stop-dfs.sh
bin/hdfs namenode -format
sbin/start-dfs.sh

此时可以访问xxx.xxx.xxx.xxx/9870查看NameNode信息

在hdfs中创建目录并执行示例：

bin/hdfs dfs -mkdir -p /user/wjrtest/input
bin/hdfs dfs -ls /user/wjrtest/input
bin/hdfs dfs -put etc/hadoop/*.xml /user/wjrtest/input

bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar grep /user/wjrtest/input /user/wjrtest/output 'dfs[a-z.]+'
bin/hdfs dfs -get /user/wjrtest/output output
# or
bin/hdfs dfs -cat /user/wjrtest/output/*

YARN测试

编辑etc/hadoop/hadoop-env.sh，添加如下内容：

export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root

启动yarn：

sbin/start-yarn.sh

访问ResourceManager接口：xxx.xxx.xxx.xxx/8088

测试之前的MR任务，注意要指定不同的输出目录，或者删掉之前的output目录：

bin/hdfs dfs -rm -r /user/wjrtest/output

输出可以看到使用了yarn：

[root@*******-tencentos /data/download/hadoop-3.4.0]# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar grep /user/wjrtest/input /user/wjrtest/output 'dfs[a-z.]+'
2024-07-05 17:40:33,883 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at /0.0.0.0:8032
2024-07-05 17:40:34,145 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1720171933597_0003
2024-07-05 17:40:34,418 INFO input.FileInputFormat: Total input files to process : 10
2024-07-05 17:40:34,856 INFO mapreduce.JobSubmitter: number of splits:10
2024-07-05 17:40:35,343 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1720171933597_0003
2024-07-05 17:40:35,344 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-07-05 17:40:35,465 INFO conf.Configuration: resource-types.xml not found
2024-07-05 17:40:35,465 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-07-05 17:40:35,519 INFO impl.YarnClientImpl: Submitted application application_1720171933597_0003
2024-07-05 17:40:35,545 INFO mapreduce.Job: The url to track the job: http://*******-tencentos:8088/proxy/application_1720171933597_0003/
2024-07-05 17:40:35,545 INFO mapreduce.Job: Running job: job_1720171933597_0003
2024-07-05 17:40:40,619 INFO mapreduce.Job: Job job_1720171933597_0003 running in uber mode : false
...

停止yarn：

sbin/stop-yarn.sh