Create a basic Hadoop cluster to play with it.
Initial information
The goal is to create a basic Hadoop cluster for testing purposes.
I will set up the following servers using Hadoop 3.2.2, although the Yarn resource manager can be configured on the name node.
Role | Server name |
---|---|
namenode | namenode.example.org |
secondary namenode | secondarynamenode.example.org |
resource manager | resourcemanager.example.org |
datanode | datanode1.example.org |
datanode | datanode2.example.org |
datanode | datanode3.example.org |
Common setup
Update package index.
$ sudo apt update
Install Java.
$ sudo apt install openjdk-11-jre-headless
Create dedicated hadoop
user.
$ sudo adduser --system --home /opt/hadoop --shell /bin/bash --uid 800 --group --disabled-login hadoop
Create a directory for the software archive.
$ sudo -u hadoop mkdir /opt/hadoop/software
Download Hadoop archive.
$ sudo -u hadoop wget --quiet --directory-prefix /opt/hadoop/software https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
Extract downloaded archive.
$ sudo -u hadoop tar --directory /opt/hadoop/ --extract --file /opt/hadoop/software/hadoop-3.2.2.tar.gz
Export JAVA_HOME
and extend PATH
to use Hadoop utilities.
$ cat <<EOF | sudo tee /etc/profile.d/javahome.sh export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") EOF
$ cat <<EOF | sudo tee /etc/profile.d/hadoop_path.sh export PATH=/opt/hadoop/hadoop-3.2.2/bin:/opt/hadoop/hadoop-3.2.2/sbin:\$PATH EOF
Create a systemd service template.
$ cat <<EOF | sudo tee /opt/hadoop/software/systemd_service.template [Unit] Description=\${SERVICE_DESCRIPTION} After=network-online.target Requires=network-online.target [Service] Type=forking User=hadoop Group=hadoop ExecStart=/opt/hadoop/hadoop-3.2.2/\${SERVICE_START_CMD} ExecStop=/opt/hadoop/hadoop-3.2.2/\${SERVICE_STOP_CMD} WorkingDirectory=/opt/hadoop/hadoop-3.2.2 Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 Environment=HADOOP_HOME=/opt/hadoop/hadoop-3.2.2 Restart=on-failure [Install] WantedBy=multi-user.target EOF
Create systemd services for this setup.
$ SERVICE_DESCRIPTION="Hadoop HDFS datanode service" SERVICE_START_CMD="bin/hdfs --daemon start datanode" SERVICE_STOP_CMD="bin/hdfs --daemon stop datanode" \ envsubst < /opt/hadoop/software/systemd_service.template | sudo tee /etc/systemd/system/hadoop-datanode.service
$ SERVICE_DESCRIPTION="Hadoop HDFS namenode service" SERVICE_START_CMD="bin/hdfs --daemon start namenode" SERVICE_STOP_CMD="bin/hdfs --daemon stop namenode" \ envsubst < /opt/hadoop/software/systemd_service.template | sudo tee /etc/systemd/system/hadoop-namenode.service
$ SERVICE_DESCRIPTION="Hadoop HDFS secondary namenode service" SERVICE_START_CMD="bin/hdfs --daemon start secondarynamenode" SERVICE_STOP_CMD="bin/hdfs --daemon stop secondarynamenode" \ envsubst < /opt/hadoop/software/systemd_service.template | sudo tee /etc/systemd/system/hadoop-secondarynamenode.service
$ SERVICE_DESCRIPTION="Hadoop YARN resourcemanager service" SERVICE_START_CMD="bin/yarn --daemon start resourcemanager" SERVICE_STOP_CMD="bin/yarn --daemon stop resourcemanager" \ envsubst < /opt/hadoop/software/systemd_service.template | sudo tee /etc/systemd/system/hadoop-yarn-resourcemanger.service
$ SERVICE_DESCRIPTION="Hadoop YARN service" SERVICE_START_CMD="bin/yarn --daemon start nodemanager" SERVICE_STOP_CMD="bin/yarn --daemon stop nodemanager" \ envsubst < /opt/hadoop/software/systemd_service.template | sudo tee /etc/systemd/system/hadoop-yarn-nodemanager.service
Reload systemd configuration.
$ sudo systemctl daemon-reload
Hadoop will use DNS names for communication, provide a consistent name resolution service using a DNS server or hosts file. Update IP addresses to reflect your setup.
$ cat <<EOF | sudo tee /etc/hosts 127.0.0.1 localhost 172.16.0.110 namenode.example.org 172.16.0.111 secondarynamenode.example.org 172.16.0.120 resourcemanager.example.org 172.16.0.131 datanode1.example.org 172.16.0.132 datanode2.example.org 172.16.0.133 datanode3.example.org 172.16.0.150 client.example.org EOF
Ensure that each server has a proper and persistent hostname set.
$ sudo hostnamectl --static set-hostname namenode.example.org
This is just an example, define hostname according to your configuration.
Namenode
Create additional directories.
$ sudo -u hadoop mkdir -p /opt/hadoop/local_data/{namenode,tmp}
Create core configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/core-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://namenode.example.org:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/hadoop/local_data/tmp</value> </property> <property> <name>hadoop.http.staticuser.user</name> <value>hadoop</value> </property> </configuration> EOF
Create hdfs configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.name.dir</name> <value>/opt/hadoop/local_data/namenode</value> </property> <property> <name>dfs.namenode.secondary.http-address</name> <value>https://secondarynamenode.example.org:9870</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> EOF
Format the namenode.
$ sudo -i -u hadoop hdfs namenode -format
Start and enable service.
$ sudo systemctl enable --now hadoop-namenode.service
Secondary namenode
Create additional directories.
$ sudo -u hadoop mkdir -p /opt/hadoop/local_data/{secondarynamenode,tmp}
Create core configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/core-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://namenode.example.org:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/hadoop/local_data/tmp</value> </property> <property> <name>hadoop.http.staticuser.user</name> <value>hadoop</value> </property> </configuration> EOF
Create hdfs configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.namenode..checkpoint.dir</name> <value>/opt/hadoop/local_data/secondarynamenode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> EOF
Start and enable service.
$ sudo systemctl enable --now hadoop-secondarynamenode.service
Datanode
Create additional directories.
$ sudo -u hadoop mkdir -p /opt/hadoop/local_data/{datanode,tmp}
Create core configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/core-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://namenode.example.org:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/hadoop/local_data/tmp</value> </property> <property> <name>hadoop.http.staticuser.user</name> <value>hadoop</value> </property> </configuration> EOF
Create hdfs configuration.
cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.datanode.data.dir</name> <value>/opt/hadoop/local_data/datanode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> EOF
Start and enable service.
sudo systemctl enable --now hadoop-datanode.service
Create mapred configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/mapred-site.xml <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Site specific YARN configuration properties --> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value> </property> </configuration> EOF
Create yarn configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/yarn-site.xml <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.resourcemanager.hostname</name> <value>resourcemanager.example.org</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration> EOF
Start and enable service.
$ sudo systemctl enable --now hadoop-yarn-nodemanager.service
Resource manager
Create core configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> EOF
Create core configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/yarn-site.xml <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration> EOF
Create core configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/mapred-site.xml <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <!-- Site specific YARN configuration properties --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value> </property> </configuration> EOF
Start and enable service.
$ sudo systemctl enable --now hadoop-yarn-resourcemanger.service
Client
Create additional directories.
$ sudo -u hadoop mkdir -p /opt/hadoop/local_data/tmp
Create core configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/core-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://namenode.example.org:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/hadoop/local_data/tmp</value> </property> <property> <name>hadoop.http.staticuser.user</name> <value>hadoop</value> </property> </configuration> EOF
Create mapred configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/mapred-site.xml <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <!-- Site specific YARN configuration properties --> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <property> <name>yarn.app.mapreduce.am.env</name> <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value> </property> <property> <name>mapreduce.map.env</name> <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value> </property> <property> <name>mapreduce.reduce.env</name> <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value> </property> </configuration> EOF
Create yarn configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/yarn-site.xml <?xml version="1.0"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.resourcemanager.hostname</name> <value>resourcemanager.example.org</value> </property> </configuration> EOF
Create hdfs configuration.
$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> EOF
Web interface
Use http://namenode.example.org:9870/ address to access namenode.

Use http://datanode1.example.org:9864/ address to access a specific datanode.

Use http://resourcemanager.example.org:8088/ address to access yarn cluster.

Use http://datanode2.example.org:8042/ address to access yarn node.

Additional notes
Play with it! This is a simple setup to learn and discover new possibilities using Vagrant/Proxmox/LXD.
GitHub repository: https://github.com/milosz/vagrant-hadoop-basic