Categories
SysOps

How to create Hadoop cluster

Create a basic Hadoop cluster to play with it.

Initial information

The goal is to create a basic Hadoop cluster for testing purposes.

I will set up the following servers using Hadoop 3.2.2, although the Yarn resource manager can be configured on the name node.

Role Server name
namenode namenode.example.org
secondary namenode secondarynamenode.example.org
resource manager resourcemanager.example.org
datanode datanode1.example.org
datanode datanode2.example.org
datanode datanode3.example.org

Common setup

Update package index.

$ sudo apt update 

Install Java.

$ sudo apt install openjdk-11-jre-headless

Create dedicated hadoop user.

$ sudo adduser --system --home /opt/hadoop --shell /bin/bash --uid 800  --group --disabled-login hadoop

Create a directory for the software archive.

$ sudo -u hadoop mkdir /opt/hadoop/software

Download Hadoop archive.

$ sudo -u hadoop wget --quiet --directory-prefix /opt/hadoop/software https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz

Extract downloaded archive.

$ sudo -u hadoop tar --directory /opt/hadoop/ --extract --file /opt/hadoop/software/hadoop-3.2.2.tar.gz

Export JAVA_HOME and extend PATH to use Hadoop utilities.

$ cat <<EOF | sudo tee /etc/profile.d/javahome.sh
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
EOF
$ cat <<EOF | sudo tee /etc/profile.d/hadoop_path.sh
export PATH=/opt/hadoop/hadoop-3.2.2/bin:/opt/hadoop/hadoop-3.2.2/sbin:\$PATH
EOF

Create a systemd service template.

$ cat <<EOF | sudo tee /opt/hadoop/software/systemd_service.template
[Unit]
Description=\${SERVICE_DESCRIPTION}
After=network-online.target 
Requires=network-online.target

[Service]
Type=forking

User=hadoop
Group=hadoop

ExecStart=/opt/hadoop/hadoop-3.2.2/\${SERVICE_START_CMD}
ExecStop=/opt/hadoop/hadoop-3.2.2/\${SERVICE_STOP_CMD}

WorkingDirectory=/opt/hadoop/hadoop-3.2.2
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Environment=HADOOP_HOME=/opt/hadoop/hadoop-3.2.2
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

Create systemd services for this setup.

$ SERVICE_DESCRIPTION="Hadoop HDFS datanode service" SERVICE_START_CMD="bin/hdfs --daemon start datanode" SERVICE_STOP_CMD="bin/hdfs --daemon stop datanode" \
	 envsubst < /opt/hadoop/software/systemd_service.template | sudo tee /etc/systemd/system/hadoop-datanode.service
$ SERVICE_DESCRIPTION="Hadoop HDFS namenode service" SERVICE_START_CMD="bin/hdfs --daemon start namenode" SERVICE_STOP_CMD="bin/hdfs --daemon stop namenode" \
	 envsubst < /opt/hadoop/software/systemd_service.template | sudo tee /etc/systemd/system/hadoop-namenode.service
$ SERVICE_DESCRIPTION="Hadoop HDFS secondary namenode service" SERVICE_START_CMD="bin/hdfs --daemon start secondarynamenode" SERVICE_STOP_CMD="bin/hdfs --daemon stop secondarynamenode" \
	 envsubst < /opt/hadoop/software/systemd_service.template | sudo tee /etc/systemd/system/hadoop-secondarynamenode.service
$ SERVICE_DESCRIPTION="Hadoop YARN resourcemanager service" SERVICE_START_CMD="bin/yarn --daemon start resourcemanager" SERVICE_STOP_CMD="bin/yarn --daemon stop resourcemanager" \
	 envsubst < /opt/hadoop/software/systemd_service.template | sudo tee /etc/systemd/system/hadoop-yarn-resourcemanger.service
$ SERVICE_DESCRIPTION="Hadoop YARN service" SERVICE_START_CMD="bin/yarn --daemon start nodemanager" SERVICE_STOP_CMD="bin/yarn --daemon stop nodemanager" \
	 envsubst  < /opt/hadoop/software/systemd_service.template | sudo tee /etc/systemd/system/hadoop-yarn-nodemanager.service

Reload systemd configuration.

$ sudo systemctl daemon-reload

Hadoop will use DNS names for communication, provide a consistent name resolution service using a DNS server or hosts file. Update IP addresses to reflect your setup.

$ cat <<EOF | sudo tee /etc/hosts
127.0.0.1    localhost

172.16.0.110 namenode.example.org
172.16.0.111 secondarynamenode.example.org

172.16.0.120 resourcemanager.example.org

172.16.0.131 datanode1.example.org
172.16.0.132 datanode2.example.org
172.16.0.133 datanode3.example.org

172.16.0.150 client.example.org
EOF

Ensure that each server has a proper and persistent hostname set.

$ sudo hostnamectl --static set-hostname namenode.example.org

This is just an example, define hostname according to your configuration.

Namenode

Create additional directories.

$ sudo -u hadoop mkdir -p /opt/hadoop/local_data/{namenode,tmp}

Create core configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://namenode.example.org:9000</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/opt/hadoop/local_data/tmp</value>
	</property>
        <property>
                <name>hadoop.http.staticuser.user</name>
                <value>hadoop</value>
        </property>
		
</configuration>
EOF

Create hdfs configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>dfs.name.dir</name>
		<value>/opt/hadoop/local_data/namenode</value>
	</property>
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>https://secondarynamenode.example.org:9870</value>
        </property>

        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
	
</configuration>
EOF

Format the namenode.

$ sudo -i -u hadoop hdfs namenode -format

Start and enable service.

$ sudo systemctl enable --now hadoop-namenode.service

Secondary namenode

Create additional directories.

$ sudo -u hadoop mkdir -p /opt/hadoop/local_data/{secondarynamenode,tmp}

Create core configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://namenode.example.org:9000</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/opt/hadoop/local_data/tmp</value>
	</property>
        <property>
                <name>hadoop.http.staticuser.user</name>
                <value>hadoop</value>
        </property>		
</configuration>
EOF

Create hdfs configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>dfs.namenode..checkpoint.dir</name>
		<value>/opt/hadoop/local_data/secondarynamenode</value>
	</property>
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
</configuration>
EOF

Start and enable service.

$ sudo systemctl enable --now hadoop-secondarynamenode.service 

Datanode

Create additional directories.

$ sudo -u hadoop mkdir -p /opt/hadoop/local_data/{datanode,tmp}

Create core configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://namenode.example.org:9000</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/opt/hadoop/local_data/tmp</value>
	</property>
        <property>
                <name>hadoop.http.staticuser.user</name>
                <value>hadoop</value>
        </property>		
</configuration>
EOF

Create hdfs configuration.

cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>/opt/hadoop/local_data/datanode</value>
	</property>
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
</configuration>
EOF

Start and enable service.

sudo systemctl enable --now hadoop-datanode.service 

Create mapred configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/mapred-site.xml
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Site specific YARN configuration properties -->

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    <property>
      <name>yarn.app.mapreduce.am.env</name>
      <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value>
    </property>
    <property>
      <name>mapreduce.map.env</name>
      <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value>
    </property>
    <property>
      <name>mapreduce.reduce.env</name>
      <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value>
    </property>
</configuration>
EOF

Create yarn configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/yarn-site.xml
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>resourcemanager.example.org</value>
        </property>

        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
                <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
</configuration>
EOF

Start and enable service.

$ sudo systemctl enable --now hadoop-yarn-nodemanager.service 

Resource manager

Create core configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
EOF

Create core configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/yarn-site.xml
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->

    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
</configuration>
EOF

Create core configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/mapred-site.xml
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->

    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    
    <property>                                                                                                                                                                                                                                                                        
        <name>yarn.app.mapreduce.am.env</name>                                                                                                                                                                                                                                          
        <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value>                                                                                                                                                                                            
    </property>                                                                                                                                                                                                                                                                       
    <property>                                                                                                                                                                                                                                                                        
        <name>mapreduce.map.env</name>                                                                                                                                                                                                                                                  
        <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value>                                                                                                                                                                                            
    </property>                                                                                                                                                                                                                                                                       
    <property>                                                                                                                                                                                                                                                                        
        <name>mapreduce.reduce.env</name>                                                                                                                                                                                                                                               
        <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value>                                                                                                                                                                                            
    </property> 
</configuration>
EOF

Start and enable service.

$ sudo systemctl enable --now hadoop-yarn-resourcemanger.service 

Client

Create additional directories.

$ sudo -u hadoop mkdir -p /opt/hadoop/local_data/tmp

Create core configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://namenode.example.org:9000</value>
	</property>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/opt/hadoop/local_data/tmp</value>
	</property>
        <property>
                <name>hadoop.http.staticuser.user</name>
                <value>hadoop</value>
        </property>			
</configuration>
EOF

Create mapred configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/mapred-site.xml
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->

    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
    
    <property>
        <name>yarn.app.mapreduce.am.env</name>
        <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value>
    </property>
    <property>
        <name>mapreduce.map.env</name>
        <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value>
    </property>
    <property>
        <name>mapreduce.reduce.env</name>
        <value>HADOOP_MAPRED_HOME=/opt/hadoop/hadoop-3.2.2</value>
    </property>
</configuration>
EOF

Create yarn configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/yarn-site.xml
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->

        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>resourcemanager.example.org</value>
        </property>
</configuration>
EOF

Create hdfs configuration.

$ cat <<EOF | sudo tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>dfs.replication</name>
                <value>1</value>
        </property>
</configuration>
EOF

Web interface

Use http://namenode.example.org:9870/ address to access namenode.

Use http://datanode1.example.org:9864/ address to access a specific datanode.

Use http://resourcemanager.example.org:8088/ address to access yarn cluster.

Use http://datanode2.example.org:8042/ address to access yarn node.

Additional notes

Play with it! This is a simple setup to learn and discover new possibilities using Vagrant/Proxmox/LXD.

GitHub repository: https://github.com/milosz/vagrant-hadoop-basic