Categories
SysOps

How to update Hadoop heartbeat interval

Update Hadoop heartbeat interval to mark data node as dead at the predefined period of time depending on your requirement.

The formula to determine whether the data node is dead (ms).

2 * dfs.namenode.heartbeat.recheck-interval + 10 * (1000 * dfs.heartbeat.interval)

The dfs.namenode.heartbeat.recheck-interval by default is 300000 milliseconds.

The dfs.heartbeat.interval by default is 3 seconds.

2 * 300000 + 10 * 3000 = 630000 milliseconds = 10 minutes 30 seconds

Note, there is a reason behind this (see additional notes).

In my specific case, on a small cluster, this was too long. I want to mark the data node as dead after it does not respond for 5 minutes.

To achieve this I would need to change dfs.namenode.heartbeat.recheck-interval to
135000 milliseconds.

2 * 135000 + 10 * 3000 = 300000 milliseconds = 5 minutes

Update hdfs-site.xml configuration file on name node.

$ cat /opt/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>dfs.name.dir</name>
                <value>/opt/hadoop/local_data/namenode</value>
        </property>
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>https://secondarynamenode.example.org:9870</value>
        </property>

        <property>
                <name>dfs.replication</name>
                <value>2</value>
        </property>

        <property>
                <name>dfs.heartbeat.interval</name>
                <value>3s</value>
        </property>

        <property>
                <name>dfs.namenode.heartbeat.recheck-interval</name>
                <value>135000</value>
        </property>
</configuration>

Restart name node service.

$ sudo systemctl restart hadoop-namenode.service

Disable a data node and wait 5 minutes to verify that it works as expected.
Do not perform this step in the production environment.

$ date
Mon May 31 21:46:07 UTC 2021
$ hdfs dfsadmin -report -dead
Configured Capacity: 42007166976 (39.12 GB)
Present Capacity: 34880764867 (32.49 GB)
DFS Remaining: 33444012032 (31.15 GB)
DFS Used: 1436752835 (1.34 GB)
DFS Used%: 4.12%
Replicated Blocks:
        Under replicated blocks: 10837
        Blocks with corrupt replicas: 0
        Missing blocks: 0
        Missing blocks (with replication factor 1): 0
        Low redundancy blocks with highest priority to recover: 10837
        Pending deletion blocks: 0
Erasure Coded Block Groups: 
        Low redundancy block groups: 0
        Block groups with corrupt internal blocks: 0
        Missing block groups: 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0

-------------------------------------------------
Dead datanodes (1):

Name: 192.168.8.175:9866 (datanode3.example.org)
Hostname: datanode3.example.org
Decommission Status : Normal
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 1430159360 (1.33 GB)
Non DFS Used: 2467323904 (2.30 GB)
DFS Remaining: 16015581184 (14.92 GB)
DFS Used%: 6.81%
DFS Remaining%: 76.25%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Mon May 31 21:39:57 UTC 2021
Last Block Report: Mon May 31 21:38:57 UTC 2021
Num of Blocks: 0

It was marked as dead after 5 minutes, name node will log lost heartbeat message.

$ tail /opt/hadoop/hadoop-3.2.2/logs/hadoop-hadoop-namenode-namenode.log 
2021-05-31 21:45:41,888 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* removeDeadDatanode: lost heartbeat from 192.168.8.175:9866, removeBlocksFromBlockMap true
2021-05-31 21:45:41,963 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.8.175:9866

Additional nodes

Decrease the datanode failure detection time