Categories
DevOps

How to put Hadoop datanodes in maintenance state

Use JSON-based configuration format for Hadoop datanodes to create a whitelist and control put these in normal, decommissioned or maintenance state.

This functionality allows using maintenance mode which is useful for short-term operations like system upgrade or reboot operation. This is an advantage over decommissioning datanode which would be useful for long-term operations.

Display HDFS report.

$ hdfs dfsadmin -report
Configured Capacity: 63010750464 (58.68 GB)
Present Capacity: 52226306048 (48.64 GB)
DFS Remaining: 43579908096 (40.59 GB)
DFS Used: 8646397952 (8.05 GB)
DFS Used%: 16.56%
Replicated Blocks:
        Under replicated blocks: 14
        Blocks with corrupt replicas: 0
        Missing blocks: 0
        Missing blocks (with replication factor 1): 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0
Erasure Coded Block Groups: 
        Low redundancy block groups: 0
        Block groups with corrupt internal blocks: 0
        Missing block groups: 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (3):

Name: 192.168.8.173:9866 (datanode1.example.org)
Hostname: datanode1.example.org
Decommission Status : Normal
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 2378018816 (2.21 GB)
Non DFS Used: 2480680960 (2.31 GB)
DFS Remaining: 15054364672 (14.02 GB)
DFS Used%: 11.32%
DFS Remaining%: 71.68%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jul 17 21:51:37 UTC 2021
Last Block Report: Sat Jul 17 21:43:38 UTC 2021
Num of Blocks: 3470


Name: 192.168.8.174:9866 (datanode2.example.org)
Hostname: datanode2.example.org
Decommission Status : Normal
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 3345268736 (3.12 GB)
Non DFS Used: 2472132608 (2.30 GB)
DFS Remaining: 14095663104 (13.13 GB)
DFS Used%: 15.93%
DFS Remaining%: 67.11%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jul 17 21:51:38 UTC 2021
Last Block Report: Sat Jul 17 21:43:45 UTC 2021
Num of Blocks: 10736


Name: 192.168.8.175:9866 (datanode3.example.org)
Hostname: datanode3.example.org
Decommission Status : Normal
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 2923110400 (2.72 GB)
Non DFS Used: 2560073728 (2.38 GB)
DFS Remaining: 14429880320 (13.44 GB)
DFS Used%: 13.92%
DFS Remaining%: 68.70%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jul 17 21:51:39 UTC 2021
Last Block Report: Sat Jul 17 21:43:45 UTC 2021
Num of Blocks: 7598

Create a simplest possible hosts.json file on a namenode that will function as a whitelist.

$ cat <<EOF | sudo -u hadoop tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/hosts.json
[
  {
    "hostName": "datanode1.example.org"
  },
  {
    "hostName": "datanode2.example.org"
  },
  {
    "hostName": "datanode3.example.org"
  }
]
EOF

Define dfs.namenode.hosts.provider.classname and dfs.hosts options inside hdfs-site.xml on a namenode.

$ sudo -u hadoop vim /opt/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml 
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>dfs.name.dir</name>
                <value>/opt/hadoop/local_data/namenode</value>
        </property>
        <property>
                <name>dfs.namenode.secondary.http-address</name>
                <value>https://secondarynamenode.example.org:9870</value>
        </property>

        <property>
                <name>dfs.replication</name>
                <value>2</value>
        </property>
        <property>
                <name>dfs.namenode.hosts.provider.classname</name>
                <value>org.apache.hadoop.hdfs.server.blockmanagement.CombinedHostFileManager</value>
        </property>
        <property>
                <name>dfs.hosts</name>
                <value>/opt/hadoop/hadoop-3.2.2/etc/hadoop/hosts.json</value>
        </property>
</configuration>

Restart service on a namenode.

$ sudo systemctl restart hadoop-namenode.service

Inspect datanodes.

$ hdfs dfsadmin -report 
Configured Capacity: 63010750464 (58.68 GB)
Present Capacity: 52226273280 (48.64 GB)
DFS Remaining: 43579875328 (40.59 GB)
DFS Used: 8646397952 (8.05 GB)
DFS Used%: 16.56%
Replicated Blocks:
        Under replicated blocks: 14
        Blocks with corrupt replicas: 0
        Missing blocks: 0
        Missing blocks (with replication factor 1): 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0
Erasure Coded Block Groups: 
        Low redundancy block groups: 0
        Block groups with corrupt internal blocks: 0
        Missing block groups: 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (3):

Name: 192.168.8.173:9866 (datanode1.example.org)
Hostname: datanode1.example.org
Decommission Status : Normal
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 2378018816 (2.21 GB)
Non DFS Used: 2480693248 (2.31 GB)
DFS Remaining: 15054352384 (14.02 GB)
DFS Used%: 11.32%
DFS Remaining%: 71.68%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jul 17 23:19:41 UTC 2021
Last Block Report: Sat Jul 17 23:09:17 UTC 2021
Num of Blocks: 3470


Name: 192.168.8.174:9866 (datanode2.example.org)
Hostname: datanode2.example.org
Decommission Status : Normal
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 3345268736 (3.12 GB)
Non DFS Used: 2472144896 (2.30 GB)
DFS Remaining: 14095650816 (13.13 GB)
DFS Used%: 15.93%
DFS Remaining%: 67.11%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jul 17 23:19:41 UTC 2021
Last Block Report: Sat Jul 17 23:09:17 UTC 2021
Num of Blocks: 10736


Name: 192.168.8.175:9866 (datanode3.example.org)
Hostname: datanode3.example.org
Decommission Status : Normal
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 2923110400 (2.72 GB)
Non DFS Used: 2560081920 (2.38 GB)
DFS Remaining: 14429872128 (13.44 GB)
DFS Used%: 13.92%
DFS Remaining%: 68.70%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jul 17 23:19:41 UTC 2021
Last Block Report: Sat Jul 17 23:09:17 UTC 2021
Num of Blocks: 7598

Put the datanode2.example.org in maintenance mode and datanode3.example.org into decommissioned state.

$ cat << | sudo -u hadoop tee /opt/hadoop/hadoop-3.2.2/etc/hadoop/hosts.json
[
  {
    "hostName": "datanode1.example.org",
    "adminState": "NORMAL"
  },
  {
    "hostName": "datanode2.example.org",
    "adminState": "IN_MAINTENANCE"
  },
  {
    "hostName": "datanode3.example.org",
    "adminState": "DECOMMISSIONED"
  }
]
EOF

Re-read the hosts.json file on a namenode.

$ hdfs dfsadmin -refreshNodes
Refresh nodes successful

Display HDFS report to inspect datanodes.

$ hdfs dfsadmin -report
Configured Capacity: 21003583488 (19.56 GB)
Present Capacity: 17432244957 (16.24 GB)
DFS Remaining: 15054168488 (14.02 GB)
DFS Used: 2378076469 (2.21 GB)
DFS Used%: 13.64%
Replicated Blocks:
        Under replicated blocks: 7586
        Blocks with corrupt replicas: 0
        Missing blocks: 0
        Missing blocks (with replication factor 1): 0
        Low redundancy blocks with highest priority to recover: 7572
        Pending deletion blocks: 0
Erasure Coded Block Groups: 
        Low redundancy block groups: 0
        Block groups with corrupt internal blocks: 0
        Missing block groups: 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (3):

Name: 192.168.8.173:9866 (datanode1.example.org)
Hostname: datanode1.example.org
Decommission Status : Normal
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 2378076469 (2.21 GB)
Non DFS Used: 2480721611 (2.31 GB)
DFS Remaining: 15054168488 (14.02 GB)
DFS Used%: 11.32%
DFS Remaining%: 71.67%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 4
Last contact: Sat Jul 17 23:25:50 UTC 2021
Last Block Report: Sat Jul 17 23:09:17 UTC 2021
Num of Blocks: 3478


Name: 192.168.8.174:9866 (datanode2.example.org)
Hostname: datanode2.example.org
Decommission Status : Entering maintenance
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 3345268736 (3.12 GB)
Non DFS Used: 2472144896 (2.30 GB)
DFS Remaining: 14095650816 (13.13 GB)
DFS Used%: 15.93%
DFS Remaining%: 67.11%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jul 17 23:25:50 UTC 2021
Last Block Report: Sat Jul 17 23:09:17 UTC 2021
Num of Blocks: 10736


Name: 192.168.8.175:9866 (datanode3.example.org)
Hostname: datanode3.example.org
Decommission Status : Decommission in progress
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 2923110400 (2.72 GB)
Non DFS Used: 2560081920 (2.38 GB)
DFS Remaining: 14429872128 (13.44 GB)
DFS Used%: 13.92%
DFS Remaining%: 68.70%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jul 17 23:25:50 UTC 2021
Last Block Report: Sat Jul 17 23:09:17 UTC 2021
Num of Blocks: 7598


Decommissioning datanodes (1):

Name: 192.168.8.175:9866 (datanode3.example.org)
Hostname: datanode3.example.org
Decommission Status : Decommission in progress
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 2923110400 (2.72 GB)
Non DFS Used: 2560081920 (2.38 GB)
DFS Remaining: 14429872128 (13.44 GB)
DFS Used%: 13.92%
DFS Remaining%: 68.70%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jul 17 23:25:50 UTC 2021
Last Block Report: Sat Jul 17 23:09:17 UTC 2021
Num of Blocks: 7598


Entering maintenance datanodes (1):

Name: 192.168.8.174:9866 (datanode2.example.org)
Hostname: datanode2.example.org
Decommission Status : Entering maintenance
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 3345268736 (3.12 GB)
Non DFS Used: 2472144896 (2.30 GB)
DFS Remaining: 14095650816 (13.13 GB)
DFS Used%: 15.93%
DFS Remaining%: 67.11%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sat Jul 17 23:25:50 UTC 2021
Last Block Report: Sat Jul 17 23:09:17 UTC 2021
Num of Blocks: 10736

After a while, the state will change.

[...]

In maintenance datanodes (1):

Name: 192.168.8.174:9866 (datanode2.example.org)
Hostname: datanode2.example.org
Decommission Status : In maintenance
Configured Capacity: 21003583488 (19.56 GB)
DFS Used: 3350933504 (3.12 GB)
Non DFS Used: 2473369600 (2.30 GB)
DFS Remaining: 14088761344 (13.12 GB)
DFS Used%: 15.95%
DFS Remaining%: 67.08%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Sun Jul 18 00:14:20 UTC 2021
Last Block Report: Sat Jul 17 23:09:17 UTC 2021
Num of Blocks: 10836
[...]

Additional notes

By default the minimal number of replicas that datanode undergoing maintenance needs to satisfy is set to 1. Alter dfs.namenode.maintenance.replication.min option in hdfs-site.xml configuration file to increase the minimal number of replicas.

You can define when maintenance will expire.

$ date
Sun Jul 18 00:01:23 UTC 2021
$ date +%s
1626566485

Calculate maintenance expire time in ms after 1 hour.

$  date +"%s000" --date "+1 hour"
1626570086000

Define it in hosts.json file, remember to refresh nodes.

$ sudo -u hadoop cat /opt/hadoop/hadoop-3.2.2/etc/hadoop/hosts.json
[
  {
    "hostName": "datanode1.example.org",
    "adminState": "NORMAL"
  },
  {
    "hostName": "datanode2.example.org",
    "adminState": "IN_MAINTENANCE",
    "maintenanceExpireTimeInMS": 1626570086000
  },
  {
    "hostName": "datanode3.example.org",
    "adminState": "NORMAL"
  }
]

Useful links:

[Umbrella] Support maintenance state for datanodes

HDFS DataNode Admin Guide