Categories
SysOps

How to mount HDFS as a local file system

Mount HDFS as a local file system.

Beware, it will support only basic file operations at a slower pace, installation is cumbersome, but it works.

I will perform the installation on a client machine, but the compilation process can be executed only once elsewhere.

Prerequisites

You need docker to create Hadoop Developer build environment, Java, and Hadoop libraries to run the fuse_dfs application.

Ownership works in a slightly different way as it is using the user name, not user id, so change ownership a directory using hdfs utilities before you try to write anything there using fuse.

$ hdfs dfs -mkdir /milosz                                                    
$ hdfs dfs -chown milosz  /milosz

Installation

Create workspace directory for compilation purposes.

$ sudo mkdir /opt/workspace

Change directory ownership.

$ sudo chown $(whoami) /opt/workspace

Change working directory.

$ cd /opt/workspace

Clone hadoop source code. I will stick with branch-3.2 as I am using Hadoop 3.2.2, but you can skip the branch part.

$ git clone --depth 1 --branch branch-3.2 https://github.com/apache/hadoop.git
Cloning into 'hadoop'...
remote: Enumerating objects: 16383, done.
remote: Counting objects: 100% (16383/16383), done.
remote: Compressing objects: 100% (11609/11609), done.
remote: Total 16383 (delta 5653), reused 7976 (delta 3721), pack-reused 0
Receiving objects: 100% (16383/16383), 34.38 MiB | 7.34 MiB/s, done.
Resolving deltas: 100% (5653/5653), done.

Change working directory.

$ cd hadoop

Create a build environment using docker. Inspect BUILDING.txt for details.

$ ./start-build-env.sh
[...]
 _   _           _                    ______
| | | |         | |                   |  _  \
| |_| | __ _  __| | ___   ___  _ __   | | | |_____   __
|  _  |/ _` |/ _` |/ _ \ / _ \| '_ \  | | | / _ \ \ / /
| | | | (_| | (_| | (_) | (_) | |_) | | |/ /  __/\ V /
\_| |_/\__,_|\__,_|\___/ \___/| .__/  |___/ \___| \_(_)
                              | |
                              |_|

This is the standard Hadoop Developer build environment.
This has all the right tools installed required to build
Hadoop from source.

You will be left inside Hadoop Developer build environment.

milosz@d463965d9dfc:~/hadoop$

Build fuse-dfs executable.

$ mvn package -Pnative -Drequire.fuse=true -DskipTests -Dmaven.javadoc.skip=true
[...]
[INFO] Reactor Summary for Apache Hadoop Main 3.2.3-SNAPSHOT:
[INFO]
[INFO] Apache Hadoop Main ................................. SUCCESS [  1.115 s]
[INFO] Apache Hadoop Build Tools .......................... SUCCESS [  0.813 s]
[INFO] Apache Hadoop Project POM .......................... SUCCESS [  0.529 s]
[INFO] Apache Hadoop Annotations .......................... SUCCESS [  1.126 s]
[INFO] Apache Hadoop Project Dist POM ..................... SUCCESS [  0.116 s]
[INFO] Apache Hadoop Assemblies ........................... SUCCESS [  0.135 s]
[INFO] Apache Hadoop Maven Plugins ........................ SUCCESS [ 31.168 s]
[INFO] Apache Hadoop MiniKDC .............................. SUCCESS [  2.158 s]
[INFO] Apache Hadoop Auth ................................. SUCCESS [ 13.252 s]
[INFO] Apache Hadoop Auth Examples ........................ SUCCESS [  0.830 s]
[INFO] Apache Hadoop Common ............................... SUCCESS [ 35.768 s]
[INFO] Apache Hadoop NFS .................................. SUCCESS [  0.922 s]
[INFO] Apache Hadoop KMS .................................. SUCCESS [ 12.954 s]
[INFO] Apache Hadoop Common Project ....................... SUCCESS [  0.022 s]
[INFO] Apache Hadoop HDFS Client .......................... SUCCESS [ 48.808 s]
[INFO] Apache Hadoop HDFS ................................. SUCCESS [ 17.106 s]
[INFO] Apache Hadoop HDFS Native Client ................... SUCCESS [01:26 min]
[INFO] Apache Hadoop HttpFS ............................... SUCCESS [  1.738 s]
[INFO] Apache Hadoop HDFS-NFS ............................. SUCCESS [  0.521 s]
[INFO] Apache Hadoop HDFS-RBF ............................. SUCCESS [  1.687 s]
[INFO] Apache Hadoop HDFS Project ......................... SUCCESS [  0.018 s]
[INFO] Apache Hadoop YARN ................................. SUCCESS [  0.019 s]
[INFO] Apache Hadoop YARN API ............................. SUCCESS [  4.747 s]
[INFO] Apache Hadoop YARN Common .......................... SUCCESS [ 24.420 s]
[INFO] Apache Hadoop YARN Registry ........................ SUCCESS [  0.792 s]
[INFO] Apache Hadoop YARN Server .......................... SUCCESS [  0.021 s]
[INFO] Apache Hadoop YARN Server Common ................... SUCCESS [  2.401 s]
[INFO] Apache Hadoop YARN NodeManager ..................... SUCCESS [ 21.355 s]
[INFO] Apache Hadoop YARN Web Proxy ....................... SUCCESS [  0.654 s]
[INFO] Apache Hadoop YARN ApplicationHistoryService ....... SUCCESS [ 12.796 s]
[INFO] Apache Hadoop YARN Timeline Service ................ SUCCESS [  0.923 s]
[INFO] Apache Hadoop YARN ResourceManager ................. SUCCESS [  6.613 s]
[INFO] Apache Hadoop YARN Server Tests .................... SUCCESS [  0.898 s]
[INFO] Apache Hadoop YARN Client .......................... SUCCESS [  1.346 s]
[INFO] Apache Hadoop YARN SharedCacheManager .............. SUCCESS [  0.717 s]
[INFO] Apache Hadoop YARN Timeline Plugin Storage ......... SUCCESS [  1.091 s]
[INFO] Apache Hadoop YARN TimelineService HBase Backend ... SUCCESS [  0.038 s]
[INFO] Apache Hadoop YARN TimelineService HBase Common .... SUCCESS [  7.515 s]
[INFO] Apache Hadoop YARN TimelineService HBase Client .... SUCCESS [ 20.104 s]
[INFO] Apache Hadoop YARN TimelineService HBase Servers ... SUCCESS [  0.022 s]
[INFO] Apache Hadoop YARN TimelineService HBase Server 1.2  SUCCESS [  1.351 s]
[INFO] Apache Hadoop YARN TimelineService HBase tests ..... SUCCESS [  5.786 s]
[INFO] Apache Hadoop YARN Router .......................... SUCCESS [  0.825 s]
[INFO] Apache Hadoop YARN Applications .................... SUCCESS [  0.015 s]
[INFO] Apache Hadoop YARN DistributedShell ................ SUCCESS [  0.624 s]
[INFO] Apache Hadoop YARN Unmanaged Am Launcher ........... SUCCESS [  0.441 s]
[INFO] Apache Hadoop MapReduce Client ..................... SUCCESS [  0.090 s]
[INFO] Apache Hadoop MapReduce Core ....................... SUCCESS [  2.469 s]
[INFO] Apache Hadoop MapReduce Common ..................... SUCCESS [  1.343 s]
[INFO] Apache Hadoop MapReduce Shuffle .................... SUCCESS [  1.036 s]
[INFO] Apache Hadoop MapReduce App ........................ SUCCESS [  1.952 s]
[INFO] Apache Hadoop MapReduce HistoryServer .............. SUCCESS [  1.057 s]
[INFO] Apache Hadoop MapReduce JobClient .................. SUCCESS [  2.217 s]
[INFO] Apache Hadoop Mini-Cluster ......................... SUCCESS [  0.780 s]
[INFO] Apache Hadoop YARN Services ........................ SUCCESS [  0.015 s]
[INFO] Apache Hadoop YARN Services Core ................... SUCCESS [ 13.258 s]
[INFO] Apache Hadoop YARN Services API .................... SUCCESS [  0.768 s]
[INFO] Apache Hadoop Image Generation Tool ................ SUCCESS [  0.563 s]
[INFO] Yet Another Learning Platform ...................... SUCCESS [  0.670 s]
[INFO] Apache Hadoop YARN Site ............................ SUCCESS [  0.013 s]
[INFO] Apache Hadoop YARN UI .............................. SUCCESS [  0.013 s]
[INFO] Apache Hadoop YARN Project ......................... SUCCESS [  0.555 s]
[INFO] Apache Hadoop MapReduce HistoryServer Plugins ...... SUCCESS [  0.326 s]
[INFO] Apache Hadoop MapReduce NativeTask ................. SUCCESS [ 25.024 s]
[INFO] Apache Hadoop MapReduce Uploader ................... SUCCESS [  0.377 s]
[INFO] Apache Hadoop MapReduce Examples ................... SUCCESS [  0.436 s]
[INFO] Apache Hadoop MapReduce ............................ SUCCESS [  0.113 s]
[INFO] Apache Hadoop MapReduce Streaming .................. SUCCESS [  0.431 s]
[INFO] Apache Hadoop Distributed Copy ..................... SUCCESS [  0.685 s]
[INFO] Apache Hadoop Archives ............................. SUCCESS [  0.209 s]
[INFO] Apache Hadoop Archive Logs ......................... SUCCESS [  0.305 s]
[INFO] Apache Hadoop Rumen ................................ SUCCESS [  0.576 s]
[INFO] Apache Hadoop Gridmix .............................. SUCCESS [  0.444 s]
[INFO] Apache Hadoop Data Join ............................ SUCCESS [  0.188 s]
[INFO] Apache Hadoop Extras ............................... SUCCESS [  0.196 s]
[INFO] Apache Hadoop Pipes ................................ SUCCESS [  4.137 s]
[INFO] Apache Hadoop OpenStack support .................... SUCCESS [  0.422 s]
[INFO] Apache Hadoop Amazon Web Services support .......... SUCCESS [ 50.254 s]
[INFO] Apache Hadoop Kafka Library support ................ SUCCESS [ 13.164 s]
[INFO] Apache Hadoop Azure support ........................ SUCCESS [  3.075 s]
[INFO] Apache Hadoop Aliyun OSS support ................... SUCCESS [  0.193 s]
[INFO] Apache Hadoop Client Aggregator .................... SUCCESS [  0.906 s]
[INFO] Apache Hadoop Scheduler Load Simulator ............. SUCCESS [  1.153 s]
[INFO] Apache Hadoop Resource Estimator Service ........... SUCCESS [  0.644 s]
[INFO] Apache Hadoop Azure Data Lake support .............. SUCCESS [  1.934 s]
[INFO] Apache Hadoop Tools Dist ........................... SUCCESS [  0.736 s]
[INFO] Apache Hadoop Tools ................................ SUCCESS [  0.015 s]
[INFO] Apache Hadoop Client API ........................... SUCCESS [01:05 min]
[INFO] Apache Hadoop Client Runtime ....................... SUCCESS [ 56.817 s]
[INFO] Apache Hadoop Client Packaging Invariants .......... SUCCESS [  0.138 s]
[INFO] Apache Hadoop Client Test Minicluster .............. SUCCESS [01:50 min]
[INFO] Apache Hadoop Client Packaging Invariants for Test . SUCCESS [  0.102 s]
[INFO] Apache Hadoop Client Packaging Integration Tests ... SUCCESS [  0.056 s]
[INFO] Apache Hadoop Distribution ......................... SUCCESS [  0.095 s]
[INFO] Apache Hadoop Client Modules ....................... SUCCESS [  0.012 s]
[INFO] Apache Hadoop Cloud Storage ........................ SUCCESS [  0.452 s]
[INFO] Apache Hadoop Cloud Storage Project ................ SUCCESS [  0.014 s]
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  12:21 min
[INFO] Finished at: 2021-06-07T20:53:41Z
[INFO] ------------------------------------------------------------------------

Exit Hadoop Developer build environment.

milosz@d463965d9dfc:~/hadoop$ exit

Download Hadoop archive.

$ wget --quiet --directory-prefix /opt/workspace https://downloads.apache.org/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz

Extract downloaded archive.

$ sudo tar --directory /opt/ --extract --file /opt/workspace/hadoop-3.2.2.tar.gz

Copy built fuse_dfs binary.

$ sudo cp /opt/workspace/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/fuse-dfs/fuse_dfs /usr/bin/

Create a simple shell wrapper to define used variables.

$ cat <<EOF | sudo tee /usr/bin/fuse_dfs_wrapper.sh
#!/bin/bash
# naive fuse_dfs wrapper

# define HADOOP_HOME & JAVA_HOME
export HADOOP_HOME=/opt/hadoop-3.2.2
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

# define CLASSPATH
while IFS= read -r -d '' file
do
  export CLASSPATH=\$CLASSPATH:\$file
done < <(find \${HADOOP_HOME}/share/hadoop/{common,hdfs} -name "*.jar" -print0)

# define LD_LIBRARY_PATH
export LD_LIBRARY_PATH=\${HADOOP_HOME}/lib/native/:\${JAVA_HOME}/lib/server/
#export LD_LIBRARY_PATH=\$(find \${HADOOP_HOME} -name libhdfs.so.0.0.0 -exec dirname {} \;):\$(find \${JAVA_HOME} -name libjvm.so -exec dirname {} \;)

fuse_dfs "\$@"
EOF

Set executable bit.

$ sudo chmod +x /usr/bin/fuse_dfs
$ sudo chmod +x /usr/bin/fuse_dfs_wrapper.sh

Configuration

Create a mount directory.

$ mkdir /opt/workspace/hadoop_data

Inspect fuse_dfs options.

$ fuse_dfs_wrapper.sh                              
USAGE: fuse_dfs [debug] [--help] [--version] [-oprotected=] [-oport=] [-oentry_timeout=] [-oattribute_timeout=] [-odirect_io] [-onopoermissions] [-o]  [fuse options]
NOTE: debugging option for fuse is -debug

Mount HDFS as a local file system

$ sudo fuse_dfs_wrapper.sh dfs://namenode.example.org:9000 /opt/workspace/hadoop_data -oinitchecks 
INFO /home/milosz/hadoop/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/fuse-dfs/fuse_options.c:164 Adding FUSE arg /opt/workspace/hadoop_data

Inspect it.

$ df -h  /opt/workspace/hadoop_data
Filesystem      Size  Used Avail Use% Mounted on
fuse_dfs         59G  2,7G   56G   5% /opt/workspace/hadoop_data
$ ls -l  /opt/workspace/hadoop_data/milosz/
total 0
-rw-r--r-- 1 milosz 99 0 Jun  8 00:45 readme.txt

You can use fstab to store mount configuration.

fuse_dfs_wrapper.sh#dfs://namenode.example.org:9000 /opt/workspace/hadoop_data fuse rw,initchecks 0 0

Additional notes

Update shell wrapper to remove dev and suid options if that bothers you.

parameters="$(echo $@ | sed -e s/,dev// -e s/,suid//)"
fuse_cmd="fuse_dfs $parameters"
$fuse_cmd

Inspect fuse-dfs readme file.

$ cat hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/fuse-dfs/doc/README
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
Fuse-DFS

Fuse-DFS allows HDFS to be mounted as a local file system.
It currently supports reads, writes, and directory operations (e.g., cp, ls, more, cat, find, less, rm, mkdir, mv, rmdir, touch, chmod, chown and permissions). Random access writing is not supported.

Contributing

It's pretty straightforward to add functionality to fuse-dfs as fuse makes things relatively simple. Some other tasks require also augmenting libhdfs to expose more hdfs functionality to C. See [https://issues.apache.org/jira/issues/?jql=text%20~%20%22fuse-dfs%22  fuse-dfs JIRAs]

Requirements

 * Hadoop with compiled libhdfs.so
 * Linux kernel > 2.6.9 with fuse, which is the default or Fuse 2.7.x, 2.8.x installed. See: [http://fuse.sourceforge.net/]
 * modprobe fuse to load it
 * fuse_dfs executable (see below)
 * fuse_dfs_wrapper.sh installed in /bin or other appropriate location (see below)


BUILDING

   fuse-dfs executable can be built by setting `require.fuse` option to true using Maven. For example:
   in HADOOP_HOME: `mvn package -Pnative -Drequire.fuse=true -DskipTests -Dmaven.javadoc.skip=true`

   The executable `fuse_dfs` will be located at HADOOP_HOME/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/fuse-dfs/

Common build problems include not finding the libjvm.so in JAVA_HOME/jre/lib/OS_ARCH/server or not finding fuse in FUSE_HOME or /usr/local.


CONFIGURING

fuse_dfs_wrapper.sh may not work out of box. To use it, look at all the paths in fuse_dfs_wrapper.sh and either correct them or set them in your environment before running. (note for automount and mount as root, you probably cannot control the environment, so best to set them in the wrapper)

INSTALLING

1. `mkdir /export/hdfs` (or wherever you want to mount it)

2. `fuse_dfs_wrapper.sh dfs://hadoop_server1.foo.com:9000 /export/hdfs -odebug` and from another terminal, try `ls /export/hdfs`

If 2 works, try again dropping the debug mode, i.e., -debug

(note - common problems are that you don't have libhdfs.so or libjvm.so or libfuse.so on your LD_LIBRARY_PATH, and your CLASSPATH does not contain hadoop and other required jars.)

Also note, fuse-dfs will write error/warn messages to the syslog - typically in /var/log/messages

You can use fuse-dfs to mount multiple hdfs instances by just changing the server/port name and directory mount point above.

DEPLOYING

in a root shell do the following:

1. add the following to /etc/fstab

fuse_dfs#dfs://hadoop_server.foo.com:9000 /export/hdfs fuse -oallow_other,rw,-ousetrash,-oinitchecks 0 0


2. Mount using: `mount /export/hdfs`. Expect problems with not finding fuse_dfs. You will need to probably add this to /sbin and then problems finding the above 3 libraries. Add these using ldconfig.


Fuse DFS takes the following mount options (i.e., on the command line or the comma separated list of options in /etc/fstab:

-oserver=%s  (optional place to specify the server but in fstab use the format above)
-oport=%d (optional port see comment on server option)
-oentry_timeout=%d (how long directory entries are cached by fuse in seconds - see fuse docs)
-oattribute_timeout=%d (how long attributes are cached by fuse in seconds - see fuse docs)
-oprotected=%s (a colon separated list of directories that fuse-dfs should not allow to be deleted or moved - e.g., /user:/tmp)
-oprivate (not often used but means only the person who does the mount can use the filesystem - aka ! allow_others in fuse speak)
-ordbuffer=%d (in KBs how large a buffer should fuse-dfs use when doing hdfs reads)
ro 
rw
-ousetrash (should fuse dfs throw things in /Trash when deleting them)
-onotrash (opposite of usetrash)
-odebug (do not daemonize - aka -d in fuse speak)
-obig_writes (use fuse big_writes option so as to allow better performance of writes on kernels >= 2.6.26)
-initchecks - have fuse-dfs try to connect to hdfs to ensure all is ok upon startup. recommended to have this  on
The defaults are:

entry,attribute_timeouts = 60 seconds
rdbuffer = 10 MB
protected = null
debug = 0
notrash
private = 0

EXPORTING

Add the following to /etc/exports:

/export/hdfs *.foo.com(no_root_squash,rw,fsid=1,sync)

NOTE - you cannot export this with a FUSE module built into the kernel
- e.g., kernel 2.6.17. For info on this, refer to the FUSE wiki.


RECOMMENDATIONS

1. From /bin, `ln -s HADOOP_HOME/hadoop-hdfs-project/hadoop-hdfs-native-client/target/main/native/fuse-dfs/fuse_dfs* .`

2. Always start with debug on so you can see if you are missing a classpath or something like that.

3. use -obig_writes

4. use -initchecks

KNOWN ISSUES 

1. if you alias `ls` to `ls --color=auto` and try listing a directory with lots (over thousands) of files, expect it to be slow and at 10s of thousands, expect it to be very very slow.  This is because `--color=auto` causes ls to stat every file in the directory. Since fuse-dfs does not cache attribute entries when doing a readdir, 
this is very slow. see [https://issues.apache.org/jira/browse/HADOOP-3797 HADOOP-3797]

2. Writes are approximately 33% slower than the DFSClient. TBD how to optimize this. see: [https://issues.apache.org/jira/browse/HADOOP-3805 HADOOP-3805] - try using -obig_writes if on a >2.6.26 kernel, should perform much better since bigger writes implies less context switching.

3. Reads are ~20-30% slower even with the read buffering. 

It is recommended to use -obig_writes option.