Want to play with hadoop without a lot of overhead? This is a step-by-step guide to set up a prototype hadoop cluster, load data, and access that data from python.
Create Hadoop Cluster
Create a temporary Hadoop cluster as proof of concept using KiwenLau’s implementation.
First install Docker if necessary: Installation instructions here,
or start the daemon:
sudo service docker start
Follow directions on KiwenLau’s page:
1. sudo docker pull kiwenlau/hadoop:1.0 2. git clone https://github.com/kiwenlau/hadoop-cluster-docker 3. sudo docker network create --driver=bridge Hadoop 4. cd hadoop-cluster-docker 5. sudo ./start-container.sh
This will leave you inside the container as user root, in directory /root.
Grant access to Hadoop Master
The container already has ssh, and turns out that explicitly opening port 22 is unnecessary and in fact errors with a conflict. Root is the only user defined within the container, and you must add authentication keys for whoever will log into the container.
Append the contents of the users public key (~/.ssh/id_rsa.pub) to the container’s root ~/.ssh/authorized_keys file. Example:
Now, if you exit the container and need to get back in, you can gain access with ssh; for example:
Start up Hadoop from within the master hadoop container. Again, per Kiwen Lau’s site:
See Kiwen Lau’s website for expected output.
Load the Data
Add your data file to the container, and then put it into Hadoop.
Example using sftp from the container host to the container:
sftp root@<container ip address> put <data files(s)> exit
or, from within the container:
sftp <host user>@<host ip> get <data files(s)> exit
Create an input directory on HDFS, if desired:
hadoop fs -mkdir -p <directory name>.
hadoop fs -mkdir -p testData
Add the file(s) to HDFS
hdfs dfs -put <data files(s)> <directory name> ex: hdfs dfs -put simpleSample.csv testData
Look to ensure the data is really there!
hdfs dfs -ls
See what else you can do with the hadoop file system with
You can now exit the Hadoop master container.
Access the data
There are several python libraries for hadoop hdfs access. This example uses Spark 1.6 and snakebite.
Here is a Python snippet that reads a file from HDFS and prints each line for testing purposes. (Don’t do this with a large file!) This assumes the Hadoop master docker container ip address is “172.18.0.4”, our data file is “simpleSample.csv”, and it is located in directory “testData”
from snakebite.client import Client datafile = r'/user/root/testData/simpleSample.csv' client = Client("172.18.0.4", 9000 , use_trash=False, effective_user="root") f = client.text([datafile]) for line in f: print "line from file: " ,line
You might want to know:
9000 is a standard HDFS access port number, and the one that Hadoop-master uses.
Accessing a docker container
Once you have set up ssh authentication, you may gain access with ssh; for example:
Otherwise, issue the command
docker exec -it <container name> bash
Container IP Address
To determine a docker container’s IP address, enter the container and issue an ifconfig command. The eth0 IP address is that of the container.
$ docker exec -it <container name> bash $ ifconfig
Or externally with:
$ docker inspect <container name>