Fix hanging NameNode for Hadoop 2.7.1 at Raspberry Pi 2 B

The problem: In /usr/local/hadoop-2.7.1 I can run ./sbin/start-dfs.sh, and the processes for both the namenode and the datanodes on all nodes are started. jps shows that the DataNode, NameNode and SecondaryNameNode are clearly running. But they are not binding to any port - neither netstat nor "lsof -i" show any TCP ports being used beside the SSH port.

Running any hdfs commands gets me  a "connection refused" error.

Apache Hadoop has a nice wiki page about how any networking problems with connection refused are not their problem, but mine. Okaaaay.

I searched for hours - was it my DNS / IP settings, was one of the configuration files, was it being on raspbian?

No. In the end, I found a posting on StackOverflow:

http://stackoverflow.com/questions/17392531/namenode-appears-to-hang-on-start

The reason for the Hadoop services not binding to any port / refusing all connections was: the process was hanging because of an old version of the Google Guava jar.

So I wrote a fix to download and install a more current version:

#!/bin/bash
cd /tmp
wget http://central.maven.org/maven2/com/google/guava/guava/18.0/guava-18.0.jar
export HADOOP_SHARED=/usr/local/hadoop-2.7.1/share/hadoop

rm "${HADOOP_SHARED}/common/lib/guava-11.0.2.jar"
rm "${HADOOP_SHARED}/hdfs/lib/guava-11.0.2.jar"
rm "${HADOOP_SHARED}/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/guava-11.0.2.jar"
rm "${HADOOP_SHARED}/kms/tomcat/webapps/kms/WEB-INF/lib/guava-11.0.2.jar"
rm "${HADOOP_SHARED}/tools/lib/guava-11.0.2.jar"
rm "${HADOOP_SHARED}/yarn/lib/guava-11.0.2.jar"

cp guava-18.0.jar "${HADOOP_SHARED}/common/lib/guava-18.0.jar"
cp guava-18.0.jar "${HADOOP_SHARED}/hdfs/lib/guava-18.0.jar"
cp guava-18.0.jar "${HADOOP_SHARED}/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/guava-18.0.jar"
cp guava-18.0.jar "${HADOOP_SHARED}/kms/tomcat/webapps/kms/WEB-INF/lib/guava-18.0.jar"
cp guava-18.0.jar "${HADOOP_SHARED}/tools/lib/guava-18.0.jar"
cp guava-18.0.jar "${HADOOP_SHARED}/yarn/lib/guava-18.0.jar"

And now Hadoop (at least the hdfs part)  starts and listens to the default ports:

hduser@node1 ~ $ sudo lsof -i tcp
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
sshd 2169 root 3u IPv4 6612 0t0 TCP *:ssh (LISTEN)
sshd 2292 root 3u IPv4 6663 0t0 TCP node1:ssh->turtle.local:55560 (ESTABLISHED)
sshd 2296 hduser 3u IPv4 6663 0t0 TCP node1:ssh->turtle.local:55560 (ESTABLISHED)
java 4395 hduser 196u IPv4 13062 0t0 TCP *:50090 (LISTEN)
java 5029 hduser 190u IPv4 22738 0t0 TCP *:50070 (LISTEN)
java 5029 hduser 202u IPv4 21275 0t0 TCP node1:8020 (LISTEN)
java 5029 hduser 212u IPv4 21877 0t0 TCP node1:8020->node4:36701 (ESTABLISHED)
java 5029 hduser 213u IPv4 21878 0t0 TCP node1:8020->node2:46195 (ESTABLISHED)
java 5029 hduser 214u IPv4 21879 0t0 TCP node1:8020->node5:44667 (ESTABLISHED)
java 5029 hduser 215u IPv4 21880 0t0 TCP node1:8020->node3:46794 (ESTABLISHED)
java 5029 hduser 216u IPv4 21882 0t0 TCP node1:8020->node1:36134 (ESTABLISHED)
java 5130 hduser 192u IPv4 21626 0t0 TCP *:50010 (LISTEN)
java 5130 hduser 196u IPv4 21632 0t0 TCP localhost:57456 (LISTEN)
java 5130 hduser 250u IPv4 18108 0t0 TCP *:50075 (LISTEN)
java 5130 hduser 251u IPv4 21851 0t0 TCP *:50020 (LISTEN)
java 5130 hduser 262u IPv4 23654 0t0 TCP node1:36134->node1:8020 (ESTABLISHED)

 

Building Hadoop native libs for Raspberry Pi 2 B

If you try to download and install the binary packages of Apache Hadoop 2.7.1 on a Raspberry PI 2 B, you will get some ugly warnings about the native libs being all wrong, something along:

Unable to load native-hadoop library for your platform... using builtin-java classes
where applicable...
warning: You have loaded library
/usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have
disabled stack guard. The VM will try to fix the stack guard now. It's
highly recommended that you fix the library with 'execstack -c
<libfile>', or link it with '-z noexecstack'.

Those warnings also mess up the start-all.sh script for starting a cluster of several machines.

From searching on the web it looks like the cleanes option to remove the warnings is to compile hadoop natively on your Raspberry. A good source of information is

http://www.instructables.com/id/Native-Hadoop-260-Build-on-Pi/?ALLSTEPS

Which is a tutorial for compiling Hadoop 2.6.0, but it is still valid for 2.7.1 it seems.

When compiling the current version, I got out-of-memory errors, so I disabled javadoc generation. Afterwards, it worked:

mvn package -Pdist,native -DskipTests -Dtar -Dmaven.javadoc.skip=true

 

Create a KVM instance with vm-builder

I want to experiment with Apache Hadoop, and to create a cluster of machines I will use KVM with Ubuntu guests.

To create VMs, the first step is to install the vm-builder, which enables you to create new virtual machines directly from the command line:

sudo apt-get install python-vm-builder

Current build script (for creating a raw VM with just Java and OpenSSH):

export MY_GNOME="gnome1"
sudo vmbuilder kvm ubuntu --suite trusty --flavour virtual\
    --destdir "/media/data/kvm/${MY_GNOME}" \
    --rootsize 10000 \
    --domain qatal.de \
    --addpkg acpid --addpkg openssh-server --addpkg linux-image-generic --addpkg openjdk-7-jre-headless \
    --user admin --pass admin \
    --mirror http://gb.archive.ubuntu.com/ubuntu/ --components main,universe,restricted \
    --arch amd64 --hostname "${MY_GNOME}" \
    --libvirt qemu:///system --bridge virbr0 \
    --mem 2048 --cpus 1 ;

This will build a machine with a 10 GByte file system and 2 GByte of RAM.

My plan was to create a couple of such VMs and then "install and run hadoop" on them. But "one does not simply install a hadoop cluster" - I underestimated the comlexity of the project, and so I am currently back at the recommended beginner step, to install a single standalone node before diving into the cluster setup via Puppet.

 

Indexing all reddit comments

When a Reddit user created a giant archive of all available Reddit comments so far, I downloaded the data set and later on indexed it with a current version of Lucene. The code to do this is in my GitHub repository: reddit-data-tools. At the moment it is only capable of doing a limited kind of search from the command line.

Searching for "fun +author:girl~" yields both a count of all matching documents as well as a display of the top ten, including their comment text and a back link to Reddit.

Opening search index at F:/reddit_data/index-all. This may take a moment.
Going to search over 1532362437 documents.
Found: 1621 matching documents.
Going to display top 10:
Score: 2.742787 author: girfl, url: http://www.reddit.com/r/Hair/comments/17njif/c87a4al
That was fun!
Score: 2.5318978 author: qirl, url: http://www.reddit.com/r/AskReddit/comments/26s6br/chtxurs
that was fun! cya at the next family reunion? :)
Score: 2.4616015 author: qgirl, url: http://www.reddit.com/r/TwoXChromosomes/comments/au7t1/c0jfana
Yes! I almost always regret it when it's time to reverse the process, but it's fun anyway.
Score: 2.4616015 author: girfl, url: http://www.reddit.com/r/AskReddit/comments/18g4t6/c8esn2j
I have old lady hands. They are very wrinkled and I used to get made fun all the time in high school.
Score: 2.4616015 author: girfl, url: http://www.reddit.com/r/MakeupAddiction/comments/ytfi3/c5ypqon
Getting ready to go out is often more fun than actually being out.
Score: 2.4616015 author: zgirl, url: http://www.reddit.com/r/Fitness/comments/1fbagh/ca914bt
I also give fellow runners high fives as I pass them. It's fun.
Score: 2.4616015 author: girl8, url: http://www.reddit.com/r/AskReddit/comments/2d7g5o/cjmtslp
Have fun, be yourself. Don't go home the first few weeks no matter how much you might want to.
Score: 2.4616015 author: qirl, url: http://www.reddit.com/r/AskReddit/comments/26u8w0/chuijd2
Because we make fun of ourselves, so y'all are given free reign to do the same.
Score: 2.4616015 author: gdrl, url: http://www.reddit.com/r/Steam/comments/2riu4j/cngq8r9
Some of the fun is just gifting friends ultra ridiculous games and then having them do the "wtf?"
when they see/msg you. :)
Score: 2.4498854 author: girl_, url: http://www.reddit.com/r/berlinsocialclub/comments/lffry/c2t2vyf
Okay me and M^^^^ want to go out Saturday night. Where can we go thats fun without
people fornicating in the bathroom?
Search took 3979 ms

The phrase "+author:girl~" selects all authors with a username that is equal or resembling "girl". The search took almost 4 seconds with the index residing on hard disk and another process hugging all the CPU.

Indexed fields are:

  • author
  • name (id of the comment in t3_-format)
  • body (text of the comment)
  • gilded
  • score
  • ups
  • downs
  • created_utc
  • parent_id
  • subreddit
  • id (id of the comment)
  • url (custom field, which *should* contain a valid link to the comment or the comment thread)

All fields are stored fields (so you can display the field value of a document).

"Body" is the only field that is considered as "text".

 

Another example:

Example search for "love story twilight" with more than 1000 up votes (links are not really reliable currently):

Opening search index at F:/reddit_data/index-all. This may take a moment. 
Going to search over 1532362437 documents.
Found: 20 matching documents. Going to display top 10:
DocScore: 4.435478 author: dathom, ups:1103, url: http://www.reddit.com/r/AskReddit/comments/psoue/c3s132v
  Still a better love story than Twilight.
DocScore: 4.435478 author: Xenoo, ups:1358, url: http://www.reddit.com/r/funny/comments/qqhcm/c3zn0xo
  Still a better love story than twilight.
DocScore: 4.435478 author: unglad, ups:1986, url: http://www.reddit.com/r/nottheonion/comments/2ewday/ck3knl6
  OK maybe twilight was a better love story than this
(...)
Search took 4392 ms

.

 

Subcategories

  • Errors

    Descriptions of error messages and possible solutions

  • Little Goblin

    Posts about Little Goblin, the Grails based open source browser game engine and its reference implementation.

    The home page of Little Goblin is littlegoblin.de