Building Hadoop native libs for Raspberry Pi 2 B

If you try to download and install the binary packages of Apache Hadoop 2.7.1 on a Raspberry PI 2 B, you will get some ugly warnings about the native libs being all wrong, something along:

Unable to load native-hadoop library for your platform... using builtin-java classes
where applicable...
warning: You have loaded library
/usr/local/hadoop/lib/native/ which might have
disabled stack guard. The VM will try to fix the stack guard now. It's
highly recommended that you fix the library with 'execstack -c
<libfile>', or link it with '-z noexecstack'.

Those warnings also mess up the script for starting a cluster of several machines.

From searching on the web it looks like the cleanes option to remove the warnings is to compile hadoop natively on your Raspberry. A good source of information is

Which is a tutorial for compiling Hadoop 2.6.0, but it is still valid for 2.7.1 it seems.

When compiling the current version, I got out-of-memory errors, so I disabled javadoc generation. Afterwards, it worked:

mvn package -Pdist,native -DskipTests -Dtar -Dmaven.javadoc.skip=true


Create a KVM instance with vm-builder

I want to experiment with Apache Hadoop, and to create a cluster of machines I will use KVM with Ubuntu guests.

To create VMs, the first step is to install the vm-builder, which enables you to create new virtual machines directly from the command line:

sudo apt-get install python-vm-builder

Current build script (for creating a raw VM with just Java and OpenSSH):

export MY_GNOME="gnome1"
sudo vmbuilder kvm ubuntu --suite trusty --flavour virtual\
    --destdir "/media/data/kvm/${MY_GNOME}" \
    --rootsize 10000 \
    --domain \
    --addpkg acpid --addpkg openssh-server --addpkg linux-image-generic --addpkg openjdk-7-jre-headless \
    --user admin --pass admin \
    --mirror --components main,universe,restricted \
    --arch amd64 --hostname "${MY_GNOME}" \
    --libvirt qemu:///system --bridge virbr0 \
    --mem 2048 --cpus 1 ;

This will build a machine with a 10 GByte file system and 2 GByte of RAM.

My plan was to create a couple of such VMs and then "install and run hadoop" on them. But "one does not simply install a hadoop cluster" - I underestimated the comlexity of the project, and so I am currently back at the recommended beginner step, to install a single standalone node before diving into the cluster setup via Puppet.


Indexing all reddit comments

When a Reddit user created a giant archive of all available Reddit comments so far, I downloaded the data set and later on indexed it with a current version of Lucene. The code to do this is in my GitHub repository: reddit-data-tools. At the moment it is only capable of doing a limited kind of search from the command line.

Searching for "fun +author:girl~" yields both a count of all matching documents as well as a display of the top ten, including their comment text and a back link to Reddit.

Opening search index at F:/reddit_data/index-all. This may take a moment.
Going to search over 1532362437 documents.
Found: 1621 matching documents.
Going to display top 10:
Score: 2.742787 author: girfl, url:
That was fun!
Score: 2.5318978 author: qirl, url:
that was fun! cya at the next family reunion? :)
Score: 2.4616015 author: qgirl, url:
Yes! I almost always regret it when it's time to reverse the process, but it's fun anyway.
Score: 2.4616015 author: girfl, url:
I have old lady hands. They are very wrinkled and I used to get made fun all the time in high school.
Score: 2.4616015 author: girfl, url:
Getting ready to go out is often more fun than actually being out.
Score: 2.4616015 author: zgirl, url:
I also give fellow runners high fives as I pass them. It's fun.
Score: 2.4616015 author: girl8, url:
Have fun, be yourself. Don't go home the first few weeks no matter how much you might want to.
Score: 2.4616015 author: qirl, url:
Because we make fun of ourselves, so y'all are given free reign to do the same.
Score: 2.4616015 author: gdrl, url:
Some of the fun is just gifting friends ultra ridiculous games and then having them do the "wtf?"
when they see/msg you. :)
Score: 2.4498854 author: girl_, url:
Okay me and M^^^^ want to go out Saturday night. Where can we go thats fun without
people fornicating in the bathroom?
Search took 3979 ms

The phrase "+author:girl~" selects all authors with a username that is equal or resembling "girl". The search took almost 4 seconds with the index residing on hard disk and another process hugging all the CPU.

Indexed fields are:

  • author
  • name (id of the comment in t3_-format)
  • body (text of the comment)
  • gilded
  • score
  • ups
  • downs
  • created_utc
  • parent_id
  • subreddit
  • id (id of the comment)
  • url (custom field, which *should* contain a valid link to the comment or the comment thread)

All fields are stored fields (so you can display the field value of a document).

"Body" is the only field that is considered as "text".


Another example:

Example search for "love story twilight" with more than 1000 up votes (links are not really reliable currently):

Opening search index at F:/reddit_data/index-all. This may take a moment. 
Going to search over 1532362437 documents.
Found: 20 matching documents. Going to display top 10:
DocScore: 4.435478 author: dathom, ups:1103, url:
  Still a better love story than Twilight.
DocScore: 4.435478 author: Xenoo, ups:1358, url:
  Still a better love story than twilight.
DocScore: 4.435478 author: unglad, ups:1986, url:
  OK maybe twilight was a better love story than this
Search took 4392 ms



8 GByte tar problem

Yesterday I encountered a strange bug - programatically generated archives of a Lucene index in the tar format were corrupted during creation. The process finished normally, but occasionally the resulting archive would be broken. Turns out that someone had (with probably good reason once upon a time) created our own implementation of a tar packaging module in Java, based upon the Apache Ant task.

The problem with the original source ode is: the Ant tar task is limited to an individual file size of 8 GByte, though the resulting archive may be far larger. This was fixed in the Apache Compression library v1.4, some time ago, but you would have to use one of the Gnu Tar formats which support unlimited file size.

The workaround for the current problem seems to be: use the default Lucene indexRamBufferSize setting of 16 MByte, so the segment files won't grow to 25 GByte and stay below the 8 GB limit. But a change of the compression module in the near future (preferably using standard open source components instead of homebrew versions) is already planned.


  • Errors

    Descriptions of error messages and possible solutions

  • Little Goblin

    Posts about Little Goblin, the Grails based open source browser game engine and its reference implementation.

    The home page of Little Goblin is