Saturday, September 8, 2012

Notes on Hadoop on Amazon Elastic MapReduce

I am on a project which requires Hadoop for crunching millions of documents. Being new to Hadoop, I was faced with a learning curve.  Here are a couple of notes on the experience.

Hadoop Streaming

Working with streaming jobs on Hadoop is straightforward enough, except when it comes to controlling how Hadoop treats the input files.

The input files I needed to process were large (100's MB). Hadoop insisted on chunking those to smaller size but unfortunately the default splitting wasn't compatible with my file type.  I ended up fooling Hadoop in accepting bigger minimum split size:

-D mapred.min.split.size=10737418240

Hadoop with Custom JAR

Although one has full control of Hadoop when using custom JAR jobs, I must say I had quite a rocky ride getting my stuff to work.

First off, I needed a a way to package the dependent JAR librairies inside the main JAR.  I was forced to write a custom ANT build file to do this.



The trick is to get the dependent jars in the directory ./lib for Hadoop EMR (I don't know about the standard Hadoop, sorry) to be able to include those in the Java classpath.

Second, I wanted to use Amazon S3 as both input and output for files.  The trick there is to grab FileSystem objects based on the URI scheme of the files.

    Path opath=new Path(outputPath);
    FileSystem ofs=opath.getFileSystem(conf);

Forget a single of those and the whole job crashes and you end up paying Amazon for nothing... $100's of dollars down the drain ironing up this one.

Wednesday, February 22, 2012

Lego Style Software Design

My company recently got the chance to work on a very interesting project (which unfortunately I can't divulge at the moment).  What made the project fun to work on, aside from the product level challenges, was the fact that we were given much more architectural flexibility than usual.

Amongst the work emerged the following:

Sunday, February 5, 2012

Python Functional Tools

I got the chance to come by a pretty cool project recently: Python Moka.  It consists of functional programming friendly implementations of standard Python dictionary and list classes.  Then it struck me:  wouldn't be nice to have Erlang-ish pattern matching functionality to Python?

So I crafted a small Python package scratching an itch I have had for way too long:  function dispatching based on pattern matching.  For those interested, here are the relevant links:

Feedback welcome :)

Thursday, February 2, 2012

EC2 architecture notes

I've updated the home page of Systemical: you'll find a bunch of useful links to documents:

Enjoy :)

Friday, January 20, 2012

Amazon AWS tools - jldaws

Today I am open-sourcing yet another project.  It consists in a collection of Linux scripts related to Amazon Web Services (AWS).

The project's home page can be found here whilst the code repository is there.