Saturday, September 8, 2012

Notes on Hadoop on Amazon Elastic MapReduce

I am on a project which requires Hadoop for crunching millions of documents. Being new to Hadoop, I was faced with a learning curve.  Here are a couple of notes on the experience.

Hadoop Streaming

Working with streaming jobs on Hadoop is straightforward enough, except when it comes to controlling how Hadoop treats the input files.

The input files I needed to process were large (100's MB). Hadoop insisted on chunking those to smaller size but unfortunately the default splitting wasn't compatible with my file type.  I ended up fooling Hadoop in accepting bigger minimum split size:

-D mapred.min.split.size=10737418240

Hadoop with Custom JAR

Although one has full control of Hadoop when using custom JAR jobs, I must say I had quite a rocky ride getting my stuff to work.

First off, I needed a a way to package the dependent JAR librairies inside the main JAR.  I was forced to write a custom ANT build file to do this.

  

   
    
      
      
    
      
  


The trick is to get the dependent jars in the directory ./lib for Hadoop EMR (I don't know about the standard Hadoop, sorry) to be able to include those in the Java classpath.

Second, I wanted to use Amazon S3 as both input and output for files.  The trick there is to grab FileSystem objects based on the URI scheme of the files.

    Path opath=new Path(outputPath);
    FileSystem ofs=opath.getFileSystem(conf);

Forget a single of those and the whole job crashes and you end up paying Amazon for nothing... $100's of dollars down the drain ironing up this one.