Saturday, September 8, 2012

Notes on Hadoop on Amazon Elastic MapReduce

I am on a project which requires Hadoop for crunching millions of documents. Being new to Hadoop, I was faced with a learning curve.  Here are a couple of notes on the experience.

Hadoop Streaming

Working with streaming jobs on Hadoop is straightforward enough, except when it comes to controlling how Hadoop treats the input files.

The input files I needed to process were large (100's MB). Hadoop insisted on chunking those to smaller size but unfortunately the default splitting wasn't compatible with my file type.  I ended up fooling Hadoop in accepting bigger minimum split size:

-D mapred.min.split.size=10737418240

Hadoop with Custom JAR

Although one has full control of Hadoop when using custom JAR jobs, I must say I had quite a rocky ride getting my stuff to work.

First off, I needed a a way to package the dependent JAR librairies inside the main JAR.  I was forced to write a custom ANT build file to do this.

  

   
    
      
      
    
      
  


The trick is to get the dependent jars in the directory ./lib for Hadoop EMR (I don't know about the standard Hadoop, sorry) to be able to include those in the Java classpath.

Second, I wanted to use Amazon S3 as both input and output for files.  The trick there is to grab FileSystem objects based on the URI scheme of the files.

    Path opath=new Path(outputPath);
    FileSystem ofs=opath.getFileSystem(conf);

Forget a single of those and the whole job crashes and you end up paying Amazon for nothing... $100's of dollars down the drain ironing up this one.

4 comments:

  1. Have you looked into using Pail, Storm, Cascading/Cascalog and/or Trident to abstract away most of this cruft? See the following book (even the free chapters) to see how this fits your needs: http://www.manning.com/marz/
    This book (at least the chapters available now) is one of the best technical reads to be found out there.

    ReplyDelete
  2. Salut Daniel,

    Thanks for the pointers.

    The problem with the platforms you mentioned rests in the fact they are not packaged as products on Amazon AWS. I don't want to be spending energy setting those up. I am mostly happy with EMR at the moment.

    ReplyDelete
  3. Salut Jean-Lou, that said, the author also uses EMR. Consequently, all of this stuff works with EMR. Remember that EMR is FWIR a pretty standard Hadoop + a few fixes + integration into Amazon's infrastructure and management console.

    See the chapter on Pail which it seems you can now read freely.

    Oh but I'm missing the point - it's not packaged there; that's right. ;)

    ReplyDelete