I am on a project which requires Hadoop for crunching millions of documents. Being new to Hadoop, I was faced with a learning curve. Here are a couple of notes on the experience.
Hadoop Streaming
Working with streaming jobs on Hadoop is straightforward enough, except when it comes to controlling how Hadoop treats the input files.The input files I needed to process were large (100's MB). Hadoop insisted on chunking those to smaller size but unfortunately the default splitting wasn't compatible with my file type. I ended up fooling Hadoop in accepting bigger minimum split size:
-D mapred.min.split.size=10737418240
Hadoop with Custom JAR
Although one has full control of Hadoop when using custom JAR jobs, I must say I had quite a rocky ride getting my stuff to work.First off, I needed a a way to package the dependent JAR librairies inside the main JAR. I was forced to write a custom ANT build file to do this.
The trick is to get the dependent jars in the directory ./lib for Hadoop EMR (I don't know about the standard Hadoop, sorry) to be able to include those in the Java classpath.