One Big Lock, Many Keys


Hadoop has changed the way we process and analyse large volumes of data.

The java map reduce framework has been the work hourse of hadoop data processing. It moves the nuts and bolts of hadoop. However the vanilla map reduce requires lot of code to be written in Java. The amount of boiler plate code required to create simple map reduce program to join two datasets is staggering. While it makes time and effort sense to develop such tedious but performant map reduce programs in certain cases such as algorithm development where the algorithm is developed once reused multiple times; it is not practical to spend such time & effort for trivial data analysis and ETL applications.

The problem led to the development of several high level frameworks. Below is a list of catagories of such frameworks.

SQL on Hadoop

Hadoop meets SQL. SQL has become the defacto language for data analysis. Business users are adept in SQL. Facebook developed Hive bring a SQL-like interface for querying hadoop. Cloudera developed Impala as a fast interactive query engine for the data stored in Hadoop. Hadpt is a startup which saw the potential for interactive SQL on Hadoop early on. Hadpt enable’s users to run SQL natively on Hadoop. HAWQ is technology developed by Pivotal (an EMC company). HAWQ users can run SQL queries against data stored in Hive, Hbase and HDFS.

Special Language

Pig Latin is a high level language for writing ETL and data analytic programs. It was developed by yahoo and now a top level apache open source software. Pig abstracts the complexity of the map reduce programs. It can generated a sequence of Map Reduce jobs in java. Pig can be extended by writing User Defined Functions (UDF’s) in java, python, ruby, Javascript or Groovy.

Java Based API & Its derivatives

Cascading developed by concurrent and crunch developed by cloudera are 2 java frameworks developed on top of the Hadoop Java API’s. These libraries make writing, testing and running complex map reduce pipeline easy and efficient. Scalding, Scrunch are scala API’s for cascading and crunch respectively. Cascalog is a cascading library for clojure.

Hadoop Streaming

Hadoop Streaming allows to develop mappers and reducers in any language of your choosing provided they read input from STDIN and return their output to STDOUT. Frameworks that support hadoop streaming are

Ruby – Wukong, Mrtoolkit

Python – mrjob, dumbo, pydoop

R – RHadoop, rmr