Hadoop and Real-time Processing

June 21, 2013 at 8:50 am | Posted in Performance | Leave a comment
Tags: , , , , , , ,

Almost since the day that Hadoop became big news some people have been predicting the demise of the system. I have heard several different flavours of this argument one being that what is needed is ‘real-time’ big data analytics and that Hadoop with its batch processing and CPU hungry data-munching is not fit for the task. I think this misunderstands the role that Hadoop is and will continue to play in any big data analytics system. In many cases batch oriented applications (often based on Hadoop and its various ecosystem products) will do the big data crunching and CPU-hungry work offline, under non-realtime constraints. Models and output then feeds into real-time systems that are able to process real-time data through the model.

A paper by Bhattacharya and Mitra called Analytics on Big FAST Data Using a Realtime Stream Data Processing Architecture on the EMC Knowledge Sharing site provides a great example of how this offline/real-time combination works. I believe this will become an archetype for how such systems should be built.

Not only do they show how event collection (Apache Kafka), batch model building (Hadoop/Mahout) and Real-time processing (Storm) can work together but they also provide a very accessible introduction to Hidden Markov Models using a couple of characters called Alice and Bob. With a 60% chance of rain Bob clearly lives in the UK. Probably somewhere near Manchester.

Edit 20 Dec 2016: Seems that link has disappeared. If you search for the paper you should be able to find a copy e.g. http://docplayer.net/1475672-Analytics-on-big-fast-data-using-real-time-stream-data-processing-architecture.html

Blog at WordPress.com.
Entries and comments feeds.