Hadoop and Real-time Processing

June 21, 2013 at 8:50 am | Posted in Performance | Leave a comment
Tags: , , , , , , ,

Almost since the day that Hadoop became big news some people have been predicting the demise of the system. I have heard several different flavours of this argument one being that what is needed is ‘real-time’ big data analytics and that Hadoop with its batch processing and CPU hungry data-munching is not fit for the task. I think this misunderstands the role that Hadoop is and will continue to play in any big data analytics system. In many cases batch oriented applications (often based on Hadoop and its various ecosystem products) will do the big data crunching and CPU-hungry work offline, under non-realtime constraints. Models and output then feeds into real-time systems that are able to process real-time data through the model.

A paper by Bhattacharya and Mitra called Analytics on Big FAST Data Using a Realtime Stream Data Processing Architecture on the EMC Knowledge Sharing site provides a great example of how this offline/real-time combination works. I believe this will become an archetype for how such systems should be built.

Not only do they show how event collection (Apache Kafka), batch model building (Hadoop/Mahout) and Real-time processing (Storm) can work together but they also provide a very accessible introduction to Hidden Markov Models using a couple of characters called Alice and Bob. With a 60% chance of rain Bob clearly lives in the UK. Probably somewhere near Manchester.

Edit 20 Dec 2016: Seems that link has disappeared. If you search for the paper you should be able to find a copy e.g. http://docplayer.net/1475672-Analytics-on-big-fast-data-using-real-time-stream-data-processing-architecture.html

Data Science London Meetup June 2013

June 14, 2013 at 2:08 pm | Posted in Performance | Leave a comment
Tags: , , , , , , , , ,

This is a quick post to record my thoughts and impressions from the Data Science London meet up I attended this week. We were treated to 4 presentations on a variety of Data Science/Machine Learning/Big Data topics. First up was Rosaria Silipo from Knime. Knime is new to me, it’s a visual and interactive machine learning environment where you develop your data science and machine learning workflows. Data sources, data manipulation, algorithm execution and outputs are nodes in a eclipse-like environment that are joined together to give you an end-to-end execution environment. Rosaria took us through a previous project showing how the Knime interface helped the project and showing how Knime can be extended to integrate other tools like R. I like the idea and would love to find some time to investigate further.

Next up was Ian Hopkinson from Scraperwiki talking about scraping and parsing PDF. Ian is a self-effacing but engaging speaker which made the relatively dry subject matter pretty easy to digest – essentially a technical walkthrough on implementing the extraction of data from 1000s of PDFs, warts and all. 2 key points:

  1. Regular Expressions are still a significant tool in data extraction. This is a dirty little secret of NLP that I’ve heard before. Kind of depressing as one of the things that attracted me to machine learning was the hope that I might write less REs in the future
  2. Scraperwiki are involved in some really interesting public data extraction for example digitizing UN Assembly archives. Don’t know if anyone has done analysis of the UN voting patterns on a large-scale but I for one would be interested to know if they correlate with voting on Eurovision Song Contest

Third up was Doug Cutting. Doug is the originator of Lucene and Hadoop which probably explains the frenzy to get into the meeting (I had been on the waiting list for a week and eventually got the a place at 4.00 for a 6.30 start) and the packed hall. Doug now works for the Hadoop provider Cloudera and was speaking on the recently announced Cloudera Search. Cloudera Search enables Lucene indexing of data stored on HDFS with index files stored in HDFS. It has always been possible (albeit a bit fiddly) to do this however there were performance issues. Performance issues were mostly resolved by adding a page cache to HDFS. They also incorporated and ‘glued-in’ some supporting technologies such as Apache Tikka (extracts indexable content out of multiple document formats like word, excel, pdf, html), Apache Zookeeper and some others that I don’t remember. A really neat idea is the ability to index a batch of content offline using MapReduce (something MapReduce would be really good at) and then merge the off-line index into the main online index. This supports use cases where companies need to re-index their content on a regular basis but still need near real-time indexing and search of new content. I can also see this being great for data migration type scenarios. All in all I think this is fascinating and it will be interesting to see how the other Hadoop providers respond. 

Last up was Ian Ozsvald talking about producing a better Named Entity Recogniser (NER) for brands in twitter-like social media. NER is a fairly mature technology these days however most of the available technology is apparently trained on more traditional long-form content with good syntax, ‘proper’ writing and with an emphasis on big (often American) brands. I particularly applaud the fact that he has only just started the project and came along to present his ideas and to make his work freely available on githup. I would love to find the time to download it myself and will be following his progress. If you are interested I suggest you check out his blog posting. As an aside he also has a personal project to track his cat using a raspberry Pi, which you can follow on twitter as @QuantifiedPolly.

All in all a great event and thanks to Carlos for the organisation, and the sponsors for the beer and pizza. Looking forward to the next time – assuming I can get in.

Blog at WordPress.com.
Entries and comments feeds.