Data Science London Meetup June 2013June 14, 2013 at 2:08 pm | Posted in Performance | Leave a comment
Tags: Big Data, Cloudera, data science, Data Science London, Doug Cutting, Ian Ozsvald, Knime, machine learning, NLP, Scraper Wiki
This is a quick post to record my thoughts and impressions from the Data Science London meet up I attended this week. We were treated to 4 presentations on a variety of Data Science/Machine Learning/Big Data topics. First up was Rosaria Silipo from Knime. Knime is new to me, it’s a visual and interactive machine learning environment where you develop your data science and machine learning workflows. Data sources, data manipulation, algorithm execution and outputs are nodes in a eclipse-like environment that are joined together to give you an end-to-end execution environment. Rosaria took us through a previous project showing how the Knime interface helped the project and showing how Knime can be extended to integrate other tools like R. I like the idea and would love to find some time to investigate further.
Next up was Ian Hopkinson from Scraperwiki talking about scraping and parsing PDF. Ian is a self-effacing but engaging speaker which made the relatively dry subject matter pretty easy to digest – essentially a technical walkthrough on implementing the extraction of data from 1000s of PDFs, warts and all. 2 key points:
- Regular Expressions are still a significant tool in data extraction. This is a dirty little secret of NLP that I’ve heard before. Kind of depressing as one of the things that attracted me to machine learning was the hope that I might write less REs in the future
- Scraperwiki are involved in some really interesting public data extraction for example digitizing UN Assembly archives. Don’t know if anyone has done analysis of the UN voting patterns on a large-scale but I for one would be interested to know if they correlate with voting on Eurovision Song Contest
Third up was Doug Cutting. Doug is the originator of Lucene and Hadoop which probably explains the frenzy to get into the meeting (I had been on the waiting list for a week and eventually got the a place at 4.00 for a 6.30 start) and the packed hall. Doug now works for the Hadoop provider Cloudera and was speaking on the recently announced Cloudera Search. Cloudera Search enables Lucene indexing of data stored on HDFS with index files stored in HDFS. It has always been possible (albeit a bit fiddly) to do this however there were performance issues. Performance issues were mostly resolved by adding a page cache to HDFS. They also incorporated and ‘glued-in’ some supporting technologies such as Apache Tikka (extracts indexable content out of multiple document formats like word, excel, pdf, html), Apache Zookeeper and some others that I don’t remember. A really neat idea is the ability to index a batch of content offline using MapReduce (something MapReduce would be really good at) and then merge the off-line index into the main online index. This supports use cases where companies need to re-index their content on a regular basis but still need near real-time indexing and search of new content. I can also see this being great for data migration type scenarios. All in all I think this is fascinating and it will be interesting to see how the other Hadoop providers respond.
Last up was Ian Ozsvald talking about producing a better Named Entity Recogniser (NER) for brands in twitter-like social media. NER is a fairly mature technology these days however most of the available technology is apparently trained on more traditional long-form content with good syntax, ‘proper’ writing and with an emphasis on big (often American) brands. I particularly applaud the fact that he has only just started the project and came along to present his ideas and to make his work freely available on githup. I would love to find the time to download it myself and will be following his progress. If you are interested I suggest you check out his blog posting. As an aside he also has a personal project to track his cat using a raspberry Pi, which you can follow on twitter as @QuantifiedPolly.
All in all a great event and thanks to Carlos for the organisation, and the sponsors for the beer and pizza. Looking forward to the next time – assuming I can get in.