Documentum on PostgreSQL

September 17, 2013 at 3:53 pm | Posted in Performance | Leave a comment
Tags: , , , , ,

Great news from Lee Dallas reporting from the Documentum Developer Conference: Documentum Developer Edition is back and now runs on PostgreSQL. I discussed this a few months back and I thought that maybe EMC didn’t have the stomach for something so technical, but I was wrong. So kudos to EMC.

Lee mentions it’s not yet production ready, so hopefully that is in the pipeline. After that how about certifying it to run on Greenplum, EMC’s massively scalable PostgreSQL. Then the sky is the limit for large-scale NLP and machine learning tasks. For example last year I wanted to run a classification algorithm on document content to identify certain types of document that couldn’t be found by metadata. There are plenty of other uses I can think. 

I’ll be downloading the edition as soon as possible to see how it runs.

Running ADMM LASSO example on Mac OS X Mountain Lion

July 5, 2013 at 2:13 pm | Posted in Big Data | Leave a comment
Tags: , , , , , , ,

Being able to process massive datasets for machine learning is becoming increasingly important. By massive datasets I mean data that won’t fit into RAM on a single machine (even with sparse representations or using the hashing trick). There have been a number of initiatives in the academic and research arena that attempt to address the problem; one very interesting one is Alternating Direction Method of Multipliers (ADMM). It’s an old idea that has been resurrected in this paper by Stephen Boyd’s team at Stanford. A quick google on ‘Alternating Direction Method of Multipliers’ shows a recent surge of academic papers as people have started to take the ideas on-board.

That paper comes with some example code including a complete small-scale example of distributed L1 regularized least squares using MPI. The code was tested on Mac OS X 10.6, Debian 6, and Ubuntu 10.04. It requires installation of an MPI implementation but the authors state that OpenMPI is installed with Mac OS X 10.5 and later. So it sounds like it would be easy to run on my new iMac. Well it turns out that from Mac OS X 10.7 (Lion) this is no longer true (see here). So here are the augmented instructions for Mac OS X 10.8 that worked for me; they come with the usual ‘your mileage may vary’ caveat.

Before You Start

I assume that XCode is already installed (freely available from the App Store, i’m using 4.6.3) and that command line tools are installed (Xcode | Preferences | install Command Line Tools). Typing gcc in the terminal gives me

You should, of course, always download from a reputable site and verify the checksum (e.g. using md5 or gpg). Safari seems to be set up to automatically uncompress .tar.gz files to .tar. Very helpful Safari but now I can’t checksum the downloaded file! To prevent this behaviour go to Safari | Preferences | General tab and untick ‘Open “safe” files after downloading’. Yes I found that ironic too.

Install GNU Scientific Library

First you need to download and install GNU Scientific Library. I used the mirror suggested by the GSL site. Download the latest release which in my case was 1.15 (gsl-1.15.tar.gz). Now do the following

tar zxf gsl-1.15.tar.gz
mv gsl-1.15 ~
cd ~/gsl-1.15
export CC=CLANG
make check > log 2>&1

The ‘make check’ call runs some tests on the installation. Originally I didn’t have the export CC=CLANG line and this failed some of the tests so it seems worthwhile to do the checks.

So review the file called log and if everything looked like it passed and no failures, proceed as follows:

sudo make install

This will place GSL in /usr/local and requires admin privileges. You should be able to use make –prefix to put it elsewhere but I didn’t try that.

Install OpenMPI

Go to and download the latest stable release of Open MPI – at the time of writing that was 1.6.5. Then the following sequence will install (again i’m installing to /usr/local):

tar zxf openmpi-1.6.5.tar.gz
mv open-1.6.5 ~
cd ~/open-1.6.5
./configure --prefix /usr/local
sudo make install

Download and Run Distributed LASSO

The link to the ADMM source code is on the page ‘MPI example for alternating direction method of multipliers‘ along with instructions for installing:

  1. Download and expand the mpi_lasso tar ball. The package contains a Makefile, the solver, and a standard library for reading in matrix data.
  2. Edit the Makefile to ensure that the GSLROOT variable is set to point to the location where you installed GSL, and that the ARCH variable is set appropriately (most likely to i386 or x86_64). On some machines, it may be necessary to remove the use of the flag entirely.
  3. Run make. This produces a binary called lasso.

Incidentally the Makefile seems to contain additional instructions to build a component called ‘gam’. gam.c is not included in the download so I just removed all references to gam. Here is what my Makefile looks like:

# use this if on 64-bit machine with 64-bit GSL libraries
# use this if on 32-bit machine with 32-bit GSL libraries
# ARCH=i386

CFLAGS=-Wall -std=c99 -arch $(ARCH) -I$(GSLROOT)/include
LDFLAGS=-L$(GSLROOT)/lib -lgsl -lgslcblas -lm

all: lasso 

lasso: lasso.o mmio.o
	$(MPICC) $(CFLAGS) $(LDFLAGS) lasso.o mmio.o -o lasso

lasso.o: lasso.c mmio.o
	$(MPICC) $(CFLAGS) -c lasso.c

mmio.o: mmio.c
	$(CC) $(CFLAGS) -c mmio.c

	rm -vf *.o lasso 

A typical execution using the provided data set and using 4 processes on the same machine is

mpirun -np 4 lasso

The output should look like this:

[0] reading data/A1.dat
[1] reading data/A2.dat
[2] reading data/A3.dat
[3] reading data/A4.dat
[3] reading data/b4.dat
[1] reading data/b2.dat
[0] reading data/b1.dat
[2] reading data/b3.dat
using lambda: 0.5000
  #     r norm    eps_pri     s norm   eps_dual  objective
  0     0.0000     0.0430     0.1692     0.0045    12.0262
  1     3.8267     0.0340     0.9591     0.0427    11.8101
  2     2.6698     0.0349     1.5638     0.0687    12.1617
  3     1.5666     0.0476     1.6647     0.0831    13.2944
  4     0.8126     0.0614     1.4461     0.0886    14.8081
  5     0.6825     0.0721     1.1210     0.0886    16.1636
  6     0.7332     0.0793     0.8389     0.0862    17.0764
  7     0.6889     0.0838     0.6616     0.0831    17.5325
  8     0.5750     0.0867     0.5551     0.0802    17.6658
  9     0.4539     0.0885     0.4675     0.0778    17.6560
 10     0.3842     0.0897     0.3936     0.0759    17.5914
 11     0.3121     0.0905     0.3389     0.0744    17.5154
 12     0.2606     0.0912     0.2913     0.0733    17.4330
 13     0.2245     0.0917     0.2558     0.0725    17.3519
 14     0.1847     0.0923     0.2276     0.0720    17.2874
 15     0.1622     0.0928     0.2076     0.0716    17.2312
 16     0.1335     0.0934     0.1858     0.0713    17.1980
 17     0.1214     0.0939     0.1689     0.0712    17.1803
 18     0.1045     0.0944     0.1548     0.0710    17.1723
 19     0.0931     0.0950     0.1344     0.0708    17.1768
 20     0.0919     0.0954     0.1243     0.0707    17.1824
 21     0.0723     0.0958     0.1152     0.0705    17.1867
 22     0.0638     0.0962     0.1079     0.0704    17.1896
 23     0.0570     0.0965     0.1019     0.0702    17.1900
 24     0.0507     0.0968     0.0964     0.0701    17.1898
 25     0.0460     0.0971     0.0917     0.0700    17.1885
 26     0.0416     0.0973     0.0874     0.0699    17.1866
 27     0.0382     0.0976     0.0834     0.0698    17.1846
 28     0.0354     0.0978     0.0798     0.0697    17.1827
 29     0.0329     0.0980     0.0762     0.0697    17.1815
 30     0.0311     0.0983     0.0701     0.0696    17.1858
 31     0.0355     0.0985     0.0667     0.0696    17.1890

If you open up the file data/solution.dat it will contain the optimal z (which equals x) parameters, most of which should be zero.

Data Science London Meetup June 2013

June 14, 2013 at 2:08 pm | Posted in Performance | Leave a comment
Tags: , , , , , , , , ,

This is a quick post to record my thoughts and impressions from the Data Science London meet up I attended this week. We were treated to 4 presentations on a variety of Data Science/Machine Learning/Big Data topics. First up was Rosaria Silipo from Knime. Knime is new to me, it’s a visual and interactive machine learning environment where you develop your data science and machine learning workflows. Data sources, data manipulation, algorithm execution and outputs are nodes in a eclipse-like environment that are joined together to give you an end-to-end execution environment. Rosaria took us through a previous project showing how the Knime interface helped the project and showing how Knime can be extended to integrate other tools like R. I like the idea and would love to find some time to investigate further.

Next up was Ian Hopkinson from Scraperwiki talking about scraping and parsing PDF. Ian is a self-effacing but engaging speaker which made the relatively dry subject matter pretty easy to digest – essentially a technical walkthrough on implementing the extraction of data from 1000s of PDFs, warts and all. 2 key points:

  1. Regular Expressions are still a significant tool in data extraction. This is a dirty little secret of NLP that I’ve heard before. Kind of depressing as one of the things that attracted me to machine learning was the hope that I might write less REs in the future
  2. Scraperwiki are involved in some really interesting public data extraction for example digitizing UN Assembly archives. Don’t know if anyone has done analysis of the UN voting patterns on a large-scale but I for one would be interested to know if they correlate with voting on Eurovision Song Contest

Third up was Doug Cutting. Doug is the originator of Lucene and Hadoop which probably explains the frenzy to get into the meeting (I had been on the waiting list for a week and eventually got the a place at 4.00 for a 6.30 start) and the packed hall. Doug now works for the Hadoop provider Cloudera and was speaking on the recently announced Cloudera Search. Cloudera Search enables Lucene indexing of data stored on HDFS with index files stored in HDFS. It has always been possible (albeit a bit fiddly) to do this however there were performance issues. Performance issues were mostly resolved by adding a page cache to HDFS. They also incorporated and ‘glued-in’ some supporting technologies such as Apache Tikka (extracts indexable content out of multiple document formats like word, excel, pdf, html), Apache Zookeeper and some others that I don’t remember. A really neat idea is the ability to index a batch of content offline using MapReduce (something MapReduce would be really good at) and then merge the off-line index into the main online index. This supports use cases where companies need to re-index their content on a regular basis but still need near real-time indexing and search of new content. I can also see this being great for data migration type scenarios. All in all I think this is fascinating and it will be interesting to see how the other Hadoop providers respond. 

Last up was Ian Ozsvald talking about producing a better Named Entity Recogniser (NER) for brands in twitter-like social media. NER is a fairly mature technology these days however most of the available technology is apparently trained on more traditional long-form content with good syntax, ‘proper’ writing and with an emphasis on big (often American) brands. I particularly applaud the fact that he has only just started the project and came along to present his ideas and to make his work freely available on githup. I would love to find the time to download it myself and will be following his progress. If you are interested I suggest you check out his blog posting. As an aside he also has a personal project to track his cat using a raspberry Pi, which you can follow on twitter as @QuantifiedPolly.

All in all a great event and thanks to Carlos for the organisation, and the sponsors for the beer and pizza. Looking forward to the next time – assuming I can get in.

Taking the EMC Data Science associate certification

May 13, 2013 at 10:06 am | Posted in Big Data, Performance | 12 Comments
Tags: , , , , ,

In the last couple of weeks I’ve been studying for the EMC data science certification. There are a number of ways of studying for this certificate but I chose the virtual learning option,which comes as a DVD that installs on a Windows PC (yes Macs are no good!).

The course consists of six modules and is derived from the classroom-based delivery of the course. Each module is dedicated to a particular aspect of data science and big data with each following a similar pattern: a number of video lectures and followed by a set of lab exercises. There are also occasional short interviews with professional data scientists focusing on various topical areas. At the end of each module there is a question and answer multiple-choice to test your understanding of the subjects.

The video lectures are a recording of the course delivered to some EMC employees. This has some pros and cons. Occasionally we veer off from the lecture to a group discussion. Sometimes this is enlightening and provides a counterpoint to the formal material, however sometimes microphones are switched off or the conversation becomes confused and off-topic (just like real life!). Overall this worked pretty well and make if easier to watch.

The labs are more problematic. You get the same labs as delivered in the classroom course however you simply get to watch a camtasia studio recording of the lab with a voiceover by one of the presenters. Clearly the main benefits of labs is to enable people to experience the software hands-on, an essential part of learning practical skills. Most of the labs use either the open source R software or EMCs own Greenplum which is available as a community software download. There is nothing to stop you from downloading your own copies of these pieces of software and in fact that is what I did with R. However many of the labs assume there are certain sets of data available on the system; in some cases this is CSV files which are actually provided with the course. However relational tables used in Greenplum are not provided. It would have been nice if a dump of the relational tables had been provided on the DVD. A more ambitious idea would have been to provide some sort of online virtual machine in which subscribers to the course could run the labs.

Since the lab guide was provided I was able in many cases to follow the labs exactly, where the data was provided, or something close to it by generating my own data. I also used an existing Postgres database as a substitute for some of the Greenplum work. However I didn’t have time to get MADLib extensions working in Postgres (these come as part of out-of-the-box Greenplum). This is unfortunate as clearly one of the things that EMC/Pivotal/Greenplum would like is for more people to use MADLib. By the way, if you didn’t know, MADLib is a way of running advanced analytics in-database with the possibility of using Massively Parallel Processing to speed delivery of results.

The first couple of modules are of a high-level nature aimed more at Project Manager or Business Analyst type people. The presenter, David Dietrich, is clearly very comfortable with this material and appears to have had considerable experience at the business end of analytics projects. The material centres around a 6-step, iterative analytics methodology which seemed very sensible to me and would be a good framework for many analytics projects. It emphasises that much of the work will go into the early Discovery phases (i.e. the ‘What the hell are we actually doing?” phase) and particularly the Data Preparation (the unsexy bit of data projects). All in all this seemed both sensible and easy material.

Things start getting technical in Module 3 which provides background technicals on statistical theory and R, the open-source statistics software. The course assumes a certain level of statistical background and programming ability and if you don’t have that this is where you might start to struggle. As an experienced programmer I found R no problem at all and thoroughly enjoyed both the programming and the statistics.

The real meat of the course is Modules 4 and 5. Module 4 is a big beast as it dives into a number of machine learning algorithms: Kmeans clustering, Apriori decision rules, linear and logistic regression, Naive Bayes and Decision Trees. Throw in some introductory Text Analysis and you have a massive subject base to cover. This particular part of the course is exceptionally well-written and pretty well presented. I’m not saying it’s perfect but it is hard to over-state how difficult it is to cover all this material effectively in a relatively short-space of time. Each of these algorithms is presented with use-cases, some theoretical background and insight, pros and cons, and a lab.

It should be acknowledged that analytics and big data projects require a considerable range of skills and this course provides a broad-brush overview of some of the more common techniques. Clearly you wouldn’t expect participation on this course to make you an expert Data Scientist any more than you would employ someone to program in Java or C just based on courses and exams taken. I certainly wouldn’t let someone loose to administer a production Documentum system without being very sure they had the tough experience to back up the certificates. Somewhere in the introduction to this course they make clear that the aim is to enable the you to become an effective participant in a big data analytics project; not necessarily as a data scientist but as someone who needs to understand both the process and the technicals. As far as this is the aim I think it is well met in Module 4.

Module 5 is an introduction to big data processing, in particular Hadoop and MADLib. I just want to make 1 point here. This is very much an overview and it is clear that the stance taken by the course is that a Data Scientist would be very concerned with technical details about which analytics methods to use and evaluate (the subject of module 4), however the processing side is just something that they need to be aware of. I suspect in real-life that this dichotomy is nowhere near as clear-cut.

Finally Module 6 is back to the high-level stuff of modules 1 and 2. Some useful stuff about how to write reports for project sponsors and other non-Data Scientists and dos and don’ts of diagrams and visualisations. If this all seems a bit obvious it’s amazing how often this is done badly. As the presenter points out it’s no good spending tons of time and effort producing great analytics if you aren’t able to effectively convince your stakeholders of your results and recommendations. This is so true. The big takeaways: don’t use 3D charts, and pie charts are usually a waste of ink (or screen real estate).

If I have one major complaint about the content it is that Feature Selection is not covered in any depth. It’s certainly there in places in module 4 but given that coming up with the right features to model on can have a huge impact on the predictive power of your model there is a case for specific focus.

So overall I think this was a worthwhile course as long as you don’t have unrealistic expectations of what you will achieve. Furthermore if you want to get full value from the labs you are going to have to invest some effort in installing software (R and Greenplum/Postgres) and ‘munging’ data sets to use.

Oh, by the way, I passed the exam!

Create a free website or blog at
Entries and comments feeds.