Running ADMM LASSO example on Mac OS X Mountain Lion

July 5, 2013 at 2:13 pm | Posted in Big Data | Leave a comment
Tags: , , , , , , ,

Being able to process massive datasets for machine learning is becoming increasingly important. By massive datasets I mean data that won’t fit into RAM on a single machine (even with sparse representations or using the hashing trick). There have been a number of initiatives in the academic and research arena that attempt to address the problem; one very interesting one is Alternating Direction Method of Multipliers (ADMM). It’s an old idea that has been resurrected in this paper by Stephen Boyd’s team at Stanford. A quick google on ‘Alternating Direction Method of Multipliers’ shows a recent surge of academic papers as people have started to take the ideas on-board.

That paper comes with some example code including a complete small-scale example of distributed L1 regularized least squares using MPI. The code was tested on Mac OS X 10.6, Debian 6, and Ubuntu 10.04. It requires installation of an MPI implementation but the authors state that OpenMPI is installed with Mac OS X 10.5 and later. So it sounds like it would be easy to run on my new iMac. Well it turns out that from Mac OS X 10.7 (Lion) this is no longer true (see here). So here are the augmented instructions for Mac OS X 10.8 that worked for me; they come with the usual ‘your mileage may vary’ caveat.

Before You Start

I assume that XCode is already installed (freely available from the App Store, i’m using 4.6.3) and that command line tools are installed (Xcode | Preferences | install Command Line Tools). Typing gcc in the terminal gives me
i686-apple-darwin11-llvm-gcc-4.2.

You should, of course, always download from a reputable site and verify the checksum (e.g. using md5 or gpg). Safari seems to be set up to automatically uncompress .tar.gz files to .tar. Very helpful Safari but now I can’t checksum the downloaded file! To prevent this behaviour go to Safari | Preferences | General tab and untick ‘Open “safe” files after downloading’. Yes I found that ironic too.

Install GNU Scientific Library

First you need to download and install GNU Scientific Library. I used the mirror suggested by the GSL site. Download the latest release which in my case was 1.15 (gsl-1.15.tar.gz). Now do the following

tar zxf gsl-1.15.tar.gz
mv gsl-1.15 ~
cd ~/gsl-1.15
export CC=CLANG
./configure
make
make check > log 2>&1

The ‘make check’ call runs some tests on the installation. Originally I didn’t have the export CC=CLANG line and this failed some of the tests so it seems worthwhile to do the checks.

So review the file called log and if everything looked like it passed and no failures, proceed as follows:

sudo make install

This will place GSL in /usr/local and requires admin privileges. You should be able to use make –prefix to put it elsewhere but I didn’t try that.

Install OpenMPI

Go to http://www.open-mpi.org and download the latest stable release of Open MPI – at the time of writing that was 1.6.5. Then the following sequence will install (again i’m installing to /usr/local):

tar zxf openmpi-1.6.5.tar.gz
mv open-1.6.5 ~
cd ~/open-1.6.5
./configure --prefix /usr/local
make
sudo make install

Download and Run Distributed LASSO

The link to the ADMM source code is on the page ‘MPI example for alternating direction method of multipliers‘ along with instructions for installing:

  1. Download and expand the mpi_lasso tar ball. The package contains a Makefile, the solver, and a standard library for reading in matrix data.
  2. Edit the Makefile to ensure that the GSLROOT variable is set to point to the location where you installed GSL, and that the ARCH variable is set appropriately (most likely to i386 or x86_64). On some machines, it may be necessary to remove the use of the flag entirely.
  3. Run make. This produces a binary called lasso.

Incidentally the Makefile seems to contain additional instructions to build a component called ‘gam’. gam.c is not included in the download so I just removed all references to gam. Here is what my Makefile looks like:

GSLROOT=/usr/local
# use this if on 64-bit machine with 64-bit GSL libraries
ARCH=x86_64
# use this if on 32-bit machine with 32-bit GSL libraries
# ARCH=i386

MPICC=mpicc
CC=gcc
CFLAGS=-Wall -std=c99 -arch $(ARCH) -I$(GSLROOT)/include
LDFLAGS=-L$(GSLROOT)/lib -lgsl -lgslcblas -lm

all: lasso 

lasso: lasso.o mmio.o
	$(MPICC) $(CFLAGS) $(LDFLAGS) lasso.o mmio.o -o lasso

lasso.o: lasso.c mmio.o
	$(MPICC) $(CFLAGS) -c lasso.c

mmio.o: mmio.c
	$(CC) $(CFLAGS) -c mmio.c

clean:
	rm -vf *.o lasso 

A typical execution using the provided data set and using 4 processes on the same machine is

mpirun -np 4 lasso

The output should look like this:


[0] reading data/A1.dat
[1] reading data/A2.dat
[2] reading data/A3.dat
[3] reading data/A4.dat
[3] reading data/b4.dat
[1] reading data/b2.dat
[0] reading data/b1.dat
[2] reading data/b3.dat
using lambda: 0.5000
  #     r norm    eps_pri     s norm   eps_dual  objective
  0     0.0000     0.0430     0.1692     0.0045    12.0262
  1     3.8267     0.0340     0.9591     0.0427    11.8101
  2     2.6698     0.0349     1.5638     0.0687    12.1617
  3     1.5666     0.0476     1.6647     0.0831    13.2944
  4     0.8126     0.0614     1.4461     0.0886    14.8081
  5     0.6825     0.0721     1.1210     0.0886    16.1636
  6     0.7332     0.0793     0.8389     0.0862    17.0764
  7     0.6889     0.0838     0.6616     0.0831    17.5325
  8     0.5750     0.0867     0.5551     0.0802    17.6658
  9     0.4539     0.0885     0.4675     0.0778    17.6560
 10     0.3842     0.0897     0.3936     0.0759    17.5914
 11     0.3121     0.0905     0.3389     0.0744    17.5154
 12     0.2606     0.0912     0.2913     0.0733    17.4330
 13     0.2245     0.0917     0.2558     0.0725    17.3519
 14     0.1847     0.0923     0.2276     0.0720    17.2874
 15     0.1622     0.0928     0.2076     0.0716    17.2312
 16     0.1335     0.0934     0.1858     0.0713    17.1980
 17     0.1214     0.0939     0.1689     0.0712    17.1803
 18     0.1045     0.0944     0.1548     0.0710    17.1723
 19     0.0931     0.0950     0.1344     0.0708    17.1768
 20     0.0919     0.0954     0.1243     0.0707    17.1824
 21     0.0723     0.0958     0.1152     0.0705    17.1867
 22     0.0638     0.0962     0.1079     0.0704    17.1896
 23     0.0570     0.0965     0.1019     0.0702    17.1900
 24     0.0507     0.0968     0.0964     0.0701    17.1898
 25     0.0460     0.0971     0.0917     0.0700    17.1885
 26     0.0416     0.0973     0.0874     0.0699    17.1866
 27     0.0382     0.0976     0.0834     0.0698    17.1846
 28     0.0354     0.0978     0.0798     0.0697    17.1827
 29     0.0329     0.0980     0.0762     0.0697    17.1815
 30     0.0311     0.0983     0.0701     0.0696    17.1858
 31     0.0355     0.0985     0.0667     0.0696    17.1890

If you open up the file data/solution.dat it will contain the optimal z (which equals x) parameters, most of which should be zero.

Documentum and Databases

July 3, 2013 at 4:11 pm | Posted in Performance | 4 Comments
Tags: , ,

Here’s a quick thought on Documentum and databases. For a long time Documentum used to support a variety of databases however these days support is just for 2 in D7 (Oracle and SQL Server) down from 4 in D6.7 (the previous 2 plus DB2 and Sybase).

The clear reason for narrowing down the choice of database server is (I suspect) the cost of developing for and supporting a large number of choices, particularly since most of the database/OS combinations were used by only a handful of customers.

So why doesn’t EMC port the application to postgres and cut that choice down to 1? Why postgres? Well because EMC owns Greenplum (well actually it’s now part of Pivotal but that just complicates the story) and Greenplum is an enhanced postgres.

The logic for this is clear: EMC would like people to move to OnDemand and it makes sense for them to have ownership of the whole technical stack. At the very least they must be shelling out money to one of the database vendors. I’m not sure which one – if you have access to an OnDemand installation try running ‘select r_server_version from dm_server_config’ and see what’s returned, someone let me know the results if you could.

There are a couple of reasons why EMC might be reluctant. First it’s a big change and people (including EMCs own development and support teams) have a big skills base in the legacy databases. Taking a medium-term strategic view this is not a great reason and is just a product of FUD – Documentum has taken brave technical steps in the past such as eliminating the dmcl layer with great success.

Second we’ve been hearing a lot over the last few years about the NG server that runs on XHive xml database that is touted to replace the venerable Content Server in the longer term. Perhaps EMC is reluctant to work on 2 such radical changes.

Who knows? It’s just a thought …

 

Customising Documentum’s Netegrity Siteminder SSO plugin pt 2

July 1, 2013 at 9:51 am | Posted in Performance | Leave a comment
Tags: , ,

The 1st part of this article introduced the motivation and architecture behind web-based Single Signon systems and Documentum’s SSO plugin. This 2nd part of the article discusses limitations in the out of the box plugin and a customisation approach to deal with the issue.

Sometimes you don’t want SSO

Whilst SSO is a great boon when you just want to login and get on with some work there are situations when it is positively unwanted. A case in point is electronic sign off of documents in systems like Documentum Compliance Manager (DCM). The document signoff screens in DCM require entry of a username and password (a GxP requirement) yet the out-of-the-box netegrity plugin only understands SSO cookies, it doesn’t know what to do with passwords.

Inside the plugin

Before looking at the solution let’s look in detail as how the out-of-the-box plugin works. When the dm_netegrity plugin receives an authentication request it contacts the SiteMinder application via the SiteMinder Agent API (SiteMinder libraries are included with the Content Server installation). The following API calls are made to the SiteMinder server:

  1. Sm_AgentApi_Init(). Sets up the connection to the SiteMinder server.

  2. Sm_AgentApi_DoManagement(). “Best practice” call to the SiteMinder server passing an authentication agent identifications string: Product=DocumentumAgent,Platform=All,Version=5.2,Label=None.

  3. Sm_AgentApi_DecodeSSOToken(). Passes the SSO token to SiteMinder to confirm that the token is valid i.e. that it has been produced by that SiteMinder infrastructure. If the call returns a success code then the token is valid. A session specification is also returned to the calling program – this is the identifier that connects the SSO token to the session originally created on the SiteMinder infrastructure.

  4. Sm_AgentApi_IsProtected(). Checks whether SiteMinder regards the web application context as a protected resource. This call is probably needed to fill in a data structure that is used in the in the next call.

  5. Sm_AgentApi_Login(). One of the input parameters to this call is the session specification (from step 3). If the session specification is passed then SiteMinder will will do some verification checks on the session (has it expired? is the user active?) and then return the user LDAP identifier. The plugin uses this information to check that the token is for the correct user.

Solution

The out of the box (OOB) dm_netegrity plugin provided by EMC is setup to authenticate users who have previously authenticated against SiteMinder and received an SSO token in their browser session. In our case, where authentication with a username and password is required, there is no support in the DCM application for re-authenticating against the SiteMinder SSO solution. Where such authentication is attempted the OOB plugin will return an authentication failure as it is not designed to authenticate usernames and passwords against SiteMinder.

One way to solve this problem is to add support in the authentication plugin for authenticating against a username and password as well as a SSO token. Since SSO tokens are very large (several hundreds of characters) whilst passwords are generally significantly smaller, we can use the length of the authentication token to decide whether the token is an SSO credential or a password. In practice something like 20 characters is a good cutoff point. If the length is greater than this limit it is treated as an SSO credential and processed as described above. If the length is 20 characters or less it is treated as a password and processed using the following API calls.

  1. Sm_AgentApi_Init(). Sets up the connection to the SiteMinder server.

  2. Sm_AgentApi_DoManagement(). “Best practice” call to the SiteMinder server passing an authentication agent identifications string: Product=DocumentumAgent,Platform=All,Version=5.2,Label=None.

  3. Sm_AgentApi_IsProtected(). Checks whether SiteMinder regards the web application context as a protected resource.

  4. Sm_AgentApi_Login(). Since Sm_AgentApi_DecodeSSOToken() has not been called no session specification is available and is not passed into the Login call (compare the out-of-the-box logic). However if the username and password are passed to the Login function  SiteMinder will validate the credentials. If a success return code is received the user is authenticated, otherwise the user is not authenticated.

Implementation and Deployment

Source code for the out of the box plugin is provided in the Content Server installation. It is written in C++ and has a makefile that covers a number of operating systems. To get this to work for 64-bit Linux took a little manipulation of the compiler and linker options.

The customisation should be deployed as a single *nix shared library. When the file is deployed to $DOCUMENTUM/dba/auth on the Content Server it is available as a dm_netegrity plugin (after a Content Server restart).

Note: the out-of-the-box dm_netegrity_auth.so library must not be present in the auth directory as this will cause a conflict when the plugins are loaded by Content Server and both try to register themselves as ‘dm_netegrity’.

Conclusion

The solution is fairly simple in concept, the devil is in the details of compile/link, deployment and testing. If you think you need to implement customised SSO for your project and want some help designing and implementing your solution please contact me for consulting work – initial advice is not charged.

Blog at WordPress.com.
Entries and comments feeds.