Running ADMM LASSO example on Mac OS X Mountain Lion

July 5, 2013 at 2:13 pm | Posted in Big Data | Leave a comment
Tags: , , , , , , ,

Being able to process massive datasets for machine learning is becoming increasingly important. By massive datasets I mean data that won’t fit into RAM on a single machine (even with sparse representations or using the hashing trick). There have been a number of initiatives in the academic and research arena that attempt to address the problem; one very interesting one is Alternating Direction Method of Multipliers (ADMM). It’s an old idea that has been resurrected in this paper by Stephen Boyd’s team at Stanford. A quick google on ‘Alternating Direction Method of Multipliers’ shows a recent surge of academic papers as people have started to take the ideas on-board.

That paper comes with some example code including a complete small-scale example of distributed L1 regularized least squares using MPI. The code was tested on Mac OS X 10.6, Debian 6, and Ubuntu 10.04. It requires installation of an MPI implementation but the authors state that OpenMPI is installed with Mac OS X 10.5 and later. So it sounds like it would be easy to run on my new iMac. Well it turns out that from Mac OS X 10.7 (Lion) this is no longer true (see here). So here are the augmented instructions for Mac OS X 10.8 that worked for me; they come with the usual ‘your mileage may vary’ caveat.

Before You Start

I assume that XCode is already installed (freely available from the App Store, i’m using 4.6.3) and that command line tools are installed (Xcode | Preferences | install Command Line Tools). Typing gcc in the terminal gives me
i686-apple-darwin11-llvm-gcc-4.2.

You should, of course, always download from a reputable site and verify the checksum (e.g. using md5 or gpg). Safari seems to be set up to automatically uncompress .tar.gz files to .tar. Very helpful Safari but now I can’t checksum the downloaded file! To prevent this behaviour go to Safari | Preferences | General tab and untick ‘Open “safe” files after downloading’. Yes I found that ironic too.

Install GNU Scientific Library

First you need to download and install GNU Scientific Library. I used the mirror suggested by the GSL site. Download the latest release which in my case was 1.15 (gsl-1.15.tar.gz). Now do the following

tar zxf gsl-1.15.tar.gz
mv gsl-1.15 ~
cd ~/gsl-1.15
export CC=CLANG
./configure
make
make check > log 2>&1

The ‘make check’ call runs some tests on the installation. Originally I didn’t have the export CC=CLANG line and this failed some of the tests so it seems worthwhile to do the checks.

So review the file called log and if everything looked like it passed and no failures, proceed as follows:

sudo make install

This will place GSL in /usr/local and requires admin privileges. You should be able to use make –prefix to put it elsewhere but I didn’t try that.

Install OpenMPI

Go to http://www.open-mpi.org and download the latest stable release of Open MPI – at the time of writing that was 1.6.5. Then the following sequence will install (again i’m installing to /usr/local):

tar zxf openmpi-1.6.5.tar.gz
mv open-1.6.5 ~
cd ~/open-1.6.5
./configure --prefix /usr/local
make
sudo make install

Download and Run Distributed LASSO

The link to the ADMM source code is on the page ‘MPI example for alternating direction method of multipliers‘ along with instructions for installing:

  1. Download and expand the mpi_lasso tar ball. The package contains a Makefile, the solver, and a standard library for reading in matrix data.
  2. Edit the Makefile to ensure that the GSLROOT variable is set to point to the location where you installed GSL, and that the ARCH variable is set appropriately (most likely to i386 or x86_64). On some machines, it may be necessary to remove the use of the flag entirely.
  3. Run make. This produces a binary called lasso.

Incidentally the Makefile seems to contain additional instructions to build a component called ‘gam’. gam.c is not included in the download so I just removed all references to gam. Here is what my Makefile looks like:

GSLROOT=/usr/local
# use this if on 64-bit machine with 64-bit GSL libraries
ARCH=x86_64
# use this if on 32-bit machine with 32-bit GSL libraries
# ARCH=i386

MPICC=mpicc
CC=gcc
CFLAGS=-Wall -std=c99 -arch $(ARCH) -I$(GSLROOT)/include
LDFLAGS=-L$(GSLROOT)/lib -lgsl -lgslcblas -lm

all: lasso 

lasso: lasso.o mmio.o
	$(MPICC) $(CFLAGS) $(LDFLAGS) lasso.o mmio.o -o lasso

lasso.o: lasso.c mmio.o
	$(MPICC) $(CFLAGS) -c lasso.c

mmio.o: mmio.c
	$(CC) $(CFLAGS) -c mmio.c

clean:
	rm -vf *.o lasso 

A typical execution using the provided data set and using 4 processes on the same machine is

mpirun -np 4 lasso

The output should look like this:


[0] reading data/A1.dat
[1] reading data/A2.dat
[2] reading data/A3.dat
[3] reading data/A4.dat
[3] reading data/b4.dat
[1] reading data/b2.dat
[0] reading data/b1.dat
[2] reading data/b3.dat
using lambda: 0.5000
  #     r norm    eps_pri     s norm   eps_dual  objective
  0     0.0000     0.0430     0.1692     0.0045    12.0262
  1     3.8267     0.0340     0.9591     0.0427    11.8101
  2     2.6698     0.0349     1.5638     0.0687    12.1617
  3     1.5666     0.0476     1.6647     0.0831    13.2944
  4     0.8126     0.0614     1.4461     0.0886    14.8081
  5     0.6825     0.0721     1.1210     0.0886    16.1636
  6     0.7332     0.0793     0.8389     0.0862    17.0764
  7     0.6889     0.0838     0.6616     0.0831    17.5325
  8     0.5750     0.0867     0.5551     0.0802    17.6658
  9     0.4539     0.0885     0.4675     0.0778    17.6560
 10     0.3842     0.0897     0.3936     0.0759    17.5914
 11     0.3121     0.0905     0.3389     0.0744    17.5154
 12     0.2606     0.0912     0.2913     0.0733    17.4330
 13     0.2245     0.0917     0.2558     0.0725    17.3519
 14     0.1847     0.0923     0.2276     0.0720    17.2874
 15     0.1622     0.0928     0.2076     0.0716    17.2312
 16     0.1335     0.0934     0.1858     0.0713    17.1980
 17     0.1214     0.0939     0.1689     0.0712    17.1803
 18     0.1045     0.0944     0.1548     0.0710    17.1723
 19     0.0931     0.0950     0.1344     0.0708    17.1768
 20     0.0919     0.0954     0.1243     0.0707    17.1824
 21     0.0723     0.0958     0.1152     0.0705    17.1867
 22     0.0638     0.0962     0.1079     0.0704    17.1896
 23     0.0570     0.0965     0.1019     0.0702    17.1900
 24     0.0507     0.0968     0.0964     0.0701    17.1898
 25     0.0460     0.0971     0.0917     0.0700    17.1885
 26     0.0416     0.0973     0.0874     0.0699    17.1866
 27     0.0382     0.0976     0.0834     0.0698    17.1846
 28     0.0354     0.0978     0.0798     0.0697    17.1827
 29     0.0329     0.0980     0.0762     0.0697    17.1815
 30     0.0311     0.0983     0.0701     0.0696    17.1858
 31     0.0355     0.0985     0.0667     0.0696    17.1890

If you open up the file data/solution.dat it will contain the optimal z (which equals x) parameters, most of which should be zero.

Blog at WordPress.com.
Entries and comments feeds.