Using the Spark Event Timeline

September 8, 2015 at 1:42 pm | Posted in Performance | Leave a comment

Source: Using the Spark Event Timeline

New Blog

March 4, 2015 at 10:59 pm | Posted in Performance | Leave a comment

I have just published the first post on my new blog, Machine Learning at Speed. I left the world of EMC Documentum mid-2013 to concentrate on my other technical interests: Hadoop, data science and machine learning. If you’ve enjoyed Inside Documentum please consider following my posts on the new site – I believe that data science and machine learning will finally allow us to fulfil the promise of true knowledge discovery that content management promised for so long.

Documentum on PostgreSQL

September 17, 2013 at 3:53 pm | Posted in Performance | Leave a comment
Tags: , , , , ,

Great news from Lee Dallas reporting from the Documentum Developer Conference: Documentum Developer Edition is back and now runs on PostgreSQL. I discussed this a few months back and I thought that maybe EMC didn’t have the stomach for something so technical, but I was wrong. So kudos to EMC.

Lee mentions it’s not yet production ready, so hopefully that is in the pipeline. After that how about certifying it to run on Greenplum, EMC’s massively scalable PostgreSQL. Then the sky is the limit for large-scale NLP and machine learning tasks. For example last year I wanted to run a classification algorithm on document content to identify certain types of document that couldn’t be found by metadata. There are plenty of other uses I can think. 

I’ll be downloading the edition as soon as possible to see how it runs.

Documentum and Databases

July 3, 2013 at 4:11 pm | Posted in Performance | 4 Comments
Tags: , ,

Here’s a quick thought on Documentum and databases. For a long time Documentum used to support a variety of databases however these days support is just for 2 in D7 (Oracle and SQL Server) down from 4 in D6.7 (the previous 2 plus DB2 and Sybase).

The clear reason for narrowing down the choice of database server is (I suspect) the cost of developing for and supporting a large number of choices, particularly since most of the database/OS combinations were used by only a handful of customers.

So why doesn’t EMC port the application to postgres and cut that choice down to 1? Why postgres? Well because EMC owns Greenplum (well actually it’s now part of Pivotal but that just complicates the story) and Greenplum is an enhanced postgres.

The logic for this is clear: EMC would like people to move to OnDemand and it makes sense for them to have ownership of the whole technical stack. At the very least they must be shelling out money to one of the database vendors. I’m not sure which one – if you have access to an OnDemand installation try running ‘select r_server_version from dm_server_config’ and see what’s returned, someone let me know the results if you could.

There are a couple of reasons why EMC might be reluctant. First it’s a big change and people (including EMCs own development and support teams) have a big skills base in the legacy databases. Taking a medium-term strategic view this is not a great reason and is just a product of FUD – Documentum has taken brave technical steps in the past such as eliminating the dmcl layer with great success.

Second we’ve been hearing a lot over the last few years about the NG server that runs on XHive xml database that is touted to replace the venerable Content Server in the longer term. Perhaps EMC is reluctant to work on 2 such radical changes.

Who knows? It’s just a thought …

 

Customising Documentum’s Netegrity Siteminder SSO plugin pt 2

July 1, 2013 at 9:51 am | Posted in Performance | Leave a comment
Tags: , ,

The 1st part of this article introduced the motivation and architecture behind web-based Single Signon systems and Documentum’s SSO plugin. This 2nd part of the article discusses limitations in the out of the box plugin and a customisation approach to deal with the issue.

Sometimes you don’t want SSO

Whilst SSO is a great boon when you just want to login and get on with some work there are situations when it is positively unwanted. A case in point is electronic sign off of documents in systems like Documentum Compliance Manager (DCM). The document signoff screens in DCM require entry of a username and password (a GxP requirement) yet the out-of-the-box netegrity plugin only understands SSO cookies, it doesn’t know what to do with passwords.

Inside the plugin

Before looking at the solution let’s look in detail as how the out-of-the-box plugin works. When the dm_netegrity plugin receives an authentication request it contacts the SiteMinder application via the SiteMinder Agent API (SiteMinder libraries are included with the Content Server installation). The following API calls are made to the SiteMinder server:

  1. Sm_AgentApi_Init(). Sets up the connection to the SiteMinder server.

  2. Sm_AgentApi_DoManagement(). “Best practice” call to the SiteMinder server passing an authentication agent identifications string: Product=DocumentumAgent,Platform=All,Version=5.2,Label=None.

  3. Sm_AgentApi_DecodeSSOToken(). Passes the SSO token to SiteMinder to confirm that the token is valid i.e. that it has been produced by that SiteMinder infrastructure. If the call returns a success code then the token is valid. A session specification is also returned to the calling program – this is the identifier that connects the SSO token to the session originally created on the SiteMinder infrastructure.

  4. Sm_AgentApi_IsProtected(). Checks whether SiteMinder regards the web application context as a protected resource. This call is probably needed to fill in a data structure that is used in the in the next call.

  5. Sm_AgentApi_Login(). One of the input parameters to this call is the session specification (from step 3). If the session specification is passed then SiteMinder will will do some verification checks on the session (has it expired? is the user active?) and then return the user LDAP identifier. The plugin uses this information to check that the token is for the correct user.

Solution

The out of the box (OOB) dm_netegrity plugin provided by EMC is setup to authenticate users who have previously authenticated against SiteMinder and received an SSO token in their browser session. In our case, where authentication with a username and password is required, there is no support in the DCM application for re-authenticating against the SiteMinder SSO solution. Where such authentication is attempted the OOB plugin will return an authentication failure as it is not designed to authenticate usernames and passwords against SiteMinder.

One way to solve this problem is to add support in the authentication plugin for authenticating against a username and password as well as a SSO token. Since SSO tokens are very large (several hundreds of characters) whilst passwords are generally significantly smaller, we can use the length of the authentication token to decide whether the token is an SSO credential or a password. In practice something like 20 characters is a good cutoff point. If the length is greater than this limit it is treated as an SSO credential and processed as described above. If the length is 20 characters or less it is treated as a password and processed using the following API calls.

  1. Sm_AgentApi_Init(). Sets up the connection to the SiteMinder server.

  2. Sm_AgentApi_DoManagement(). “Best practice” call to the SiteMinder server passing an authentication agent identifications string: Product=DocumentumAgent,Platform=All,Version=5.2,Label=None.

  3. Sm_AgentApi_IsProtected(). Checks whether SiteMinder regards the web application context as a protected resource.

  4. Sm_AgentApi_Login(). Since Sm_AgentApi_DecodeSSOToken() has not been called no session specification is available and is not passed into the Login call (compare the out-of-the-box logic). However if the username and password are passed to the Login function  SiteMinder will validate the credentials. If a success return code is received the user is authenticated, otherwise the user is not authenticated.

Implementation and Deployment

Source code for the out of the box plugin is provided in the Content Server installation. It is written in C++ and has a makefile that covers a number of operating systems. To get this to work for 64-bit Linux took a little manipulation of the compiler and linker options.

The customisation should be deployed as a single *nix shared library. When the file is deployed to $DOCUMENTUM/dba/auth on the Content Server it is available as a dm_netegrity plugin (after a Content Server restart).

Note: the out-of-the-box dm_netegrity_auth.so library must not be present in the auth directory as this will cause a conflict when the plugins are loaded by Content Server and both try to register themselves as ‘dm_netegrity’.

Conclusion

The solution is fairly simple in concept, the devil is in the details of compile/link, deployment and testing. If you think you need to implement customised SSO for your project and want some help designing and implementing your solution please contact me for consulting work – initial advice is not charged.

Hadoop and Real-time Processing

June 21, 2013 at 8:50 am | Posted in Performance | Leave a comment
Tags: , , , , , , ,

Almost since the day that Hadoop became big news some people have been predicting the demise of the system. I have heard several different flavours of this argument one being that what is needed is ‘real-time’ big data analytics and that Hadoop with its batch processing and CPU hungry data-munching is not fit for the task. I think this misunderstands the role that Hadoop is and will continue to play in any big data analytics system. In many cases batch oriented applications (often based on Hadoop and its various ecosystem products) will do the big data crunching and CPU-hungry work offline, under non-realtime constraints. Models and output then feeds into real-time systems that are able to process real-time data through the model.

A paper by Bhattacharya and Mitra called Analytics on Big FAST Data Using a Realtime Stream Data Processing Architecture on the EMC Knowledge Sharing site provides a great example of how this offline/real-time combination works. I believe this will become an archetype for how such systems should be built.

Not only do they show how event collection (Apache Kafka), batch model building (Hadoop/Mahout) and Real-time processing (Storm) can work together but they also provide a very accessible introduction to Hidden Markov Models using a couple of characters called Alice and Bob. With a 60% chance of rain Bob clearly lives in the UK. Probably somewhere near Manchester.

Edit 20 Dec 2016: Seems that link has disappeared. If you search for the paper you should be able to find a copy e.g. http://docplayer.net/1475672-Analytics-on-big-fast-data-using-real-time-stream-data-processing-architecture.html

Data Science London Meetup June 2013

June 14, 2013 at 2:08 pm | Posted in Performance | Leave a comment
Tags: , , , , , , , , ,

This is a quick post to record my thoughts and impressions from the Data Science London meet up I attended this week. We were treated to 4 presentations on a variety of Data Science/Machine Learning/Big Data topics. First up was Rosaria Silipo from Knime. Knime is new to me, it’s a visual and interactive machine learning environment where you develop your data science and machine learning workflows. Data sources, data manipulation, algorithm execution and outputs are nodes in a eclipse-like environment that are joined together to give you an end-to-end execution environment. Rosaria took us through a previous project showing how the Knime interface helped the project and showing how Knime can be extended to integrate other tools like R. I like the idea and would love to find some time to investigate further.

Next up was Ian Hopkinson from Scraperwiki talking about scraping and parsing PDF. Ian is a self-effacing but engaging speaker which made the relatively dry subject matter pretty easy to digest – essentially a technical walkthrough on implementing the extraction of data from 1000s of PDFs, warts and all. 2 key points:

  1. Regular Expressions are still a significant tool in data extraction. This is a dirty little secret of NLP that I’ve heard before. Kind of depressing as one of the things that attracted me to machine learning was the hope that I might write less REs in the future
  2. Scraperwiki are involved in some really interesting public data extraction for example digitizing UN Assembly archives. Don’t know if anyone has done analysis of the UN voting patterns on a large-scale but I for one would be interested to know if they correlate with voting on Eurovision Song Contest

Third up was Doug Cutting. Doug is the originator of Lucene and Hadoop which probably explains the frenzy to get into the meeting (I had been on the waiting list for a week and eventually got the a place at 4.00 for a 6.30 start) and the packed hall. Doug now works for the Hadoop provider Cloudera and was speaking on the recently announced Cloudera Search. Cloudera Search enables Lucene indexing of data stored on HDFS with index files stored in HDFS. It has always been possible (albeit a bit fiddly) to do this however there were performance issues. Performance issues were mostly resolved by adding a page cache to HDFS. They also incorporated and ‘glued-in’ some supporting technologies such as Apache Tikka (extracts indexable content out of multiple document formats like word, excel, pdf, html), Apache Zookeeper and some others that I don’t remember. A really neat idea is the ability to index a batch of content offline using MapReduce (something MapReduce would be really good at) and then merge the off-line index into the main online index. This supports use cases where companies need to re-index their content on a regular basis but still need near real-time indexing and search of new content. I can also see this being great for data migration type scenarios. All in all I think this is fascinating and it will be interesting to see how the other Hadoop providers respond. 

Last up was Ian Ozsvald talking about producing a better Named Entity Recogniser (NER) for brands in twitter-like social media. NER is a fairly mature technology these days however most of the available technology is apparently trained on more traditional long-form content with good syntax, ‘proper’ writing and with an emphasis on big (often American) brands. I particularly applaud the fact that he has only just started the project and came along to present his ideas and to make his work freely available on githup. I would love to find the time to download it myself and will be following his progress. If you are interested I suggest you check out his blog posting. As an aside he also has a personal project to track his cat using a raspberry Pi, which you can follow on twitter as @QuantifiedPolly.

All in all a great event and thanks to Carlos for the organisation, and the sponsors for the beer and pizza. Looking forward to the next time – assuming I can get in.

Taking the EMC Data Science associate certification

May 13, 2013 at 10:06 am | Posted in Big Data, Performance | 8 Comments
Tags: , , , , ,

In the last couple of weeks I’ve been studying for the EMC data science certification. There are a number of ways of studying for this certificate but I chose the virtual learning option,which comes as a DVD that installs on a Windows PC (yes Macs are no good!).

The course consists of six modules and is derived from the classroom-based delivery of the course. Each module is dedicated to a particular aspect of data science and big data with each following a similar pattern: a number of video lectures and followed by a set of lab exercises. There are also occasional short interviews with professional data scientists focusing on various topical areas. At the end of each module there is a question and answer multiple-choice to test your understanding of the subjects.

The video lectures are a recording of the course delivered to some EMC employees. This has some pros and cons. Occasionally we veer off from the lecture to a group discussion. Sometimes this is enlightening and provides a counterpoint to the formal material, however sometimes microphones are switched off or the conversation becomes confused and off-topic (just like real life!). Overall this worked pretty well and make if easier to watch.

The labs are more problematic. You get the same labs as delivered in the classroom course however you simply get to watch a camtasia studio recording of the lab with a voiceover by one of the presenters. Clearly the main benefits of labs is to enable people to experience the software hands-on, an essential part of learning practical skills. Most of the labs use either the open source R software or EMCs own Greenplum which is available as a community software download. There is nothing to stop you from downloading your own copies of these pieces of software and in fact that is what I did with R. However many of the labs assume there are certain sets of data available on the system; in some cases this is CSV files which are actually provided with the course. However relational tables used in Greenplum are not provided. It would have been nice if a dump of the relational tables had been provided on the DVD. A more ambitious idea would have been to provide some sort of online virtual machine in which subscribers to the course could run the labs.

Since the lab guide was provided I was able in many cases to follow the labs exactly, where the data was provided, or something close to it by generating my own data. I also used an existing Postgres database as a substitute for some of the Greenplum work. However I didn’t have time to get MADLib extensions working in Postgres (these come as part of out-of-the-box Greenplum). This is unfortunate as clearly one of the things that EMC/Pivotal/Greenplum would like is for more people to use MADLib. By the way, if you didn’t know, MADLib is a way of running advanced analytics in-database with the possibility of using Massively Parallel Processing to speed delivery of results.

The first couple of modules are of a high-level nature aimed more at Project Manager or Business Analyst type people. The presenter, David Dietrich, is clearly very comfortable with this material and appears to have had considerable experience at the business end of analytics projects. The material centres around a 6-step, iterative analytics methodology which seemed very sensible to me and would be a good framework for many analytics projects. It emphasises that much of the work will go into the early Discovery phases (i.e. the ‘What the hell are we actually doing?” phase) and particularly the Data Preparation (the unsexy bit of data projects). All in all this seemed both sensible and easy material.

Things start getting technical in Module 3 which provides background technicals on statistical theory and R, the open-source statistics software. The course assumes a certain level of statistical background and programming ability and if you don’t have that this is where you might start to struggle. As an experienced programmer I found R no problem at all and thoroughly enjoyed both the programming and the statistics.

The real meat of the course is Modules 4 and 5. Module 4 is a big beast as it dives into a number of machine learning algorithms: Kmeans clustering, Apriori decision rules, linear and logistic regression, Naive Bayes and Decision Trees. Throw in some introductory Text Analysis and you have a massive subject base to cover. This particular part of the course is exceptionally well-written and pretty well presented. I’m not saying it’s perfect but it is hard to over-state how difficult it is to cover all this material effectively in a relatively short-space of time. Each of these algorithms is presented with use-cases, some theoretical background and insight, pros and cons, and a lab.

It should be acknowledged that analytics and big data projects require a considerable range of skills and this course provides a broad-brush overview of some of the more common techniques. Clearly you wouldn’t expect participation on this course to make you an expert Data Scientist any more than you would employ someone to program in Java or C just based on courses and exams taken. I certainly wouldn’t let someone loose to administer a production Documentum system without being very sure they had the tough experience to back up the certificates. Somewhere in the introduction to this course they make clear that the aim is to enable the you to become an effective participant in a big data analytics project; not necessarily as a data scientist but as someone who needs to understand both the process and the technicals. As far as this is the aim I think it is well met in Module 4.

Module 5 is an introduction to big data processing, in particular Hadoop and MADLib. I just want to make 1 point here. This is very much an overview and it is clear that the stance taken by the course is that a Data Scientist would be very concerned with technical details about which analytics methods to use and evaluate (the subject of module 4), however the processing side is just something that they need to be aware of. I suspect in real-life that this dichotomy is nowhere near as clear-cut.

Finally Module 6 is back to the high-level stuff of modules 1 and 2. Some useful stuff about how to write reports for project sponsors and other non-Data Scientists and dos and don’ts of diagrams and visualisations. If this all seems a bit obvious it’s amazing how often this is done badly. As the presenter points out it’s no good spending tons of time and effort producing great analytics if you aren’t able to effectively convince your stakeholders of your results and recommendations. This is so true. The big takeaways: don’t use 3D charts, and pie charts are usually a waste of ink (or screen real estate).

If I have one major complaint about the content it is that Feature Selection is not covered in any depth. It’s certainly there in places in module 4 but given that coming up with the right features to model on can have a huge impact on the predictive power of your model there is a case for specific focus.

So overall I think this was a worthwhile course as long as you don’t have unrealistic expectations of what you will achieve. Furthermore if you want to get full value from the labs you are going to have to invest some effort in installing software (R and Greenplum/Postgres) and ‘munging’ data sets to use.

Oh, by the way, I passed the exam!

Supporting Testing

November 30, 2012 at 10:30 am | Posted in Performance | Leave a comment
Tags: , , , ,

When the designers of WDK sat down to design the framework one thing I don’t think they did was decide to make it easy to test. Anyone who has tried to design scripts for load runner, JMeter or any other tool will have experienced the pain of trying to trap the right dmfRequestId, dmfSerialNum and so on. As for content transfer testing it is really only possible with the unsupported Invoker tool that comes with the load runner scripts.

So my question is to IIG-do xCP and D2 and any other new interface coming out of IIG make it easier to test?

Troubleshooting weird DCM messages

July 24, 2012 at 5:34 pm | Posted in Performance | Leave a comment
Tags:

This came up on the ECN forum today and the message is so obscure (but quite common) that I thought it worth writing up the troubleshooting notes.

The original post is here. The poster was trying to create a Change Notice or Change Request in Documentum Compliance Manager (DCM) and got the following error message in a dialog box:

The System can not complete your request. The action you have chosen is no longer valid because of a change in repository

This seems to be a generic message that DCM pops up whenever an ‘onexecutiononly’ pre-condition check fails.

What’s a pre-condition?

A pre-condition is a framework built into Documentum WDK (the framework that DCM, Webtop, Taskspace, WebPublisher, etc are built on) that allows menu options to be programatically turned-on/turned-off/greyed-out/hidden in the browser interface. To give an example a developer may have created a component to display the contents of a folder and for each entry there can be different menu options available such as View, Edit, Check-out, checkin, Create PDF rendition and so on. Now if a document is not checked out it doesn’t make sense for the checkin option to be available. In fact it would be just confusing if that selection was left available (WDK applications tend to be confusing enough as it is). So a pre-condition is a piece of code that can be run for each item which will return either true or false to decide whether a menu option is available.

What’s an ‘onexecutiononly’ pre-condition?

With great power comes great responsibility! Imagine you have 100 objects in a folder and you have 40 or 50 menu options for each one (not untypical). That’s 4,000 – 5,000 pre-condition checks. If the pre-condition code just does calculations and checks based on information available or cached on the application server then generally this is not a problem and your UI should remain pretty responsive. However if your pre-condition runs a query against against the content server, however ‘fast’, or does an object fetch (e.g. using IdfSession.getObjectBy…()) then you are going to suffer some pretty sluggish UI performance.

The WDK references do warn about this in the section on pre-conditions however it seems that this warning was not heeded in DCM 5.3 (naughty EMC). Generally navigating around DCM5.3 is pretty miserable for most productions users and the best that can be suggested is to upgrade to DCM 6.x (by the way if you absolutely have to stay on DCM5.x but can bear some development and testing effort to alleviate the pain then there are some code-based possibilities).WDK 6 introduced a new pre-condition setting – onexecutiononly – which was taken up by the DCM developers to ‘fix’ the performance problems they had introduced.

‘onexecutiononly’ means that the pre-condition is not evaluated when the list of objects is rendered onto the screen but only when the user selects the menu option in the user interface. As a result you no longer have 1000s of pre-conditions running when rendering the interface. Of course in a way this rather ‘neuters’ the power of the pre-condition because now we could have, for instance, check-in available for documents that aren’t checked out. If we try to checkin the document the pre-condition will return false and we will get a warning message on the screen. Typically like the one the poster saw when trying to create a change notice or change request. In that particular case there are likely to be some checks in the pre-condition code for a newchangerequest or newchangenotice action and they have ‘failed’. At the time of writing the problem hadn’t been fully resolved so I’ll update this entry if any new information comes to light.

Next Page »

Blog at WordPress.com.
Entries and comments feeds.