Taking the EMC Data Science associate certification

May 13, 2013 at 10:06 am | Posted in Big Data, Performance | 12 Comments
Tags: , , , , ,

In the last couple of weeks I’ve been studying for the EMC data science certification. There are a number of ways of studying for this certificate but I chose the virtual learning option,which comes as a DVD that installs on a Windows PC (yes Macs are no good!).

The course consists of six modules and is derived from the classroom-based delivery of the course. Each module is dedicated to a particular aspect of data science and big data with each following a similar pattern: a number of video lectures and followed by a set of lab exercises. There are also occasional short interviews with professional data scientists focusing on various topical areas. At the end of each module there is a question and answer multiple-choice to test your understanding of the subjects.

The video lectures are a recording of the course delivered to some EMC employees. This has some pros and cons. Occasionally we veer off from the lecture to a group discussion. Sometimes this is enlightening and provides a counterpoint to the formal material, however sometimes microphones are switched off or the conversation becomes confused and off-topic (just like real life!). Overall this worked pretty well and make if easier to watch.

The labs are more problematic. You get the same labs as delivered in the classroom course however you simply get to watch a camtasia studio recording of the lab with a voiceover by one of the presenters. Clearly the main benefits of labs is to enable people to experience the software hands-on, an essential part of learning practical skills. Most of the labs use either the open source R software or EMCs own Greenplum which is available as a community software download. There is nothing to stop you from downloading your own copies of these pieces of software and in fact that is what I did with R. However many of the labs assume there are certain sets of data available on the system; in some cases this is CSV files which are actually provided with the course. However relational tables used in Greenplum are not provided. It would have been nice if a dump of the relational tables had been provided on the DVD. A more ambitious idea would have been to provide some sort of online virtual machine in which subscribers to the course could run the labs.

Since the lab guide was provided I was able in many cases to follow the labs exactly, where the data was provided, or something close to it by generating my own data. I also used an existing Postgres database as a substitute for some of the Greenplum work. However I didn’t have time to get MADLib extensions working in Postgres (these come as part of out-of-the-box Greenplum). This is unfortunate as clearly one of the things that EMC/Pivotal/Greenplum would like is for more people to use MADLib. By the way, if you didn’t know, MADLib is a way of running advanced analytics in-database with the possibility of using Massively Parallel Processing to speed delivery of results.

The first couple of modules are of a high-level nature aimed more at Project Manager or Business Analyst type people. The presenter, David Dietrich, is clearly very comfortable with this material and appears to have had considerable experience at the business end of analytics projects. The material centres around a 6-step, iterative analytics methodology which seemed very sensible to me and would be a good framework for many analytics projects. It emphasises that much of the work will go into the early Discovery phases (i.e. the ‘What the hell are we actually doing?” phase) and particularly the Data Preparation (the unsexy bit of data projects). All in all this seemed both sensible and easy material.

Things start getting technical in Module 3 which provides background technicals on statistical theory and R, the open-source statistics software. The course assumes a certain level of statistical background and programming ability and if you don’t have that this is where you might start to struggle. As an experienced programmer I found R no problem at all and thoroughly enjoyed both the programming and the statistics.

The real meat of the course is Modules 4 and 5. Module 4 is a big beast as it dives into a number of machine learning algorithms: Kmeans clustering, Apriori decision rules, linear and logistic regression, Naive Bayes and Decision Trees. Throw in some introductory Text Analysis and you have a massive subject base to cover. This particular part of the course is exceptionally well-written and pretty well presented. I’m not saying it’s perfect but it is hard to over-state how difficult it is to cover all this material effectively in a relatively short-space of time. Each of these algorithms is presented with use-cases, some theoretical background and insight, pros and cons, and a lab.

It should be acknowledged that analytics and big data projects require a considerable range of skills and this course provides a broad-brush overview of some of the more common techniques. Clearly you wouldn’t expect participation on this course to make you an expert Data Scientist any more than you would employ someone to program in Java or C just based on courses and exams taken. I certainly wouldn’t let someone loose to administer a production Documentum system without being very sure they had the tough experience to back up the certificates. Somewhere in the introduction to this course they make clear that the aim is to enable the you to become an effective participant in a big data analytics project; not necessarily as a data scientist but as someone who needs to understand both the process and the technicals. As far as this is the aim I think it is well met in Module 4.

Module 5 is an introduction to big data processing, in particular Hadoop and MADLib. I just want to make 1 point here. This is very much an overview and it is clear that the stance taken by the course is that a Data Scientist would be very concerned with technical details about which analytics methods to use and evaluate (the subject of module 4), however the processing side is just something that they need to be aware of. I suspect in real-life that this dichotomy is nowhere near as clear-cut.

Finally Module 6 is back to the high-level stuff of modules 1 and 2. Some useful stuff about how to write reports for project sponsors and other non-Data Scientists and dos and don’ts of diagrams and visualisations. If this all seems a bit obvious it’s amazing how often this is done badly. As the presenter points out it’s no good spending tons of time and effort producing great analytics if you aren’t able to effectively convince your stakeholders of your results and recommendations. This is so true. The big takeaways: don’t use 3D charts, and pie charts are usually a waste of ink (or screen real estate).

If I have one major complaint about the content it is that Feature Selection is not covered in any depth. It’s certainly there in places in module 4 but given that coming up with the right features to model on can have a huge impact on the predictive power of your model there is a case for specific focus.

So overall I think this was a worthwhile course as long as you don’t have unrealistic expectations of what you will achieve. Furthermore if you want to get full value from the labs you are going to have to invest some effort in installing software (R and Greenplum/Postgres) and ‘munging’ data sets to use.

Oh, by the way, I passed the exam!

12 Comments »

RSS feed for comments on this post. TrackBack URI

  1. Congrats Robin! Thanks for sharing your experience with this course.

  2. Congrats Robin. Thanks for sharing. I am planning to invest on this course. Will I need any preparation outside of the material covered in the course to pass the exam? How close was the practice test to the real one?

    • The course ought to give you the requisite practice to pass the exam. The sample exam gives a good indication of the type of questions you will get. Of course getting some additional practice in won’t be a bad thing; sites like Kaggle have plenty of competitions, old and new, that give you data to work with. Of course as I pointed out in the post the Data Science certificate is not just about the technicals but also about some of the softer skills of being involved with a Data science project.

      Good luck!

  3. Robin,

    First of congratulations on becoming an Associate Data Scientist

    Well, this certainly seems like my review about the course.

    I have completed the tutorials till Module 5:
    Best modules in terms of content & learning are 3 & 4
    Module 1 & 2 cover’s the theoretical aspects of data sciences was still fine as it gives an introduction & life cycle of Data analytics.

    What bother’s me the most was module 5: Providing some basic highlights of Hadoop & MapReduce & lacking in actual hands on practice session(though there were some labs but I didn’t found them relevant).

    Robin, I would also like to have your review on importance of content & level of questions from Module 5. Were the questions equally restricted (questions covering gist of Hadoop & MapReduce) to advanced analytics tools & technique as is seem in the modules.
    Moreover, can you tell me the link to the sample exam you mentioned in the previous comment & some more practice material.

    Lastly, please share the sources for the data sets & tables(SQL) used by them during the Labs, if you are acquainted to them.

    Thanks

    – Aman Vij

    • Yes module 5 was very high-level in content and the questions matched that. Basically if you are looking to learn in-depth Hadoop/MapReduce this isn’t the course to do it.

      Practice exams are here: https://education.emc.com/guest/certification/exams.aspx

      Much of the data for the courses is on the DVD – not all of it but some of it in CSV files.

      Good luck
      Robin

  4. Thanks a lot for this valuable information.
    I would be really grateful, if you could tell me any other source to practice for the examination apart from the EMC mock exam.
    There is one more question I have, as per your personal opinion, how was the number of questions distributed per the modules. It seems like module 3 or 4 would have majority of question, was that true?

    Thanks again for your help.!

    I have scheduled my exam on 6th July 2013.

    • As far as practice goes that’s it! Would be great if EMC could cobble together more than 1 practice exam.

      I don’t recall the exact proportions but modules 3 and 4 definitely provided the most questions. I guess it’s pretty much in proportion to how long each of the modules were.

  5. Hey Robin,

    First Congratulations for the certification. This review is very helpful to give somebody a good start for preparation. I am also pursuing the same course certification. Is it possible for you to share the CSV files which are used in labs?

    Thanks
    D soni

  6. Congrats Robin, which topics of the statistics should i know before i start to studying for the exam ?

    • It’s been a while since I did the course. From recollection knowing about probability distributions in concept plus normal distribution in general. Standard stuff about mean, variance and standard deviation. What you might typically cover in say a first year under graduate course or even high school stats.

      • Thank you for replying, is there a place where i can find which topics in statistics should i know before i take this course, beside statistics what should i know , do i have to know R and Python ?

  7. To get data science certification in Pune, it is important that you understand the details of the course layout and in the manner, one can have the apt handling of the concept. You have the objectives telling you about the dirty data set, and one can even deal with the aspect of data cleaning, and this can lead to data set which is just ready for analysis.
    http://www.datasciencetraininginpune.com/


Leave a comment

Blog at WordPress.com.
Entries and comments feeds.