#scalasv meetup: Scalable and Flexible Machine Learning With Scala

Last night I went to really interesting Scala BASE meetup at LinkedIn. The topic was Scalable and Flexible Machine Learning With Scala (slides). The large room was packed with over 300 people! I was sitting down the front, which made it difficult to see the audience, but my impression of the polling was: most folks were developing in Java, many were using Scala and lots of people were doing machine learning.

The speakers were Chris Severs from eBay and Vitaly Gordon from LinkedIn. The talk started with a characterization of the different types data scientists and the pros/cons of the way they write map/reduce jobs:

Mixer – uses what ever combination of tools work such a combination of Apache Pig and Python: a python script that executes an Apache Pig script that invokes a user-defined function written in Python.
Single language expert – uses the one tool that they are familiar with, e.g. Perl scripts + Hadoop streaming
Craftsmen – write Map/Reduce jobs in Java and specify low-level configuration details.

Next, the speakers claimed that we need a better, more pragmatic approach: “A five tool tool” (see Five Tool player): agile, productive, correct, scalable, simple. And such a tool is Scalding, which is a Scala-based DSL developed by Twitter for writing Map/Reduce jobs in familiar functional style. Looks impressive!

The final part of the talk consisted of some machine learning examples:

Using a decision tree to estimate insurance risk of Titanic 2 passengers
Using streaming K-means clustering (coming in Mahout 0.8) to analyze eBay data
Using the PageRank algorithm to analyze LinkedIn endorsements to rank Scala expertise in the bay area and amongst attendees of the meetup

During the Q&A there was an interesting discussion of Python vs. Scala for machine learning. Python has lots of mature libraries (Numpy, Scipy, …) but the claim was that

many libraries are available in the Java/Scala ecosystem and that
unlike Python Java/Scala can scale beyond a single machine by leveraging Hadoop.

Seems plausible but anyone else care to comment?

This entry was posted in Uncategorized. Bookmark the permalink.

1 Response to #scalasv meetup: Scalable and Flexible Machine Learning With Scala

Sam Kl says:

March 12, 2013 at 12:05 pm

Thanks for the review.
On the question of python vs scala, when it comes to machine learning,I think the maturity of python libraries outweighs the performance or the scalability of Java. Many libraries have been there for a while unlike in Java, where for people who are not programmers can have a steeper learning curve of those tools. When it comes to ML libraries, weka is good but not as good as python nltk, scipy, sk learn …… And mahout is still incomplete, because its first purpose was to be run on Hadoop, but they have made some interesting changes lately, and the new additions make it useful beyond Hadoop.
On my daily Job, I work with data scientists who use R and python while working on their models, and in production we use scala, most of time we rewrite the models or just use mahout. But lately we’ve been using a mix of Hadoop and Berkeley’s Spark, so most of the data scientist are learning Scala and prototyping on the full data set and use libraries like breeze and scalala. But we still use python, since spark has a python API.