Last night I went to really interesting Scala BASE meetup at LinkedIn. The topic was Scalable and Flexible Machine Learning With Scala (slides). The large room was packed with over 300 people! I was sitting down the front, which made it difficult to see the audience, but my impression of the polling was: most folks were developing in Java, many were using Scala and lots of people were doing machine learning.
The speakers were Chris Severs from eBay and Vitaly Gordon from LinkedIn. The talk started with a characterization of the different types data scientists and the pros/cons of the way they write map/reduce jobs:
- Mixer – uses what ever combination of tools work such a combination of Apache Pig and Python: a python script that executes an Apache Pig script that invokes a user-defined function written in Python.
- Single language expert – uses the one tool that they are familiar with, e.g. Perl scripts + Hadoop streaming
- Craftsmen – write Map/Reduce jobs in Java and specify low-level configuration details.
Next, the speakers claimed that we need a better, more pragmatic approach: “A five tool tool” (see Five Tool player): agile, productive, correct, scalable, simple. And such a tool is Scalding, which is a Scala-based DSL developed by Twitter for writing Map/Reduce jobs in familiar functional style. Looks impressive!
The final part of the talk consisted of some machine learning examples:
- Using a decision tree to estimate insurance risk of Titanic 2 passengers
- Using streaming K-means clustering (coming in Mahout 0.8) to analyze eBay data
- Using the PageRank algorithm to analyze LinkedIn endorsements to rank Scala expertise in the bay area and amongst attendees of the meetup
- many libraries are available in the Java/Scala ecosystem and that
- unlike Python Java/Scala can scale beyond a single machine by leveraging Hadoop.
Seems plausible but anyone else care to comment?