Sandhya Prabhakaran MLConf 2016 Event Summary

Post written by Sandhya Prabhakaran:

MLConf 2016

Firstly, I would like to thank MLConf and NYC Women in Machine Learning & Data Science for giving me the opportunity to attending this informative, educational and cutting-edge conference. I would also like to thank all the conference sponsors that made this event possible: http://mlconf.com/events/new-york-city-ny/.

The first talk was by Edo Liberty, Research Director from Yahoo labs on Online Data Mining: PCA and K-Means. Traditional PCA and K-means work on batch data, meaning the data is fixed, resides in one place and never changes during the execution of the algorithm. With online data, the scenario is different. Online data resides on multiple servers (distributed data) and data changes over time (streaming data). Edo presents a distributed, streaming computational model for online data that also deals with on-the-fly optimisation of a cost function. He cites the Ski rental problem. Say daily rental of skis costs 100$ and buying skis costs 1000$. For a new comer, he/she may want to rent a few number of times before buying personal skis. The question is how many times must one rent so that he/she does not incur huge losses in the long run. Solutions are a) buy on day 1 itself (total cost is 1000$) or b) rent 10 times and then buy (total cost is 1000*10 + 1000 = 2000$; worst case). There could also be a situation where you buy skis and then meet with an accident and cannot ski anymore. The cost function should deal with such future (unknown) events and still be able to do this on-the-fly. Examples of such distributed streaming data and where such look-ahead optimisation is crucial are in online portfolio management, online advertising (catered to every user), Yahoo Finance app, Yahoo News app (where you need to mash-up information according to every user).

The second talk was by Samantha Kleinberg, Assistant Professor of Computer Science, Stevens Institute of Technology on Causal Inference and Explanation to Improve Human Health. There is umpteen data being collected by body-worn sensors, ICU data streams and electronic health records. Most ML methods find just a correlation between factors, which Samantha rightfully says is not sufficient, more so when the data are noisy and has missing values. She goes the extra leg to identify causal factors aka what caused factor A, was it due to factor B alone or Factor B and C together. For example, we know that smoking and lung cancer are positively correlated. The causality-based question would be – how long must one smoke to get lung cancer. This poses great challenges in data generation, data collection and interpretation.

The third talk was by Braxton McKee, CEO & Founder, Ufora on Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code. Braxton’s take was on scalable ML code. A programmer writes and rewrites his code to suit high-dimensional data. Most of the time, the logic in these multiple code versions is the same; the extra lines of code are only to make the software work on scalable systems like Theano and TensorFlow. Braxton argues that there is enough information in the source code to enable scalability and that one need not rewrite the source code. Pyfora from Ufora is a scalable implementation of Python which accepts the original grass-fed code within pre-written Pyfora blocks that cater to scalability. The programmer need not worry about calling libraries and there are no complex frameworks or APIs. Certainly something worth considering for scaling ML code.

The fourth talk was by Geetu Ambwani, Principal Data Scientist, Huffington Post on Data Science in the Newsroom: Geetu introduces the ML challenges faced at Huffington post that is the biggest social publisher as well as having pioneered blogging and its aggregation via massive blogging networks. Digital media data processing consists of content creation, followed by content distribution and content consumption. Content creation deals with tools to discover trends and optimising headlines and images. Content consumption deals with tools for better recommendations to a user, are the news read on a mobile or a desktop etc. Content distribution is where ML chips in. Given that there are various distributed platform such as Facebook Instant, Google Amp, Apple News, the publisher has no control of where the data (news/ads) gets published. The challenge is therefore, to know where to publish on what platform and when so as to maximise readership base, advertisement viewing etc. This would require predicting upfront what articles would gain traction in future and when must such a prediction occur. It is a similar problem to online streaming by Edo Liberty as explained in his Ski rental example.

The fifth talk was by Soumith Chintala, Artificial Intelligence Research Engineer, Facebook on Predicting the Future Using Deep Adversarial Networks: Learning With No Labeled Data: Soumith begins his talk by demonstrating how one can solve a route-finding problem. If say, one is at 42nd street 9th avenue and needs to get to 49th street 7th avenue, a linear regression with minimum squared error would return a route that is diagonal i.e. it goes right through the buildings. Soumith introduces the Generative Adversarial networks, an unsupervised technique that learns a loss function based on the problem at hand by generating data-driven distributions. These algorithms can be used in scenarios where there is unlimited unlabelled data and very limited labelled data, as is the case in many applications.

The next talk was by Lei Yang, Senior Engineering Manager, Quora on Sharing and Growing the World’s Knowledge with Machine Learning: Lei’s talk opens with Quora’s mission of ‘share and grow the world’s knowledge”.’ At Quora, quality, demand and relevance are all valued with equal importance. Data@Quora is big and ‘rich’. Data consists of topics, users and Q&A which form nodes in a graph and user actions are captured via the edges. This is a huge knowledge base layered with social networks and topic anthology. ML@Quora deals with algorithms for ranking feeds and answers and recommending users and topics to individual users. Lei also stressed the difference between similarity of topics versus ‘interestingness’ – what other topics/questions could be interesting given a user’s interest in a particular topic/question. Given that Quora makes use of a wide variety of ML algorithms ranging from regression to DL, they have an inhouse ML platform in Python and C++. QMF, their matrix factorisation library can be found here.

The next talk was by Furong Huang, Ph.D. Candidate, UC Irvine (Winner of MLconf Industry Impact Student Research Award) on Discovery of Latent Factors in High-dimensional Data Using Tensor Methods: A vector is a 1-dimensional object, a matrix is a 2-d object and a tensor is a 3-d or higher dimensional object (visualise a Rubic’s cube). A tensor captures 3rd order moments or triple-wise relationships amongst objects. In her research, Furong uses tensor decomposition to uncover hidden variables in high-d space. Tensor decomposition is further shown to provide stabler inference and can be used in place of MCMC sampling which can be slow or Variational Inference that operates over a non-convex likelihood and can be unstable. Both MCMC and VI are common techniques used to infer model parameters in a Bayesian probabilistic setting.

The next talk was by Jennifer Marsman, Principal Developer Evangelist, Microsoft on Using EEG and Azure Machine Learning to Perform Lie Detection: Jennifer showed off how her lie detector works runtime. For this she used the EPOC headset from Emotiv (another female-run company) to capture EEG brain waves from participants. A set of yes-no questions were asked to participants from where Jennifer created her labelled training set. This was then used as input to a binary classifier that predicted whether a particular answer was a lie or not. She demonstrated the lie-detector on Microsoft’s Azure ML platform. Download the MS Azure ML algorithm cheat sheet.

The next talk was by Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consulting on Scripts that Scale with F# and mbrace.io: Most interacting scripting languages tend to get slow whilst processing large amounts of data. Mathias introduces Microsoft’s F# which is an open-source Meta language. F# brings the best of both SCALA and Python. F# along with mbrace.io, enable scalable cloud programming, for example on the Azure cloud.

The next talk was by Sergei Vassilvitskii, Research Scientist, Google on Teaching K-Means New Tricks: Sergei delves into how to make good-old k-means run on large datasets. He introduces k-means++ that does not fully rely on random initialisation of the cluster centres, rather start by choosing k* (where k* > k) centres, cluster and then recluster the clusters to get k clusters. There is a tradeoff between memory and number of iterations but still kmeans++ offers significant power than the naive kmeans.