Carissa Shafto – Strata+Hadoop conference

Post written by Carissa Shafto (@carissa_shafto):

Thanks to a complimentary pass from WiMLDS sponsored by both O’Reilly and Cloudera, I attended Strata+Hadoop at the Javits Center in NYC. The conference included two full days of keynote speakers, talks by fellow data scientists and engineers, and product representatives sharing a multitude of tools for managing and using data. The keynotes were a highpoint of the conference for me. The conference organizers invited speakers that highlighted all the ways data can be used to address social issues such as healthcare, politics, and criminal justice.


Mike Olson (@mikeolson; Cloudera) began the line-up by announcing three projects that Cloudera has taken on for social good. First, in the next week they will release Apache Spot, a shared, open source data model for cybersecurity. Second, they have launched the Precision Medicine Initiative, which awards grants to organizations who are using data to improve healthcare. Third, they are partnering with Thorn (@thorn) to support work to stop child sex exploitation, by looking for child references in social media data.

Other keynote highlights included Susan Woodward (Sand Hill Econometrics), who presented statistics on the success of start-ups. It was disappointing to hear the difference between female-founded start-ups and male-founded ones, with exit values for female-founded start-ups at 1.5 times initial investment compared to male-founded start-ups exiting at 3 times initial investment. Sriram Vishwanath (@AccordionHealth; Accordion Health) busted myths about healthcare, as well as sharing how data and analytics can be used to reduce costs of healthcare. He focused on misperceptions regarding the Affordable Care Act. The finale was Jill Lepore (Harvard University | The New Yorker), who shared how poll data is collected and how unrepresentative current polls tend to be – in some cases representing less than 5% of the actual population. This is an important fact to keep in mind as we all prepare for the upcoming and future elections.

During lunch on Wednesday there was a meetup for the Women in Big Data Forum. The speaker for the lunch was Jill Dyche (@jilldyche; SAS). She gave a compelling presentation about the need for data and analytics in social organizations. She used dog shelters as an example of organizations that still rely on analog data, and what that means for consumers (and the dogs). I was lucky enough to win a copy of her book “The New IT.” Click here to learn more about the Women in Big Data Forum.

I attended many individual presentations on Wednesday, but want to highlight one that was particularly fascinating. June Andrews (@DrJuneAndrews; Pinterest), gave a talk titled ‘Iterative supervised clustering: A dance between data scientists and machine learning.’ She described a project in which she used a clustering algorithm to cluster pins and re-pins over the course of the first year of the item’s arrival on Pinterest. Her analysis yielded six unique clusters. However, she was curious whether other data scientists would come to the same cluster solution as her, so she had nine other Pinterest analysts perform the same clustering analysis on the same data. The idea was to better understand the data; generally it is not cost-effective to have more than one data scientist do the same analysis. Interestingly, the nine solutions did not overlap nearly as much as expected, somewhere around 50%. This lack of replication in results is similar to problems in psychology research that have made press recently and a very recent New York Times article that compared the results from five independent data analysts using Florida presidential poll data. Her take-home message was “It depends which data scientist does an analysis.” This is important because human interpretations of data are driving product decisions every day at Pinterest and at other companies.

June Andrews


Thursday morning offered up another great lineup of keynote presentations. Mar Cabra (@cabralens; International Consortium of Investigative Journalists) described the discovery of the Panama Papers and the tools that were required to make that data useful. She called for more data scientists to work in journalism so that we do not have to rely on serendipitous discoveries. DJ Patil (@DJ44) and Lynn Overmann (@LynnOvermann) of The White House gave a joint presentation on the ways the U.S. government is utilizing data for social good. A quote that really resonated with me was that “A technology is neither radical nor revolutionary unless it benefits everyone.” Lynn specifically highlighted how the Miami-Dade Police Department has been using data to redirect arrestees to other social services (e.g., addiction counseling, mental health treatment) instead of sending them to jail. This is part of a broader government focus on #DataDrivenJustice.

Lynn Overmann

I want to highlight one presentation from Thursday that was particularly insightful. Danielle Dean (@danielleodean) and Shaheen Gauher of Microsoft gave a talk titled ‘Evaluating models for a needle in a haystack: Applications in predictive maintenance.’ They described the process for building and evaluating models for rare events, using the example of predictive maintenance. One important point they made was the need to re-think what you consider to be a failure for the purposes of building a predictive model. If you only look at the actual failures and it is a low rate, then it is difficult to teach your model enough about the failures to be able to accurately predict future failures (see photo below). They also stressed the importance of comparing your machine learning model to a baseline solution (think confusion matrix) to ensure that what you have developed is actually an improvement over baseline.

Dean & Gauher

One final topic worth mentioning is the large number of conference presentations related to data privacy. For decades now there have been conversations about data security and the tools necessary to keep data secure. This has been largely driven by the need to keep data out of the hands of criminals. With the explosion of social media the conversation (at least for some folks) also now includes data privacy. Data privacy is really focused on only certain individuals having access to data, in some cases for only specific purposes. Several presentations focused on what is necessary to keep data private, particularly when one is working with health data. With so much geo data available now all of us are more identifiable.