Post written by Lucia Palova:

PyGotham 2016 – A peek into the world of data

Thanks to WiMLDS, I attended the PyGotham 2106 conference this year. The two-day event hosted about eighty excellent talks with topics ranging from getting and handling data (for example, Tracy Osborn’s Design for Non-Designers, Adrian Cruz’s Push Data, Pull Data, Present Data, Stevie Slloterback’s Introduction to BeautifulSoup, or David Baumgold’s Advanced Git), special interest topics like Making Games (Piper Thunstrom’s step-by-step guide – my favorite), to the hot topics in machine learning, including neural networks (Mike Craig’s Introduction to TensorFlow, Eric Schles’ walk-though on neural nets), Mike Williams’ text summarization and NLP tools, vector space models (Tim Schmeier’s music data analysis), or Jessica Forde’s Introduction to Reinforcement Learning.

See the (upcoming) PyGotham 2016 videos for more details!

I’d like to point to two talks that caught my attention. Manojit Nandi introduced anomaly detection techniques. Anomaly detection, also called outlier detection, is the identification of items or observations which do not conform to an expected pattern or other observations in a dataset. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. One example of a density based anomaly detection technique is the local outlier factor (LOF) algorithm. The average local reachability density of an object’s neighbors divided by the object’s own local reachability density is the LOF score; values of LOF much greater than 1 suggest an outlier. Here I’ll cite Manojit’s words that best describe the algorithm: “Do your close neighbors see you as one of their close neighbors? If not, you are an outlier.”

If you are dealing with time series, Twitter’s anomaly detection R package might be the right choice for you (if you are also an R coder)! Their Seasonal Hybrid ESD builds on the generalized extreme Studentized deviate test for detecting time series anomalies. Besides time series, the package can also be used to detect anomalies in a vector of numerical values. Thanks for the great intro to the topic, Manojit!

Aileen Nielsen told us about the probabilistic graphical models. A must try python model library pgmpy allows a user to play with simple Bayesian network with conditional probability tables. A nice addition to libpgm. Thanks Aileen!