Isha's PyData Medium Post

PyData NYC 2015

Where Python and Data have beautiful babies

Python enthusiasts who are interested in data analysis have no doubt used at least one, if not many, of the scientific computing libraries such as numpy, scipy, pandas, scikit-learn, statsmodels, matplotlib andIPython/Jupyter. It’s hard not to become a fan of all this great open source software after you see your productivity and effectiveness at handling and analyzing data at least double.

# No doubt there is lot more to data analytics than simply building
# and training models. But, really, this is how easy it is train a 
# classifier and build a predictive model by using scikit-learn

classifier = RandomForestClassifier() # or one of many others
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
accuracy = classifier.score(X_test, y_test)

Even those of us who are used to programming in Java, C or a number of other languages, find the use case of data analysis very compelling in Python. This is because of the presence of the large, vibrant and very active community of developers building a wide variety of open source tools, which make the process of data analytics from step 1 — getting the data — to step n— visualizing and telling a story of the analysis — very easy. All within the python infrastructure. There are others, such as R and Julia for example, which have the community or the convenience, but Python is unique in that it really has both.

When I got the chance, through WiMLDS, to attend the PyData conference in NYC last week, I jumped at the opportunity to connect with the group of people that were actively working on the things that I found useful, and also beautiful.

The conference started, fittingly, with a talk titled Python as the Zen of Data Science by Travis Oliphant. Oliphant is the creator of NumPy and SciPy and also the founder of NumFOCUS, a non profit that promotes and supports much of the open source scientific software in Python. He talked about how pythonic data analysis really “fits the brain” of the data scientist, rather than the other way around, and, honestly, I couldn’t agree more.

Achieving the ‘scale-out’

One thing that Python, historically, has not been great at, is ‘scale-out.’ One of the popular talks at the conference was by Andrew Montalenti, CTO atparse.ly. His company is working on products, such as pykafka, streamparseand pystorm, to help do real time analysis over streaming data in python. Montalenti talked to a crammed room full of self-styled Pythonistas (because, really, is there any other kind?) about “Beating Python’s GIL to Max Out Your CPUs.” GIL is the global interpreter lock for the uninitiated. Given the interest in that talk, and the number of people who raised their hands for “having run into GIL issues,” I would say that all the love for Python notwithstanding, performance and scale-out issues still crop up for people doing analysis in Python.

In his talk, Montalenti argued against going multi-threaded (which many times has worse performance than single threaded python, because of the GIL), and recommended going multi-process instead (by using libraries such as joblib and ipyparallel) so as to not be limited by the GIL. Additionally, this way, you not be limited to just cores on your machine but are enabled to scale to multiple machines on a cluster. Montalenti said, tongue-in-cheek, that GIL is a feature, rather than a bug, of Python because it prevents people from wasting their time on the multi-core use case, when, really, the multi-node framework is more useful and scalable anyway. Point.

From darkness to light

Another really interesting talk at the conference was by Katrina Riehl of Continuum Analytics on a project that they are doing with DARPA calledMemex. The focus of the project is on mining the “dark” web, which is the part of the internet that doesn’t show up in our search results because either it’s dynamic data, or the access is discouraged by the robots.txt, or because it is in multimedia content which is not usually crawled by search engines. All of these are now being crawled as a part of the Memex project. The hope being that this will lead to data about illicit activities such as information about child trafficking etc.

Riehl made it very clear that they don’t go for people’s private data such as health or financial information but that is not because they can not. Despite the noble goals, though, at some level I do worry about government organizations going for access to more and more digital information about us. As users, it is hard to even know who knows what and how much, where the information leaks even are, so there is very little guidance on what one can do to keep one’s private information private.

Zen and the purpose of being

One final thing I want to mention is the discussion hosted by Tara Adiseshan of the Coral Project about user comments on news sites. Adiseshan mentioned that journalists are told ‘don’t read comments,’ but toxicity decreases when journalists engage. She wanted to talk about how to make online communities more constructive than are normally ‘found in the wild’.

I had missed the initial part of the discussion, but I joined, drawn in by the buzz on Twitter. And I am glad I did because rich, lively and honest discussions such as these are why one goes to conferences. You can sit in a big hall and listen to someone talk about the latest features and advancements in a cool tool used by thousands, and that is useful and informative, but what is truly inspiring is what people are doing with those tools and to find out where they are making the difference.

I would like to end with shout outs to two excellent tutorials at the conference. The first was a highly methodical take on how to approach machine learning using scikit-learn by Andres Mueller. The second was a tutorial on becoming a power user of pandas by Jeff Reback. I realized that I was only using about 20% of pandas before the talk. Wes McKinney, the creator of pandas, was conspicuous by his absence though.

I look forward to the next meet up of the PyData community in 2016 and hope to be able to have a project to present at that time, at least as a lightning talk.

Full Disclosure: WiMLDS and NumFOCUS paid for my ticket to PyData and I am very grateful for the opportunity. Having said that, the opinions expressed in this post are entirely my own.