The 11th European Conference on Python in Science took place in Trento, Italy from August 28 to September 1.
During this conference, I delivered a tutorial session on data visualization, gave a talk (read more about it below), and also attended several tutorials and many interesting sessions. This was a very educational and productive conference with a good balance between academic and industry presence. All the conference activities were streamed live to YouTube and are available on the EuroSciPy channel.
Here’s a summary of several talks that I attended, in no particular order. The full program is here.
Data visualization
I delivered a tutorial “Data visualization — from default and suboptimal to efficient and awesome.” (the video is here). I also gave a short related talk entitled, “Three most common mistakes in data visualization.” I also invite you to read my thoughts on preparing the tutorial and the talk.
In addition to me, Pietro Marchesi talked about Data visualizations for the web with Altair and Vega(-Lite). Vega is a high-level grammar of interactive graphics. It provides a concise JSON syntax for rapidly generating visualizations to support analysis. Altair is a package that generates Vega graphics from Python. Unlike Matplotlib, Vega (and Altair) target web graphics that support interactivity — by generating JavaScript documents that can be customized easily. Personally, I would not use Altair-based images to create static publication-ready images, however, they are a very good fit for interactive data exploration.
How to not screw up with machine learning in production
This short and good talk by Denys Kovalenko surveys several “deadly sins” of having machine learning in production. Even if you are new to delivering machine learning solutions to production, or don’t deal with it on a daily basis, I highly recommend that you watch this talk on YouTube.
Running time optimization and profiling
There were several talks and workshops devoted to running time optimization and profiling.
The tutorial “From exploratory computing to performances, a tour of Python profiling and optimization” by Antonio Ingargiola dealt with overall strategies and offered a survey of existing methods. One of the biggest surprises for me was the fact that several mathematical computations can be much faster when performed with the standard math
module, compared to the corresponding functions of NumPy.
The tutorial by Matti Picus, CFFI, Ctypes, Cython The Good, The Bad and The Ugly, demonstrated several ways to implement mathematical computations in C, with subsequent embedding in Python code using CFFI, Ctypes, and Cython). The speed improvement was very impressive — up to a factor of six. Even more impressive was that fact that after fiddling with C/Python integration, Matti executed the original pure Python code without any modifications, using PyPy which resulted in execution time that was very similar to the pure C implementation. Before you switch your own programs to PyPy, it’s worth remembering that PyPy is very fast with pure Python functions but is much slower when dealing with functions that are already implemented in C, such as NumPy and the bigger chunks of Pandas.
Another related talk, “Benchmarking and performance analysis for scientific Python” provided a survey of profiling methods and promoted https://github.com/symerio/neurtu — a simple performance measurement tool.
Data privacy for data scientists
An interesting workshop by Katharine Jarmul who surveyed several ways to anonymize and de-anonymize data. The workshop’s most important message was that sharing anonymized data is a complex task. Once you do it, someone will try to de-anonymize it, which means that sharing such data needs to be done with a great deal of caution. It is good to remember that several solutions exist. See the tutorial video to learn about some of them.
scikit-learn and tabular data: closing the gap
This talk was presented by Gael Varoquaux, one of the core contributors to scikit-learn — THE machine learning library in the Python ecosystem. Gael reviewed the upcoming changes in sklearn 0.20 (expected to be released very soon). The bigger changes in 0.20 deal with easier data pipelining, especially when dealing with tabular data with mixed data types. They also plan to have better support for categorical data. One of the biggest problems with sklearn
is that it doesn’t fully support Pandas data types. Although we can send Pandas DataFrames and Series to sklearn, the output is always a NumPy array or a matrix. Both of them lack the variable names, which makes model analysis harder. This problem won’t be solved in 0.20 but it’s good to know that it is on the developers’ minds.
This brings us to the next talk.
NumPy: where we are and where we want to be
Matti Picus (mentioned earlier), is one of the two full-time programmers employed to develop NumPy. According to Matti, the project’s main effort is on compatibility with other libraries such as sklearn (see above) and Dask. These improvements will provide several useful features such as native support of categorical data types in sklearn, and special value for “missing data” (currently, categories are encoded as integers, and missing values are encoded as special values).
According to Matti, the team’s priorities are determined by community demand, as is tracked in the NumPy issue tracker. That is why we are all encouraged to fill bug reports.
Parallelization
There were two tutorials devoted to parallel processing. You may find them here and here. These are interesting tutorials that demonstrate, among other things, how to use Python’s native multiprocessing, the joblib package, and Dask.
BoostCat — better boosted trees with categorical variables
In this talk, Anna Veronika Dorogush from Yandex demonstrated an open source classification library. The library has visualizing and diagnostic functions. According to most of the benchmarks that we saw (surprise!!!) boostcat demonstrated faster training and prediction times.
Closing the gap with data classes
Data classes are a cool way to define “data classes” (duh) — a class that is meant mainly to store data but still needs object functionality. Formally speaking, a data class is a regular class with several “magic” functions that are auto-generated by a decorator. These function include init, repr, and others. Data classes exist in the standard Python library since 3.7 but can be installed as a package in 3.6.
Final words
EuroSciPy 2018 was an excellent experience for me, and from what I understand, for many more attendees. I’m grateful to the organizers for this wonderful conference.