Jun 6, 2017

This Week in Data Reading

This week, Charles, Xiao, Chris, and Boris share pieces on machine learning, MySQL, pop lyrics, and career advice.

Charles Earl

Last week, I came to appreciate the post “How to Use t-SNE Effectively.”

t-SNE is a visualization algorithm that has been a boon for understanding machine learning models. Machine learning models, particularly those used in deep learning, may have hundreds or thousands of dimensions. t-SNE can be used to transform hundreds of dimensions down to two or three. With two dimensions, it is possible to draw graphs that make it easier to see the relationships a model has learned. The article sheds light on what t-SNE does well, and where its results are more open to question.

The post is published in the Distill Research Journal — a new online machine learning journal focused on articles that prioritize visualization and interactivity. Distill is definitely worth checking out.

Xiao Yu

I found this post about MySQL removing the query cache quite interesting. What resonated most with me was the idea that predictability of query speed was much more important then actual raw performance.

For user facing systems, reducing the variability of performance is often more important than improving peak throughput.

We see this in the internal tools that we build for WordPress.com analytics. It was more important to build out our data systems so that we can, for example, make sure blog visitor counts are updated around a certain set interval predictability for every user, rather then having a system that can quickly update counts but with such a large variability that users can’t easily predict how up-to-date the numbers are.

People can work around a constant speed bottleneck of some kind and schedule around it but having a large variation forces people to wait on the application which is a worse overall user experience, even if the average time spent waiting is less.

Chris Rosser

Around the world, around the world
Around the world, around the world…

Are pop lyrics getting more repetitive?

The Pudding looked for repeated sequences in pop lyrics from the ’60s to the present day using the Lempel-Ziv (LZ) compression algorithm. The findings are interesting but it’s the storytelling and data visualization that stood out for me. Graphs are drawn and words combined as you scroll through the post, and gradually the results unfold.

Boris Gorelik

In my post blog post, “Don’t study data science as a career move; you’ll waste your time!” I warn aspiring data scientist about getting a subpar education. “You might end up a mediocre Python or R programmer…one of the many. Sometimes it’s good enough. Frequently, it’s not.”