Apr 4, 2017

This Week in Data Reading

This week, Demet offers a piece that dissects President Donald Trump’s support on Reddit, Charles shares two papers he recently enjoyed on natural language processing, and Carly offers a piece on back propagation.

Demet Dagdelen

This week, I’m sharing “Dissecting Trump’s Most Rabid Online Following” by Trevor Martin over at FiveThirtyEight. There are so many things I love about this article. Not only is it a strong article from a data science perspective — it explains and introduces technical concepts well — I also like to read it as a digital anthropological analysis.

I spend way too much time on Reddit. In the past few years, whenever I venture away from the safe haven of smaller subreddits I’ve curated to my front page, I realize how hate-filled Reddit can be. But online communities, especially on Reddit, are fascinating to me. Author Trevor Martin has a very good understanding of Reddit (which many journalists don’t). The results of his analysis come as no surprise to anyone who spends enough time on the website. In my experience in data science, the lack of “no-surprise-to-anyone” means that the methodology works. A minor improvement could have been made by taking the score of the comments into account, since not everyone who comments on a subreddit actually supports the ideas of the community.

I’m still mulling over this article days later and can’t wait to dig into the code and the data as well — I feel like I can already tell this will be one of my favourite pieces from 2017.

There is an interactive tool based on this methodology and data that received “the Reddit hug of death” when the article was first published, but you can play around with it here. Did you discover any interesting connections? Let us know!

Charles Earl

A couple of papers by Alexandra Schofield and David Mimno recently caught my eye. Mimno maintains the popular NLP library Mallet.

In natural language processing pipelines, a text is usually pre-processed in some way before we begin to look for deep statistical or semantic regularities. Stop wording and stemming are are two common pre-processing operations. Stop wording is where ubiquitous words that might contribute little to the meaning of a text are removed. For example, the apple in the orchard becomes apple in orchard if the is in the list of stop words. Stemming is when a word is replaced by its base. For example, eating apples becomes eat apple.

In the papers, “Pulling Out the Stops: Rethinking Stopword Removal for Topic Models” (PDF), and “Comparing Apples to Apple: The Effects of Stemmers on Topic Models” (PDF), Schofield and Mimno look at natural language pipelines that produce topic models — a kind of text feature extraction that infers topics common to a collection of documents. They make convincing arguments that stop wording and stemming add little to topic model quality.

While these results are specific to the kind of models Schofield and Mimno are building and the corpuses they analyze, the results certainly offer some flexibility to the practitioner in terms of constructing text pipelines. In many cases, reducing the size of an index or model is critical (filter stop words liberally!), while in others, being able to feed text “as-is” into an application can simplify the code and enable faster development.

P.S. I loved the titles of these papers!

Carly Stambaugh

I do believe there are situations where it’s perfectly acceptable to treat your tools as a black box, but Andrej Karpathy makes a great argument in his post, “Yes you should understand backprop,” that back propagation is not one of them. It includes practical examples, and as an added bonus, has links to free lectures on the subject.