Jan 31, 2017

This Week in Data Reading

This week, Boris, Charles, and Greg bring you three new resources for data reading and pose some questions for discussion on your approach to reading scientific literature critically, and how you detect and deal with bias inherent in your applications. Looking forward to your comments!

Boris Gorelik

If you’re looking for merciless, data-based critiques of published scientific papers, check out the site of Professor Lior Pachter (@lpachter) of UC Berkley. You’ll want to read his critiques in the field of network analysis and graph theory, especially the ones written by famous figures in this scientific field. (Note that Pachter’s critiques can be harsh as, for example here or here; be sure to read the comments for criticism on the critique itself.) If you write or read scientific literature as part of your work, follow this blog as a reminder not to blindly follow “expert authority.”

A question to you, the reader: how often do you find yourself disagreeing with a renowned scientist only to dismiss yourself, thinking “they know better”? Do you even bother reviewing a paper critically if you read it in a “glamor” journal written by a celebrity author? I’m curious about your thoughts!

Charles Earl

At our recent meetup at the NETSCI-X conference, Yoav Goldberg recommended “Quantifying and Reducing Stereotypes in Word Embeddings” (PDF) by Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai as a good study of how the gender and racial bias inherent in training data can creep into machine learning models. Word embeddings have emerged in the last two years as a very powerful and readily applied tool for natural language processing. I think the work is quite relevant for us in particular as we incorporate this technology into search and recommendation products.

My questions for you: how are you addressing the bias that might be implicit in the data and models used to power your applications? Can you effectively detect or remove the bias? How do you communicate this to your users?

Greg Ichneumon Brown

I recommend “An overview of Mozilla’s Data Pipeline” over at Roberto Agostino Vitillo’s blog. Roberto is a staff data engineer at Mozilla. This piece a good overview of how Mozilla gathers telemetry data from Firefox and processes it using a number of open source data tools.