This month a few members of the Data team attended Spark + AI Summit 2018 in San Francisco. In addition to speakers from Databricks, the keynotes featured speakers from a wide variety of industries, including software, car manufacturing, genomics, and even construction. There were over 200 sessions that were organized into tracks, such as developer, data scientist, AI, technical deep dives, and advanced analytics. We’d like to share our favorite sessions with you!
I was really intrigued by Luis Leal’s talk about Differential Neural Computers. This model, recently developed by Deep Mind, combines the learning capabilities of a neural network with an external memory store. This one ups recurrent neural nets, by dramatically increasing the scale of memory the network can retain. What’s really fascinating about this though, is that the neural net learns how to manage memory storage and retrieval on its own, in a similar manner to how neuroscientists believe a human brain manages memories.
My first question was whether this model was feasible to implement with limited resources, and Luis said that he built one using two GPUs! You can find a video of the talk and a link to the slides on the Databricks website. I hope you enjoy it as much as I did.
In working with Data Scientists and Data Engineers at Automattic I’ve noticed one of our biggest pain points is the mismatch in preferred systems and languages. Data Scientists in general love using Python — its wide array of tools and libraries makes manipulating and processing data easy. Data Engineers on the other hand, are very adept at working with Java and the JVM as it’s how most of the distributed systems in the Hadoop universe are built and how clusters are managed. This results in a trade-off for every project we tackle; either use Python with smaller sampled data due to resource constraints of running on a single computer or use Spark to get fast distributed execution across a large cluster but without libraries to do even basic data science.
To bridge this gulf we have tried using PySpark with limited success so I was quite excited by Holden Karau’s talk Making PySpark Amazing and it did not disappoint. Not only does she summarize the common problems everyone is facing and explains where some of the bottlenecks exist, the talk offers a glimpse into the future and tools (Apache Arrow) we can try now to get more useable PySpark performance.
As a data engineer, I’m always looking to extract more and more value from our Hadoop cluster by increasing cluster utilization by optimizing the jobs running on it. Oversubscribing Apache Spark Resource Usage for Fun and $$$ by Sital Kedia and Sergey Makagonov was of particular interest to me as it explains how to play with some of the Spark knobs like
spark.task.cpus to close the gap between resources reserved by a job and the actual resource utilization. The second part covered how to use historical jobs data for predicting and auto-tuning resource allocations for Spark jobs.
Few companies collect and analyze as many data points as Strava, the fitness world’s most loved activity tracking social platform. Multiple streams of data are collected for each activity type. For example, a single run includes multiple streams at 1-second intervals: time, GPS position, elevation, heart rate, etc.
I was lucky enough to catch Drew Robb’s talk on how they switched to Spark to help process the more than 17 billion miles of exercise data into the beautiful rasterized global heatmaps (down to 2 metre resolution density!). I really love how they use bilinear smoothing and the weighted average CDFs (Cumulative Distribution Function) of neighboring tiles to remove tile artifacts and ensure that the heatmaps are visually compelling at all zoom levels.
Strava heatmap processing numbers:
- + 1 billion activities
- 17 billion miles of activities
- 3 trillion GPS coordinates
- ±8 trillion rasterized pixels
- +100,000 years of exercise!
- 100 i3.2xlarge machines (8CPUs, 60GB RAM, 1.7TB SSD)
- 5 hours for full heatmap build