May 23, 2019

Data Science Insights from Cameron Davidson-Pilon

We had the honor of hosting Cameron Davidson‑Pilon for an Ask Me Anything (AMA) session. Among other things, Cam is known for writing the Probabilistic Programming & Bayesian Methods for Hackers book and for the lifelines and lifetimes Python packages, which implement algorithms for survival analysis and customer lifetime value estimation, respectively. He recently left Shopify where he was a Director of Data Science to pursue professional development toward a potential career change. This post presents some highlights and clips from our conversation.

What strategies do you find effective for improving adoption of good data practices across the organization?

Telling stakeholders about negative causal inference examples and how going with gut feeling can be wrong. One example is birth order and probability of Down syndrome. They are positively correlated, so people may wrongly infer that it is best not to have more than 1‑3 children. However, when taking maternal age into account, the causal relationship evaporates. This gets people thinking about confounding variables and understanding why simply looking at 2D charts that ignore the full causal structure is problematic.

In general, it’s not enough to say that correlation isn’t causation. As data scientists, we should also provide examples where assuming causation from correlation fails, and show stakeholders how to model causal relationships with DAGs (directed acyclic graphs). Every data scientist should read the Causal Inference book by Hernán and Robins, and communicate its contents to stakeholders. It’s often the stakeholders that can draw the best DAGs because they’re the domain experts.

To avoid a negative reaction, I wouldn’t present such findings as something that specific stakeholders do wrong. Instead, highlight cases where the entire scientific community got something wrong, or where data scientists made mistakes that we learned from.

What techniques and tools do you recommend for causal inference?

The CausalImpact package for one‑off events that can’t be A/B tested. The visualizations produced by the package tell a very good story that’s accessible to stakeholders.
The GeoexperimentsResearch package is useful for A/B testing marketing campaigns across regions.
As noted, the Causal Inference book should be read by all data scientists. The zEpid package provides Python implementations of some of the algorithms from the book.

Can you share some thoughts on the organizational structure of data teams?

This depends on the company. At Shopify we had one team per business area (e.g., marketing, finance), which worked well. The data organization was strongly influenced by the organization of the company as a whole, as the focus was on engineering, product, and design. Within each group, we maintained separate ETL (extract‑transform‑load) flows, datasets, and stakeholder relationships, but APIs were well‑documented and discoverable so that data scientists from the entire company could use data regardless of their team and domain.

However, as with any company, areas of expertise aren’t evenly distributed across teams. For example, my teams were focused on reporting and ETL, whereas product‑facing teams were focused more on machine learning. Due to hype cycles, many people wanted to do more machine learning work, but I highlighted causal inference as a cool new area that is worth specializing in. This is especially worthwhile as there are fewer causal inference specialists than machine learning specialists.

We cultivated a community of data experts across the domain‑focused teams by implementing consistent APIs for accessing data. We also started a data nomenclature movement, enforced rules around dataset creation and storage, and invested in data discovery. This included the creation of metadata for each dataset, and showing the lineage and dependencies of the data. In addition to helping people work across teams, it helped with on‑boarding new hires.

This wasn’t easy to implement in existing teams with legacy constraints, but when new teams were established, it was a good opportunity to create new APIs and a warehouse layout that implemented the lessons we’ve learned over the years. Such teams could get everything right from day one, e.g., having tests and regex constraints on column names. This is similar to charter cities, where a state creates a brand new city with all the best infrastructure and principles. Like with real cities with a physical infrastructure, it’s hard to bring all the best practices to existing teams, as changing infrastructure like table schemas requires a lot of effort. It’s easier to build things correctly when projects are new, and use this “charter data warehouse” approach to spread best practices across the organization.

What can data science teams do to become more successful?

Talk to your stakeholders a lot to ensure you have a good grasp of their wants and needs.
Do the legwork of basic descriptive stats (analytics, reporting, and dashboards) to get a good idea what the client teams are doing. This is also important for building trust between the data team and stakeholders. Once trust is established, stakeholders may ask riskier (and more interesting) questions around machine learning and causal inference.
On the topic of trust, it is important to ship by the deadlines the team has committed to, and continuously inform stakeholders of any delays.
It is worth looking beyond reporting into causal inference questions.
In general, strive to save your clients time and money, which are always in short supply.

What have you been up to since leaving Shopify? What’s next for you?

I worked a lot on lifetimes and lifelines. That took a few months. I was never really proud of the packages, but now I think they’re in a very good spot. I recently posted on the evolution of lifelines over the past few months.

Now I’m thinking of a change in the domain I’m working in. I’m going back to school to take chemistry and biology courses. I like the idea of cellular agriculture (lab‑grown meat), as it’s good for the environment and for animal welfare. I may do a master’s degree in food science or microbiology. There’s not much happening in Canada in this field, and I want to change it, as there is a good opportunity to do more here. This may or may not include making Margaret Atwood’s ChickieNobs a reality.

Data Science