Most data scientists have to write code to analyze data or build products. While coding, data scientists act as software engineers. Adopting best practices from software engineering is key to ensuring the correctness, reproducibility, and maintainability of data science projects. This post describes some of our efforts in the area.
Different data scientists, different backgrounds
Data science is often defined as the intersection of many fields, including software engineering and statistics. However, as demonstrated by the above Venn diagram, viewing it as an intersection tends to be too exclusive – in reality, it’s a union of many fields. Hence, data scientists tend to come from various backgrounds, and it is common to encounter data scientists with no formal training in computer science or software engineering. According to Michael Hochster, data scientists can be classified into two types:
- Type A (analyst): Focused on static data analysis. More of a statistician or a scientist with coding skills than an engineer.
- Type B (builder): Focused on building data products. More of an engineer with a strong statistical or scientific background than a scientist.
With advanced degrees in areas such as physics, computational chemistry, and mathematics, most of my colleagues at Automattic initially leaned more toward Type A than B. In contrast, I’m a classic Type B, having worked as a software engineer before doing a PhD and becoming a data scientist.
Type B data scientists may take for granted best practices from software engineering that are unfamiliar to Type A data scientists. For example, in a post on software development skills for data scientists, Trey Causey describes the following imaginary dialogue between a software engineer and a data scientist:
DS: “Checked my what into what without a what?”
While this is an extreme example, it is a good demonstration of the knowledge gaps of new Type A data scientists. Causey lists five engineering skills that data scientists should acquire to start bridging those gaps, but he notes that his list isn’t comprehensive. This is partly because there is no single definition of best practices, and many of them can be seen as obvious or implicit. Nonetheless, since joining Automattic last year, I’ve dedicated some of my time to help increase awareness and adoption of engineering best practices in our data science projects. The rest of this post describes some of the results of this work.
Adopting better practices
Using a version control system like Git is very common advice given to anyone who writes code. This is useful even for solo projects, as it makes it easy to keep track of development progress, explore different ideas, and revert changes that haven’t gone well. For complex multi-person projects, version control is a strong requirement — it’s nearly impossible to work effectively without it. Fortunately, Automattic’s data science projects were already under various GitHub repositories when I joined. However, in the past few months we have improved our use of version control in the following ways:
- Started using the pre-commit Python package to check our code locally before it gets committed. The checkers that we use are pycodestyle, pydocstyle, and Pylint. The former two enforce compliance with some of the guidelines from PEP8 and PEP257 (the official Python style guides for code and documentation, respectively), while the latter detects potential programming errors and ensures adherence to a wide range of coding standards.
- Set up CircleCI to run the above checkers whenever a pull request is made or code is merged to the master branch. This ensures that even if pre-commit hooks were skipped for some reason (e.g., misconfiguration), errors would be caught. In addition, CircleCI uses pytest to run our tests in an isolated Conda environment that is similar to the environment in which the code is expected to run (e.g., a sandbox that connects to our Hadoop cluster).
- Increased clarity around code placement by setting up a central data science repository for all our shared code, one-time analyses, notebooks, and various experiments. The only projects that have their own repositories are those that get developed and deployed as standalone tools (e.g., a2f2 – our time series analysis tool).
To make it easier to share code across projects, we converted our collection of reusable scripts to a private Conda package. Originally, code was sometimes copied to different projects, violating the DRY (don’t repeat yourself) principle. In other cases, the shared code was just assumed to be at a specific local path, which made it hard to reproduce results if the shared code changed in ways that broke the dependent project. With our versioned Conda package, different projects can rely on different package versions, making long-term maintenance easier. We chose Conda rather than pip because it allows us to manage the binary dependencies of our projects without making special requests to our system administrators.
While our projects handle diverse datasets and address various customer needs, they often end up following similar patterns. For example, all projects accept some data as input, process it, and produce an output (e.g., predictions). Different data scientists working in isolation tend to come up with different directory structures, file names, and identifiers to represent similar concepts. To reduce this divergence and make it easier for us to switch between projects, we have started working towards a standard project structure, inspired by the Cookiecutter Data Science template. However, we found that the suggested template doesn’t fully match our needs, so we are in the process of adapting it, and may share the results in the future.
- It spreads code ownership.
- Communicates changes across teams.
- Serves as a sanity check to verify that requirements make sense.
- Allows folks to find bugs through both manual testing and in code.
- Lets all folks involved learn new things!
- Can also serve as documentation for how something worked, and why certain decisions were made. Perhaps even for a future you!
In addition to improving the quality of our projects, code reviews and our other efforts in adopting best practices address the main problem identified in Trey Causey’s article — that “many new data scientists don’t know how to effectively collaborate.” As with all things that we do, we are committed to never stop learning, and strive to further improve our processes. As noted above, initiatives like our standard project structure still require more work. In addition, we would love to share more of our data science code with the open source community. This is likely to happen as we further improve the modularity of our projects, making it possible to publish generally-useful libraries. Watch this space!