Feb 22, 2021

SQL — a Common Language for the Whole Data Team

Today, I wanted to share how we’ve empowered colleagues outside the data engineering team to write their own data transformations by overcoming coding language barriers. Within our Data team, there are several specializations:

Data Analysts create dashboards and analyze data for business leads and product teams across the company.
Data Scientists apply machine learning technology at Automattic.
Data Engineers build the infrastructure and data pipelines to support data analysts and data scientists.

A couple of years ago we faced a problem with data transformations. We wrote data pipelines in Scala and executed them in Spark. While data scientists and data analysts at Automattic know which data they need and in which format, most of them do not know Scala very well. Setting up these pipelines and writing the data transformations was thus the responsibility of a small team of data engineers, and they soon became a bottleneck for the work of data analysts and data scientists.

To empower data scientists and data analysts, and free up some of our data engineers’ time, we created a toolset to allow everyone to transform and prepare the data according to their needs. This solution is based on writing transformations in pure SQL.

Why SQL?

Everyone on the team — from data analysts to data engineers — knows SQL.
SQL is well supported by Spark, and Spark is already used for existing transformations.

Problems, problems

SQL is widely used, but there are few things we needed to overcome.

Code style. While Scala, Python, and other languages have standard formatting guidelines and tools, SQL is wild. We’ve spent a lot of time discussing rules.
Even if you come up with a code style, it’s hard to find tools that will check your SQL code. We are using a formatter that comes with JetBrain IDEs.
You need to follow an exact SQL dialect. For example, column names in MySQL are not case sensitive, but they are in PostgreSQL. Hive and Impala differ, among other things, in the date functionality functions they provide. Thus, a query that runs fine in Hive might not run in Impala. We’re using Spark for transformations but we also allow access to our data with Impala. People should always keep in mind which engine will execute their SQL code at a given moment.

Our solution

We developed a framework on top of Spark SQL to give everyone the ability to build their own data pipelines. They don’t need to know Scala (or even Python) to transform data. In addition to writing an SQL query, they define a data schema in the Avro IDL language and write tests using Cucumber steps. Cucumber comes with data tables that conveniently define input and output data of ETLs. Any person at the company can create a PR on GitHub. They get a boost of confidence when they pass all the continuous integration tests, get reviewer approval, and enjoy working with the new data table they created.

We have used this system for more than a year now, and more than 700 data transformations have been written in this way. While the initial Data Engineering team has only five people, more than 50 people are now committing code and extending the usage of the data available in our data warehouse.

Data Engineering

Best Practices, Data Analytics, Data Science, Machine Learning

Comments

David Pardoe

5 years ago

This is a fantastic concept. SQL and relational data concepts are such a critical foundation for ALL data science work. That does not mean that all coding must be in SQL and that a database must be relational and be in fourth normal form. To find a method for translating this understanding into the plethora of newer data manipulation and management and coding frameworks is the holy grail for effective use of data for data science.

Reply
Kajal

5 years ago

So true when I am looking at hiring someone, I look at mathematics and SQL skills. Rest can be learnt at the job.

Reply
Architecting ExPlat: Automattic’s New Experimentation Platform – Data for Breakfast

5 years ago

[…] workflows that are too costly to run on the end-user-facing backend. This component includes the Ganymedes system, which makes it easy for anyone at the company to deploy well-validated ETL workflows using SQL or […]

Reply
ExPlat’s Development Principles and Practices – Data for Breakfast

5 years ago

[…] means no phpcs warnings in PHP, full linting in Python, small maintainable queries in SQLT, a build that fully passes in Abacus, etc. Ideally, we will add new conventions where there are […]

Reply