Nov 6, 2023

Synchronizing Data with Apache Superset: Our Internal Solution

It is well known that, at Automattic, we love free and open source software. In fact, we have Automatticians actively contributing to a multitude of open source projects, with the main one being of course: WordPress. That’s why, as soon as we verified that the current versions of Apache Superset are robust enough to satisfy many of our data visualization needs, we proceeded to install an instance of the software and make it available to every member of our organization.

Installation involved implementing a multitude of configurations, such a sophisticated authentication protocol associated with the data on our intranet and, of course, the connection to our different data sources. As you may imagine, it was a most entertaining job for our efficient data engineers!

Now we’d like to describe one of those integrations that we are particularly proud of: the automation of the creation and synchronization of Superset datasets. Jarek Cellary did the work, and we enjoy sharing his novel work that solves a friction point while improving our data discoverability.

Problem statement

Superset charts, and dashboards are always nourished by datasets. This means that it is not enough to simply connect Superset to a database: an extra step of creating a corresponding dataset is necessary for its tables (with or without extra data transformations) to be available.

*The Superset data flow pyramid: notice the semantic layer with datasets on top of the database (tables)*

This extra step probably means little for workers with plenty of knowledge of the world of data, but it is definitely an impediment that can discourage folks in other business units, who may be less familiar with the data realm. And, why not say it, it’s also wonderful to eliminate additional hurdles in our daily work! 🙂

In short, these are some of the handicaps that Superset’s native table‑dataset solution presents:

It always requires an extra manual step for the dataset creation.
Table updates are also a manual step folks need to be aware of, as changes are not synced.
Datasets generated by multiple users but from the same tables might be redundant or duplicates.
Other users’ datasets might present permissions issues.
As explained in a previous post about our Data Discovery Tool, we are also generating table and columns metadata annotations using the Apache Avro format. We’d like to keep those so Superset users have all the context handy.

The solution

Our goal was straightforward: to make all the tables hosted in our database available to our users as if they already had the category of datasets. In this way, the creation of visualizations would be fast, intuitive, and supported by the security of having updated schemas.

Accordingly, we set out to create a script that took into account the following requirements:

Browse the catalog: it is necessary that we constantly scan all the available tables to detect possible new features and changes.
Frequency: we believe that daily execution would be ideal.
Interaction with Superset, in such a way that the datasets can be rewritten in case of detection of changes in the tables.
Mirror our Apache Avro annotations, if possible

The development that we finally implemented was a Python script that executes queries on the database, such as SHOW SCHEMAS or SHOW TABLES FROM "{schema}", and later compares the response with the existing datasets in Superset. If any change is detected, then the script initiates a communication via API to create a new data set or modify an existing one. Fortunately, Superset supports datasets and column descriptions, so the annotations are covered too. We run this script daily on our Airflow instance.

*Example of one of our available synchronized table-dataset. Notice the table description, displayed when hoovering on the information icon*

Conclusion

Finding satisfaction in utilizing open source solutions is even greater when we can successfully modify and evolve the code further to meet our needs. In so doing, we enrich the software’s development, contributing our grain of sand to the community. For instance, here’s our Data Engineering team lead Igor Khrol proposing and implementing a fix.

We hope that this post inspires Superset installations in other organizations, along with potential solutions that address this problem in future releases.

Data Engineering