We love data and so do our bloggers, many of whom fastidiously check their stats pages after publishing posts to see how well posts are engaging readers. For this they rely on fast and accurate tracking of visitors and views. At WordPress.com, we serve over 23 billion pages each month. On the data team, we are responsible for taking a stream of Nginx logs and turning them into those counts of unique visitors and views. We do this for each individual site as well as globally across all sites on a per day, week, and month basis. For all that to work efficiently and quickly, we have a near-realtime pipeline that uses a myriad of technologies including PHP, Kafka, Flume, and various components from the Hadoop ecosystem.
While this automated system works quite well, we occasionally will want to perform custom log analysis or some one-time queries on the raw data. But with tens of billions of log lines a day, simply looking at just a couple of days of data means that we are left to sift through a lot of data. It’s certainly not something we want to try and churn through on a single server. Luckily, we already have a Hadoop cluster in place for our near-realtime processing so we can tap into the powers of MapReduce to solve this problem.
Hive to the rescue
Apache Hive is a data warehouse infrastructure built on top of Hadoop. It allows processing of data with SQL-like queries and it’s very pluggable so we can configure it to process our logs quite easily.
First, we will create a Hive table that’s configured to read raw compressed Nginx logs. To do this, we will instruct Hive to create an external table at the location where we will copy our logs to on HDFS. With the table schema we specify that it is to be stored as
TEXTFILE which allows Hive to read the stored files as plain text — or, for those files with a
.gz extension, to be read as gziped plain text. We also want to apply a custom serialization/deserialization (serde) format to each log line so that each line is parsed into columns of data in our table. To do that, we will use the included RegEx serde; it’s slow, but given it simply applies a RegEx expression to each line in order to extract data, it’s easily and highly configurable.
Our create table statement then looks something like this:
Once we’re done, all we need to do is to copy our gziped logs to the specified HDFS location and query that Hive table. Hive even offers some very helpful functions like
parse_url() which we can use to extract parts of the URL or query parameters.
While HiveSQL comes with lots of functions, it’s likely you may have some custom business logic that cannot be handled with pure HiveSQL functions. Fortunately, Hive is super pluggable on the querying side as well and comes with a concept of User Defined Functions (UDFs). With UDFs, we can write arbitrarily complex Java classes that implement a common UDF interface and install it onto our Hive cluster at query time for use. These UDFs will then be distributed across our mappers and executed in parallel.
By combining the flexibility of custom deserializers on the parser side with custom UDFs on the query side, we can provide a relatively simple SQL interface that can be used to process TBs of logs without needing to build new complex ETL processes when answering one-off questions or freeform data exploration.
If you’ve found this post interesting, I also occasionally blog about data engineering over at my own blog, xyu.io, and would recommend you check out this post about how the ETL process works for WordPress.com.