Nov 22, 2016

Bulk Log Analytics With Hive

We love data and so do our bloggers, many of whom fastidiously check their stats pages after publishing posts to see how well posts are engaging readers. For this they rely on fast and accurate tracking of visitors and views. At WordPress.com, we serve over 23 billion pages each month. On the data team, we are responsible for taking a stream of Nginx logs and turning them into those counts of unique visitors and views. We do this for each individual site as well as globally across all sites on a per day, week, and month basis. For all that to work efficiently and quickly, we have a near-realtime pipeline that uses a myriad of technologies including PHP, Kafka, Flume, and various components from the Hadoop ecosystem.

While this automated system works quite well, we occasionally will want to perform custom log analysis or some one-time queries on the raw data. But with tens of billions of log lines a day, simply looking at just a couple of days of data means that we are left to sift through a lot of data. It’s certainly not something we want to try and churn through on a single server. Luckily, we already have a Hadoop cluster in place for our near-realtime processing so we can tap into the powers of MapReduce to solve this problem.

Hive to the rescue

Apache Hive is a data warehouse infrastructure built on top of Hadoop. It allows processing of data with SQL-like queries and it’s very pluggable so we can configure it to process our logs quite easily.

First, we will create a Hive table that’s configured to read raw compressed Nginx logs. To do this, we will instruct Hive to create an external table at the location where we will copy our logs to on HDFS. With the table schema we specify that it is to be stored as TEXTFILE which allows Hive to read the stored files as plain text — or, for those files with a .gz extension, to be read as gziped plain text. We also want to apply a custom serialization/deserialization (serde) format to each log line so that each line is parsed into columns of data in our table. To do that, we will use the included RegEx serde; it’s slow, but given it simply applies a RegEx expression to each line in order to extract data, it’s easily and highly configurable.

Our create table statement then looks something like this:

	— Table for raw NGINX logs
	CREATE EXTERNAL TABLE IF NOT EXISTS `raw_nginx_logs` (
	`remote_addr` string,
	`remote_user` string,
	`time_local` string,
	`http_verb` string,
	`url` string,
	`http_ver` string,
	`status` int,
	`body_bytes_sent` int,
	`http_referer` string,
	`http_user_agent` string
	)

	— Use RegEx to parse each line
	ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
	WITH serdeproperties(
	"input.regex" = "([0-9\\.]+) – ([^ ]) \\[([^\\]])\\] \"([^ ]) ([^ ]) ([^ ])\" ([0-9]) ([0-9]) \"(.)\" \"(.*)\",
	"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s"
	)

	— This also supports gzip, just have to have a .gz extension
	STORED AS TEXTFILE

	— Be explicit where we are going to store this table
	LOCATION 'hdfs://nameservice/user/hive/warehouse/raw_nginx_logs';

view raw

hive-nginx.hql

hosted with ❤ by GitHub

Once we’re done, all we need to do is to copy our gziped logs to the specified HDFS location and query that Hive table. Hive even offers some very helpful functions like parse_url() which we can use to extract parts of the URL or query parameters.

While HiveSQL comes with lots of functions, it’s likely you may have some custom business logic that cannot be handled with pure HiveSQL functions. Fortunately, Hive is super pluggable on the querying side as well and comes with a concept of User Defined Functions (UDFs). With UDFs, we can write arbitrarily complex Java classes that implement a common UDF interface and install it onto our Hive cluster at query time for use. These UDFs will then be distributed across our mappers and executed in parallel.

By combining the flexibility of custom deserializers on the parser side with custom UDFs on the query side, we can provide a relatively simple SQL interface that can be used to process TBs of logs without needing to build new complex ETL processes when answering one-off questions or freeform data exploration.

If you’ve found this post interesting, I also occasionally blog about data engineering over at my own blog, xyu.io, and would recommend you check out this post about how the ETL process works for WordPress.com.

Data Engineering

Automattic, Data Analytics, Open Source, WordPress

Comments

afthead

10 years ago

Does Automattic open source your analytics solution? We’ve got an analytics mining challenge and would love to dig into your code.

Reply
1. Xiao
  
  10 years ago
  
  We have not yet open sourced our analytics implementation however it’s mostly built on top of other open source systems like Hadoop, Kafka, Zookeeper, Elasticsearch, etc. We are also planning on blogging more about how we are using these systems with examples on this blog in the future, so stay tuned.
  
  Reply
  1. afthead
    
    10 years ago
    
    Thanks! We are using many of the same technologies, so I look forward to hearing more about your implementation.
Architecting ExPlat: Automattic’s New Experimentation Platform – Data for Breakfast

5 years ago

[…] Tracks, our home-grown user analytics solution. Tracks has clients in many of Automattic’s products that collect data and send it back to Hadoop via Kafka. […]

Reply