On WordPress.com we average 23 million actions per day that trigger indexing 75 million Elasticsearch documents into hundreds of indices. Users cause these actions by making changes to the data in more than 1 billion MySQL tables, which we mirror into our many Elasticsearch indices. Keeping data correctly mirrored between MySQL and Elasticsearch, where both are distributed across multiple data centers, gets pretty complex.
We get asked about the details of this process (usually when something appears to be going wrong), and we’ve made a lot of improvements over the past six years. Our system is a moving target, but to the extent that prose can describe code that coordinates thousands of servers, this post outlines our current system.
I’m going to focus on indexing posts/pages since that comprises most of our data, but we index many other document types that follow the same process. The steps of the process are:
- User makes a change triggering a WordPress action hook.
- Indexing job submitted and deduped.
- Indexing job runs.
- Check for errors and start indexing.
- Elasticsearch refreshes indexed documents.
- Recover from indexing errors.
Indexing a single post takes between one and thirty seconds to accomplish assuming there are no errors. In cases where we’re reindexing a site with millions of posts, it can take hours to synchronize Elasticsearch and MySQL.
Let’s look at each of these stages in detail.
User Makes a Change
This is pretty simple, although the change could come from many different sources:
- A new post.
- Adding a tag.
- Changing the site name (reindex all posts).
- Running an import.
- Running a wp-cli command that updates many posts.
- A request on the public api.
- A change on a Jetpack-connected site synced to WP.com.
All of these somehow make their way to a WP.com web server, where they trigger one of the 68 WordPress actions/filters that we use to initiate indexing.
Submit Indexing Job
Because a single WordPress process could trigger many indexing jobs, all the indexing jobs go into an in-memory queue. There, we dedupe the jobs and optimize them. (For example, if 10 posts get changed we put them into a single job rather than 10 separate jobs.) Then, we add these jobs to a MySQL queue for indexing jobs on the WP shutdown action (runs at the end of every WordPress process).
Longer running jobs are submitted to a separate queue so we can dedupe them across multiple web servers. This way, if a user’s action requires us to reindex all of their posts, and then they immediately take another action that also triggers reindexing all the posts, we will only run a single job.
Run the Indexing Job
The main jobs system (built on a MySQL queue) runs jobs (mostly) in order, with a slight preference for higher priority jobs. We prioritize changes to single posts above bulk changes to many posts. The jobs system also imposes a slight delay on the jobs (milliseconds) to give the the database and memcache time to sync between data centers.
Each indexing job runs as its own WordPress process for the site to ensure we have the correct data. This means loading the correct theme and loading custom post types code. Because of the many customizations possible in WordPress code, it’s not sufficient to look only at the database to get the data correctly prepared for indexing.
The start of the indexing job is our first opportunity to check for errors. We don’t want to check for errors when submitting jobs because users are waiting on those requests.
All indexing jobs run a number of checks when they start:
- Check if the job is disabled. This applies backpressure on the jobs system if a cluster is overloaded. We also use this as a mechanism to disable indexing in emergencies. When jobs get disabled, the
blog_idis recorded in our Failed Blog Queue so they can be reindexed later.
- Verify that the indexing job looks right, otherwise
die(), which will show up in error logs and alerts (e.g., an incorrect
- For bulk indexing/deleting a blog, we apply a lock on the
blog_id. When we can’t get a lock, the job is retried (see below).
- It is also possible for jobs to have fatal errors due to custom VIP code or transient errors. These get picked up from our error logging and then go into the Failed Blog Queue.
Assuming those checks pass, we start indexing posts based on job type. Some jobs index/delete/update a single post; others index/delete many posts on a blog.
Refresh Elasticsearch Index
Once a document has been sent to Elasticsearch, it still takes time before it’s available in search results. Our indices’ refresh rates are set to one second (for WordPress.com VIP indices) or 30 seconds (global indices containing all sites) depending on the application and indexing rate they experience. We’re planning to evaluate how many servers we will need to lower the 30 second refresh rate down to one second.
Handle Indexing Runtime Errors
A number of errors can occur while a job runs:
- If an indexing error gets returned by Elasticsearch during indexing, the job enters the Retry Jobs Queue. Our most common errors occur when bulk indexing a set of 100 or more posts; if the posts are very large or need extra analysis, the indexing request can timeout. Sometimes the Elasticsearch server can also get briefly overloaded with a burst of indexing and reject some requests. Rejected requests are very rare for us — unless a server has a hardware problem.
- When database errors occur when trying to get the latest data, we also put the job into the Retry Jobs Queue. In these cases we’re hoping a temporary database performance issue will resolve itself by the next time we run the job. This typically occurs 500-600 times a day (0.0026% of all jobs).
- Due to the “fun” of PHP memory management, long running jobs (sometimes many hours) can run out of memory and we need to start a new job where we left off. For the most part we detect when the PHP process is about to run out of memory and resubmit the job, but very rarely the process will crash before we catch it and so it ends up in the Failed Jobs Queue.
Of course, when things go wrong the error rate can jump, and the same code that handles “normal” errors also handles cases where many servers or an entire Elasticsearch cluster is non-responsive (hardware failures or network failures).
If an Elasticsearch error does occur, the indexing job sees the error, and based on the error code, may mark the Elasticsearch server down (stored in memcache) so that other jobs won’t try to query/index that server for a few minutes. This further attempts to prevent indexing errors due to transient network connectivity or server issues. These events are pretty bursty; some days no servers are marked as down, other days, it happens 10-20 times in a day.
Retrying a Job
A retry occurs when we expect a job’s error may resolve itself soon; for example a http error, a database error, or a timeout.
In these cases, we use the indexing job retry code:
- The first time a job fails, resubmit it as a deferred job we’ll run two minutes later.
- Each subsequent time a job fails, we double the time until we’ve run it four times over a period of 14 minutes (2 + 4 + 8 = 14 minutes). At that time, we add the job to the Failed Jobs Queue (below).
We do a few hundred retries an hour with spikes if there’s some indexing/cluster event going on. We add a very small number of jobs to the indexing retry queue. This really only happens during significant cluster events. Of 23 million jobs a day, we typically retry about 20k jobs a day (0.087% of all jobs).
Failed Jobs Queue
This queue handles the cases where a job fails completely, or something is so wrong with the clusters that we need to stop indexing. Job failures are spiky, and typically driven by fatal errors due to bad deploys. On normal days, we see 10 to 100 failed jobs.
We have an hourly cron job that looks at failed jobs each hour and sticks them into the Failed Jobs Queue. This queue currently consists of a table of blog_ids and the indices that those blogs should be reindexed into. Once a day, a cron job empties the queue. Effectively blogs that experience fatal errors or some other repeating failure to index should get back in sync with the database within 24 hours.
So how long does it take for a change to reach Elasticsearch?
We don’t have full end-to-end instrumentation, and in the rare cases where an error occurs, the time can increase, but let’s look at the typical times for a couple of examples.
- New post on Jetpack powered site -> WordPress.com Database -> VIP Elasticsearch Index: a couple of seconds.
- Sync data from the original site to WP.com: typically seconds unless there is a large backlog.
- WP.com database writes to running the indexing job: less than a second.
- Add the doc to Elasticsearch: probably tens of milliseconds.
- Refresh time on the Elasticsearch index: one second.
- Bulk sync a Jetpack-powered site -> WordPress.com Database -> Global Elasticsearch Index: minutes to hours.
- Sync all data from the original site to WP.com: minutes to hours depending on the size of the site.
- WP.com database writes completes running the bulk indexing job: five minutes because we dedupe bulk indexing jobs.
- Walk all posts in the WP.com database and send them to Elasticsearch: minutes to hours depending on the size of the site. A million post site can take four to eight hours.
- Refresh time on the global Elasticsearch index: 30 seconds.
- Change the name of a tag on WP.com -> WordPress.com Database -> Reindexing all posts that have that tag in a VIP Elasticsearch Index: five minutes up to an hour.
- WP.com database writes to run the indexing job: five minutes because we dedupe bulk jobs like this.
- Updating the docs in Elasticsearch that have that tag: depends on the number of docs, but never more than an hour. If there are only a few hundred posts it should complete within a minute or so.
- Refresh time on the Elasticsearch index: one second.
For many cases the slowest part of an “update” is actually waiting for the Elasticsearch query caching to timeout. We cut our query volume in half (and greatly reduce query latency) by caching all Elasticsearch queries in memcache. The cache timeout depends on the particular application. For most search applications it is between 30 seconds and five minutes. For services such as related posts, we have a 48 hour cache timeout.
Our indexing system has been iterated on for almost six years and continues to get better. We’re constantly learning how to run a better service. One multi-datacenter caching problem we solved in 2013 led me to this great quote from a Facebook paper about replication across data centers:
We treat the probability of reading transient stale data as a parameter to be tuned, similar to responsiveness.
Elasticsearch mirrors our database, so just like a cache, our ES data will always be slightly stale relative to new changes to the database. By focusing on tuning the probability of not having stale data we reduce the likelihood of stale data affecting the user experience.
The indexing system we have today is significantly improved from what it looked like one year ago. We expect those improvements to constantly continue, year after year, as we find gaps we have missed, or reduce the gaps we are already aware of.