Improving Relevance and Elasticsearch Query Patterns

The WordPress.org plugin directory has been significantly rebuilt over the past year and should go live soon (test site). Many from across the WordPress community helped with this effort. I focused on improving plugin search relevancy. This was a great learning experience on how to build more relevant searches for a couple of reasons:

  1. There is a decent volume of search traffic (100,000 searches per day and over 500k unique search queries per month).
  2. The repo is small enough to iterate easily (45k total plugins) and yet has enough users and use cases that it can be pretty complex. We went through five major iterations on how to index the data.
  3. A lot of people care and have opinions about how plugin search can be better. This makes for a great opportunity to learn because it is easy to get lots of feedback.

Despite building search engines with Elasticsearch for many years, my opinion on how to structure an Elasticsearch query and index content changed a lot because of this project. This post describes my latest opinions.

Background on Plugin Search

In surveys about WordPress, the community regularly rates the plugin ecosystem as both a top strength and a top weakness of WordPress. Plugins give users flexibility in building websites, but can also be a source of frustration due to updates, incompatibility, and getting support when something goes wrong.

The most popular plugins are installed on millions of websites and are often built and maintained by teams of developers. But many plugins are small; they fill many different niches, and have varying levels of developer support. Sometimes they solve a problem really well, sometimes they are abandoned and rarely used.

Algorithm Philosophy

Ultimately, a search algorithm is driven by design opinions about the problem you’re solving. In looking at the data and having discussions over the years, I’ve developed some opinions about plugin search:

  • We design search primarily for end users, not for developers. Developers make up a small percentage of the 100k searches each day.
  • We steer end users toward plugins that are most likely to give them the best WordPress experience. This doesn’t mean simply matching text, but rather trying to answer the question implied by the text: Which plugin will solve problem X for me?
  • Plugins shouldn’t just solve the problem right now; they should still solve the problem a year or two from now. Past history is the best indicator we have of a plugin’s future.
  • Search terms indicate demand for a feature, and active plugin installs indicate the supply side of that equation. If 10k people per year want a particular feature, we should recommend plugins that can support that volume of new users.

I expressed this on the search relevancy ticket and I know that some of it is controversial. Opinions usually are. Let me try an example. (Disclaimer: I’ve worked on Jetpack Stats in the past.)

Users search for “stats” 93 thousand times a year. (It’s the 13th most frequent search on the plugin directory.) Here are the top four search results with the old algorithm:

Screen Shot 2017-01-20 at 1.07.45 AM.png

The old search suggests that 90 thousand people per year go and install plugins that collectively, have only proven that they can “handle” 20 thousand users. Scaling stats as a service (as many plugins do) can be quite hard and expensive. Sending 90k new users per year to these plugins seems unrealistic. Even if the plugin doesn’t work as a hosted service, it still needs to scale answering support requests from end users.

Let’s compare that to the top four suggestions from the new search algorithm:

Screen Shot 2017-03-08 at 12.58.10 PM.png

These install counts make a lot more sense given that we’re getting 93k searches per year. Collectively, these four are already used by millions of sites, so sending a hundred thousand more sites toward them each year will not overwhelm them. Presumably users we direct toward them will have a good experience with those plugins — as millions already have.

In improving the fidelity of search results, it’s not just a question of how we satisfy a single user’s search query, but how we satisfy thousands of users for each unique search term: which plugins will support that volume of users and their requests for support? Which are most likely to give all of these users a great WordPress experience?

Evaluating Results

To build relevant search, you need a plan on how to iterate on evaluating search results. In the previous iteration of plugin repo search, we didn’t have good click data for evaluating search, nor did we have a way to evaluate it on live traffic, so I used a few sets of searches for my testing:

  • “Important” searches. This was a list of about 50 searches. Important searches include some searches from the top 10 and some that I and others found interesting. A number of the searches that were cited as feedback on the search ticket ended up in here (“responsive slider,” “event,” “import,” “transport,” “glotpress”). We focused particularly on searches that were ambiguous or were words that would show up for completely unrelated plugins.
  • Top 1k searches. This covers 46% of all searches.
  • Random 1k from the top 100k: just take the top 100k and randomly select from among them.
  • Random 1k from the bottom 400k searches. These are mostly searches that occurred once over the course of two months.

I repeatedly ran the search queries I evaluated against these lists and imported them into a spreadsheet to sort and evaluate them. Because 3000+ searches is way too many to manually evaluate, I tried to focus on a few things:

  • Were the “important” searches looking good for the top 14 results?
  • Which searches received zero to four results? How many of them were there? (This analysis also led to defining some future work on auto corrections)
  • Which searches had the lowest/highest ratio of active_installs to number of times the search was performed? (I tried to use this as a proxy for the supply vs. demand ratio I mentioned above. I also used a similar ratio for resolved support threads.)

It wasn’t a perfect system, but it allowed me to quickly iterate and evaluate new index mappings and queries. None of these “metrics” were good indicators of performance, but having them let me sort the queries, focus on the outliers, and work to improve the results overall.

Common Elasticsearch Query Patterns

I’ve started to think about the generic structure that I apply when writing Elasticsearch (ES) queries, and how that structure can help me create relevant results.

Over the past year, I’ve settled into some common patterns. The core of searches are almost always a text-matching portion boosted by a number of other signals. The structure also splits the query into three parts: (1) function boosting applied to meta data about the documents; (2) an AND text query that reduces the set of documents that match; and (3) a section that boosts the document scores based on the text of the documents.

So overall, there are three major sections of the query: function boosting, text matching, and text boosting. These all get wrapped together into a single ES query, but let’s discuss the sections separately.

Function Boosting

A search query that is modified by a number of other signals is straightforward:

  "query": {
    "function_score": {
      "query": {
        //Text scoring query (text matching and text boosting)
      },
      "functions": [
        //series of functions to multiply the text score
      ],
      "boost_mode": "multiply"
      //if any signals are open ended, set a max using max_boost
    }
  }

For simplicity I’ve left off any filtering of the results to focus just on scoring the documents.

For the plugin search, we use a number of different signals. Initially we adjusted them on an ad-hoc basis, by looking at a single query and making small tweaks. Eventually we discovered that making adjustments for one query would hurt a different query. I found that graphing the functions across the ranges that mattered to me helped a lot. The charts below are from my (very simple) graphing code.

We iterated on the active_installs boosting three times, so it’s a good example.

It started as sqrt( active_installs ):

sqrt.png

This had the problem where sqrt increases very quickly when applied to a field that is exponential in nature. Starting from 1, active_installs goes up to 1,000,000. Clearly, 1 million installs is not a million times more important than a single install. But even with a square root we increase much too quickly and give the 1 million plugin 100 times more importance.

And so we switched to ln( active_installs ):

log2p.png

That is better, but we found that there was not enough differentiation between a plugin installed 100k times and one that was installed more than a million times. We often ended up recommending a plugin with 50k installs even though there was one with 500k that seemed like a better option.

This resulted in combining two functions for boosting active installs:

{
  "field_value_factor": {
    "field": "active_installs",
    "factor": 0.375,
    "modifier": "log2p",
    "missing": 1
  }
},
{
  "exp": {
    "active_installs": {
      "origin": 1000000,
      "offset": 0,
      "scale": 900000,
      "decay": 0.75
    }
  }
}

The exponential turns the logarithmic curve into a mostly straight line from 100k to 1m so that there’s more differentiation in that range. You can see the difference:

combined.png

Most of the other signals were not this complicated and only needed minor adjustments. Along the way, writing out the actual scoring equation also helped a lot. In the end, our boosting looks something like this:

text_score *
	0.375 * log2p( active_installs ) *
	exp( active_installs, 1000000, 0, 900000, 0.75 ) *
	0.25 * log2p( support_threads_resolved ) *
	0.25 * sqrt( rating ) *
	exp( tested, 4.7, 0.1, 0.4, 0.6 ) *
	gauss( plugin_modified, 2017, 180d, 360d, 0.5 )

Though active installs is important, we also focus on signals that plugin authors have a lot more control over and are behaviors that should lead to a good user experience:

  • Resolving support threads.
  • Keeping the plugin update to date.
  • Testing the plugin on the latest versions of WordPress.

Text Matching

During this project, I decided that the way I’ve been structuring my search queries did not give me enough flexibility when trying to reason about and fine tune the search. In the past, I have been using a text query that looks something like this:

{
  "multi_match" : {
    "fields" : [
      "title.ngram^2",
      "content",
      "tags",
      "author^2",
    ],
    "query" : "post stats",
    "operator" : "and",
    "type" : "cross_fields"
  }
}

Let’s break down these pieces:

  • We search across a number of fields, and boost some individual fields. (For example, the author is twice as important as the content because if you match on author you are probably doing an author search.)
  • Partial word boosting is very helpful for plugin titles so we use n-grams. My favorite example of this is “Formidible,” a form builder plugin with a name that is slightly too clever to score well in a search for “form.”
  • Use an AND operator. The user specified both “post” and “stats.” The docs we return should have both terms in them. This is the behavior users expect.
  • We match across fields. This way if “stats” is in the title and “post” is in the content, the document will still match. Search will feel broken otherwise.

Often I would wrap the above in a boolean query with this as a must clause, and then add a should clause that would boost the results when we matched a phrase. So if “post stats” is in the doc, it would be boosted a bit.

The above query structure works OK, but I found it very difficult to separate which set of documents were getting matched from how we were scoring those documents. The title.ngram for instance can be a very noisy matching query, but when I tried adjusting it I kept finding other cases that would break. Like most search problems, when something isn’t working well, it’s time to rebuild and restructure your index. So I added a new field called all_content that contained all the text about the plugin: title, author, slug, content, tags, etc. Then I built a query that mostly separated matching documents from boosting the score based on the text of the docs.

The text query pattern became:

"bool": {
  "must": {
    "multi_match": {
      //must match against the content,
      // but very low boost so the score is mostly inconsequential
      "fields" : [ "all_content_en^0.1" ],
      "query": "USER_QUERY",
      "operator": "and"
    }
  },
  "should": [
    //a series of other queries that are used to boost the results
  ]
}

Now, document matching is mostly separated from determining which document has the most relevant text. The docs which match our query are determined by the all_content field entirely. This also works great for search as you type. We can build an all_content.edgengram field to match against very efficiently. When all we have is a few characters, the ranking will be determined by this field.

The all_content field always contributes some scoring to our results, but because of the low boost, if any of the should clauses match, then those will completely dominate the scoring. If none of the boosts match though, then this basic query ensures we get some ranking which will mostly get reranked by our function boosting.

Implementation detail: In our case we treat the all_content field as an entirely independent ES field, but Elasticsearch also has a copy_to parameter in its mappings that can be used to implement it.

Text Boosting

When any of the should clauses match, our score effectively becomes the sum of the different queries in the should clause. When debugging individual queries that are not performing well, we can focus on coming up with a new query clause to improve those specific results without worrying too much about the changes affecting the set of documents we’re scoring. It’s still important to test the impact, but in my experience it’s a lot easier to reason about.

For instance, if I see a number of cases where it looks like we want to match a partial word in the title of a plugin, we can add this query to the should list:

{
  "match_phrase": {
    "title_en.ngram": {
      "query": "USER_QUERY",
      "boost": 0.2
    }
  }
}

Because these are n-grams (and it is a 3-5 n-gram field), the boost doesn’t need to be very high to have a strong impact. But because we’re not including n-grams in the must portion of the query, we won’t have as much noise as we did before.

In the end, we now have a much more complicated-looking query, but it was surprisingly easy to reason about how to build it when evaluating specific cases:

"bool": {
  "must": {
    "match": {
      "all_content_en": {
        "query": "stats",
        "operator": "and",
        "boost": 0.1
      }
    }
  },
  "should": [
    {
      "multi_match": {
        "query": "stats",
        "fields": [
          "title_en",
          "excerpt_en",
          "description_en",
          "taxonomy.plugin_tags.name"
          ],
          "type": "phrase",
          "boost": 2
      }
    },
    {
      "match_phrase": {
        "title_en.ngram": {
          "query": "stats",
          "boost": 0.2
        }
      }
    },
    {
      "multi_match": {
        "query": "stats",
        "fields": [
          "title_en",
          "slug_text"
        ],
        "type": "best_fields",
        "boost": 2
      }
    },
    {
      "multi_match": {
        "query": "stats",
        "fields": [
          "excerpt_en",
          "description_en",
          "taxonomy.plugin_tags.name"
        ],
        "type": "best_fields",
        "boost": 2
      }
    },
    {
      "multi_match": {
        "query": "stats",
        "fields": [
          "author",
          "contributors"
        ],
        "type": "best_fields",
        "boost": 2
      }
    }
  ]
}

I have also been using this same query structure on other projects and so far it’s made iterating on algorithm relevancy much easier.

You can also take a look at how the whole query gets put together in the source code. The actual query also has some interesting customizations for searching in non-English languages given that many plugins do not have good translations to the hundred or so languages that WordPress supports.

 


Want to help build better search for the Open Web? We’re hiring.

14 thoughts on “Improving Relevance and Elasticsearch Query Patterns

  1. Thank you for sharing your experience with Elastic Search!
    Have you got any idea whether these ideas will be applied also to the plugin support forums?
    Right now, unless I am overlooking something obvious, these forums are not searchable, and a huge pain in the neck to use.
    Thanks,
    Guido

    Liked by 1 person

  2. Hi Greg,
    Interesting read.
    I don’t see you accounting for the new plugins anywhere? A new plugin may be able to support more users than existing plugins, but since you are not sending them any users, we will never know.

    We recently launched a plugin called Easy AdSense Ads and Scripts Manager.
    https://wordpress.org/plugins/easy-adsense-ads-scripts-manager/

    It used to be in the top 10 for the search term “AdSense” in the old search system. In our short time out there we have seen users switching from plugins which come at the top in the search results to ours.

    1. We have a decent product which can give strong competition to the existing plugins in the same niche.
    2. We have the resources, energy and enthu to support our product.

    But, too bad. We are crushed by the numbers other plugins have accumulated over years.

    Any ranking which factors metrics accumulated over time should also factor in time to normalise the score.

    You have to normalize your score by factoring the amount of time it took them to achieve their numbers. Then the system will be fair new entrants and give them a fair chance to prove their mettle.

    Hope you make the necessary changes in the next iteration.

    Liked by 1 person

    1. Hi Satish,

      Yup, that’s a valid criticism from the point of view of a plugin author. It sounds like you’re working hard to make your plugin great. That’s awesome.

      As background though, there are also 475 other plugins that mention “adsense” in their description. There are 50-60 new “adsense” plugins a year, so I don’t think what you suggest will actually help you that much.

      As I said on the other thread, I do like the idea of looking at install count changes over time and think it could be a good signal. When we’ve gathered enough data for another iteration on the algorithm we’ll take a look at trying it.

      Liked by 1 person

  3. Totally agree with Satish. It is quite discouraging for the new plugin developers. If the stat numbers are not normalized for time, the new plugin developers will simply won’t stand a chance and will be discouraged to release new plugins to the repository and maintaining or adding new features to them. Why bother if your shiny new plugins with latest features are not seen or downloaded by anyone? It will establish a status quo where age of a plugin is the single most dominating factor.

    I sincerely hope that this issue is taken into account and soon.

    Thanks

    Like

    1. I’m definitely keeping it in mind and thinking of ways to help. There are also discussions about creating UI for finding new plugins or recently updated plugins. I’m planning further work in these directions over the next few weeks as I get more examples of searches that aren’t working well. If you have examples that would be very helpful.

      The primary goal of the search though is to connect end users with a plugin that will meet their needs in both the short and long term. The primary search algorithm needs to be tailored to that core use case so that we give end users the best possible WordPress experience based on the best data that we have. I think the current algorithm is mostly doing that, but there are a lot of corner cases where it can be further improved. Exact matches against plugins is the top one, but the changes in install counts over some period of time is also interesting.

      Liked by 1 person

      1. New areas for finding new plugins or recently updated plugins will help. But it will not be enough because it is true that the very high weight given to the plugin age will hurt new WordPres developers. The effect will not be visible today or in few months. But as it gradually sets in that no matter what you do, your new plugin’s ranking in search results cannot be helped and your plugins will get almost no download – developers will find little incentive to submit to the repo.

        I want to quote Mika here “Your plugin rank is, as always, based on the quality of your readme, the resolved support posts, and the average of your reviews. That’s not changed.” (Ipstenu (Mika Epstein) 8:40 pm on March 29, 2017 comment in https://make.wordpress.org/plugins/2017/03/28/the-new-directory-is-mostly-live/ )

        It used to be true. So, a new developer could make a great new plugin that solves some old problem better, write a killer readme text file, show live demos, give great and quick support etc.- there was a chance that he could rank higher and get downloads and feedback and improve on it. Now there will be no download, no feedback, no incentive to make the plugin better. This creates a bad loop. Why even develop the next free plugin for WordPress repository?

        Examples:
        1. Searching for the generic term “directory” – I see Next Active Directory Integration, TablePress, Redirection, Dashboard Directory Size in the 1st page. What are tablepress and Redirection doing there? Do not seem relevant. tablepress has only one mention of the term “directory” – “If you like TablePress, please rate and review it here in the WordPress Plugin Directory,” But being in the repo for 5 years has really really paid off in this case. This single example shows how much weight is given to the plugin age. Too much.

        2. Searching for the generic term “charts” – I see Gravity Forms Charts Reports, Inline Google Spreadsheet Viewer, NC Size Chart for Woocommerce, Easy Digital Downloads, YITH Product Size Charts for WooCommerce, Live Gold Price & Silver Price Charts Widgets in 1st page. Easy Digital Downloads should not be there at all. Others are for highly specific scenarios and should turn up for longer tails or more Exact match keywords only – not for a generic term like Charts where a user is more likely to be looking for a generic solution to create Charts and Graphs. But as I understand exact match keywords are devalued as well.

        Both examples show an overwhelming power of plugin age – over which a new developer has no control. This should be reduced in my humble opinion and make the readme file more relevant again. The developers know what they made the plugin for – which is reflected in their readme file. But if readme file gets beaten by irrelevant plugins’ age in search results every time then the developer is hopeless.

        Thanks

        Liked by 3 people

    2. Thanks for the feedback.

      Both of these examples are helpful. I agree there is some noise in them. The top 9-11 results I think are pretty good in both cases. There are also a number of plugins there that are pretty new (600 installs or 100 installs). After those we start to see some problems with very active plugins getting boosted up. Looking at the results on pages two and three, the problem is kinda less about these plugins and more about the other choices that match. We quickly start to see a lot of plugins that are out of date or that are not tested on the latest WP. But I’ll take a look at these cases in more detail on my next run through of the search alg. I’ve added it to this ticket:

      https://meta.trac.wordpress.org/ticket/2642#comment:15

      Thanks.

      So, a new developer could make a great new plugin that solves some old problem better, write a killer readme text file, show live demos, give great and quick support etc.- there was a chance that he could rank higher and get downloads and feedback and improve on it

      It was also true that a new developer could write a terrible plugin, spend lots of time crafting a readme/title/slug/tags to hit a bunch of keywords and provide a pretty terrible experience for the end users who downloaded and tried to use their plugin. We needed to do a better job of connecting end users with high quality and proven plugins. To the extent that we can also find ways to highlight new plugins or enable users to find new plugins we should, but first we needed to get the main search algorithm working better for the first 5-10 results.

      Like

      1. Thanks for taking these into considerations. Truly appreciate that. I believe you have done a wonderful job so far for the first round!

        I understand that in the 1st iteration you concentrated on getting the best possible results for top searched keywords and tried to rank the trusted plugins for them. But giving too much weight to plugin age/install skews a lot of other search results. Example search term: knowledgebase – results in 1st page:
        Yoast SEO
        ShiftNav – Responsive Mobile Menu
        WC Vendors
        The Events Calendar
        LifterLMS
        WP Product Review Lite
        Key4ce osTicket Bridge
        WHMCS Bridge

        Yoast SEO for knowledge base? They mentioned this term one time as a link to “The Yoast SEO Knowledgebase”. Seems like plugins with high install bases can have just a single out of context word in their descriptions or anywhere and that will instantly start ranking on 1st page, hijacking the most valuable real estates from other plugins that actually serves the purpose of user intents. This feels unfair for all other plugins.

        May I suggest dampening the age/installbase effect significantly after first 4 results and rank the rest based on readme file, exact match, keyword occurrences, and other criteria? This will serve multiple purposes:
        1. Get the trusted plugins with high installbases to show up in the beginning
        2. Give other (newer) plugins a genuine chance to be visible.
        3. Give users more options to choose from
        4. Reduce the number of irrelevant plugnins showing up because despite the chance of spamming if a plugin description mentions knowledgebase ten times as opposed to one – it is more likely to be relevant.

        You can add new areas for new plugins, updated plugins, sorting by latest etc. but I am afraid they will have limited effect. I think 90%-95% users will never look beyond the default search results and 1st page, Regardless of age, all plugins should have a viable chance to compete for the 1st 14 results – which is what really matters.

        Please get rid of the reverse ageism 🙂 Just kidding. Keep up the great work and please see how you can give fair chances to all plugins.

        Thanks again!

        Liked by 1 person

  4. I have to agree, new wordpress developers are being killed by this new alghoritm…
    Even if I understand that a plugin with 5k installation is something that the visitor must notice, there are obviously more recent plugins that are doing a better job (better UI for example, same functionalities). You run the risk to drive new WP users to “old” plugins, while the modern ones will stay in the shadow forever, because this is what the new search does.
    If you search for the EXACT name of some plugins and/or the EXACT list of their tags, they end up in 5th page or even worse. Are we sure this makes much sense?

    Like

    1. there are obviously more recent plugins that are doing a better job

      Do you have any examples? It’s hard to discuss this in the abstract. Which searches are you seeing where the results should be different?

      If you search for the EXACT name of some plugins and/or the EXACT list of their tags, they end up in 5th page or even worse. Are we sure this makes much sense?

      This ticket is focused on better exact matching, so if you have some examples can you add them there. As mentioned above, just because a plugin has the name “stats” or the tag “stats” does not mean that it is the best plugin to recommend to users who search for “stats”. As the search user types more words we should do a better job than we currently are at exact matches, but especially for shorter searches the new algorithm seems to be working pretty well based on the feedback I’ve received.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s