-
Notifications
You must be signed in to change notification settings - Fork 509
Commit
Signed-off-by: Eric Pugh <[email protected]>
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,255 @@ | ||
--- | ||
layout: default | ||
title: Core Concepts | ||
nav_order: 10 | ||
parent: LTR search | ||
has_children: false | ||
--- | ||
|
||
# Core Concepts | ||
Check failure on line 9 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
|
||
Welcome! You're here if you're interested in adding machine learning | ||
Check failure on line 11 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
ranking capabilities to your OpenSearch system. This guidebook is | ||
intended for OpenSearch developers and data scientists. | ||
|
||
## What is Learning to Rank? | ||
Check failure on line 15 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
Check failure on line 15 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
|
||
*Learning to Rank* (LTR) applies machine learning to search relevance | ||
ranking. How does relevance ranking differ from other machine learning | ||
problems? Regression is one classic machine learning problem. In | ||
*regression*, you're attempting to predict a variable (such as a stock | ||
price) as a function of known information (such as number of company | ||
employees, the company's revenue, etc). In these cases, you're | ||
Check warning on line 22 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
building a function, say *f*, that can take what's known | ||
(*numEmployees*, *revenue*), and have | ||
Check failure on line 24 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
*f* output an approximate stock price. | ||
|
||
Classification is another machine learning problem. With classification, | ||
our function *f*, would classify our company into several | ||
categories. For example, profitable or not profitable. Or perhaps | ||
whether or not the company is evading taxes. | ||
|
||
In Learning to Rank, the function *f* we want to learn does | ||
not make a direct prediction. Rather it's used for ranking documents. | ||
We want a function *f* that comes as close as possible to | ||
our user's sense of the ideal ordering of documents dependent on a | ||
query. The value output by *f* itself has no meaning (it's | ||
not a stock price or a category). It's more a prediction of a users' | ||
sense of the relative usefulnes of a document given a query. | ||
Check failure on line 38 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
|
||
Here, we'll briefly walk through the 10,000 meter view of Learning to | ||
Rank. For more information, we recommend blog articles [How is Search | ||
Different From Other Machine Learning | ||
Problems?](http://opensourceconnections.com/blog/2017/08/03/search-as-machine-learning-prob/) | ||
and [What is Learning to | ||
Rank?](http://opensourceconnections.com/blog/2017/02/24/what-is-learning-to-rank/). | ||
|
||
## Judgements: expression of the ideal ordering | ||
|
||
Judgement lists, sometimes referred to as "golden sets", grade | ||
individual search results for a keyword search. For example, our | ||
[demo](http://github.com/o19s/elasticsearch-learning-to-rank/tree/master/demo/) | ||
uses [TheMovieDB](http://themoviedb.org). When users search for | ||
"Rambo" we can indicate which movies ought to come back for "Rambo" | ||
based on our user's expectations of search. | ||
|
||
For example, we know these movies are very relevant: | ||
|
||
- First Blood | ||
- Rambo | ||
|
||
We know these sequels are fairly relevant, but not exactly relevant: | ||
|
||
- Rambo III | ||
- Rambo First Blood, Part II | ||
|
||
Some movies that star Sylvester Stallone are only tangentially relevant: | ||
|
||
- Rocky | ||
- Cobra | ||
|
||
And of course many movies are not even close: | ||
|
||
- Bambi | ||
- First Daughter | ||
|
||
Judgement lists apply "grades" to documents for a keyword, this helps | ||
establish the ideal ordering for a given keyword. For example, if we | ||
grade documents from 0-4, where 4 is exactly relevant. The above would | ||
Check warning on line 78 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
turn into the judgement list: | ||
|
||
grade,keywords,movie | ||
4,Rambo,First Blood # Exactly Relevant | ||
4,Rambo,Rambo | ||
3,Rambo,Rambo III # Fairly Relevant | ||
3,Rambo,Rambo First Blood Part II | ||
2,Rambo,Rocky # Tangentially Relevant | ||
2,Rambo,Cobra | ||
0,Rambo,Bambi # Not even close... | ||
0,Rambo,First Daughter | ||
|
||
A search system that approximates this ordering for the search query | ||
"Rambo", and all our other test queries, can said to be performing | ||
well. Metrics such as | ||
[NDCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) and | ||
[ERR](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.4509&rep=rep1&type=pdf) | ||
evaluate a query's actual ordering vs the ideal judgement list. | ||
Check warning on line 96 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
|
||
Our ranking function *f* needs to rank search results as | ||
close as possible to our judgement lists. We want to maximize quality | ||
metrics such as ERR or NDCG over the broadest number of queries in our | ||
training set. When we do this, with accurate judgements, we work to | ||
return results listings that will be maximally useful to users. | ||
|
||
## Features: the raw material of relevance | ||
|
||
Above in the example of a stock market predictor, our ranking function | ||
Check warning on line 106 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
*f* used variables such as the number of employees, revenue, | ||
etc to arrive at a predicted stock price. These are *features* of the | ||
Check warning on line 108 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
company. Here our ranking function must do the same: using features that | ||
describe the document, the query, or some relationship between the | ||
document and the query (such as query keyword's TF\*IDF score in a | ||
Check warning on line 111 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
Check warning on line 111 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
field). | ||
|
||
Features for movies, for example, might include: | ||
|
||
- Whether/how much the search keywords match the title field (let's | ||
call this *titleScore*) | ||
- Whether/how much the search keywords match the description field | ||
(*descScore*) | ||
- The popularity of the movie (*popularity*) | ||
- The rating of the movie (*rating*) | ||
- How many keywords are used during search? | ||
(*numKeywords*) | ||
|
||
Our ranking function then becomes | ||
`f(titleScore, descScore, popularity, rating, numKeywords)`. We hope | ||
whatever method we use to create a ranking function can utilize these | ||
features to maximize the likelihood of search results being useful for | ||
users. For example, it seems intuitive in the "Rambo" use case that | ||
*titleScore* matters quite a bit. But one top movie "First | ||
Blood" probably only mentions the keyword Rambo in the description. So | ||
in this case *descScore* comes into play. Also | ||
*popularity/rating* might help determine which movies | ||
are "sequels" and which are the originals. We might learn this feature | ||
doesn't work well in this regard, and introduce a new feature | ||
*isSequel* that our ranking function could use to make | ||
better ranking decisions. | ||
|
||
Selecting and experimenting with features is a core piece of learning to | ||
rank. Good judgements with poor features that don't help predict | ||
patterns in the predicted grades and won't create a good search | ||
experience. Just like any other machine learning problem: garbage | ||
Check warning on line 142 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
in-garbage out! | ||
|
||
For more on the art of creating features for search, check out the book | ||
[Relevant Search](http://manning.com/books/relevant-search) by Doug | ||
Turnbull and John Berryman. | ||
|
||
## Logging features: completing the training set | ||
|
||
With a set of features we want to use, we need to annotate the judgement | ||
list above with values of each feature. This data will be used once | ||
Check warning on line 152 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
training commences. | ||
|
||
In other words, we need to transfer: | ||
|
||
grade,keywords,movie | ||
4,Rambo,First Blood | ||
4,Rambo,Rambo | ||
3,Rambo,Rambo III | ||
... | ||
|
||
into: | ||
|
||
grade,keywords,movie,titleScore,descScore,popularity,... | ||
4,Rambo,First Blood,0.0,21.5,100,... | ||
4,Rambo,Rambo,42.5,21.5,95,... | ||
3,Rambo,Rambo III,53.1,40.1,50,... | ||
|
||
(here titleScore is the relevance score of "Rambo" for title field in | ||
document "First Blood", and so on) | ||
|
||
Many learning to rank models are familiar with a file format introduced | ||
by SVM Rank, an early learning to rank method. Queries are given ids, | ||
and the actual document identifier can be removed for the training | ||
process. Features in this file format are labeled with ordinals starting | ||
at 1. For the above example, we'd have the file format: | ||
Check warning on line 177 in _search-plugins/ltr/core-concepts.md GitHub Actions / style-job
|
||
|
||
4 qid:1 1:0.0 2:21.5 3:100,... | ||
4 qid:1 1:42.5 2:21.5 3:95,... | ||
3 qid:1 1:53.1 2:40.1 3:50,... | ||
... | ||
|
||
In actual systems, you might log these values after the fact, gathering | ||
them to annotate a judgement list with feature values. In others the | ||
judgement list might come from user analytics, so it may be logged as the | ||
user interacts with the search application. More on this when we cover | ||
it in `logging-features`{.interpreted-text role="doc"}. | ||
|
||
## Training a ranking function | ||
|
||
With judgements and features in place, the next decision is to arrive | ||
at the ranking function. There's a number of models available for | ||
ranking, with their own intricate pros and cons. Each one attempts | ||
to use the features to minimize the error in the ranking function. | ||
Each has its own notion of what "error" means in a ranking system. | ||
For more information read [this blog article](http://opensourceconnections.com/blog/2017/08/03/search-as-machine-learning-prob/). | ||
|
||
Generally speaking there's a couple of families of models: | ||
|
||
- Tree-based models (LambdaMART, MART, Random Forests): These | ||
models tend to be most accurate in general. They're large and complex models that can be fairly expensive to train. | ||
[RankLib](https://sourceforge.net/p/lemur/wiki/RankLib/) and [xgboost](https://github.com/dmlc/xgboost) both focus on tree-based models. | ||
|
||
- SVM based models (SVMRank): Less accurate, but cheap to train. See [SVM Rank](https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html). | ||
|
||
- Linear models: Performing a basic linear regression over the judgement | ||
list. Tends to not be useful outside of toy examples. See [this blog article](http://opensourceconnections.com/blog/2017/04/01/learning-to-rank-linear-models/). | ||
|
||
As with any technology, model selection can be as much about what a team | ||
has experience with, not just with what performs best. | ||
|
||
|
||
## Testing: is our model any good? | ||
|
||
Our judgement lists can't cover every user query our model will | ||
encounter out in the wild. So it's important to throw our model | ||
curveballs, to see how well it can "think for itself." Or as machine | ||
learning folks say: can the model generalize beyond the training data? A | ||
model that cannot generalize beyond training data is *overfit* to the | ||
training data, and not as useful. | ||
|
||
To avoid overfitting, you hide some of your judgement lists from the | ||
training process. You then use these to test your model. This side data | ||
set is known as the "test set." When evaluating models you'll hear | ||
about statistics such as "test NDCG" vs "training NDCG." The former | ||
reflects how your model will perform against scenarios it hasn't seen | ||
before. You hope as you train, your test search quality metrics continue | ||
to reflect high quality search. Further: after you deploy a model, | ||
you'll want to try out newer/more recent judgement lists to see if your | ||
model might be overfit to seasonal/temporal situations. | ||
|
||
## Real World Concerns | ||
|
||
Now that you're oriented, the rest of this guide builds on this context | ||
to point out how to use the Learning to Rank plugin. But before we move | ||
on, we want to point out some crucial decisions everyone encounters in | ||
building learning to rank systems. We invite you to watch a talk with | ||
[Doug Turnbull and Jason | ||
Kowalewski](https://www.youtube.com/watch?v=JqqtWfZQUTU&list=PLq-odUc2x7i-9Nijx-WfoRMoAfHC9XzTt&index=5) | ||
where the painful lessons of real Learning to Rank systems are brought | ||
out. | ||
|
||
- How do you get accurate judgement lists that reflect your users real | ||
sense of search quality? | ||
- What metrics best measure whether search results are useful to | ||
users? | ||
- What infrastructure do you need to collect and log user behavior and | ||
features? | ||
- How will you detect when/whether your model needs to be retrained? | ||
- How will you A/B test your model vs your current solution? What KPIs | ||
will determine success in your search system. | ||
|
||
Next up, see how exactly this plugin's functionality fits into a | ||
learning to rank system: `fits-in`{.interpreted-text role="doc"}. |