From 7e3d1761be642ddbd0d9689e94c949a3c3071c87 Mon Sep 17 00:00:00 2001 From: Eric Pugh Date: Fri, 23 Aug 2024 13:36:33 -0400 Subject: [PATCH] Core concepts of LTR page --- _search-plugins/ltr/core-concepts.md | 255 +++++++++++++++++++++++++++ 1 file changed, 255 insertions(+) create mode 100644 _search-plugins/ltr/core-concepts.md diff --git a/_search-plugins/ltr/core-concepts.md b/_search-plugins/ltr/core-concepts.md new file mode 100644 index 00000000000..757e17187ca --- /dev/null +++ b/_search-plugins/ltr/core-concepts.md @@ -0,0 +1,255 @@ +--- +layout: default +title: Core Concepts +nav_order: 10 +parent: LTR search +has_children: false +--- + +# Core Concepts + +Welcome! You're here if you're interested in adding machine learning +ranking capabilities to your OpenSearch system. This guidebook is +intended for OpenSearch developers and data scientists. + +## What is Learning to Rank? + +*Learning to Rank* (LTR) applies machine learning to search relevance +ranking. How does relevance ranking differ from other machine learning +problems? Regression is one classic machine learning problem. In +*regression*, you're attempting to predict a variable (such as a stock +price) as a function of known information (such as number of company +employees, the company's revenue, etc). In these cases, you're +building a function, say *f*, that can take what's known +(*numEmployees*, *revenue*), and have +*f* output an approximate stock price. + +Classification is another machine learning problem. With classification, +our function *f*, would classify our company into several +categories. For example, profitable or not profitable. Or perhaps +whether or not the company is evading taxes. + +In Learning to Rank, the function *f* we want to learn does +not make a direct prediction. Rather it's used for ranking documents. +We want a function *f* that comes as close as possible to +our user's sense of the ideal ordering of documents dependent on a +query. The value output by *f* itself has no meaning (it's +not a stock price or a category). It's more a prediction of a users' +sense of the relative usefulnes of a document given a query. + +Here, we'll briefly walk through the 10,000 meter view of Learning to +Rank. For more information, we recommend blog articles [How is Search +Different From Other Machine Learning +Problems?](http://opensourceconnections.com/blog/2017/08/03/search-as-machine-learning-prob/) +and [What is Learning to +Rank?](http://opensourceconnections.com/blog/2017/02/24/what-is-learning-to-rank/). + +## Judgements: expression of the ideal ordering + +Judgement lists, sometimes referred to as "golden sets", grade +individual search results for a keyword search. For example, our +[demo](http://github.com/o19s/elasticsearch-learning-to-rank/tree/master/demo/) +uses [TheMovieDB](http://themoviedb.org). When users search for +"Rambo" we can indicate which movies ought to come back for "Rambo" +based on our user's expectations of search. + +For example, we know these movies are very relevant: + +- First Blood +- Rambo + +We know these sequels are fairly relevant, but not exactly relevant: + +- Rambo III +- Rambo First Blood, Part II + +Some movies that star Sylvester Stallone are only tangentially relevant: + +- Rocky +- Cobra + +And of course many movies are not even close: + +- Bambi +- First Daughter + +Judgement lists apply "grades" to documents for a keyword, this helps +establish the ideal ordering for a given keyword. For example, if we +grade documents from 0-4, where 4 is exactly relevant. The above would +turn into the judgement list: + + grade,keywords,movie + 4,Rambo,First Blood # Exactly Relevant + 4,Rambo,Rambo + 3,Rambo,Rambo III # Fairly Relevant + 3,Rambo,Rambo First Blood Part II + 2,Rambo,Rocky # Tangentially Relevant + 2,Rambo,Cobra + 0,Rambo,Bambi # Not even close... + 0,Rambo,First Daughter + +A search system that approximates this ordering for the search query +"Rambo", and all our other test queries, can said to be performing +well. Metrics such as +[NDCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) and +[ERR](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.4509&rep=rep1&type=pdf) +evaluate a query's actual ordering vs the ideal judgement list. + +Our ranking function *f* needs to rank search results as +close as possible to our judgement lists. We want to maximize quality +metrics such as ERR or NDCG over the broadest number of queries in our +training set. When we do this, with accurate judgements, we work to +return results listings that will be maximally useful to users. + +## Features: the raw material of relevance + +Above in the example of a stock market predictor, our ranking function +*f* used variables such as the number of employees, revenue, +etc to arrive at a predicted stock price. These are *features* of the +company. Here our ranking function must do the same: using features that +describe the document, the query, or some relationship between the +document and the query (such as query keyword's TF\*IDF score in a +field). + +Features for movies, for example, might include: + +- Whether/how much the search keywords match the title field (let's + call this *titleScore*) +- Whether/how much the search keywords match the description field + (*descScore*) +- The popularity of the movie (*popularity*) +- The rating of the movie (*rating*) +- How many keywords are used during search? + (*numKeywords*) + +Our ranking function then becomes +`f(titleScore, descScore, popularity, rating, numKeywords)`. We hope +whatever method we use to create a ranking function can utilize these +features to maximize the likelihood of search results being useful for +users. For example, it seems intuitive in the "Rambo" use case that +*titleScore* matters quite a bit. But one top movie "First +Blood" probably only mentions the keyword Rambo in the description. So +in this case *descScore* comes into play. Also +*popularity/rating* might help determine which movies +are "sequels" and which are the originals. We might learn this feature +doesn't work well in this regard, and introduce a new feature +*isSequel* that our ranking function could use to make +better ranking decisions. + +Selecting and experimenting with features is a core piece of learning to +rank. Good judgements with poor features that don't help predict +patterns in the predicted grades and won't create a good search +experience. Just like any other machine learning problem: garbage +in-garbage out! + +For more on the art of creating features for search, check out the book +[Relevant Search](http://manning.com/books/relevant-search) by Doug +Turnbull and John Berryman. + +## Logging features: completing the training set + +With a set of features we want to use, we need to annotate the judgement +list above with values of each feature. This data will be used once +training commences. + +In other words, we need to transfer: + + grade,keywords,movie + 4,Rambo,First Blood + 4,Rambo,Rambo + 3,Rambo,Rambo III + ... + +into: + + grade,keywords,movie,titleScore,descScore,popularity,... + 4,Rambo,First Blood,0.0,21.5,100,... + 4,Rambo,Rambo,42.5,21.5,95,... + 3,Rambo,Rambo III,53.1,40.1,50,... + +(here titleScore is the relevance score of "Rambo" for title field in +document "First Blood", and so on) + +Many learning to rank models are familiar with a file format introduced +by SVM Rank, an early learning to rank method. Queries are given ids, +and the actual document identifier can be removed for the training +process. Features in this file format are labeled with ordinals starting +at 1. For the above example, we'd have the file format: + + 4 qid:1 1:0.0 2:21.5 3:100,... + 4 qid:1 1:42.5 2:21.5 3:95,... + 3 qid:1 1:53.1 2:40.1 3:50,... + ... + +In actual systems, you might log these values after the fact, gathering +them to annotate a judgement list with feature values. In others the +judgement list might come from user analytics, so it may be logged as the +user interacts with the search application. More on this when we cover +it in `logging-features`{.interpreted-text role="doc"}. + +## Training a ranking function + +With judgements and features in place, the next decision is to arrive +at the ranking function. There's a number of models available for +ranking, with their own intricate pros and cons. Each one attempts +to use the features to minimize the error in the ranking function. +Each has its own notion of what "error" means in a ranking system. +For more information read [this blog article](http://opensourceconnections.com/blog/2017/08/03/search-as-machine-learning-prob/). + +Generally speaking there's a couple of families of models: + +- Tree-based models (LambdaMART, MART, Random Forests): These + models tend to be most accurate in general. They're large and complex models that can be fairly expensive to train. + [RankLib](https://sourceforge.net/p/lemur/wiki/RankLib/) and [xgboost](https://github.com/dmlc/xgboost) both focus on tree-based models. + +- SVM based models (SVMRank): Less accurate, but cheap to train. See [SVM Rank](https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html). + +- Linear models: Performing a basic linear regression over the judgement +list. Tends to not be useful outside of toy examples. See [this blog article](http://opensourceconnections.com/blog/2017/04/01/learning-to-rank-linear-models/). + +As with any technology, model selection can be as much about what a team +has experience with, not just with what performs best. + + +## Testing: is our model any good? + +Our judgement lists can't cover every user query our model will +encounter out in the wild. So it's important to throw our model +curveballs, to see how well it can "think for itself." Or as machine +learning folks say: can the model generalize beyond the training data? A +model that cannot generalize beyond training data is *overfit* to the +training data, and not as useful. + +To avoid overfitting, you hide some of your judgement lists from the +training process. You then use these to test your model. This side data +set is known as the "test set." When evaluating models you'll hear +about statistics such as "test NDCG" vs "training NDCG." The former +reflects how your model will perform against scenarios it hasn't seen +before. You hope as you train, your test search quality metrics continue +to reflect high quality search. Further: after you deploy a model, +you'll want to try out newer/more recent judgement lists to see if your +model might be overfit to seasonal/temporal situations. + +## Real World Concerns + +Now that you're oriented, the rest of this guide builds on this context +to point out how to use the Learning to Rank plugin. But before we move +on, we want to point out some crucial decisions everyone encounters in +building learning to rank systems. We invite you to watch a talk with +[Doug Turnbull and Jason +Kowalewski](https://www.youtube.com/watch?v=JqqtWfZQUTU&list=PLq-odUc2x7i-9Nijx-WfoRMoAfHC9XzTt&index=5) +where the painful lessons of real Learning to Rank systems are brought +out. + +- How do you get accurate judgement lists that reflect your users real + sense of search quality? +- What metrics best measure whether search results are useful to + users? +- What infrastructure do you need to collect and log user behavior and + features? +- How will you detect when/whether your model needs to be retrained? +- How will you A/B test your model vs your current solution? What KPIs + will determine success in your search system. + +Next up, see how exactly this plugin's functionality fits into a +learning to rank system: `fits-in`{.interpreted-text role="doc"}.