Skip to content

Commit

Permalink
chapter12_part2: /080_Structured_Search/05_term.asciidoc #Fix (elasti…
Browse files Browse the repository at this point in the history
…csearch-cn#379)

* improve

* improve

* improve

* improve
  • Loading branch information
richardwei2008 authored and medcl committed Dec 3, 2016
1 parent 910481c commit 9a868c8
Showing 1 changed file with 49 additions and 110 deletions.
159 changes: 49 additions & 110 deletions 080_Structured_Search/05_term.asciidoc
Original file line number Diff line number Diff line change
@@ -1,22 +1,14 @@
=== Finding Exact Values
[[_finding_exact_values]]
=== 精确值查找

When working with exact values,((("structured search", "finding exact values")))((("exact values", "finding")))
you will be working with non-scoring, filtering queries. Filters are
important because they are very fast. They do not calculate
relevance (avoiding the entire scoring phase) and are easily cached. We'll
talk about the performance benefits of filters later in <<filter-caching>>,
but for now, just keep in mind that you should use filtering queries as often as you
can.
当进行精确值查找时,((("structured search", "finding exact values")))((("exact values", "finding"))) 我们会使用过滤器(filters)。过滤器很重要,因为它们执行速度非常快,不会计算相关度(直接跳过了整个评分阶段)而且很容易被缓存。我们会在本章后面的 <<filter-caching, 过滤器缓存>> 中讨论过滤器的性能优势,不过现在只要记住:请尽可能多的使用过滤式查询。

==== term Query with Numbers
==== term 查询数字

We are going to explore the `term` query ((("term query", "with numbers")))
((("structured search", "finding exact values", "using term filter with numbers")))
first because you will use it often. This query is capable of handling numbers,
booleans, dates, and text.
我们首先来看最为常用的 `term` 查询,((("term query", "with numbers")))
((("structured search", "finding exact values", "using term filter with numbers")))可以用它处理数字(numbers)、布尔值(Booleans)、日期(dates)以及文本(text)。

We'll start by indexing some documents representing products, each having a
`price` and `productID`:
让我们以下面的例子开始介绍,创建并索引一些表示产品的文档,文档里有字段 `price` 和 `productID` ( `价格` 和 `产品ID` ):

[source,js]
--------------------------------------------------
Expand All @@ -32,9 +24,7 @@ POST /my_store/products/_bulk
--------------------------------------------------
// SENSE: 080_Structured_Search/05_Term_number.json

Our goal is to find all products with a certain price. You may be familiar
with SQL if you are coming from a relational database background. If we
expressed this query as an SQL query, it would look like this:
我们想要做的是查找具有某个价格的所有产品,有关系数据库背景的人肯定熟悉 SQL,如果我们将其用 SQL 形式表达,会是下面这样:

[source,sql]
--------------------------------------------------
Expand All @@ -43,10 +33,7 @@ FROM products
WHERE price = 20
--------------------------------------------------

In the Elasticsearch query DSL, we use a `term` query to accomplish the same
thing. The `term` query will look for the exact value that we specify. By
itself, a `term` query is simple. It accepts a field name and the value
that we wish to find:
在 Elasticsearch 的查询表达式(query DSL)中,我们可以使用 `term` 查询达到相同的目的。 `term` 查询会查找我们指定的精确值。作为其本身, `term` 查询是简单的。它接受一个字段名以及我们希望查找的数值:

[source,js]
--------------------------------------------------
Expand All @@ -57,11 +44,9 @@ that we wish to find:
}
--------------------------------------------------

Usually, when looking for an exact value, we don't want to score the query. We just
want to include/exclude documents, so we will use a `constant_score` query to execute
the `term` query in a non-scoring mode and apply a uniform score of one.
通常当查找一个精确值的时候,我们不希望对查询进行评分计算。只希望对文档进行包括或排除的计算,所以我们会使用 `constant_score` 查询以非评分模式来执行 `term` 查询并以一作为统一评分。

The final combination will be a `constant_score` query which contains a `term` query:
最终组合的结果是一个 `constant_score` 查询,它包含一个 `term` 查询:

[source,js]
--------------------------------------------------
Expand All @@ -80,12 +65,10 @@ GET /my_store/products/_search
--------------------------------------------------
// SENSE: 080_Structured_Search/05_Term_number.json

<1> We use a `constant_score` to convert the `term` query into a filter
<2> The `term` query that we saw previously.
<1> 我们用 `constant_score` `term` 查询转化成为过滤器
<2> 我们之前看到过的 `term` 查询

Once executed, the search results from this query are exactly what you would
expect: only document 2 is returned as a hit (because only `2` had a price
of `20`):
执行后,这个查询所搜索到的结果与我们期望的一致:只有文档 2 命中并作为结果返回(因为只有 `2` 的价格是 `20` ):

[source,json]
--------------------------------------------------
Expand All @@ -102,16 +85,11 @@ of `20`):
}
]
--------------------------------------------------
<1> Queries placed inside the `filter` clause do not perform scoring or relevance,
so all results receive a neutral score of `1`.
<1> 查询置于 `filter` 语句内不进行评分或相关度的计算,所以所有的结果都会返回一个默认评分 `1` 。

==== term Query with Text
==== term 查询文本

As mentioned at the top of ((("structured search", "finding exact values", "using term filter with text")))
((("term filter", "with text")))this section, the `term` query can match strings
just as easily as numbers. Instead of price, let's try to find products that
have a certain UPC identification code. To do this with SQL, we might use a
query like this:
如本部分开始处提到过的一样 ((("structured search", "finding exact values", "using term filter with text")))((("term filter", "with text"))),使用 `term` 查询匹配字符串和匹配数字一样容易。如果我们想要查询某个具体 UPC ID 的产品,使用 SQL 表达式会是如下这样:

[source,sql]
--------------------------------------------------
Expand All @@ -120,8 +98,7 @@ FROM products
WHERE productID = "XHDK-A-1293-#fJ3"
--------------------------------------------------

Translated into the query DSL, we can try a similar query with the `term`
filter, like so:
转换成查询表达式(query DSL),同样使用 `term` 查询,形式如下:

[source,js]
--------------------------------------------------
Expand All @@ -140,10 +117,7 @@ GET /my_store/products/_search
--------------------------------------------------
// SENSE: 080_Structured_Search/05_Term_text.json

Except there is a little hiccup: we don't get any results back! Why is
that? The problem isn't with the `term` query; it is with the way
the data has been indexed. ((("analyze API, using to understand tokenization"))) If we use the `analyze` API (<<analyze-api>>), we
can see that our UPC has been tokenized into smaller tokens:
但这里有个小问题:我们无法获得期望的结果。为什么呢?问题不在 `term` 查询,而在于索引数据的方式。 ((("analyze API, using to understand tokenization"))) 如果我们使用 `analyze` API (<<analyze-api, 分析 API>>),我们可以看到这里的 UPC 码被拆分成多个更小的 token :

[source,js]
--------------------------------------------------
Expand Down Expand Up @@ -187,24 +161,17 @@ GET /my_store/_analyze
--------------------------------------------------
// SENSE: 080_Structured_Search/05_Term_text.json

There are a few important points here:
这里有几点需要注意:

* We have four distinct tokens instead of a single token representing the UPC.
* All letters have been lowercased.
* We lost the hyphen and the hash (`#`) sign.
* Elasticsearch 用 4 个不同的 token 而不是单个 token 来表示这个 UPC
* 所有字母都是小写的。
* 丢失了连字符和哈希符( `#` )。

So when our `term` query looks for the exact value `XHDK-A-1293-#fJ3`, it
doesn't find anything, because that token does not exist in our inverted index.
Instead, there are the four tokens listed previously.
所以当我们用 `term` 查询查找精确值 `XHDK-A-1293-#fJ3` 的时候,找不到任何文档,因为它并不在我们的倒排索引中,正如前面呈现出的分析结果,索引里有四个 token 。

Obviously, this is not what we want to happen when dealing with identification
codes, or any kind of precise enumeration.
显然这种对 ID 码或其他任何精确值的处理方式并不是我们想要的。

To prevent this from happening, we need to tell Elasticsearch that this field
contains an exact value by setting it to be `not_analyzed`.((("not_analyzed string fields"))) We saw this
originally in <<custom-field-mappings>>. To do this, we need to first delete
our old index (because it has the incorrect mapping) and create a new one with
the correct mappings:
为了避免这种问题,我们需要告诉 Elasticsearch 该字段具有精确值,要将其设置成 `not_analyzed` 无需分析的。((("not_analyzed string fields"))) 我们可以在 <<custom-field-mappings, 自定义字段映射>> 中查看它的用法。为了修正搜索结果,我们需要首先删除旧索引(因为它的映射不再正确)然后创建一个能正确映射的新索引:

[source,js]
--------------------------------------------------
Expand All @@ -226,12 +193,11 @@ PUT /my_store <2>
}
--------------------------------------------------
// SENSE: 080_Structured_Search/05_Term_text.json
<1> Deleting the index first is required, since we cannot change mappings that
already exist.
<2> With the index deleted, we can re-create it with our custom mapping.
<3> Here we explicitly say that we don't want `productID` to be analyzed.
<1> 删除索引是必须的,因为我们不能更新已存在的映射。
<2> 在索引被删除后,我们可以创建新的索引并为其指定自定义映射。
<3> 这里我们告诉 Elasticsearch ,我们不想对 `productID` 做任何分析。

Now we can go ahead and reindex our documents:
现在我们可以为文档重建索引:

[source,js]
--------------------------------------------------
Expand All @@ -247,9 +213,7 @@ POST /my_store/products/_bulk
--------------------------------------------------
// SENSE: 080_Structured_Search/05_Term_text.json

Only now will our `term` query work as expected. Let's try it again on the
newly indexed data (notice, the query and filter have not changed at all, just
how the data is mapped):
此时, `term` 查询就能搜索到我们想要的结果,让我们再次搜索新索引过的数据(注意,查询和过滤并没有发生任何改变,改变的是数据映射的方式):

[source,js]
--------------------------------------------------
Expand All @@ -268,58 +232,33 @@ GET /my_store/products/_search
--------------------------------------------------
// SENSE: 080_Structured_Search/05_Term_text.json

Since the `productID` field is not analyzed, and the `term` query performs no
analysis, the query finds the exact match and returns document 1 as a hit.
Success!
因为 `productID` 字段是未分析过的, `term` 查询不会对其做任何分析,查询会进行精确查找并返回文档 1 。成功!

[[_internal_filter_operation]]
==== Internal Filter Operation
==== 内部过滤器的操作

Internally, Elasticsearch is((("structured search", "finding exact values", "intrnal filter operations")))
((("filters", "internal filter operation"))) performing several operations when executing a
non-scoring query:
在内部,Elasticsearch ((("structured search", "finding exact values", "intrnal filter operations")))
((("filters", "internal filter operation")))会在运行非评分查询的时执行多个操作:

1. _Find matching docs_.
1. _查找匹配文档_.
+
The `term` query looks up the term `XHDK-A-1293-#fJ3` in the inverted index
and retrieves the list of documents that contain that term. In this case,
only document 1 has the term we are looking for.
`term` 查询在倒排索引中查找 `XHDK-A-1293-#fJ3` 然后获取包含该 term 的所有文档。本例中,只有文档 1 满足我们要求。

2. _Build a bitset_.
2. _创建 bitset_.
+
The filter then builds a _bitset_--an array of 1s and 0s--that
describes which documents contain the term. Matching documents receive a `1`
bit. In our example, the bitset would be `[1,0,0,0]`. Internally, this is represented
as a https://www.elastic.co/blog/frame-of-reference-and-roaring-bitmaps["roaring bitmap"],
which can efficiently encode both sparse and dense sets.
过滤器会创建一个 _bitset_ (一个包含 0 和 1 的数组),它描述了哪个文档会包含该 term 。匹配文档的标志位是 1 。本例中,bitset 的值为 `[1,0,0,0]` 。在内部,它表示成一个 https://www.elastic.co/blog/frame-of-reference-and-roaring-bitmaps["roaring bitmap"],可以同时对稀疏或密集的集合进行高效编码。

3. _Iterate over the bitset(s)_
3. _迭代 bitset(s)_
+
Once the bitsets are generated for each query, Elasticsearch iterates over the
bitsets to find the set of matching documents that satisfy all filtering criteria.
The order of execution is decided heuristically, but generally the most sparse
bitset is iterated on first (since it excludes the largest number of documents).
一旦为每个查询生成了 bitsets ,Elasticsearch 就会循环迭代 bitsets 从而找到满足所有过滤条件的匹配文档的集合。执行顺序是启发式的,但一般来说先迭代稀疏的 bitset (因为它可以排除掉大量的文档)。

4. _Increment the usage counter_.
4. _增量使用计数_.
+
Elasticsearch can cache non-scoring queries for faster access, but its silly to
cache something that is used only rarely. Non-scoring queries are already quite fast
due to the inverted index, so we only want to cache queries we _know_ will be used
again in the future to prevent resource wastage.
Elasticsearch 能够缓存非评分查询从而获取更快的访问,但是它也会不太聪明地缓存一些使用极少的东西。非评分计算因为倒排索引已经足够快了,所以我们只想缓存那些我们 _知道_ 在将来会被再次使用的查询,以避免资源的浪费。
+
To do this, Elasticsearch tracks the history of query usage on a per-index basis.
If a query is used more than a few times in the last 256 queries, it is cached
in memory. And when the bitset is cached, caching is omitted on segments that have
fewer than 10,000 documents (or less than 3% of the total index size). These
small segments tend to disappear quickly anyway and it is a waste to associate a
cache with them.


Although not quite true in reality (execution is a bit more complicated based on
how the query planner re-arranges things, and some heuristics based on query cost),
you can conceptually think of non-scoring queries as executing _before_ the scoring
queries. The job of non-scoring queries is to reduce the number of documents that
the more costly scoring queries need to evaluate, resulting in a faster search request.

By conceptually thinking of non-scoring queries as executing first, you'll be
equipped to write efficient and fast search requests.
为了实现以上设想,Elasticsearch 会为每个索引跟踪保留查询使用的历史状态。如果查询在最近的 256 次查询中会被用到,那么它就会被缓存到内存中。当 bitset 被缓存后,缓存会在那些低于 10,000 个文档(或少于 3% 的总索引数)的段(segment)中被忽略。这些小的段即将会消失,所以为它们分配缓存是一种浪费。


实际情况并非如此(执行有它的复杂性,这取决于查询计划是如何重新规划的,有些启发式的算法是基于查询代价的),理论上非评分查询 _先于_ 评分查询执行。非评分查询任务旨在降低那些将对评分查询计算带来更高成本的文档数量,从而达到快速搜索的目的。

从概念上记住非评分计算是首先执行的,这将有助于写出高效又快速的搜索请求。

0 comments on commit 9a868c8

Please sign in to comment.