1 Star 0 Fork 303

脏小强/elasticsearch-definitive-guide-cn

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
20_Making_text_searchable.asciidoc 3.50 KB
一键复制 编辑 原始数据 按行查看 历史
Looly 提交于 2014-09-22 14:58 +08:00 . First commit, finished 1.1 and 1.2

Making text searchable

The first challenge that had to be solved was how to make text searchable. Traditional databases store a single value per field, but this is insufficient for full text search. Every word in a text field needs to be searchable, which means that the database needs to be able to index multiple values — ``words'' in this case — in a single field.

The data structure that best supports the multiple-values-per-field requirement is the inverted index, which we introduced in [inverted-index]. The inverted index contains a sorted list of all of the unique values or terms that occur in any document and, for each term, a list of all the documents that contain it.

Term  | Doc 1 | Doc 2 | Doc 3 | ...
------------------------------------
brown |   X   |       |  X    | ...
fox   |   X   |   X   |  X    | ...
quick |   X   |   X   |       | ...
the   |   X   |       |  X    | ...

When discussing inverted indices we talk about indexing documents'' because, historically, an inverted index was used to index whole unstructured text documents. A document'' in Elasticsearch is a structured JSON document with fields and values. In reality, every indexed field in a JSON document has its own inverted index.

The inverted index may actually hold a lot more information than just the list of documents which contain a particular term. It may store a count of how many documents contain each term, how many times a term appears in a particular document, the order of terms in each document, the length of each document, the average length of all documents, etc. These statistics allow Elasticsearch to determine which terms are more important than others, and which documents are more important than others, as described in [relevance-intro].

The important thing to realise is that the inverted index needs to know about all documents in the collection in order for it to function as intended.

In the early days of full text search, one big inverted index was built for the entire document collection and written to disk. As soon as the new index was ready, it replaced the old index and recent changes became searchable.

Immutability

The inverted index that is written to disk is immutable — it doesn’t change. Ever. This immutability has important benefits:

  • There is no need for locking. If you never have to update the index, you never have to worry about multiple processes trying to make changes at the same time.

  • Once the index has been read into the kernel’s file-system cache, it stays there because it never changes. As long as there is enough space in the file-system cache, most reads will come from memory instead of having to hit disk. This provides a big performance boost.

  • Any other caches (like the filter cache) remain valid for the life of the index. They don’t need to be rebuilt every time the data changes, because the data doesn’t change.

  • Writing a single large inverted index allows the data to be compressed, reducing costly disk I/O and the amount of RAM needed to cache the index.

Of course, an immutable index has its downsides too, primarily, the fact that it is immutable! You can’t change it. If you want to make new documents searchable, you have to rebuild the entire index.

This places a significant limitation either on the amount of data that an index can contain, or the frequency with which the index can be updated.

Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/icodes/elasticsearch-definitive-guide-cn.git
git@gitee.com:icodes/elasticsearch-definitive-guide-cn.git
icodes
elasticsearch-definitive-guide-cn
elasticsearch-definitive-guide-cn
master

搜索帮助