1 Star 0 Fork 303

脏小强/elasticsearch-definitive-guide-cn

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
60_Mixed_language_fields.asciidoc 4.70 KB
一键复制 编辑 原始数据 按行查看 历史
Looly 提交于 2014-09-22 14:58 +08:00 . First commit, finished 1.1 and 1.2

Mixed language fields

Usually documents that mix multiple languages in a single field come from sources beyond your control, such as pages scraped from the web:

{ "body": "Page not found / Seite nicht gefunden / Page non trouvée" }

They are the most difficult type of multilingual document to handle correctly. While you can simply use the standard analyzer on all fields, your documents will be less searchable than if you had used an appropriate stemmer. But of course, you can’t choose just one stemmer — stemmers are language specific. Or rather, stemmers are language and script specific. As discussed in [different-scripts], if every language uses a different script, then stemmers can be combined.

Assuming that your mix of languages uses the same script such as Latin, then there are three choices available to you:

Split into separate fields

The Compact Language Detector mentioned in [identifying-language] can tell you which parts of the document are in which language. You can split the text up based on language and use the same approach as was used in [one-lang-fields].

Analyze multiple times

If you primarily deal with a limited number of languages, then you could use multi-fields to analyze the text once per language:

PUT /movies
{
  "mappings": {
    "title": {
      "properties": {
        "title": { (1)
          "type": "string",
          "fields": {
            "de": { (2)
              "type":     "string",
              "analyzer": "german"
            },
            "en": { (2)
              "type":     "string",
              "analyzer": "english"
            },
            "fr": { (2)
              "type":     "string",
              "analyzer": "french"
            },
            "es": { (2)
              "type":     "string",
              "analyzer": "spanish"
            }
          }
        }
      }
    }
  }
}
  1. The main title field uses the standard analyzer.

  2. Each sub-field applies a different language analyzer to the text in the title field.

Use n-grams

You could just index all words as n-grams, using the same approach as described in [ngrams-compound-words]. Most inflections involve adding a suffix (or in some languages, a prefix) to a word, so by breaking each word down into n-grams, you have a good chance of matching words that are similar, but not exactly the same. This can be combined with the analyze-multiple- times approach to provide a catch-all field for unsupported languages:

PUT /movies
{
  "settings": {
    "analysis": {...} (1)
  },
  "mappings": {
    "title": {
      "properties": {
        "title": {
          "type": "string",
          "fields": {
            "de": {
              "type":     "string",
              "analyzer": "german"
            },
            "en": {
              "type":     "string",
              "analyzer": "english"
            },
            "fr": {
              "type":     "string",
              "analyzer": "french"
            },
            "es": {
              "type":     "string",
              "analyzer": "spanish"
            },
            "general": { (2)
              "type":     "string",
              "analyzer": "trigrams"
            }
          }
        }
      }
    }
  }
}
  1. In the analysis section, we define the same trigrams analyzer as described in [ngrams-compound-words].

  2. The title.general field uses the trigrams analyzer to index any language.

When querying the catch-all general field, you can use minimum_should_match to reduce the number of low quality matches. It may also be necessary to boost the other fields slightly more than the general field, so that matches on the the main language fields are given more weight than those on the general field:

GET /movies/movie/_search
{
    "query": {
        "multi_match": {
            "query":    "club de la lucha",
            "fields": [ "title*^1.5", "title.general" ], (1)
            "type":     "most_fields",
            "minimum_should_match": "75%" (2)
        }
    }
}
  1. All title or title.* fields are given a slight boost over the title.general field.

  2. The minimum_should_match parameter reduces the number of low quality matches returned, especially important for the title.general field.

Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/icodes/elasticsearch-definitive-guide-cn.git
git@gitee.com:icodes/elasticsearch-definitive-guide-cn.git
icodes
elasticsearch-definitive-guide-cn
elasticsearch-definitive-guide-cn
master

搜索帮助