Technology Sharing

[Elasticsearch] Detailed explanation of Elasticsearch inverted index

2024-07-08

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

📑引言

Elasticsearch is a distributed search engine based on Lucene, which is widely used in full-text search, log analysis, real-time data analysis and other fields. Its core advantage lies in its powerful search performance, and one of the foundations of this performance is the inverted index. This article will introduce the inverted index in Elasticsearch in detail to help readers deeply understand its principle, structure and application.

image.png

1. Introduction to inverted index

The inverted index is the core data structure of the full-text search engine. Its main function is to extract keywords from documents and establish a mapping relationship between keywords and documents. This structure is opposite to the traditional forward index (i.e., the mapping from documents to keywords), so it is called an inverted index.
In the inverted index, each keyword is associated with a list of documents containing the keyword, which enables search operations to quickly locate documents containing specific keywords, thereby greatly improving query efficiency.

2. The basic structure of inverted index

The basic structure of an inverted index consists of the following parts:

  1. Dictionary: Contains all keywords that appear in the document set.
  2. Inverted List: For each keyword, the record contains a list of document IDs for the keyword and its location information in the document.

Let's take a simple example:
Suppose we have the following three documents:

  • Document 1:"Elasticsearch is a powerful search engine"
  • Document 2:"Elasticsearch uses inverted index"
  • Document 3:"Search engines use indexes"

The steps to build an inverted index are as follows:

  1. Tokenization: Split the document into words and perform normalization (such as lowercase, remove stop words, etc.).
  2. Build a dictionary: Extract unique words from all documents.
  3. Create an inverted list: record the position of each word in each document.

The results are as follows: