Searching with Elasticsearch
In this blog, we will cover the Elasticsearch basics and answer questions including:
- What is Elasticsearch?
- How does Elasticsearch work?
- What does it mean to be powered by Lucene?
- How does Elasticsearch store data?
- How does Elasticsearch “search”?
So what is Elasticsearch?
Elasticsearch is a part of the Elastic tech-stack. It is used in combination with visualization and ingestion tools, Kibana and Logstash, and Beats. Elasticsearch is the “middle engine,” a real-time search analytics engine, enabling you to store, search, analyze, and explore your data.
- Elasticsearch has a distributed document (data) store.
- It also provides real-time analytics
- Elasticsearch is highly scalable
- Elasticsearch also provides JSON based REST API to access its functionality from any web client of your choice or even from the command line.
- As Elasticsearch is a distributed system, it also provides API to manage and monitor the system.
How does Elasticsearch work?
Elasticsearch is powered by Apache Lucene, an open-source, text search engine built in Java, to search its data it stores.
How does Elasticsearch store data?
For a full-text search, Elasticsearch uses a data structure called an inverted index. Inverted indexes are indexes created based on the unique words found in the content stored in their repositories. In this case, anything stored in Elasticsearch contributes to the index.
Here is how an inverted index works:
|QueryAI Decentralized Data Access & Analysis|
|QueryAI helps you unlock the power of your data.|
To create the inverted index, we first split the statement into separate words, which are called tokens (or terms).
Once the tokens are determined we apply filters to increase searchability
- Removing stop words (the, in, etc. of the English word)
- Lowercasing (To make search case insensitive)
- stemming (using root words. “Foxes” will get converted to “fox”)
- Synonymous (jumped and leap are synonyms and are indexed as just the single term jump)
After applying the above rules, we get:
|tokens||Present in Document 1||Present in Document 2|
When a user searches, the same filters are used on the search string. When a user searches QueryAI, it will get lowercased to queryai before searching.
Result: As it is present in both the documents search result will bring up both.
Search: queryai data power
Result: As the “queryai” snippet is present in both documents, it will bring both up. However, since document 2 also matched on “data power,” Elasticsearch will rank document 2 as a higher match percentage.
Elasticsearch is a real-time search analytics engine, enabling you to store, search, analyze, and explore your data. In this blog, we covered how Elasticsearch uses Lucene and how it searches through data. In the next blog, we will discuss other components of the Elastic tech stack: Kibana and Logstash.
Did you enjoy this content? Follow our linkedin page!
Query.AI is a decentralized data analysis technology that unlocks the power of your organization’s data, simplifying access and analysis across your platforms and locations, without data duplication. With Query.AI, you can analyze your enterprise data in a language-, location-, and platform-neutral way to gain cost-effective, consistent security operations and eliminate complexity.
Common Elasticsearch terminology (used in the article)
Elasticsearch stores the entire JSON object after indexing. Each of the stored objects is individually called a document.
For a user, the index is a place to store related documents.
An index created based on the unique words used in stored documents.
In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria.