Building a simple SQL interface for ElasticSearch
ElasticSearch provides the ability to not only store and retrieve text in an efficient manner but can be used as an effective analytics store. While the ElasticSearch query DSL is fantastic and provides all the capabilities for a developer to retrieve data, it can quickly become verbose and complicated.
Use Case
One of the most common use cases of ElasticSearch is log analysis and we ingested logs from multiple sources into an ElasticSearch cluster mainly for troubleshooting purposes. The log analysis started with typical ingestion pipeline where the logs are parsed, cleansed, enriched with additional metadata and indexed in ElasticSearch.
The end users of the log analysis application we wrote are technical support engineers who are comfortable in writing SQL to query to analyze log data. By letting the users write their queries in a SQL like manner, helped us to open up the majority of the querying capabilities of ElasticSearch without them learning a new DSL.
Data Model
Data is stored as a collection of documents and indices. For the purpose of this post, let’s assume the following mapping to the relational data model.
- Documents = Rows
- Index = Table
So that we can try to build an interface which can take a SQL query
select * from employee where age >= 30 limit 20
and translate it into
This can be executed using standard ElasticSearch high level or low level clients to fetch the results.
SQL Parsing
For parsing the SQL query, we can use any of the available SQL parser, in this case we used presto-parser . Using the parser you can get an AST tree for the select statement.
Query
QueryBody
QuerySpecification
Select
*
From
Table[employee]
Where
GREATER_THAN_OR_EQUAL
QualifiedName[age]
Long[30]
Limit: 20
Now that we have a tree, we can visit tree nodes and leverage the ElasticSearch QueryBuilder to build out the translation. The presto parser provides various tree traversal classes and for one such class is DefaultExpressionTraversalVisitor which can be extended to build and ExpressionVisitor for the where operations.For e.g the below code listing is a simplistic way of handling comparison operations in the where clause of the query.
Aggregations as Functions
In the previous section we saw how to handle a typical where clause. ElasticSearch comes with a battery of aggregations like,
- Terms
- Statistical Operations like min, max, median etc.
- Date Histograms for time series data
The approach we took in this case was to model these aggregations as SQL functions. For e.g a simple query to find the departments where the employees with age ≥ 30 are working can be found using
select terms(department) from employee where age >= 30 limit 0
Query Execution
The generated query is then executed using standard ElasticSearch Java clients. This helps in embedding this query layer into most of the existing Java applications and provide the application developers a familiar SQL like query capability for ElasticSearch.
Alternatives
- The SQL JDBC plugin from ElasticSearch is part of the x-pack which is not Apache licensed
- OpenDistro from Amazon is a different fork altogether.
- https://github.com/NLPchina/elasticsearch-sql