The SQL language is one of the oldest in existence. The first ANSI SQL Standard is from 1986. You would think that almost 40 years should be enough for all SQL vendors and implementors to have converged on a well-defined syntax and semantics. In reality, the opposite has happened: each database vendor has a different SQL dialect, new ones appear every day, and vendors continue to add new extensions.
In fact, even quantifying compatibility between two SQL dialects is impractical today. For example:
- the syntax of the statements (e.g., MySQL allows you sometimes to omit
GROUP BY
) - the operators supported (e.g.,
GROUPING SETS
,ASOF
,ROLLUP
,PIVOT
,QUALIFY
,MATCH RECOGNIZE
,LATERAL
,UNNEST
, etc. are not universally supported) - the set of legal statements (e.g., MySQL allows aliases defined in
SELECT
to be used inGROUP BY
, but other databases do not) - the semantics of queries (in different dialects the same query can produce different results; e.g., how
ORDER BY
comparesNULL
values) - the type system (e.g., in Postgres
DECIMAL
types can have unlimited precision, and support values such asInfinity
) - the semantics of primitive operations (e.g., in sqlite
+x
is a noop, whereas in MySQL it is a cast ofx
toINTEGER
). - the set of scalar functions supported and their behaviors (e.g.,
SUBSTR
does something different in MySQL, Oracle, Big Query, and Postgres).
And these are the differences only pertaining to the query language! We haven’t even got to the query engine yet.
Which dialect is the right one?
Let's pretend you are a system builder starting the foolish task of building a new database or query engine. Which dialect should you support? This is a tough question, and the right answer will depend on the market segment you are targeting.
But there is a big choice you have to start with: do you fully support an existing dialect or do you build a new dialect from scratch? This is a choice that is very difficult to reverse once you’ve made it.
tldr; In a nutshell, choosing to support an existing well-established dialect has some clear advantages. Choosing to build a new dialect is fraught with peril.
Factor | Support an Existing Dialect | Build a New Dialect |
Adoption | Easy for existing users to adopt | Everyone must rewrite or port code |
Reference | You have docs and a reference system | You must write your own spec and ensure it's coherent |
Starting point | Can reuse existing OSS components (e.g., parser/optimizer) | Must build everything from scratch |
Scope | Bounded by existing dialect’s features, even if a moving target | Open-ended, users will ask for everything |
Design Risk | Safer: follow proven semantics | Easy to introduce confusing or unsound features |
User Expectations | Clear expectations from familiarity | Uncertainty and learning curve |
Feldera's choice
At Feldera, we chose to build our compiler infrastructure on top of Apache Calcite. Calcite was founded as an Apache project in 2016, but has been in existence significantly longer.
Using Calcite allowed us to remain compliant with standard SQL, while keeping enough flexibility to adapt to market needs. It allowed us to support users wanting to migrate SQL queries from their existing databases and warehouses like Postgres, Sqlite, Databricks, Snowflake and more.
Calcite is flexible
Calcite is a very flexible framework for building SQL query parsers and optimizers. It has a very configurable and extensible parser, a set of validation and type checking tools, two flexible and extensible optimizers, a large set of SQL code generators designed for source-to-source translation, and a large set of scalar functions supported, culled from multiple dialects. Calcite is very flexible -- it has many, many knobs for configuring the behavior of the tools. In theory this is great, but in practice this makes the testing matrix humongous, and most of the features that are well-tested are part of the "default" configuration.
There are various configuration levels:
- flags to configure the behavior of the parser (
SqlParser.Config
), such as whether the dialect is case sensitive or not - knobs such as the Conformance levels (
SqlConformance
) which are designed to emulate coarse-grain behaviors from other dialects, e.g., whetherGROUP BY 1
indicates a column number of a constant - knobs to control how queries are rewritten when converted from an abstract syntax tree to a relational representation (
SqlToRelConverter.Config
) - knobs to control how queries are normalized during the validation process (
SqlValidator.Config
)
There is also a large set of dialects (SqlDialect
) and its many subclasses (e.g., SqliteSqlDialect
). The dialects in Calcite are *not* used to control the set of programs accepted by the system. Often Calcite is used as a source-to-source translator; the dialects are used to specify how a specific query should be represented when SQL code is generated (calcite can generate code for federated queries, where some part of each query executes on a different system).
There are also much more powerful configuration knobs available. For example:
- the entire type system is provided as an interface, and can be reconfigured entirely or piece by piece. There are two default type systems:
RelDataTypeSystem
, relatively high-level, for generic SQL programs, and one based on JavaJavaTypeFactory
, which is used when Calcite generates executable code that runs on the JVM. Most of the Calcite testing is done by compiling SQL into Java, then compiling the Java programs using the Janino Java compiler, and executing the results. Calcite implements a static typing discipline for SQL, but some SQL dialects are dynamically-typed (e.g., in sqlite the type of the result of plus will depend on the value of the arguments) - there is a very large set of optimization passes (
RelOptRule
), which are very modular; all such optimizations implement a common interface. Anyone can write new passes and contribute them to Calcite, and the optimizer will pick them up. There's a Volcano-style cost-based optimizer.
Also, the project being open-source, it provides the ultimate configuration capability: if something does not exist, it can be implemented and contributed to the project. The only constraints stem from keeping backwards compatibility with the existing implementation.
Unfortunately, this very general design of Calcite is also a significant barrier of entry, since one has to understand the design of the various components in order to contribute, and the documentation is unfortunately still very scarce. Even after almost 3 years of usage, I can hardly claim to have a good understanding of all parts of Calcite.
Did we make the right choice? So far so good.
It is still early to make a definite pronouncement, but it looks like choosing to start from Calcite and evolve it to implement our own dialect was the right choice:
- Even though users have to port their programs, most users would have to do that no matter which dialect you choose
- We have implemented and contributed to Calcite many SQL features that our users have asked for, such as
ASOF
joins,VARIANT
support,UUID
s,UNSIGNED
numbers, and more. - Whenever we find bugs in Calcite, we can contribute them to the project; the review cycle can be very fast, especially for shallow bugs, and thus we can fix problems we find in a day or two, leading to a very fast release cycle
- The flexibility of Calcite made it possible for us to design various "non-standard" features, such as various streaming extensions or recursive queries. (We didn't contribute such changes back to Calcite, but they are open-source in our repository.)
- Since Feldera is an incremental compute engine which needs to connect with many other systems through adapters, we have borrowed syntax inspired from other systems to describe such connectors inline in SQL programs.
In summary, the flexibility of Calcite has enabled us to be very responsive to customer requests, and to quickly roll out new capabilities, even when they weren't supported in Calcite itself.
So far we have opted *not* to expose any configuration knobs to our users; testing is hard enough for a single SQL dialect; it becomes much harder when users can tweak its behavior. However, we realize that every choice we make today will become a fixed behavior tomorrow.
Is there any feature that you always wanted in your incremental compute engine? Maybe we can support it for you. Just ask!