Databases in 2021: A Year in Review
Updated: Apr 19
We all appreciate a well documented and thought out article/post that shows little in the way of vendor 'advertorial' and brings proper measures and math to defend a premise. Through this recent holiday season I came upon Andy Pavlo and his post titled "Databases in 2021: A year in Review' and felt this worthy sharing and reposting (with some editorial license to shorten it for brevity).
Would like to add a few comments to bring additional observations that Mr Pavlo so rightfully outlined:
The reason Postgres is taking over is three things: (1) price, (2) license terms and (3) innovation. Add to it the other so-called open-source options is MySQL that is 100% under the control of Oracle.
The cloud vendors are being posers around this topic - AWS with RDS, Azure with their mess of options and GCP with their database offerings. Upon researching benchmarking and database I came upon this Youtube side-by-side comparison how the cloud vendors make their real money. Much like Mr Pavlo, his presentation made this both interesting, fact-based and a tad entertaining.
---------------The original post-----------
It was a wild year for the database industry, with newcomers overtaking the old guard, vendors fighting over benchmark numbers, and eye-popping funding rounds. We also had to say goodbye to some of our database friends through acquisitions, bankruptcies, or retractions.
As the end of the year draws near, it’s worth reflecting and taking stock as we move into 2022. Here are some of the highlights and a few of my thoughts on what they might mean for the field of databases.
Dominance of PostgreSQL
The conventional wisdom among developers has shifted: PostgreSQL has become the first choice in new applications. It is reliable. It has many features and keeps adding more. In 2010, the PostgreSQL development team switched to a more aggressive release schedule to put out a new major version once per year (H/T Tomas Vondra). And of course PostgreSQL is open-source.
PostgreSQL compatibility is a distinguishing feature for a lot of systems now. Such compatibility is achieved by supporting PostgreSQL’s SQL dialect (DuckDB), wire protocol (QuestDB, HyPer), or the entire front-end (Amazon Aurora, YugaByte, Yellowbrick). The big players have jumped on board. Google announced in October that they added PostgreSQL compatibility in Cloud Spanner. Also in October, Amazon announced the Babelfish feature for converting SQL Server queries into Aurora PostgreSQL.
One measurement of the popularity of a database is the DB-Engine rankings. This ranking is not perfect and the score is somewhat subjective, but it’s a reasonable approximation for the top 10 systems. As of December 2021, the ranking shows that while PostgreSQL remains the fourth most popular database (after Oracle, MySQL, and MSSQL), it reduced the gap with MSSQL in the past year.
Another trend to consider is how often PostgreSQL is mentioned in online communities. This gives another signal for what people are talking about in databases. I downloaded all of the 2021 comments made on the Database Subreddit and counted the frequency of database names (in PostgreSQL of course). I cross-referenced the list of every database that I know about from my Database of Databases, cleaned up abbreviations (e.g., Postgres → PostgreSQL, Mongo → MongoDB, ES → Elasticsearch), and then calculated the top 10 most-mentioned DBMS.
dbms | cnt ---------------+----- PostgreSQL | 656 MySQL | 317 MongoDB | 266 Oracle | 222 SQLite | 213 Redis | 88 Elasticsearch | 70 Snowflake | 52 DGraph | 46 Neo4j | 42
Of course this ranking is not scientific, since I am not doing sentiment analysis on the comments. But it clearly shows that people are mentioning Postgres more than other systems in the past year. There are often posts from developers asking what DBMS to use for their new application, and the response from the community is almost always Postgres.
Foremost, it is a good thing that a relational database system has become the first choice in greenfield applications. This shows the staying power of Ted Codd’s relational model from the 1970s. Second, PostgreSQL is a great database system. Yes, it has known issues and dark corners, as does every DBMS. But with so much attention and energy focused on it, PostgreSQL is only going to get better over the years.
There’s no love lost between different database vendors this year over benchmark results. Vendors trying to show that their system is faster than their competitors’ goes back to the late 1980s. That’s why TPC was set up to provide a non-partisan forum for officiating over comparisons. But as the influence and prevalence of TPC has waned over the last decade, we now find ourselves in a new round of database benchmark wars.
There were three major street battles that heated up this year over benchmark results.
Databricks vs. Snowflake Databricks announced that their new Photon SQL engine set a new world record in 100TB TPC-DS. Snowflake fired back, saying its database is 2x faster and that Databricks ran Snowflake incorrectly. Databricks countered, claiming their SQL engine provides superior execution and price performance over Snowflake.
Rockset vs. Apache Druid vs. ClickHouse ClickHouse came out swinging, saying it nailed cost efficiency when compared to Druid and Rockset. But not so fast: Imply responded with tests on a newer version of Druid and claimed victory. Rockset joined in, saying its performance is was better for real-time analytics than the other two.
ClickHouse vs. TimescaleDB Smelling blood in the water, tiger-style Timescale joined the fray. They shot out their own benchmarks results and took the opportunity to point out weaknesses in ClickHouse’s technology. The discussion of third-party benchmarks got heated on Hacker News.
Too much blood has been shed in the database community in previous benchmark turf wars   . I fully admit that I used to be in the game. But I’ve lost too many friends in the streets. I even broke up with a girlfriend once because of sloppy benchmark results. As I’ve gotten older, I can now say that it’s not worth it. It’s even harder now to compare systems because cloud DBMSs have so many moving parts and tunable options that it is often difficult to ascertain the real reasons for performance differences. Real applications also do more than just run the same queries one after another. The user experience when ingesting, transforming, and cleaning data can matter as much as raw performance numbers. And as I told the reporter in this article about Databricks’ benchmarks results, only old people care about official TPC numbers.
Big Data, Big Money
The number of venture rounds worth at least $100 million has been steadily increasing since the second half of 2020. There were 327 of these mega-deals in 2020 (just less than half of total VC deal volume). And as of January of 2021, there were over 100 venture-backed investment rounds worth $100 million or more.
A lot of that investment money was thrown at database companies in 2021. For operational databases, CockroachDB led the fundraising leaderboard by starting the year with a $160m round and then closed it off with raising another $278m in December 2021. Yugabyte got paid when they raised a $188m Series C round. PlanetScale pulled out a $20m Series B for their hosted version of Vitess. The comparatively older NoSQL stalwart DataStax, too, raised $37.6m in a venture round for their Cassandra business.
As impressive as these amounts are, the analytical database market is even more heated. TileDB raised an undisclosed amount in September 2021. Vectorized.io raised $15m for their Kafka-compatible streaming platform. StarTree came out of stealth and announced its $24m round to commercialize Apache Pinot. The matviews-on-steroids DBMS Materialize announced that they copped $60m for their series C. Imply raised $70m for database service based on Apache Druid. SingleStore raised $80m in September 2021, taking them one step closer to going IPO. At the beginning of the year, Starburst Data raised $100m for its Trino system (formerly PrestoSQL). Firebolt was another DBMS start-up coming out of stealth to announced that they raised $127m for its new cloud data warehouse based on a fork of ClickHouse. A new company, ClickHouse, Inc., raised a staggering $250m to establish a new company around the system (as well as securing the rights to use the ClickHouse name from Yandex).
We are in the golden era of databases. There are so many excellent choices available today. Investors are searching for database start-ups that can become the next Snowflake-like IPO. These fundraising amounts are larger than previous database start-ups. For example, Snowflake did not have a round over $100m until its Series D, which was five years into the company’s life. Starburst completed a $100m round within less than three years of its founding. Now there are a lot of factors involved in funding (e.g. the Starburst team was working on Presto at TeraData for years before spinning out), but I feel like more money is being thrown around these days.
Regretfully, we said goodbye to some database friends in the past year.
ServiceNow acquired Swarm64 The company started off as an FPGA accelerator for running analytical workloads on PostgreSQL. They then switched to being a software-only accelerator for PostgreSQL using extensions. But they failed to gain traction, especially against the other well-funded cloud data warehouses. After the ServiceNow acquisition, there’s still no word on whether the Swarm64 product will continue.
Splice Machine went bust Splice was pushing a hybrid (HTAP) DBMS that combined HBase for operational workloads and Spark SQL for analytics. They then pushed to provide a platform for operational/real-time ML applications. But, an all-in-one hybrid system failed to make inroads in the database market due to the dominance of specialized OLTP and OLAP systems.
Private equity firms bought Cloudera Since the world moved away from MapReduce and Hadoop technologies in the second half of the last decade, Cloudera has failed to have the same traction in the cloud data warehouse market. Most of the original engineering teams for Impala and Kudu have left the company, although the projects are still under development and putting out new releases. The stock has dropped to below its IPO price from 2018. It remains to be seen whether its new investors will be able to turn the company around.
It’s always sad to see another database project or company go under, but that is the bloodsport nature of the database industry. Being open-source potentially helps a DBMS outlive the company that created it, but not always. Due to their complexity, databases require full-time crews working on them to fix bugs and add new features. Moving source code rights and control of a defunct DBMS to an open-source software foundation, like the Apache Foundation or CNCF, does not mean that the project will be magically revived. For example, RethinkDB was donated to the Linux Foundation after the company went bust, but from all appearances on Github they are dead in the water (few commits, PRs not getting merged). Another example of this happening include DeepDB: the company failed and created their own non-profit foundation for the code but nobody ever worked on it. I anticipate that there will be more database companies going under in the next year that are unable to compete with the major cloud vendors and the well-funded start-ups listed above.