So what is a column-store?

January 09, 2014

If you have some connection to data management (business, research, etc.) there are good chances you heard the term "column-store".

In short, column-stores are one of the biggest movements in the history of data management systems. As we speak, all major database systems vendors are rewriting or rethinking their core data management systems due to this movement and the well accepted by now benefits that this new design brings for many aspects of data analytics. Many have already shipped new products or acquired an exciting startup (e.g., Vertica, VectorWise, etc.).

Together with Daniel Abadi from Yale, Peter Boncz from CWI, Stavros Harizopoulos from Amiato and Sam Madden from MIT, we recently wrote the first (100 page) survey for the design of modern column-store database systems. It is published by Foundations and Trends in Databases and can be found here. It contains discussion and examples about how data is stored, accessed, details about system architecture and of course the history of how we came to this point.

I want to mention just one of the main discussion points. What exactly is a column-store? We can probably call column-store any system that stores data one column (attribute) at a time as opposed to one row or record at a time in traditional database systems. This allows for more selective reading from disk during query processing but it won't give all the performance benefits modern column-stores claim. So if we look at the examples from academia that shaped the early steps of the modern column-store movement (MonetDB, VectorWise, C-Store) we see that the main characteristic is a from scratch design which includes several concepts that go well beyond the simple idea of storing data one column at a time.

These design concepts include, late materialization, compression and working over compressed data, vectorization, virtual IDs, cache conscious algorithms, etc. One great question is whether all those features are unique to column-stores or they can be applied to row-stores as well. The answer is that most of those features have been one way or the other in the minds of database systems researchers for decades and one can find individual seminal research publications spread over the past 2-3 decades about cache conscious algorithms, vectorization, working over compressed data and of course about storing data one column at a time. Typically such research in the past was done by extending an existing row-store system. In contrast, modern column-stores were built from scratch which meant that researchers could innovate even more, pushing these ideas to their extreme and coming up with new ones as well; essentially, there was no legacy restrictions and all those concepts could be made central in the new designs, resulting in brand new database architectures.

So, overall, modern column-stores reflect the combined wisdom of the database systems community and they are not simply about storing data one column at a time. In fact, many modern column-stores support storing data in a hybrid form (groups of columns), while maintaining all other design points. In the end, the way data should be stored is purely workload dependent and the main story behind modern column-stores is about rethinking how we design database systems for data analytics and given modern hardware.

Now that major vendors such as IBM, Microsoft, Oracle, SAP, HP, etc. are embrasing these ideas and are pushing them even further, we are set for an even more exciting future in terms of database system design.