# Publications

2012
S. Idreos, S. Manegold, and G. Graefe, “Adaptive indexing in modern database kernels,” in Proceedings of the 15th International Conference on Extending Database Technology (EDBT), Berlin, Germany, 2012, pp. 566-569.Abstract
Physical design represents one of the hardest problems for database management systems. Without proper tuning, systems cannot achieve good performance. Offline indexing creates indexes a priori assuming good workload knowledge and idle time. More recently, online indexing monitors the workload trends and creates or drops indexes online. Adaptive indexing takes another step towards completely automating the tuning process of a database system, by enabling incremental and partial online indexing. The main idea is that physical design changes continuously, adaptively, partially, incrementally and on demand while processing queries as part of the execution operators. As such it brings a plethora of opportunities for rethinking and improving every single corner of database system design. We will analyze the indexing space between offline, online and adaptive indexing through several state of the art indexing techniques, e. g., what-if analysis and soft indexes. We will discuss in detail adaptive indexing techniques such as database cracking, adaptive merging, sideways cracking and various hybrids that try to balance the online tuning overhead with the convergence speed to optimal performance. In addition, we will discuss how various aspects of modern techniques for database architectures, such as vectorization, bulk processing, column-store execution and storage affect adaptive indexing. Finally, we will discuss several open research topics towards fully autonomous database kernels.
S. Idreos, “Cracking Big Data,” ERCIM News. Special theme: Big Data, 2012. Publisher's Version
S. Idreos, F. Groffen, N. Nes, S. Manegold, S. K. Mullender, and M. L. Kersten, “MonetDB: Two Decades of Research in Column-oriented Database Architectures,” IEEE Data Engineering Bulletin, vol. 35, no. 1, pp. 40-45, 2012.Abstract
MonetDB is a state-of-the-art open-source column-store database management system targeting applications in need for analytics over large collections of data. MonetDB is actively used nowadays in health care, in telecommunications as well as in scientiﬁc databases and in data management research, accumulating on average more than 10,000 downloads on a monthly basis. This paper gives a brief overview of the MonetDB technology as it developed over the past two decades and the main research highlights which drive the current MonetDB design and form the basis for its future evolution.
E. Liarou, S. Idreos, S. Manegold, and M. L. Kersten, “MonetDB/DataCell: Online Analytics in a Streaming Column-Store,” Proceedings of the Very Large Databases Endowment (PVLDB), vol. 5, no. 12, pp. 1910-1913, 2012.Abstract
In DataCell, we design streaming functionalities in a modern relational database kernel which targets big data analytics. This includes exploitation of both its storage/execution engine and its optimizer infrastructure. We investigate the opportunities and challenges that arise with such a direction and we show that it carries significant advantages for modern applications in need for online analytics such as web logs, network monitoring and scientific data management. The major challenge then becomes the efficient support for specialized stream features, e.g., multi-query processing and incremental window-based processing as well as exploiting standard DBMS functionalities in a streaming environment such as indexing. This demo presents DataCell, an extension of the MonetDB open-source column-store for online analytics. The demo gives users the opportunity to experience the features of DataCell such as processing both stream and persistent data and performing window based processing. The demo provides a visual interface to monitor the critical system components, e.g., how query plans transform from typical DBMS query plans to online query plans, how data flows through the query plans as the streams evolve, how DataCell maintains intermediate results in columnar form to avoid repeated evaluation of the same stream portions, etc. The demo also provides the ability to interactively set the test scenarios and various DataCell knobs.
I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki, “NoDB: efficient query execution on raw data files,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, Scottsdale, Arizona, 2012, pp. 241-252.Abstract
I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki, “NoDB in Action: Adaptive Query Processing on Raw Data,” Proceedings of the Very Large Databases Endowment (PVLDB), vol. 5, no. 12, pp. 1942-1945, 2012.Abstract
As data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare the data, to load the data into the database and to execute the desired queries. Many applications already avoid using traditional database systems, e.g., scientific data analysis and social networks, due to their complexity and the increased data-to-query time, i.e. the time between getting the data and retrieving its first useful results. For many applications data collections keep growing fast, even on a daily basis, and this data deluge will only increase in the future, where it is expected to have much more data than what we can move or store, let alone analyze. In this demonstration, we will showcase a new philosophy for designing database systems called NoDB. NoDB aims at minimizing the data-to-query time, most prominently by removing the need to load data before launching queries. We will present our prototype implementation, PostgresRaw, built on top of PostgreSQL, which allows for efficient query execution over raw data files with zero initialization overhead. We will visually demonstrate how PostgresRaw incrementally and adaptively touches, parses, caches and indexes raw data files autonomously and exclusively as a side-effect of user queries.
F. Halim, S. Idreos, P. Karras, and R. H. C. Yap, “Stochastic Database Cracking: Towards Robust Adaptive Indexing in Main-Memory Column-Stores,” Proceedings of the Very Large Databases Endowment (PVLDB), vol. 5, no. 6, pp. 502-513, 2012.Abstract
G. Graefe, F. Halim, S. Idreos, H. A. Kuno, and S. Manegold, “Concurrency Control for Adaptive Indexing,” Proceedings of the Very Large Databases Endowment (PVLDB), vol. 5, pp. 656-667, 2012.Abstract

R. Barber, et al., “Business Analytics in (a) Blink,” IEEE Data Engineering Bulletin, vol. 35, no. 1, pp. 9-14, 2012.Abstract

2011
Too Many Links in the Horizon; What is Next? Linked Views and Linked History,” in Proceedings of the 10th International Semantic Web Conference (ISWC), Bonn, Germany, 2011.Abstract

The trend for more online linked data becomes stronger. Foreseeing a future where “everything” will be online and linked, we ask the critical question; what is next? We envision that managing, querying and storing large amounts of links and data is far from yet another query processing task. We highlight two distinct and promising research directions towards managing and making sense of linked data. We introduce linked views to help focusing on specific link and data instances and linked history to help observe how links and data change over time.

S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki, “Here are my Data Files. Here are my Queries. Where are my Results?” in Proceedings of the 5th International Conference on Innovative Data Systems Research (CIDR), Asilomar, California, 2011, pp. 57-68.Abstract
Database management systems (DBMS) provide incredible flexibility and performance when it comes to query processing, scalability and accuracy. To fully exploit DBMS features, however, the user must define a schema, load the data, tune the system for the expected workload, and answer several questions. Should the database use a column-store, a row-store or some hybrid format? What indices should be created? All these questions make for a formidable and time-consuming hurdle, often deterring new applications or imposing high cost to existing ones. A characteristic example is that of scientific databases with huge data sets. The prohibitive initialization cost and complexity still forces scientists to rely on ancient" tools for their data management tasks, delaying scientific understanding and progress. Users and applications collect their data in flat files, which have traditionally been considered to be outside" a DBMS. A DBMS wants control: always bring all data inside", replicate it and format it in its own secret" way. The problem has been recognized and current efforts extend existing systems with abilities such as reading information from flat files and gracefully incorporating it into the processing engine. This paper proposes a new generation of systems where the only requirement from the user is a link to the raw data files. Queries can then immediately be fired without preparation steps in between. Internally and in an abstract way, the system takes care of selectively, adaptively and incrementally providing the proper environment given the queries at hand. Only part of the data is loaded at any given time and it is being stored and accessed in the format suitable for the current workload.
P. Bonnet, et al., “Repeatability and workability evaluation of SIGMOD 2011,” SIGMOD Record, vol. 40, pp. 45-48, 2011.Abstract
SIGMOD has offered, since 2008, to verify the experiments published in the papers accepted at the conference. This year, we have been in charge of reproducing the experiments provided by the authors (repeatability), and exploring changes to experiment parameters (workability). In this paper, we assess the SIGMOD repeatability process in terms of participation, review process and results. While the participation is stable in terms of number of submissions, we find this year a sharp contrast between the high participation from Asian authors and the low participation from American authors. We also find that most experiments are distributed as Linux packages accompanied by instructions on how to setup and run the experiments. We are still far from the vision of executable papers.
M. L. Kersten, S. Idreos, S. Manegold, and E. Liarou, “The Researcher's Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds,” Proceedings of the Very Large Databases Endowment (PVLDB), vol. 4, no. 12, pp. 1474-1477, 2011.Abstract
There is a clear need nowadays for extremely large data processing. This is especially true in the area of scientific data management where soon we expect data inputs in the order of multiple Petabytes. However, current data management technology is not suitable for such data sizes. In the light of such new database applications, we can rethink some of the strict requirements database systems adopted in the past. We argue that correctness is such a critical property, responsible for performance degradation. In this paper, we propose a new paradigm towards building database kernels that may produce wrong but fast, cheap and indicative results. Fast response times is an essential component of data analysis for exploratory applications; allowing for fast queries enables the user to develop a feeling" for the data through a series of painless" queries which eventually leads to more detailed analysis in a targeted data area. We propose a research path where a database kernel autonomously and on-the-fly decides to reduce the processing requirements of a running query based on workload, hardware and environmental parameters. It requires a complete redesign of database operators and query processing strategy. For example, typical and very common scenarios were query processing performance degrades significantly are cases where a database operator has to spill data to disk, or is forced to perform random access, or has to follow long linked lists, etc. Here we ask the question: What if we simply avoid these steps, ignoring" the side-effect in the correctness of the result?
S. Idreos, S. Manegold, H. A. Kuno, and G. Graefe, “Merging What's Cracked, Cracking What's Merged: Adaptive Indexing in Main-Memory Column-Stores,” Proceedings of the Very Large Databases Endowment (PVLDB), vol. 4, pp. 585-597, 2011.Abstract
Adaptive indexing is characterized by the partial creation and refinement of the index as side effects of query execution. Dynamic or shifting workloads may benefit from preliminary index structures focused on the columns and specific key ranges actually queried --- without incurring the cost of full index construction. The costs and benefits of adaptive indexing techniques should therefore be compared in terms of initialization costs, the overhead imposed upon queries, and the rate at which the index converges to a state that is fully-refined for a particular workload component. Based on an examination of database cracking and adaptive merging, which are two techniques for adaptive indexing, we seek a hybrid technique that has a low initialization cost and also converges rapidly. We find the strengths and weaknesses of database cracking and adaptive merging complementary. One has a relatively high initialization cost but converges rapidly. The other has a low initialization cost but converges relatively slowly. We analyze the sources of their respective strengths and explore the space of hybrid techniques. We have designed and implemented a family of hybrid algorithms in the context of a column-store database system. Our experiments compare their behavior against database cracking and adaptive merging, as well as against both traditional full index lookup and scan of unordered data. We show that the new hybrids significantly improve over past methods while at least two of the hybrids come very close to the ideal performance'' in terms of both overhead per query and convergence to a final state.
R. Barber, et al., “Blink: Not Your Father's Database!” in Proceedings of the 5th International Workshop on Business Intelligence for the Real-Time Enterprises (BIRTE), 2011, pp. 1-22.Abstract

2010
S. Idreos, “Database Cracking: Towards Auto-tuning Database Kernels,” 2010.Abstract