2013
S. Idreos, “
Big Data Exploration,” in
Big Data Computing, Taylor and Francis, 2013.
AbstractWe are now entering the era of data deluge, where the
amount of data outgrows the capabilities of query processing technology.
Many emerging applications, from social networks to scientific experiments,
are representative examples of this deluge, where the rate at which data is
produced exceeds any past experience. For example, scientific analysis such
as astronomy is soon expected to collect multiple Terabytes of data on a daily
basis, while already web-based businesses such as social networks or web
log analysis are confronted with a growing stream of large data inputs.
Therefore, there is a clear need for efficient big data query processing to
enable the evolution of businesses and sciences to the new era of data
deluge.
In this chapter, we focus on a new direction of query processing for big data
where data exploration becomes a first class citizen. Data exploration is
necessary when new big chunks of data arrive rapidly and we want to react
quickly, i.e., with little time to spare for tuning and set-up. In particular, our
discussion focuses on database systems technology, which for several
decades has been the predominant data processing tool.
In this chapter, we introduce the concept of data exploration and we discuss a
series of early techniques from the database community towards the direction
of building database systems which are tailored for big data exploration, i.e.,
adaptive indexing, adaptive loading and sampling-based query processing.
These directions focus on reconsidering fundamental assumptions and on
designing next generation database architectures for the big data era.
BigDataExploration.pdf S. Idreos and E. Liarou, “
dbTouch: Analytics at your Fingertips,” in
Proceedings of the 7th International Conference on Innovative Data Systems Research (CIDR), Asilomar, California, 2013.
AbstractAs we enter the era of data deluge,
turning data into knowledge has become the major challenge across most sciences and businesses that deal with data.
In addition, as we increase our ability to create data, more and more people are confronted
with data management problems on a daily basis for numerous aspects of every day life.
A fundamental need is data exploration through interactive tools, i.e.,
being able to quickly and effortlessly determine data and patterns of interest.
However, modern database systems have not been designed with data exploration and usability in mind;
they require users with expert knowledge and skills,
while they react in a strict and monolithic way to every user request, resulting in correct answers but slow response times.
In this paper, we introduce the vision of a new generation of data management systems, called dbTouch;
our vision is to enable interactive and intuitive data exploration via database kernels
which are tailored for touch-based exploration.
No expert knowledge is needed.
Data is represented in a visual format, e.g., a column shape for an attribute or a fat rectangle shape for a table,
while users can touch those shapes and interact/query with gestures as opposed to firing complex SQL queries.
The system does not try to consume all data; instead it analyzes only parts of the data at a time,
continuously refining the answers and continuously reacting to user input.
Every single touch on a data object can be seen as a request to run an operator or a collection of operators
over part of the data. Users react to running results and continuously adjust the data exploration -
they continuously determine the data to be processed next
by adjusting the direction and speed of a gesture, i.e., a collection of touches;
the database system does not have control on the data flow anymore.
We discuss the various benefits that dbTouch systems bring for data analytics as well as the
new and unique challenges for database research in combination with touch interfaces.
In addition, we provide an initial architecture, implementation and evaluation (and demo) of a dbTouch prototype
over IOs for IPad.
dbTouchCIDR13.pdf E. Liarou, S. Idreos, S. Manegold, and M. L. Kersten, “
Enhanced Stream Processing in a DBMS Kernel,” in
Proceedings of the 16th International Conference on Extending Database Technology (EDBT), Genoa, Italy, 2013, pp. 501-512.
AbstractContinuous query processing has emerged as
a promising query processing paradigm with numerous applications.
A recent development is the need to handle both streaming queries and typical one-time queries
in the same application. For example, data warehousing can greatly benefit from the
integration of stream semantics, i.e., online analysis of incoming data and combination with existing data.
This is especially useful to provide low latency in data-intensive analysis
in big data warehouses that are augmented with new data on a daily basis.
However, state-of-the-art database technology cannot handle streams efficiently due to their
``continuous" nature.
At the same time, state-of-the-art stream technology is purely focused on stream applications.
The research efforts are mostly geared towards the creation of specialized stream management systems
built with a different philosophy than a DBMS.
The drawback of this approach is the limited opportunities to exploit
successful past data processing technology, e.g., query optimization techniques.
For this new problem we need to combine the best of both worlds.
Here we take a completely different route by designing a
stream engine on top of an existing relational database kernel.
This includes reuse of both its storage/execution engine and its optimizer
infrastructure. The major challenge then becomes the efficient support
for specialized stream features.
This paper focuses on incremental window-based processing, arguably the most crucial
stream-specific requirement.
In order to maintain and reuse the generic storage and execution model of the DBMS,
we elevate the problem at the query plan level.
Proper optimizer rules, scheduling and intermediate result caching and reuse,
allow us to modify the DBMS query plans for efficient incremental processing.
We describe in detail the new approach and we demonstrate
efficient performance even against specialized stream engines, especially when scalability
becomes a crucial factor
DataCellEdbt2013.pdf D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos, and S. Madden, “
The Design and Implementation of Modern Column-Oriented Database Systems,”
Foundations and Trends in Databases, vol. 5, no. 3, pp. 197-280, 2013.
AbstractIn this article, we survey recent research on column-oriented database systems, or column-stores, where each attribute of a table is stored in a separate file or region on storage. Such databases have seen a resurgence in recent years with a rise in interest in analytic queries that perform scans and aggregates over large portions of a few columns of a table. The main advantage of a column-store is that it can access just the columns needed to answer such queries. We specifically focus on three influential research prototypes, MonetDB, VectorWise, and C-Store. These systems have formed the basis for several well-known commercial column-store implementations. We describe their similarities and differences and discuss their specific architectural features for compression, late materialization, join processing, vectorization and adaptive indexing (database cracking).
columnstores.pdf 2012
S. Idreos, S. Manegold, and G. Graefe, “
Adaptive indexing in modern database kernels,” in
Proceedings of the 15th International Conference on Extending Database Technology (EDBT), Berlin, Germany, 2012, pp. 566-569.
AbstractPhysical design represents one of the hardest problems for database management systems. Without proper tuning, systems cannot achieve good performance. Offline indexing creates indexes a priori assuming good workload knowledge and idle time. More recently, online indexing monitors the workload trends and creates or drops indexes online. Adaptive indexing takes another step towards completely automating the tuning process of a database system, by enabling incremental and partial online indexing. The main idea is that physical design changes continuously, adaptively, partially, incrementally and on demand while processing queries as part of the execution operators. As such it brings a plethora of opportunities for rethinking and improving every single corner of database system design.
We will analyze the indexing space between offline, online and adaptive indexing through several state of the art indexing techniques, e. g., what-if analysis and soft indexes. We will discuss in detail adaptive indexing techniques such as database cracking, adaptive merging, sideways cracking and various hybrids that try to balance the online tuning overhead with the convergence speed to optimal performance. In addition, we will discuss how various aspects of modern techniques for database architectures, such as vectorization, bulk processing, column-store execution and storage affect adaptive indexing. Finally, we will discuss several open research topics towards fully autonomous database kernels.
a4-idreos.pdf S. Idreos, “
Cracking Big Data,”
ERCIM News. Special theme: Big Data, 2012.
Publisher's Version S. Idreos, F. Groffen, N. Nes, S. Manegold, S. K. Mullender, and M. L. Kersten, “
MonetDB: Two Decades of Research in Column-oriented Database Architectures,”
IEEE Data Engineering Bulletin, vol. 35, no. 1, pp. 40-45, 2012.
AbstractMonetDB is a state-of-the-art open-source column-store database management system targeting applications in need for analytics over large collections of data. MonetDB is actively used nowadays in health care, in telecommunications as well as in scientific databases and in data management research, accumulating on average more than 10,000 downloads on a monthly basis. This paper gives a brief overview of the MonetDB technology as it developed over the past two decades and the main research highlights which drive the current MonetDB design and form the basis for its future evolution.
MonetDebull2012.pdf E. Liarou, S. Idreos, S. Manegold, and M. L. Kersten, “
MonetDB/DataCell: Online Analytics in a Streaming Column-Store,”
Proceedings of the Very Large Databases Endowment (PVLDB), vol. 5, no. 12, pp. 1910-1913, 2012.
AbstractIn DataCell, we design streaming functionalities in a modern relational database kernel which targets big data analytics. This includes exploitation of both its storage/execution engine and its optimizer infrastructure. We investigate the opportunities and challenges that arise with such a direction and we show that it carries significant advantages for modern applications in need for online analytics such as web logs, network monitoring and scientific data management. The major challenge then becomes the efficient support for specialized stream features, e.g., multi-query processing and incremental window-based processing as well as exploiting standard DBMS functionalities in a streaming environment such as indexing.
This demo presents DataCell, an extension of the MonetDB open-source column-store for online analytics. The demo gives users the opportunity to experience the features of DataCell such as processing both stream and persistent data and performing window based processing. The demo provides a visual interface to monitor the critical system components, e.g., how query plans transform from typical DBMS query plans to online query plans, how data flows through the query plans as the streams evolve, how DataCell maintains intermediate results in columnar form to avoid repeated evaluation of the same stream portions, etc. The demo also provides the ability to interactively set the test scenarios and various DataCell knobs.
DataCellVldb2012.pdf I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki, “
NoDB: efficient query execution on raw data files,” in
Proceedings of the ACM SIGMOD International Conference on Management of Data, Scottsdale, Arizona, 2012, pp. 241-252.
AbstractAs data collections become larger and larger, data loading evolves to a major bottleneck. Many applications already avoid using database systems, e.g., scientific data analysis and social networks, due to the complexity and the increased data-to-query time. For such applications data collections keep growing fast, even on a daily basis, and we are already in the era of data deluge where we have much more data than what we can move, store, let alone analyze.
Our contribution in this paper is the design and roadmap of a new paradigm in database systems, called NoDB, which do not require data loading while still maintaining the whole feature set of a modern database system. In particular, we show how to make raw data files a first-class citizen, fully integrated with the query engine. Through our design and lessons learned by implementing the NoDB philosophy over a modern DBMS, we discuss the fundamental limitations as well as the strong opportunities that such a research path brings. We identify performance bottlenecks specific for in situ processing, namely the repeated parsing and tokenizing overhead and the expensive data type conversion costs. To address these problems, we introduce an adaptive indexing mechanism that maintains positional information to provide efficient access to raw data files, together with a flexible caching structure.
Our implementation over PostgreSQL, called PostgresRaw, is able to avoid the loading cost completely, while matching the query performance of plain PostgreSQL and even outperforming it in many cases. We conclude that NoDB systems are feasible to design and implement over modern database architectures, bringing an unprecedented positive effect in usability and performance.
NoDBsigmod2012.pdf I. Alagiannis, R. Borovica, M. Branco, S. Idreos, and A. Ailamaki, “
NoDB in Action: Adaptive Query Processing on Raw Data,”
Proceedings of the Very Large Databases Endowment (PVLDB), vol. 5, no. 12, pp. 1942-1945, 2012.
AbstractAs data collections become larger and larger, users are faced with increasing bottlenecks in their data analysis. More data means more time to prepare the data, to load the data into the database and to execute the desired queries. Many applications already avoid using traditional database systems, e.g., scientific data analysis and social networks, due to their complexity and the increased data-to-query time, i.e. the time between getting the data and retrieving its first useful results. For many applications data collections keep growing fast, even on a daily basis, and this data deluge will only increase in the future, where it is expected to have much more data than what we can move or store, let alone analyze.
In this demonstration, we will showcase a new philosophy for designing database systems called NoDB. NoDB aims at minimizing the data-to-query time, most prominently by removing the need to load data before launching queries. We will present our prototype implementation, PostgresRaw, built on top of PostgreSQL, which allows for efficient query execution over raw data files with zero initialization overhead. We will visually demonstrate how PostgresRaw incrementally and adaptively touches, parses, caches and indexes raw data files autonomously and exclusively as a side-effect of user queries.
NoDBvldb2012.pdf F. Halim, S. Idreos, P. Karras, and R. H. C. Yap, “
Stochastic Database Cracking: Towards Robust Adaptive Indexing in Main-Memory Column-Stores,”
Proceedings of the Very Large Databases Endowment (PVLDB), vol. 5, no. 6, pp. 502-513, 2012.
AbstractModern business applications and scientific databases call for inherently dynamic data storage environments. Such environments are characterized by two challenging features: (a) they have little idle system time to devote on physical design; and (b) there is little, if any, a priori workload knowledge, while the query and data workload keeps changing dynamically. In such environments, traditional approaches to index building and maintenance cannot apply. Database cracking has been proposed as a solution that allows on-the-fly physical data reorganization, as a collateral effect of query processing. Cracking aims to continuously and automatically adapt indexes to the workload at hand, without human intervention. Indexes are built incrementally, adaptively, and on demand. Nevertheless, as we show, existing adaptive indexing methods fail to deliver workload-robustness; they perform much better with random workloads than with others. This frailty derives from the inelasticity with which these approaches interpret each query as a hint on how data should be stored. Current cracking schemes blindly reorganize the data within each query's range, even if that results into successive expensive operations with minimal indexing benefit.
In this paper, we introduce stochastic cracking, a significantly more resilient approach to adaptive indexing. Stochastic cracking also uses each query as a hint on how to reorganize data, but not blindly so; it gains resilience and avoids performance bottlenecks by deliberately applying certain arbitrary choices in its decision-making. Thereby, we bring adaptive indexing forward to a mature formulation that confers the workload-robustness previous approaches lacked. Our extensive experimental study verifies that stochastic cracking maintains the desired properties of original database cracking while at the same time it performs well with diverse realistic workloads.
StochasticCrackingPVLDB12.pdf G. Graefe, F. Halim, S. Idreos, H. A. Kuno, and S. Manegold, “
Concurrency Control for Adaptive Indexing,”
Proceedings of the Very Large Databases Endowment (PVLDB), vol. 5, pp. 656-667, 2012.
AbstractAdaptive indexing initializes and optimizes indexes incrementally, as a side effect of query processing. The goal is to achieve the benefits of indexes while hiding or minimizing the costs of index creation. However, index-optimizing side effects seem to turn read-only queries into update transactions that might, for example, create lock contention. This paper studies concurrency control in the context of adaptive indexing. We show that the design and implementation of adaptive indexing rigorously separates index structures from index contents; this relaxes the constraints and requirements during adaptive indexing compared to those of traditional index updates. Our design adapts to the fact that an adaptive index is refined continuously, and exploits any concurrency opportunities in a dynamic way. A detailed experimental analysis demonstrates that (a) adaptive indexing maintains its adaptive properties even when running concurrent queries, (b) adaptive indexing can exploit the opportunity for parallelism due to concurrent queries, (c) the number of concurrency conflicts and any concurrency administration overheads follow an adaptive behavior, decreasing as the workload evolves and adapting to the workload needs.
ConcurrencyControlForAdaptiveIndexingPVLDB2012.pdf