Here are my Data Files. Here are my Queries. Where are my Results?

Publication information:

S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki,
“Here are my Data Files. Here are my Queries. Where are my Results?”, in Proceedings of the 5th International Conference on Innovative Data Systems Research (CIDR), Asilomar, California, 2011, pp. 57–68.

Abstract

Database management systems (DBMS) provide incredible flexibility and performancewhen it comes to query processing, scalability and accuracy.To fully exploit DBMS features, however, the user must define a schema,load the data, tune the system for the expected workload, and answer several questions.Should the database use a column-store, a row-store or some hybrid format?What indices should be created?All these questions make for a formidable and time-consuming hurdle, oftendeterring new applications or imposing high cost to existing ones.A characteristic example is that of scientific databases with huge data sets.The prohibitive initialization cost and complexitystill forces scientists to rely on ``ancient" tools for their data management tasks,delaying scientific understanding and progress.Users and applications collect their data in flat files, which have traditionally beenconsidered to be ``outside" a DBMS. A DBMS wants control:always bring all data ``inside", replicate it and format it in its own ``secret" way.The problem has been recognized and current efforts extend existing systems withabilities such as reading information from flat filesand gracefully incorporating it into the processing engine.This paper proposes a new generation of systems wherethe only requirement from the user is a link to the raw data files.Queries can then immediately be fired without preparation steps in between.Internally and in an abstract way, the system takes care ofselectively, adaptively and incrementally providing the proper environmentgiven the queries at hand. Only part of the data is loaded at any given timeand it is being stored and accessed in the format suitable for the current workload.