The Researcher's Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds

Publication information:

M. Kersten, S. Idreos, S. Manegold, and E. Liarou,
“The Researcher’s Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds”, Proceedings of the Very Large Databases Endowment (PVLDB), vol. 4, no. 12, Art. no. 12, 2011.

Abstract

There is a clear need nowadays for extremely large data processing.This is especially true in the area of scientific data management where soon we expectdata inputs in the order of multiple Petabytes.However, current data management technology is not suitable for such data sizes.In the light of such new database applications, we can rethink some of the strictrequirements database systems adopted in the past.We argue that correctness is such a critical property, responsible for performance degradation.In this paper, we propose a new paradigm towards building database kernelsthat may produce wrong but fast, cheap and indicative results.Fast response times is an essential component of data analysis for exploratory applications;allowing for fast queries enablesthe user to develop a ``feeling" for the data through a series of ``painless" queries which eventually leadsto more detailed analysis in a targeted data area.We propose a research path where a database kernel autonomously and on-the-flydecides to reduce the processing requirements of a running querybased on workload, hardware andenvironmental parameters.It requires a complete redesign of database operatorsand query processing strategy.For example, typical and very common scenarios were query processing performance degrades significantlyare cases where a database operator has to spill datato disk, or is forced to perform random access, or has to follow long linked lists, etc.Here we ask the question: What if we simply avoid these steps, ``ignoring" the side-effectin the correctness of the result?