During exploratory statistical analysis, data scientists repeatedly compute statistics on data sets to infer knowledge. Moreover, statistics form the building blocks of core machine learning classification and filtering algorithms. Modern data systems, software libraries, and domain-specific tools provide support to compute statistics but lack a cohesive framework for storing, organizing, and reusing them. This creates a significant problem for exploratory statistical analysis as data grows: Despite existing overlap in exploratory workloads (which are repetitive in nature), statistics are always computed from scratch. This leads to repeated data movement and recomputation, hindering interactive data exploration.
We address this challenge in Data Canopy, where descriptive and dependence statistics are synthesized from a library of basic aggregates. These basic aggregates are stored within an in-memory data structure, and are reused for overlapping data parts and for various statistical measures. What this means for exploratory statistical analysis is that repeated requests to compute different statistics do not trigger a full pass over the data. We discuss in detail the basic design elements in Data Canopy, which address multiple challenges: (1) How to decompose statistics into basic aggregates for maximal reuse? (2) How to represent, store, maintain, and access these basic aggregates? (3) Under different scenarios, which basic aggregates to maintain? (4) How to tune Data Canopy in a hardware conscious way for maximum performance and how to maintain good performance as data grows and memory pressure increases?
We demonstrate experimentally that Data Canopy results in an average speed-up of at least 10× after just 100 exploratory queries when compared with state-of-the-art systems used for exploratory statistical analysis.