CS265: Big Data Systems

Semester: Fall
|
Year offered: 2014

Big data is everywhere. A fundamental goal across numerous modern businesses and sciences is to be able to exploit as many machines as possible, to consume as much information as possible and as fast as possible. The big challenge is "how to turn data into useful knowledge". This is far from a simple task and a moving target as both the underlying hardware and our ability to collect data evolve. In this class, we will discuss how to design data systems and algorithms that can "scale up" and "scale out". Scale up refers to the ability to use a single machine to all its potential, i.e., to exploit properly the memory hierarchy and the multiple CPU and GPU cores. Scale out refers to the ability to use more than 1 machines (typically 100s or 1000s) effectively. This is a research oriented class. Every week we will read two modern research papers; one from the scale up area and one from the scale out area. We will use examples from several areas, including relational systems and distributed databases, graph processing systems (i.e., for social networks), key value stores, noSQL and newSQL systems as well as mobile computing. Each student will work on a semester long data systems research project (in groups of 2-4 students) which can be in any of the above areas and will be based on an open research problem. 


Class Materials:

Classes

Here we will maintain the reading list and schedule for CS265 as well as guidelines for various admin issues. Always check this page for updates in the schedule or reading list.

SECTIONS AND OFFICE HOURS

Class meets for sections twice a week: Tuesdays 4-5:30pm and Thursdays 5:30-7pm. Sections are not mandatory; you should go to sections to meet with the TF and other students to discuss about your projects, technical problems as well as about readings. We will also hold sections where we will do lectures about specific topics; such sections will be announced upfront in class and Piazza.

Stratos holds office hours after class every Wednesday. 

You are welcome to arrange ad hoc meetings with Stratos or the TFs outside office hours and sections if you need more input. Just send an email at least one day before.

BACKGROUND READING 

If you are not familiar with the papers below, you are expected to read them within the first month of the class. We will schedule off-class meetings to discuss these papers in detail.

Architecture of a Database System. J. Hellerstein, M. Stonebraker and J. Hamilton. 

Foundations and Trends in Databases, 2007

The Design and Implementation of Modern Column-Oriented Database Systems. D. Abadi, P.  Boncz, S. Harizopoulos, S. Idreos and S. Madden. Foundations and Trends in Databases, 2013

Modern B-Tree Techniques. Goetz Graefe. Foundations and Trends in Databases, 2011

A Survey of Large-Scale Analytical Query Processing in MapReduce. Christos Doulkeridis and Kjetil Nørvag. Very Large Databases Journal, 23(3), 2014.

GUIDELINES FOR PRESENTATIONS AND DISCUSSIONS

In every class, one or two students are responsible to lead the discussion for one or more research papers. It is not necessary to prepare slides. Each group can choose between creating slides and using the projector or simply creating a set of notes with bullets to lead the discussion and sharing this trough Piazza right before class. 

The goal of each group is to lead a lively discussion in class. The notes or presentation should have enough material to lead discussion for at least the following topics:

(1) Summary of the paper: What is the problem it studies? Why is it an important problem? Why past/naive solutions do not work? What is the main intuition of the proposed solution? Why is it new?

(2) At least three strong points for the paper: Why this is a paper worth reading?

(3) At least three weak points: Anything you would do differently. Missing analysis; etc.

(4) Open topics: What ideas come to mind after reading the current paper(s)? Any open research problems? Can you think of a possible startup because of the new technology? Does it enable any new applications?

GUIDELINES FOR REVIEWS

For every class, each student will prepare a one page review of the primary paper(s) of the day. The review should cover the four points mentioned in the presentation guidelines section above, i.e., (1) summary, (2) strong points, (3) weak points and (4) open problems. You should send the review in PDF to the TF prior to each class. 

SCHEDULE

Here we will maintain a list of the readings per class.

In each class we will have one or two papers marked with (P) which means that these are the primary papers that you are expected to read in detail and one or more papers marked with (B) which means that these are papers that we suggest you start from if you need more background on the particular topic or if you would like to read more about similar research.

For all papers, a good plan is to follow a few of the papers cited in the readings. You should see our readings as a trigger to read about an active area of research. The papers we read were recently published in premier data systems venues but there is typically many more great papers, both recent and past, in similar areas. If you would like to read more about a specific area, just ask for more suggestions. 

In most classes we will have one primary paper. In some cases we will have two primary papers. The reason is that in these cases you will get a much better picture of the area by reading both papers while at the same time the papers have such a significant overlap that the added effort to read the second paper is reasonably small.

All links to the papers are accessible through the Harvard network at the time of creating this list. If you have trouble accessing any of the papers, please let us know. 

Class 1

Sep 3

Introduction to data systems & CS265

Presenter: Stratos

slides

Class 2

Sep 5 

DB Architectures and Column-stores basics

Presenter: Stratos

slides

Class 3

Sep 10         

(P) Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age.Viktor Leis, Peter A. Boncz, Alfons Kemper, Thomas Neumann.

ACM SIGMOD International Conference on Management of Data. 2014

Presenters: Yihe & Sam

(B) MonetDB/X100: Hyper-Pipelining Query ExecutionPeter A. Boncz, Marcin Zukowski, Niels Nes

Conference on Innovative Data Systems Research (CIDR), 2005

Class 4

Sep 12

(P) Megastore: Providing Scalable, Highly Available Storage for Interactive Services
Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson,Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh.

Conference on Innovative Data Systems Research (CIDR), 2011

(P) Spanner: Google’s Globally-Distributed Database.James C. Corbett, Jeffrey Dean, Michael Epstein, et all.

USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2012

Presenters: Ore & Archie

(B) The State of the art in distributed query processingDonald Kossmann

ACM Computing Surveys, 2000

Class 5

Sep 18

Class will meet on Thursday Sep 18 at 5:30pm. There will be a section during the normal class time on Wednesday Sep 17.

 Invited lecture: Nikita Shamgunov from memsql 

Class 6

Sep 19

(P) H2O: A Hands-free Adaptive Store
Ioannis Alagiannis, Stratos Idreos and Anastassia Ailamaki

ACM SIGMOD International Conference on Data Management, 2014

Presenters: Stella & Wasay

(B) Efficiently Compiling Efficient Query Plans for Modern HardwareThomas Neumann

Proceedings of the Very Large Databases Endowment (PVLDB), 2011

Class 7

Sep 24

(P) SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database ArchitecturesAvrilia Floratou, Umar Farooq Minhas, Fatma Özcan 

Proceedings of the Very Large Databases Endowment (PVLDB), 2014

Presenter: Jenny & Alex

(B) HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical WorkloadsAzza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, Avi Silberschatz

Proceedings of the Very Large Databases Endowment (PVLDB), 2009

Class 8

Sep 26

(P) Indexing for interactive exploration of big data series.Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.

ACM SIGMOD International Conference on Data Management, 2014

Presenters: Michael

(B) Buffering accesses to memory-resident index structures
J.Zhouand and K. Ross

International Conference on Very Large Databases (VLDB), 2003

Class 9

Oct 1

(P) SharedDB: Killing One Thousand Queries With One StoneGeorgios Giannikis, Gustavo Alonso and Donald Kossmann

Proceedings of the Very Large Databases Endowment (PVLDB), 2014

(P) Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS
Marcin Zukowski, Sándor Héman, Niels Nes, Peter A. Boncz

International Conference on Very Large Databases (VLDB), 2007

Presenter: Wilson & Kevin

(B)  Main-memory scan sharing for multi-core CPUsLin Qiao, Vijayshankar Raman, Frederick Reiss, Peter J. Haas, Guy M. Lohman

Proceedings of the Very Large Databases Endowment (PVLDB), 2008

Class 10

Oct 3

(P) Split query processing in polybase. David J. DeWitt, Alan Halverson, Rimma V. Nehme, Srinath Shankar, Josep Aguilar-Saborit,Artin Avanes, Miro Flasza, Jim Gramling

ACM SIGMOD International Conference on Data Management, 2013

Presenters: Tarik & Lukas

(B) A case for fractured mirrorsRavishankar Ramamurthy, David J. DeWitt, Qi Su

International Conference on Very Large Databases (VLDB), 2003

Class 11

Oct 8

(P) The adaptive radix tree: ARTful indexing for main-memory databases.Viktor Leis, Alfons Kemper, Thomas Neumann

International Conference on Data Engineering (ICDE), 2013

Presenters: Mike & Sierra

(B) A study of index structures for main memory database management systemsT. J. Lehman and M. J. Carey

International Conference on Very Large Databases (VLDB),1986


(B) Cache conscious indexing for decision-support in main memoryJ. Rao and K. RossInternational Conference on Very Large Databases (VLDB), 1999

Class 12

Oct 10

(P) The Researcher's Guide to the Data Deluge: Querying a Scientific Database in Just a Few SecondsMartin L. Kersten, Stratos Idreos, Stefan Manegold, Erietta Liarou

Proceedings of the Very Large Databases Endowment (PVLDB), 2011

(P) BlinkDB: queries with bounded errors and bounded response times on very large dataSameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica

EuroSys 2013

Presenters: Yifan, Rob

(B) SciBORQ: Scientific data management with Bounds On Runtime and QualityLefteris Sidirourgos, Martin L. Kersten, Peter A. Boncz

Conference on Innovative Data Systems Research (CIDR), 2011

Class 13

Oct 15

Project Presentations

Class 14

Oct 17

Project Presentations

Class 15

Oct 22

Reducing Database Locking Contention Through Multi-version ConcurrencyMohammad Sadoghi, Mustafa Canim, Bishwaranjan Bhattacharjee, Fabian Nagel, Kenneth A. Ross

Proceedings of the Very Large Databases Endowment (PVLDB), 2014

Presenters: Siera, Oyu, Lukas

Class 16

Oct 24

Shark: SQL and rich analytics at scale
Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica

ACM SIGMOD International Conference on Data Management, 2013

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ComputingMatei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly,Michael J. Franklin, Scott Shenker, Ion Stoica

USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012

Presenters: Nithin, Neel, Archie

Class 17

Oct 29

Energy Analysis of Hardware and Software Range Partitioning
Lisa Wu, Orestis Polychroniou, Raymond J. Barker, Martha A. Kim, Kenneth A. Ross

ACM Trans. Comput. Syst. 32(3): 8 (2014)

Presenters: Ore, Sam

Class 18

Oct 31

Guest Lecture: C Mohan, IBM Research

Class 19

Nov 5

Track join: distributed joins with minimal network traffic
Orestis Polychroniou, Rajkumar Sen, Kenneth A. Ross

ACM SIGMOD International Conference on Data Management, 2014

Presenters: Kevin, Alex, Wilson

Class 20

Nov 7

Lazy evaluation of transactions in database systems
Jose M. Faleiro, Alexander Thomson, Daniel J. Abad

ACM SIGMOD International Conference on Data Management, 2014

Presenters: Stella, Mike, Jenny

Class 21

Nov 12

MLbase: A Distributed Machine Learning System
T. Kraska, A. Talwalkar, J.Duchi, R. Griffith, M. Franklin, M.I. Jordan

Conference on Innovative Data Systems Research , 2013

Map-Reduce for Machine Learning on Multicore
Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun

NIPS 2006

Presenters: Tarik, Wasay

Class 22

Nov 14

Guest Lecture by Alkis Simitsis and Georgia Koutrika from HP Labs.

Class 22

Nov 19

In-Memory Performance for Big DataGoetz Graefe, Haris Volos, Hideaki Kimura, Harumi A. Kuno, Joseph Tucek, Mark Lillibridge, Alistair C. Veitch

Proceedings of the Very Large Databases Endowment (PVLDB), 2014

 

Presenters: Yifan, Michael, Rob

Class 23

Nov 21

 

Syllabus

CS265 SYLLABUS (tentative: check back for updates)

Fall 2014

Welcome to CS265!

Professor: Stratos Idreos

url: http://stratos.seas.harvard.eduοffice: MD139

email: stratos@seas.harvard.edu

TF: Manos Athanassoulis

manos.athanassoulis@gmail.com

What is this class about?

Big data is everywhere. A fundamental goal across numerous modern businesses and sciences is to be able to exploit as many machines as possible, to consume as much information as possible and as fast as possible. The big challenge is "how to turn data into useful knowledge". This is far from a simple task and a moving target as both the underlying hardware and our ability to collect data evolve. In this class, we will discuss how to design data systems and algorithms that can "scale up" and "scale out". Scale up refers to the ability to use a single machine to all its potential, i.e., to exploit properly the memory hierarchy and the multiple CPU and GPU cores. Scale out refers to the ability to use more than 1 machines (typically 100s or 1000s) effectively. We will use examples from several areas, including relational systems and distributed databases, graph processing systems (i.e., for social networks), key value stores, noSQL and newSQL systems as well as mobile computing and interactive analytics (such as dbTouch). In a fast moving industry and research environment such skills are in high demand. 

Why take this class?

Data is everywhere. Every year we create even more data. As it stands, every two years we create as much data as much we created from the dawn of humanity up to 2003 [Eric Schmidt, Google]. Sciences, businesses and everyday life are severely affected. Data systems are in the middle of all this. Data systems is how we store and access data, i.e., they are the backbone for any data-driven application. It is a $100B industry, growing 10% every year [Economist, “Data, data everywhere”]. At the same time data systems research and the whole industry are going through a major and continuous transition; given that new data-driven scenarios and applications continuously pop up, there is a continuous need to redefine what is a good data system in such dynamic environments.  

This class will also be ideal for undergrads who ask the question “what is research?”. 

Expected learning outcomes

-Understanding the basic tradeoffs in designing modern big data systems.

-Being able to design a new big data system given a data-driven scenario.

-Develop basic research skills: reading, writing and understanding research papers.

Who can take this class? 

The class is accessible to students who have taken CS165 and to students who have a good background of modern data systems and designing algorithms for big data.  Contact the instructor if you are unsure about whether you will be able to follow the class or not.

Lectures

The class meets twice a week: Wednesdays and Fridays 1:00pm-2:30pm in MD323. Class starts at 1:10pm. We will not have traditional lectures. All classes will be based on a discussion format. Each week we will discuss 2 research papers. Each student will be tasked to lead the discussion for at least 1 paper through the semester. 

Office hours

Starting Week 1, Prof. Stratos Idreos will hold office hours every Wednesday 2:30pm-3:30pm (time slot can change if it is not convenient for everyone). Students are also welcome to schedule ad-hoc meetings with the instructor as often as needed. 

TF office hours will be announced soon.

Research and Brainstorming Sessions 

We will often schedule "brainstorming sessions". In such sessions, each group may work with the instructor independently on the details of their research project and also each group may get feedback on their progress from the rest of the groups. 

Required textbook

We will use recent research papers and surveys which will be posted on the class website and you will have access to them through the Harvard network. 

Research project

Each student will work on a research project throughout the semester. Students may work on groups of 2-4. Each group will work on its own project which will be defined within the first month. The projects will be tailored such as they follow both the interests of the students as well as open and challenging questions in the big data systems area. An ideal project will lead to 1) a set of new ideas on how to solve a specific problem, 2) an analysis that demonstrates the effectiveness and benefits of the new approaches as well as 3) a write up that is close to a full research paper (and may lead to a publication). 

Assessment and grading

Class participation: 10%

Project: 90%

Collaboration Policy 

You are welcome to discuss your projects and designs with each other. In fact you are expected to do so during our brainstorming sessions and to give valuable feedback to your classmates. However, all material you deliver (code, design docs and reports) should be produced by you. For ideas that have greatly affected your designs and came outside a project group, you should acknowledge your fellow students in your report.  Your final grade will be assessed after several 1:1 meetings with the instructor and the TF where you will be expected to demonstrate not only the end result of your project but also that you "own" the results and the design. 

 

Online discussions

We will use Piazza where you will be able to ask questions and discuss issues related to the course. sign up url: https://piazza.com/harvard/fall2014/cs265class url: https://piazza.com/harvard/fall2014/cs265/home