CS165 Data Systems

Semester: Spring
|
Year offered: 2014

We are in the big data era and data systems sit in the critical path of everything we do, i.e., in businesses, in sciences, as well as in everyday life. This course will be a comprehensive introduction to modern data systems. The primary focus of the course will be on modern trends that are shaping the data management industry right now such as column-store and hybrid systems, shared nothing architectures, cache conscious algorithms, hardware/software co-design, main memory systems, adaptive indexing, stream processing, scientific data management, and key value stores. We will also study the history of data systems, traditional and seminal concepts and ideas such as the relational model, row-store database systems, optimization, indexing, concurrency control, recovery and SQL; In this way, we will discuss both how data systems evolved over the years and why, as well as how these concepts apply today and how data systems might evolve in the future. 


Class Materials:

Classes

We will post here the class schedule, slides, projects and reading material.Slides are not notes. We'll take scribe notes during class using google docs.

Part of the slides will be uploaded a few minutes before each class, while the final copy will be uploaded after each class.

Labs will be Mondays 6-9pm, Room: Pierce 320 and Tuesdays 6-9pm, Room Pierce 320.Bring laptop, ideas and bugs with you to work with TFs and other students on your projects.

We will have ice cream, drinks, cookies, etc. 

Project 1Project 2Project 3Project 4Final Evaluation

Quiz 1Quiz 2Quiz 3

Midterm1Midterm2

Class 1: 1/27/2014 Introduction
Class 2: 1/29/2014Basics - Relational model - SQL      

slides

Reading:Textbook Chapters 1, 3 (-3.5), 5 (-5.8,-5.9)       

Take a look at The Fourth Paradigm

Project 1: Due on 2/14/2014 1pm

Class 3: 2/3/2014 Database architectures basics

slides

Reading:Architecture of a Database System 
J. Hellerstein, M. Stonebraker and J. HamiltonFoundations and Trends in Databases, 2007

(Sections 1-4)

notes by Michael Chen

Quiz 1

Class 4: 2/5/2014Data layouts, pages, files,Column-stores, early/late tuple reconstructionvirtual IDs

slides

Reading:The Design and Implementation of Modern Column-Oriented Database SystemsD. Abadi, P.  Boncz, S. Harizopoulos, S. Idreos and S. MaddenFoundations and Trends in Databases, 2013

Sections: All-4.8 (covers next class as well)

notes by Tarik Adnan Moon and Yifan Wu

Class 5: 2/10/2014More on column-storesQuery plansVectorized processingbulk processing Intermediates: bit vectors vs positionsUpdates, Compression,

Fractured mirrors

slides

notes by Wilson Qin and Stella Pantela

Class 6: 2/12/2014 scansshared scansbuffer pooltree indexing basics

slides

Reading:

Textbook Chapters 8,9

Cooperative Scans: Dynamic Bandwidth Sharing in a DBMSMarcin Zukowski, Sándor Héman, Niels Nes, Peter A. Boncz 

Very large Databases Conference, 2007

Main-memory scan sharing for multi-core CPUs
Lin Qiao, Vijayshankar Raman, Frederick Reiss, Peter J. Haas, Guy M. Lohman Very large Databases Conference, 2008

 

Project 2

notes by Jenny Liu and Yuechen Zhao

No class: 2/17/2014

Holiday

Class 7: 2/19/2014b-trees basicsb+-trees

slides

Reading:

Textbook Chapters 10,12

Modern B-Tree TechniquesGoetz Graefe Foundations and Trends in Databases, 2011

Sections: 1,2,3,5

notes by Cheng Huang

Quiz 2 

 

Class 8: 2/24/2014clustered/secondary indexesrandom accesscovering indexesc-store projectionssmooth scan

slides

Reading:

Making B+trees Cache Conscious in Main MemoryJun Rao and Ken RossACM SIGMOD International Conference on Management of Data, 2000

(will be discussed next time)

IEEE Data Engineering Bulletin, 35(1), March 2012
Special Issue on Column-stores (9 short overview papers)

Smooth Scan: One access path to rule them all

Renata Borovica, Stratos Idreos, Anastasia Ailamaki, Marcin Zukowski,Campbell Fraser

notes by Daniel Newman

Class 9: 2/26/2014cache-conscious b-treeslate vs early reconstructionon-the-fly layout transformation 

slides

Reading:

DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing
Marcin Zukowski, Niels Nes, Peter A. Boncz
International Workshop on Data Management on New Hardware (DaMoN) 2008

Quiz 3

notes by Xueshan Zhang (Ellie) and Wan Jun Yang (Jenna)

Midterm1: 3/3/2014

Open books and notes. No laptops. 

Midterm1

Class 10: 3/5/2014basics on joinsnested loopsblock nested loopszig zag joinsort merge joincolumn-store join planspushing down selectionsblock operators

slides

Reading:

Textbook Chapters 4.1, 4.2, 14

notes by Jessica Yao 

notes by Joshua Lee

Project 3

Class 11: 3/10/2014finish up basic joinssorting

external sorting

slides

Reading:

Textbook Chapter 13

notes by Kevin Mu & Alisa Nguyen

Class 12-13: 3/11/2014hashing, dynamic hashing,linear hashing, extendible hashing,blocking vs pipelining,symmetric hash join,

counting

slides

Reading:

Textbook Chapter 11

Class 14: 3/12/2014external joinssimple joingrace joinhybrid join

slides

Reading:

Join Processing in Databases with Large Main Memoriesby L. Shapiro

ACM Transactions on Database Systems. 11(3), 1986 

Class 15: 3/24/2014cache conscious joinswhat happens after a join

post projection

slides

Reading:

Cache Conscious Algorithms for Relational Query Processingby A. Shatdal, C. Kant and J. Naughton

Very Large Databases Conference, 1984

Database Architecture Optimized for the new Bottleneck: Memory AccessBy P. Boncz, S. Manegold and M. Kersten

Very Large Databases Conference, 1999

Cache-Conscious Radix Decluster ProjectionsBy S. Manegold, P. Boncz, N. Nes, and M. Kersten

Very Large Databases Conference, 2004

notes by Ramya Rangan

Class 16: 3/26/2014loop unrollingloop fusion/fissioncpu pipeliningbranch prediction

slides

Reading:

Selection conditions in main memoryKenneth A. Ross

ACM Transactions on Database Systems, 2004

notes by Ashley Jamerson and Roberto Cantos

3/31 no class

4/2 no class

Class 17: 4/7/2014prefetchinghardware threadsmulti-cores/numaSIMD

GPUs

slides

Reading:

Improving Database Performance on Simultaneous Multithreading ProcessorsJingren Zhou, John Cieslewicz, Kenneth A. Ross, Mihir Shah

International Conference on Very Large Databases, 2005

Implementing database operations using SIMD instructionsJingren Zhou, Kenneth A. Ross 

ACM SIGMOD Conference, 2002

Efficient Implementation of Sorting on Multi-Core SIMD CPU ArchitectureJatin Chhugani, et all.

International Conference on Very Large Databases, 2008

Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs
Changkyu Kim, et all.

International Conference on Very Large Databases, 2009

Class 18: 4/9/2014stream processingtransactionsNUMA

Guest Lecture: Erietta Liarou

slides

Reading:

Enhanced Stream Processing in a DBMS Kernel E. Liarou, S. Idreos, S. Manegold and M. Kersten 

International Conference on Extending Database Technology, 2013

ATraPos: Adaptive Transaction Processing on Hardware IslandsD. Porobic, E. Liarou, P. Tözün and A. Ailamaki

International Conference on Data Engineering, 2014

dbTouch: Analytics at your FingertipsStratos Idreos, Erietta LiarouConference on Innovative Data Systems Research, 2013
Class 19: 4/14/2014updatesread-storeswrite storesbufferingwrite ahead logging
slides

Reading:

Textbook Chapters 16, 17, 18

notes by Michelle Deng and Lucy Cheng 

Class 20: 4/16/2014concurrency controlLSM trees
Class 21: 4/21/2014database crackingsideways crackingupdates in adaptive indexing 

slides

Reading:

Database CrackingStratos Idreos, Martin Kersten and Stefan Manegold

In Proceedings of the International Conference on Innovative Data Systems Research (CIDR), 2007

Self-organizing Tuple Reconstruction In Column-storesStratos Idreos, Martin Kersten and Stefan Manegold

In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2009

Updating a Cracked DatabaseStratos Idreos, Martin Kersten and Stefan Manegold

In Proceedings of the ACM SIGMOD International Conference on Management of Data, 2007

Class 224/23/2014stochastic crackingconcurrency control 

slides

Reading:

Stochastic Database Cracking: Towards Robust Adaptive Indexing in Main-Memory Column-StoresFelix Halim, Stratos Idreos, Panagiotis Karras and Roland H. C. Yap

In Proceedings of the Very Large Databases Endowment (PVLDB), 2012

Concurrency Control for Adaptive Indexing
Goetz Graefe, Felix Halim, Stratos Idreos, Harumi A. Kuno, Stefan Manegold
In Proceedings of the Very Large Databases Endowment (PVLDB), 2012

Merging What’s Cracked, Cracking What’s Merged:
Adaptive Indexing in Main-Memory Column-Stores
Stratos Idreos, Stefan Manegold, Harumi Kuno and Goetz Graefe
In Proceedings of the Very Large Databases Endowment (PVLDB), 2011

Midterm2: 4/28/2014
Open book
April 30: Guest lecture by Guy Lohman
Project evaluation:

May 5-6-7 

Each team will spend 30 minutes with Stratos and 30 minutes with the TFs.
May 22: Extra meeting

Discussion on noSQL, graph processing, distributed systems and other modern topics

Discussions/Announcements

We will use piazza.

Here is the class web site: piazza.com/harvard/spring2014/cs165/home

Use this link to sign up: piazza.com/harvard/spring2014/cs165

Syllabus

CS165 SYLLABUS 

Spring 2014

Welcome to CS165!

Professor: Stratos Idreosurl: http://stratos.seas.harvard.eduοffice: MD139

email: stratos@seas.harvard.edu

TFs

Mike Kester kester@eecs.harvard.eduPeter Macko pmacko@eecs.harvard.edu

Rob Bowden rbowden91@gmail.com

Class websitehttp://stratos.seas.harvard.edu/classes/cs165-data-systems

What is this class about?

We are in the big data era and data systems sit in the critical path of everything we do, i.e., in businesses, in sciences, as well as in everyday life. This course will be a comprehensive introduction to modern data systems. The primary focus of the course will be on modern trends that are shaping the data management industry right now such as column-store and hybrid systems, shared nothing architectures, cache conscious algorithms, hardware/software co-design, main memory systems, adaptive indexing, stream processing, scientific data management, and key value stores. We will also study the history of data systems, traditional and seminal concepts and ideas such as the relational model, row-store database systems, optimization, indexing, concurrency control, recovery and SQL; In this way, we will discuss both how data systems evolved over the years and why, as well as how these concepts apply today and how data systems might evolve in the future. 

Why take this class?

Data is everywhere. Every year we create even more data. As it stands, every two days we create as much data as much we created from the dawn of humanity up to 2003 [Eric Schmidt, Google]. Sciences, businesses and everyday life are severely affected. Data systems are in the middle of all this. Data systems is how we store and access data, i.e., they are the backbone for any data-driven application. It is a $100B industry, growing 10% every year [Economist, “Data, data everywhere”]. 

At the same time data systems research and the whole industry are going through a major and continuous transition; given that new data-driven scenarios and applications continuously pop up, there is a continuous need to redefine what is a good data system in such dynamic environments.  

Expected learning outcomes

  • Become familiar with the history and evolution of data systems design over the past 4-5 decades.
  • Understanding the basic tradeoffs in designing and implementing modern data systems.
  • Being able to design a new data system given a data-driven scenario and to built a prototype.
  • Being able to understand which data system is a good fit given the needs of an application. 
  • Advanced C programming and debugging skills.

Who can take this class? 

Required: CS50 and CS61 or good hacking and algorithm designing skills. Talk to the instructor if you have not taken one of those two courses but you think you are ready for CS165.

Lectures

The class meets twice a week: Mondays and Wednesdays 1:00pm-2:30pm. Class starts at 1:10pm. Classes will be designed to be discussion-based and slides will be used mainly to drive discussions as opposed to delivering the material. For some of the classes you will be required to read part of the reading material up front as homework and we will use the class time to discuss design choices and solve problems. Also, we will schedule sections/recitations when needed. 

 

Office hours

Starting Week 1, Prof. Stratos Idreos will hold office hours every Wednesday 2:30pm-4:00pm (time slot can change if it is not convenient for everyone).

Starting Week 2, TFs will also hold office hours. We will arrange a room for that once we know the slot. During office hours you can chat with the TFs and (among each other) about your running projects, your design and implementations and ask questions about the class material.

Guest lectures

We are arranging guest lectures from leaders in data system design from industry. 

Required textbook

We will use the following textbook: Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke. This textbook is a great source for all the seminal and traditional topics. For modern topics, we will use recent research papers and surveys which will be posted on the course website and you will have access to them through the Harvard network. 

Slides/Notes

The slides used during class will be available online shortly after each class. However, you should not expect the slides to cover the material in detail; the class will be based a lot on discussion and problem solving so the slides will be tailored to drive the discussion as opposed to serve the material. As such, in each class one or more students will be assigned to take notes which will then make available to everyone (we will set-up a wiki). Afterwards, any student will be able to jump in and enrich the notes further. Collaborative note taking and editing will be part of your class participation grade and a great way to recite the material and also see how your fellow students perceive it.  

Running project

Through the semester we will have a running project with several components. You will have to deliver a component every week or every second week. By the end of the project you will have designed, implemented  and evaluated several key elements of a modern data system and you will have experienced several design tradeoffs. The running project is an individual project (all components) and no group projects are allowed. Discussing the design and implementation problems with other students is allowed and encouraged! We will do so in the class as well. However, the final deliverable should be personal, you must write from scratch all the code of your system and all documentation and reports. Whenever applicable we will let you know if there are existing libraries that is OK to use. You will not be judged only on how good your system works; it should be clear that you have designed and implemented the whole system, i.e., you should be able to perform changes on-the-fly, explain design details, etc. 

Delivery of the various components will be done online (details will follow) and the deadline will typically be on Mondays at the time the course starts. 

You are allowed a total of 10 late days. Each late day extends the deadline for delivering a project component by 24 hours. No excuse needed and no penalty will be imposed. You may distribute the late days as you wish, i.e., you could even use all late days for just a single deadline. If you do not have any late days left, then for every day you are late, you will lose 20% of the grade for this course component. 

We will regularly assign extra tasks or you can come up with your own extra tasks for the various components of the running project. With these extra tasks you gain extra points.

The best 5 overall projects will gain additional extra points. "Best" is defined in terms of elegant system design, code quality, system efficiency and documentation.

Quizzes

We will do several (quick) quizzes during class.

Final project 

Towards the end of the course we will have time for a final project. You will have about a full month. We will provide ideas for projects but you will also be expected to come up with your own project ideas given what you learned in the class. Each project will be judged individually depending on its complexity. You will have to deliver both the system as well as a technical report which will state clearly the problem, the motivation, the solution as well as related work in the same style as some of the research papers we will see in the class. Before each project starts there will be a design phase where you will deliver a report with the problem statement, motivation and the design. During this phase you will work with the instructor to define your project idea as well as the possible design and implementation solutions. Each project group may consist of up to 5 students. The final report as well as the initial design report should clearly explain the responsibilities of each group member.  No late days will be granted. The best 3 projects will receive extra points. "Best" is defined similarly to the running project with the additional element of judging the project idea and motivation. 

Assessment and grading

  • Running project/Homework: 40%
  • Quizzes and class participation: 15%
  • Midterms (2): 20%
  • Final project: 25%
  • Extra points: Extra tasks for the running project: 10% 
  • Extra points: Best projects: 10% (5% running project + 5% final project)

Online discussions

We will set-up a Piazza forum where you will be able to ask questions and discuss issues related to the course. 

Updates

During the semester and given student feedback we updated the syllabus in many ways. Some of the most important ones are the following:1) Students had the option to work in groups of 2 for the running project.2) We dropped all deadlines and late days within the semester. There was just a single deadline to deliver the whole project at the end of the semester. 3) During final grading we took into account the starting status of every student, as opposed to strictly taking into account the final project result only.4) We introduced the requirement for design documents for each project step.5) We converted the open project to being  a part of the running project.

6) We offered a significant amount of office and section hours. Most weeks this was in the order of 10 hours or more, while students also had the option to arrange ad hoc meetings with the professor or the TFs.  

We will maintain these changes for the next iteration of the class which will be in Fall 2015. We will also apply a stream of additional small changes based on student feedback. If you are a Harvard undergrad and you have questions for CS165 feel free to contact Stratos.