CS165 Data Systems
We are in the big data era and data systems sit in the critical path of everything we do, i.e., in businesses, in sciences, as well as in everyday life. This course will be a comprehensive introduction to modern data systems. The primary focus of the course will be on modern trends that are shaping the data management industry right now such as column-store and hybrid systems, shared nothing architectures, cache conscious algorithms, hardware/software co-design, main memory systems, adaptive indexing, stream processing, scientific data management, and key value stores. We will also study the history of data systems, traditional and seminal concepts and ideas such as the relational model, row-store database systems, optimization, indexing, concurrency control, recovery and SQL; In this way, we will discuss both how data systems evolved over the years and why, as well as how these concepts apply today and how data systems might evolve in the future.
Class Materials:
Classes
We will post here the class schedule, slides, projects and reading material.Slides are not notes. We'll take scribe notes during class using google docs.Part of the slides will be uploaded a few minutes before each class, while the final copy will be uploaded after each class.
Labs will be Mondays 6-9pm, Room: Pierce 320 and Tuesdays 6-9pm, Room Pierce 320.Bring laptop, ideas and bugs with you to work with TFs and other students on your projects.We will have ice cream, drinks, cookies, etc.
Project 1, Project 2, Project 3, Project 4, Final Evaluation
Class 1: 1/27/2014 Introduction | |
Class 2: 1/29/2014Basics - Relational model - SQL | Reading:Textbook Chapters 1, 3 (-3.5), 5 (-5.8,-5.9) Take a look at The Fourth Paradigm Project 1: Due on 2/14/2014 1pm |
Class 3: 2/3/2014 Database architectures basics | Reading:Architecture of a Database System J. Hellerstein, M. Stonebraker and J. HamiltonFoundations and Trends in Databases, 2007 (Sections 1-4) notes by Michael Chen |
Class 4: 2/5/2014Data layouts, pages, files,Column-stores, early/late tuple reconstructionvirtual IDs | Reading:The Design and Implementation of Modern Column-Oriented Database SystemsD. Abadi, P. Boncz, S. Harizopoulos, S. Idreos and S. MaddenFoundations and Trends in Databases, 2013 Sections: All-4.8 (covers next class as well) notes by Tarik Adnan Moon and Yifan Wu |
Class 5: 2/10/2014More on column-storesQuery plansVectorized processingbulk processing Intermediates: bit vectors vs positionsUpdates, Compression, Fractured mirrors | |
Class 6: 2/12/2014 scansshared scansbuffer pooltree indexing basics | Reading: Textbook Chapters 8,9 Cooperative Scans: Dynamic Bandwidth Sharing in a DBMSMarcin Zukowski, Sándor Héman, Niels Nes, Peter A. BonczVery large Databases Conference, 2007 Main-memory scan sharing for multi-core CPUs notes by Jenny Liu and Yuechen Zhao |
No class: 2/17/2014 Holiday | |
Class 7: 2/19/2014b-trees basicsb+-trees | Reading: Textbook Chapters 10,12 Modern B-Tree TechniquesGoetz Graefe Foundations and Trends in Databases, 2011Sections: 1,2,3,5 notes by Cheng Huang
|
Class 8: 2/24/2014clustered/secondary indexesrandom accesscovering indexesc-store projectionssmooth scan | Reading: Making B+trees Cache Conscious in Main MemoryJun Rao and Ken RossACM SIGMOD International Conference on Management of Data, 2000(will be discussed next time) IEEE Data Engineering Bulletin, 35(1), March 2012 Renata Borovica, Stratos Idreos, Anastasia Ailamaki, Marcin Zukowski,Campbell Fraser notes by Daniel Newman |
Class 9: 2/26/2014cache-conscious b-treeslate vs early reconstructionon-the-fly layout transformation | Reading: DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing notes by Xueshan Zhang (Ellie) and Wan Jun Yang (Jenna) |
Midterm1: 3/3/2014 | Open books and notes. No laptops. |
Class 10: 3/5/2014basics on joinsnested loopsblock nested loopszig zag joinsort merge joincolumn-store join planspushing down selectionsblock operators | |
Class 11: 3/10/2014finish up basic joinssorting external sorting | |
Class 12-13: 3/11/2014hashing, dynamic hashing,linear hashing, extendible hashing,blocking vs pipelining,symmetric hash join, counting | |
Class 14: 3/12/2014external joinssimple joingrace joinhybrid join | Reading: Join Processing in Databases with Large Main Memoriesby L. ShapiroACM Transactions on Database Systems. 11(3), 1986 |
Class 15: 3/24/2014cache conscious joinswhat happens after a join post projection | Reading: Cache Conscious Algorithms for Relational Query Processingby A. Shatdal, C. Kant and J. NaughtonVery Large Databases Conference, 1984 Database Architecture Optimized for the new Bottleneck: Memory AccessBy P. Boncz, S. Manegold and M. KerstenVery Large Databases Conference, 1999 Cache-Conscious Radix Decluster ProjectionsBy S. Manegold, P. Boncz, N. Nes, and M. KerstenVery Large Databases Conference, 2004 notes by Ramya Rangan |
Class 16: 3/26/2014loop unrollingloop fusion/fissioncpu pipeliningbranch prediction | Reading: Selection conditions in main memoryKenneth A. RossACM Transactions on Database Systems, 2004 notes by Ashley Jamerson and Roberto Cantos |
3/31 no class | |
4/2 no class | |
Class 17: 4/7/2014prefetchinghardware threadsmulti-cores/numaSIMD GPUs | Reading: Improving Database Performance on Simultaneous Multithreading ProcessorsJingren Zhou, John Cieslewicz, Kenneth A. Ross, Mihir ShahInternational Conference on Very Large Databases, 2005 Implementing database operations using SIMD instructionsJingren Zhou, Kenneth A. RossACM SIGMOD Conference, 2002 Efficient Implementation of Sorting on Multi-Core SIMD CPU ArchitectureJatin Chhugani, et all.International Conference on Very Large Databases, 2008 Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUsChangkyu Kim, et all. International Conference on Very Large Databases, 2009 |
Class 18: 4/9/2014stream processingtransactionsNUMA | Guest Lecture: Erietta Liarou Reading: Enhanced Stream Processing in a DBMS Kernel E. Liarou, S. Idreos, S. Manegold and M. KerstenInternational Conference on Extending Database Technology, 2013 ATraPos: Adaptive Transaction Processing on Hardware IslandsD. Porobic, E. Liarou, P. Tözün and A. AilamakiInternational Conference on Data Engineering, 2014 dbTouch: Analytics at your FingertipsStratos Idreos, Erietta LiarouConference on Innovative Data Systems Research, 2013 |
Class 19: 4/14/2014updatesread-storeswrite storesbufferingwrite ahead logging | |
Class 20: 4/16/2014concurrency controlLSM trees | |
Class 21: 4/21/2014database crackingsideways crackingupdates in adaptive indexing | Reading: Database CrackingStratos Idreos, Martin Kersten and Stefan ManegoldIn Proceedings of the International Conference on Innovative Data Systems Research (CIDR), 2007 Self-organizing Tuple Reconstruction In Column-storesStratos Idreos, Martin Kersten and Stefan ManegoldIn Proceedings of the ACM SIGMOD International Conference on Management of Data, 2009 Updating a Cracked DatabaseStratos Idreos, Martin Kersten and Stefan ManegoldIn Proceedings of the ACM SIGMOD International Conference on Management of Data, 2007 |
Class 22: 4/23/2014stochastic crackingconcurrency control | Reading: Stochastic Database Cracking: Towards Robust Adaptive Indexing in Main-Memory Column-StoresFelix Halim, Stratos Idreos, Panagiotis Karras and Roland H. C. YapIn Proceedings of the Very Large Databases Endowment (PVLDB), 2012 Concurrency Control for Adaptive Indexing Merging What’s Cracked, Cracking What’s Merged: |
Midterm2: 4/28/2014 | Open book |
April 30: Guest lecture by Guy Lohman | |
Project evaluation: May 5-6-7 | Each team will spend 30 minutes with Stratos and 30 minutes with the TFs. |
May 22: Extra meeting | Discussion on noSQL, graph processing, distributed systems and other modern topics |
Discussions/Announcements
We will use piazza.
Here is the class web site: piazza.com/harvard/spring2014/cs165/home
Use this link to sign up: piazza.com/harvard/spring2014/cs165
Syllabus
CS165 SYLLABUS
Spring 2014
Welcome to CS165!
Professor: Stratos Idreosurl: http://stratos.seas.harvard.eduοffice: MD139email: stratos@seas.harvard.edu
TFs
Mike Kester kester@eecs.harvard.eduPeter Macko pmacko@eecs.harvard.eduRob Bowden rbowden91@gmail.com
Class website: http://stratos.seas.harvard.edu/classes/cs165-data-systems
What is this class about?
We are in the big data era and data systems sit in the critical path of everything we do, i.e., in businesses, in sciences, as well as in everyday life. This course will be a comprehensive introduction to modern data systems. The primary focus of the course will be on modern trends that are shaping the data management industry right now such as column-store and hybrid systems, shared nothing architectures, cache conscious algorithms, hardware/software co-design, main memory systems, adaptive indexing, stream processing, scientific data management, and key value stores. We will also study the history of data systems, traditional and seminal concepts and ideas such as the relational model, row-store database systems, optimization, indexing, concurrency control, recovery and SQL; In this way, we will discuss both how data systems evolved over the years and why, as well as how these concepts apply today and how data systems might evolve in the future.
Why take this class?
Data is everywhere. Every year we create even more data. As it stands, every two days we create as much data as much we created from the dawn of humanity up to 2003 [Eric Schmidt, Google]. Sciences, businesses and everyday life are severely affected. Data systems are in the middle of all this. Data systems is how we store and access data, i.e., they are the backbone for any data-driven application. It is a $100B industry, growing 10% every year [Economist, “Data, data everywhere”].
At the same time data systems research and the whole industry are going through a major and continuous transition; given that new data-driven scenarios and applications continuously pop up, there is a continuous need to redefine what is a good data system in such dynamic environments.
Expected learning outcomes
- Become familiar with the history and evolution of data systems design over the past 4-5 decades.
- Understanding the basic tradeoffs in designing and implementing modern data systems.
- Being able to design a new data system given a data-driven scenario and to built a prototype.
- Being able to understand which data system is a good fit given the needs of an application.
- Advanced C programming and debugging skills.
Who can take this class?
Required: CS50 and CS61 or good hacking and algorithm designing skills. Talk to the instructor if you have not taken one of those two courses but you think you are ready for CS165.
Lectures
The class meets twice a week: Mondays and Wednesdays 1:00pm-2:30pm. Class starts at 1:10pm. Classes will be designed to be discussion-based and slides will be used mainly to drive discussions as opposed to delivering the material. For some of the classes you will be required to read part of the reading material up front as homework and we will use the class time to discuss design choices and solve problems. Also, we will schedule sections/recitations when needed.
Office hours
Starting Week 1, Prof. Stratos Idreos will hold office hours every Wednesday 2:30pm-4:00pm (time slot can change if it is not convenient for everyone).
Starting Week 2, TFs will also hold office hours. We will arrange a room for that once we know the slot. During office hours you can chat with the TFs and (among each other) about your running projects, your design and implementations and ask questions about the class material.
Guest lectures
We are arranging guest lectures from leaders in data system design from industry.
Required textbook
We will use the following textbook: Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke. This textbook is a great source for all the seminal and traditional topics. For modern topics, we will use recent research papers and surveys which will be posted on the course website and you will have access to them through the Harvard network.
Slides/Notes
The slides used during class will be available online shortly after each class. However, you should not expect the slides to cover the material in detail; the class will be based a lot on discussion and problem solving so the slides will be tailored to drive the discussion as opposed to serve the material. As such, in each class one or more students will be assigned to take notes which will then make available to everyone (we will set-up a wiki). Afterwards, any student will be able to jump in and enrich the notes further. Collaborative note taking and editing will be part of your class participation grade and a great way to recite the material and also see how your fellow students perceive it.
Running project
Through the semester we will have a running project with several components. You will have to deliver a component every week or every second week. By the end of the project you will have designed, implemented and evaluated several key elements of a modern data system and you will have experienced several design tradeoffs. The running project is an individual project (all components) and no group projects are allowed. Discussing the design and implementation problems with other students is allowed and encouraged! We will do so in the class as well. However, the final deliverable should be personal, you must write from scratch all the code of your system and all documentation and reports. Whenever applicable we will let you know if there are existing libraries that is OK to use. You will not be judged only on how good your system works; it should be clear that you have designed and implemented the whole system, i.e., you should be able to perform changes on-the-fly, explain design details, etc.
Delivery of the various components will be done online (details will follow) and the deadline will typically be on Mondays at the time the course starts.
You are allowed a total of 10 late days. Each late day extends the deadline for delivering a project component by 24 hours. No excuse needed and no penalty will be imposed. You may distribute the late days as you wish, i.e., you could even use all late days for just a single deadline. If you do not have any late days left, then for every day you are late, you will lose 20% of the grade for this course component.
We will regularly assign extra tasks or you can come up with your own extra tasks for the various components of the running project. With these extra tasks you gain extra points.
The best 5 overall projects will gain additional extra points. "Best" is defined in terms of elegant system design, code quality, system efficiency and documentation.
Quizzes
We will do several (quick) quizzes during class.
Final project
Towards the end of the course we will have time for a final project. You will have about a full month. We will provide ideas for projects but you will also be expected to come up with your own project ideas given what you learned in the class. Each project will be judged individually depending on its complexity. You will have to deliver both the system as well as a technical report which will state clearly the problem, the motivation, the solution as well as related work in the same style as some of the research papers we will see in the class. Before each project starts there will be a design phase where you will deliver a report with the problem statement, motivation and the design. During this phase you will work with the instructor to define your project idea as well as the possible design and implementation solutions. Each project group may consist of up to 5 students. The final report as well as the initial design report should clearly explain the responsibilities of each group member. No late days will be granted. The best 3 projects will receive extra points. "Best" is defined similarly to the running project with the additional element of judging the project idea and motivation.
Assessment and grading
- Running project/Homework: 40%
- Quizzes and class participation: 15%
- Midterms (2): 20%
- Final project: 25%
- Extra points: Extra tasks for the running project: 10%
- Extra points: Best projects: 10% (5% running project + 5% final project)
Online discussions
We will set-up a Piazza forum where you will be able to ask questions and discuss issues related to the course.
Updates
During the semester and given student feedback we updated the syllabus in many ways. Some of the most important ones are the following:1) Students had the option to work in groups of 2 for the running project.2) We dropped all deadlines and late days within the semester. There was just a single deadline to deliver the whole project at the end of the semester. 3) During final grading we took into account the starting status of every student, as opposed to strictly taking into account the final project result only.4) We introduced the requirement for design documents for each project step.5) We converted the open project to being a part of the running project.6) We offered a significant amount of office and section hours. Most weeks this was in the order of 10 hours or more, while students also had the option to arrange ad hoc meetings with the professor or the TFs.
We will maintain these changes for the next iteration of the class which will be in Fall 2015. We will also apply a stream of additional small changes based on student feedback. If you are a Harvard undergrad and you have questions for CS165 feel free to contact Stratos.