#  Courses 

 



In Spring 2014, I will be teaching ***CS-165***; a course about modern ***data systems in the big data era***. Below is a tentative course description and syllabus. More details will follow soon.

**Course Name:** Data Systems

**Course Description:** We are in the big data era and data systems sit in the critical path of everything we do, i.e., in businesses, in sciences, as well as in everyday life. This course will be a comprehensive introduction to modern data systems. The primary focus of the course will be on modern trends that are shaping the data management industry right now such as column-store and hybrid systems, shared nothing architectures, cache conscious algorithms, hardware/software co-design, main memory systems, adaptive indexing, stream processing, scientific data management, and key value stores. We will also study the history of data systems, traditional and seminal concepts and ideas such as the relational model, row-store database systems, optimization, indexing, concurrency control, recovery and SQL; In this way, we will discuss both how data systems evolved over the years and why, as well as how these concepts apply today and how data systems might evolve in the future.

**CS165 SYLLABUS** (tentative: check back for updates)

**Spring 2014**

Welcome to CS165!

**Professor**: Stratos Idreosurl: <http://stratos.seas.harvard.eduοffice:> MD139email: <stratos@seas.harvard.edu>

**What is this class about?**

We are in the **big data era and data systems sit in the critical path of everything we do**, i.e., in businesses, in sciences, as well as in everyday life. This course will be a comprehensive introduction to modern data systems. The primary focus of the course will be on modern trends that are shaping the data management industry right now such as column-store and hybrid systems, shared nothing architectures, cache conscious algorithms, hardware/software co-design, main memory systems, adaptive indexing, stream processing, scientific data management, and key value stores. We will also study the history of data systems, traditional and seminal concepts and ideas such as the relational model, row-store database systems, optimization, indexing, concurrency control, recovery and SQL; In this way, we will discuss both how data systems evolved over the years and why, as well as how these concepts apply today and how data systems might evolve in the future.

**Why take this class?**

Data is everywhere. Every year we create even more data. As it stands, **every two days we create as much data as much we created from the dawn of humanity up to 2003** \[Eric Schmidt, Google\]. Sciences, businesses and everyday life are severely affected. Data systems are in the middle of all this. Data systems is how we store and access data, i.e., they are the backbone for any data-driven application. It is a $100B industry, growing 10% every year \[Economist, “Data, data everywhere”\].

At the same time data systems research and the whole industry are going through a major and continuous transition; given that new data-driven scenarios and applications continuously pop up, **there is a continuous need to redefine what is a good data system** in such dynamic environments.

**Expected learning outcomes**

- Become familiar with the history and evolution of data systems design over the past 4-5 decades.
- Understanding the basic tradeoffs in designing and implementing modern data systems.
- Being able to design a new data system given a data-driven scenario and to built a prototype.
- Being able to understand which data system is a good fit given the needs of an application.
- Advanced C programming and debugging skills.

**Who can take this class?**

Required: CS50 and CS61 or good hacking and algorithm designing skills. Talk to the instructor if you have not taken one of those two courses but you think you are ready for CS165.

**Lectures**

The class meets twice a week: Mondays and Wednesdays 1:00pm-2:30pm. Class starts at 1:10pm. Some of the classes will follow a traditional lecture style. Other classes will follow a discussion-based approach where you will be required to read part of the reading material up front as homework and we will use the class time to discuss design choices and solve problems together. Also, we will schedule sections/recitations when needed.

**Office hours**

Prof. Stratos Idreos will hold office hours every Wednesday 2:30pm-4:30pm.

**Guest lectures**

We are arranging guest lectures from leaders in data system design from industry.

**Required textbook**

We will use the following textbook: Database Management Systems, by Raghu Ramakrishnan and Johannes Gehrke. This textbook is a great source for all the seminal and traditional topics. For modern topics, we will use recent research papers and surveys which will be posted on the course website and you will have access to them through the Harvard network.

**Slides/Notes**

The slides used during class will be available online shortly after each class. However, you should not expect the slides to cover the material in detail; the class will be based a lot on discussion and problem solving so the slides will be tailored to drive the discussion as opposed to serve the material. As such, in each class one or more students will be assigned to take notes which will then make available to everyone (we will set-up a wiki). Afterwards, any student will be able to jump in and enrich the notes further. Collaborative note taking and editing will be part of your class participation grade and a great way to recite the material and also see how your fellow students perceive it.

**Running project**

Through the semester we will have a running project with several components. You will have to deliver a component every week or every second week. By the end of the project you will have designed, implemented and evaluated several key elements of a modern data system and you will have experienced several design tradeoffs. The running project is an individual project (all components) and no group projects are allowed. Discussing the design and implementation problems with other students is allowed and encouraged! We will do so in the class as well. However, the final deliverable should be personal, you must write from scratch all the code of your system and all documentation and reports. Whenever applicable we will let you know if there are existing libraries that is OK to use. You will not be judged only on how good your system works; it should be clear that you have designed and implemented the whole system, i.e., you should be able to perform changes on-the-fly, explain design details, etc.

Delivery of the various components will be done online (details will follow) and the deadline will typically be on Mondays at the time the course starts.

You are allowed a total of 10 late days. Each late day extends the deadline for delivering a project component by 24 hours. No excuse needed and no penalty will be imposed. You may distribute the late days as you wish, i.e., you could even use all late days for just a single deadline. If you do not have any late days left, then for every day you are late, you will lose 20% of the grade for this course component.

We will regularly assign extra tasks or you can come up with your own extra tasks for the various components of the running project. With these extra tasks you gain extra points.

The best 5 overall projects will gain additional extra points. "Best" is defined in terms of elegant system design, code quality, system efficiency and documentation.

**Quizzes**

We will do several (quick) quizzes during class.

**Final project**

Towards the end of the course we will have time for a final project. You will have about a full month. We will provide ideas for projects but you will also be expected to come up with your own project ideas given what you learned in the class. Each project will be judged individually depending on its complexity. You will have to deliver both the system as well as a technical report which will state clearly the problem, the motivation, the solution as well as related work in the same style as some of the research papers we will see in the class. Before each project starts there will be a design phase where you will deliver a report with the problem statement, motivation and the design. During this phase you will work with the instructor to define your project idea as well as the possible design and implementation solutions. Each project group may consist of up to 5 students. The final report as well as the initial design report should clearly explain the responsibilities of each group member. No late days will be granted. The best 3 projects will receive extra points. "Best" is defined similarly to the running project with the additional element of judging the project idea and motivation.

**Assessment and grading**

- Running project/Homework: 40%
- Quizzes and class participation: 15%
- Midterms (2): 20%
- Final project: 25%
- Extra points: Extra tasks for the running project: 10%
- Extra points: Best projects: 10% (5% running project + 5% final project)

**Online discussions**

We will set-up a Piazza forum where you will be able to ask questions and discuss issues related to the course.