#  CS265: Big Data Systems 

 





 Semester:   Fall 

|

 Year offered:  2014 

 

 

 

Big data is everywhere. A fundamental goal across numerous modern businesses and sciences is to be able to exploit as many machines as possible, to consume as much information as possible and as fast as possible. The big challenge is "how to turn data into useful knowledge". This is far from a simple task and a moving target as both the underlying hardware and our ability to collect data evolve. In this class, we will discuss how to design data systems and algorithms that can "scale up" and "scale out". Scale up refers to the ability to use a single machine to all its potential, i.e., to exploit properly the memory hierarchy and the multiple CPU and GPU cores. Scale out refers to the ability to use more than 1 machines (typically 100s or 1000s) effectively. This is a research oriented class. Every week we will read two modern research papers; one from the scale up area and one from the scale out area. We will use examples from several areas, including relational systems and distributed databases, graph processing systems (i.e., for social networks), key value stores, noSQL and newSQL systems as well as mobile computing. Each student will work on a semester long data systems research project (in groups of 2-4 students) which can be in any of the above areas and will be based on an open research problem.

---

## Class Materials:

### Classes

Here we will maintain the reading list and schedule for CS265 as well as guidelines for various admin issues. Always check this page for updates in the schedule or reading list.

**SECTIONS AND OFFICE HOURS**Class meets for sections twice a week: Tuesdays 4-5:30pm and Thursdays 5:30-7pm. Sections are not mandatory; you should go to sections to meet with the TF and other students to discuss about your projects, technical problems as well as about readings. We will also hold sections where we will do lectures about specific topics; such sections will be announced upfront in class and Piazza.

Stratos holds office hours after class every Wednesday.

You are welcome to arrange ad hoc meetings with Stratos or the TFs outside office hours and sections if you need more input. Just send an email at least one day before.

**BACKGROUND READING** If you are not familiar with the papers below, you are expected to read them within the first month of the class. We will schedule off-class meetings to discuss these papers in detail.

[Architecture of a Database System](http://mvdirona.com/jrh/perspectives/content/binary/ArchitectureOfDatabaseSystem.pdf). J. Hellerstein, M. Stonebraker and J. Hamilton. Foundations and Trends in Databases, 2007

[The Design and Implementation of Modern Column-Oriented Database Systems](/publications/design-and-implementation-modern-column-oriented-database-systems). D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos and S. Madden. Foundations and Trends in Databases, 2013

[Modern B-Tree Techniques](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.219.7269). Goetz Graefe. Foundations and Trends in Databases, 2011

[A Survey of Large-Scale Analytical Query Processing in MapReduce](http://www.idi.ntnu.no/~noervaag/papers/VLDBJ2013_MapReduceSurvey.pdf). Christos Doulkeridis and Kjetil Nørvag. Very Large Databases Journal, 23(3), 2014.

**GUIDELINES FOR PRESENTATIONS AND DISCUSSIONS**In every class, one or two students are responsible to lead the discussion for one or more research papers. It is not necessary to prepare slides. Each group can choose between creating slides and using the projector or simply creating a set of notes with bullets to lead the discussion and sharing this trough Piazza right before class.

The goal of each group is to lead a lively discussion in class. The notes or presentation should have enough material to lead discussion for at least the following topics:

(1) Summary of the paper: What is the problem it studies? Why is it an important problem? Why past/naive solutions do not work? What is the main intuition of the proposed solution? Why is it new?

(2) At least three strong points for the paper: Why this is a paper worth reading?

(3) At least three weak points: Anything you would do differently. Missing analysis; etc.

(4) Open topics: What ideas come to mind after reading the current paper(s)? Any open research problems? Can you think of a possible startup because of the new technology? Does it enable any new applications?

**GUIDELINES FOR REVIEWS**For every class, each student will prepare a one page review of the primary paper(s) of the day. The review should cover the four points mentioned in the presentation guidelines section above, i.e., (1) summary, (2) strong points, (3) weak points and (4) open problems. You should send the review in PDF to the TF prior to each class.

**SCHEDULE**Here we will maintain a list of the readings per class.

In each class we will have one or two papers marked with (P) which means that these are the primary papers that you are expected to read in detail and one or more papers marked with (B) which means that these are papers that we suggest you start from if you need more background on the particular topic or if you would like to read more about similar research.

For all papers, a good plan is to follow a few of the papers cited in the readings. You should see our readings as a trigger to read about an active area of research. The papers we read were recently published in premier data systems venues but there is typically many more great papers, both recent and past, in similar areas. If you would like to read more about a specific area, just ask for more suggestions.

In most classes we will have one primary paper. In some cases we will have two primary papers. The reason is that in these cases you will get a much better picture of the area by reading both papers while at the same time the papers have such a significant overlap that the added effort to read the second paper is reasonably small.

All links to the papers are accessible through the Harvard network at the time of creating this list. If you have trouble accessing any of the papers, please let us know.

Sort**Class 1**

Sep 3



Introduction to data systems &amp; CS265

Presenter: Stratos

[slides](/files/stratos/files/cs265_class1.pdf)



**Class 2**

Sep 5



DB Architectures and Column-stores basics

Presenter: Stratos

[slides](/files/stratos/files/cs265_class2.pdf)



**Class 3**

Sep 10



**(P)** [Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age](http://www-db.in.tum.de/~leis/papers/morsels.pdf).Viktor Leis, Peter A. Boncz, Alfons Kemper, Thomas Neumann.ACM SIGMOD International Conference on Management of Data. 2014

Presenters: Yihe &amp; Sam

**(B)** [MonetDB/X100: Hyper-Pipelining Query Execution](http://www.cidrdb.org/cidr2005/papers/P19.pdf)Peter A. Boncz, Marcin Zukowski, Niels NesConference on Innovative Data Systems Research (CIDR), 2005



**Class 4**

Sep 12



**(P)** [Megastore: Providing Scalable, Highly Available Storage for Interactive Services](http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf)  
Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson,Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh.Conference on Innovative Data Systems Research (CIDR), 2011

**(P)[ ](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)**[Spanner: Google’s Globally-Distributed Database.](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)James C. Corbett, Jeffrey Dean, Michael Epstein, et all.USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2012

Presenters: Ore &amp; Archie

(B) [The State of the art in distributed query processing](http://dl.acm.org/citation.cfm?doid=371578.371598)Donald KossmannACM Computing Surveys, 2000



**Class 5**

Sep 18



Class will meet on Thursday Sep 18 at 5:30pm. There will be a section during the normal class time on Wednesday Sep 17.

 Invited lecture: Nikita Shamgunov from memsql



**Class 6**

Sep 19



**(P)** [H2O: A Hands-free Adaptive Store](/file_url/158)  
Ioannis Alagiannis, Stratos Idreos and Anastassia AilamakiACM SIGMOD International Conference on Data Management, 2014

Presenters: Stella &amp; Wasay

**(B)** [Efficiently Compiling Efficient Query Plans for Modern Hardware](http://www.vldb.org/pvldb/vol4/p539-neumann.pdf)Thomas NeumannProceedings of the Very Large Databases Endowment (PVLDB), 2011



**Class 7**

Sep 24



**(P)** [SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database Architectures](http://pages.cs.wisc.edu/~floratou/SQLOnHadoop.pdf)Avrilia Floratou, Umar Farooq Minhas, Fatma Özcan Proceedings of the Very Large Databases Endowment (PVLDB), 2014

Presenter: Jenny &amp; Alex

(B) [HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads](http://www.vldb.org/pvldb/2/vldb09-861.pdf)Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, Avi SilberschatzProceedings of the Very Large Databases Endowment (PVLDB), 2009



**Class 8**

Sep 26



**(P)** [Indexing for interactive exploration of big data series.](/file_url/157)Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas.ACM SIGMOD International Conference on Data Management, 2014

Presenters: Michael

**(B)** [Buffering accesses to memory-resident index structures](http://www.vldb.org/conf/2003/papers/S13P02.pdf)  
J.Zhouand and K. RossInternational Conference on Very Large Databases (VLDB), 2003



**Class 9**

Oct 1



**(P)** [SharedDB: Killing One Thousand Queries With One Stone](http://vldb.org/pvldb/vol5/p526_georgiosgiannikis_vldb2012.pdf)Georgios Giannikis, Gustavo Alonso and Donald KossmannProceedings of the Very Large Databases Endowment (PVLDB), 2014

**(P)** [Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS](http://www.vldb.org/conf/2007/papers/research/p723-zukowski.pdf)  
Marcin Zukowski, Sándor Héman, Niels Nes, Peter A. BonczInternational Conference on Very Large Databases (VLDB), 2007

Presenter: Wilson &amp; Kevin

**(B)**  [Main-memory scan sharing for multi-core CPUs](http://www.vldb.org/pvldb/1/1453924.pdf)Lin Qiao, Vijayshankar Raman, Frederick Reiss, Peter J. Haas, Guy M. LohmanProceedings of the Very Large Databases Endowment (PVLDB), 2008



**Class 10**

Oct 3



**(P)** [Split query processing in polybase. ](http://dl.acm.org/citation.cfm?id=2463709)David J. DeWitt, Alan Halverson, Rimma V. Nehme, Srinath Shankar, Josep Aguilar-Saborit,Artin Avanes, Miro Flasza, Jim GramlingACM SIGMOD International Conference on Data Management, 2013

Presenters: Tarik &amp; Lukas

**(B)** [A case for fractured mirrors](http://link.springer.com/article/10.1007/s00778-003-0093-1)Ravishankar Ramamurthy, David J. DeWitt, Qi SuInternational Conference on Very Large Databases (VLDB), 2003



**Class 11**

Oct 8



**(P)** [The adaptive radix tree: ARTful indexing for main-memory databases.](http://www-db.in.tum.de/~leis/papers/ART.pdf)Viktor Leis, Alfons Kemper, Thomas NeumannInternational Conference on Data Engineering (ICDE), 2013

Presenters: Mike &amp; Sierra

**(B)** [A study of index structures for main memory database management systems](http://www.vldb.org/conf/1986/P294.PDF)T. J. Lehman and M. J. CareyInternational Conference on Very Large Databases (VLDB),1986

  
**(B)** [Cache conscious indexing for decision-support in main memory](http://www.vldb.org/conf/1999/P7.pdf)J. Rao and K. RossInternational Conference on Very Large Databases (VLDB), 1999

**Class 12**

Oct 10



**(P)** [The Researcher's Guide to the Data Deluge: Querying a Scientific Database in Just a Few Seconds](/files/TheResearchersGuidetotheDataDeluge.pdf)Martin L. Kersten, Stratos Idreos, Stefan Manegold, Erietta LiarouProceedings of the Very Large Databases Endowment (PVLDB), 2011

****(P)[ ](http://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf)**[](http://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf)**[BlinkDB: queries with bounded errors and bounded response times on very large data](http://www.cs.berkeley.edu/~sameerag/blinkdb_eurosys13.pdf)Sameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion StoicaEuroSys 2013

Presenters: Yifan, Rob

(B) [SciBORQ: Scientific data management with Bounds On Runtime and Quality](http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper39.pdf)Lefteris Sidirourgos, Martin L. Kersten, Peter A. BonczConference on Innovative Data Systems Research (CIDR), 2011



**Class 13**

Oct 15



Project Presentations

**Class 14**

Oct 17



Project Presentations

**Class 15**

Oct 22



[Reducing Database Locking Contention Through Multi-version Concurrency](http://www.vldb.org/pvldb/vol7/p1331-sadoghi.pdf)Mohammad Sadoghi, Mustafa Canim, Bishwaranjan Bhattacharjee, Fabian Nagel, Kenneth A. RossProceedings of the Very Large Databases Endowment (PVLDB), 2014

Presenters: Siera, Oyu, Lukas



**Class 16**

Oct 24



[Shark: SQL and rich analytics at scale](http://dl.acm.org/citation.cfm?doid=2463676.2465288)  
Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion StoicaACM SIGMOD International Conference on Data Management, 2013

[Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia)Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly,Michael J. Franklin, Scott Shenker, Ion StoicaUSENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012

Presenters: Nithin, Neel, Archie



**Class 17**

Oct 29



[Energy Analysis of Hardware and Software Range Partitioning](http://dl.acm.org/citation.cfm?doid=2638550)  
Lisa Wu, Orestis Polychroniou, Raymond J. Barker, Martha A. Kim, Kenneth A. RossACM Trans. Comput. Syst. 32(3): 8 (2014)

Presenters: Ore, Sam



****Class 18****

Oct 31



Guest Lecture: C Mohan, IBM Research



**Class 19**

Nov 5



[Track join: distributed joins with minimal network traffic](http://dl.acm.org/citation.cfm?doid=2588555.2610521)  
Orestis Polychroniou, Rajkumar Sen, Kenneth A. RossACM SIGMOD International Conference on Data Management, 2014

Presenters: Kevin, Alex, Wilson



**Class 20**

Nov 7



[Lazy evaluation of transactions in database systems](http://dl.acm.org/citation.cfm?doid=2588555.2610529)  
Jose M. Faleiro, Alexander Thomson, Daniel J. AbadACM SIGMOD International Conference on Data Management, 2014

Presenters: Stella, Mike, Jenny



**Class 21**

Nov 12



[MLbase: A Distributed Machine Learning System](http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper118.pdf)  
T. Kraska, A. Talwalkar, J.Duchi, R. Griffith, M. Franklin, M.I. JordanConference on Innovative Data Systems Research , 2013

[Map-Reduce for Machine Learning on Multicore](http://papers.nips.cc/paper/3150-map-reduce-for-machine-learning-on-multicore.pdf)  
Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle OlukotunNIPS 2006

Presenters: Tarik, Wasay



**Class 22**

Nov 14



Guest Lecture by Alkis Simitsis and Georgia Koutrika from HP Labs.

**Class 22**

Nov 19



[In-Memory Performance for Big Data](http://www.vldb.org/pvldb/vol8/p37-graefe.pdf)Goetz Graefe, Haris Volos, Hideaki Kimura, Harumi A. Kuno, Joseph Tucek, Mark Lillibridge, Alistair C. VeitchProceedings of the Very Large Databases Endowment (PVLDB), 2014

Presenters: Yifan, Michael, Rob



**Class 23**

Nov 21



[45 years after Codd’s proposal: Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks (1969)](http://www.sigmod.org/publications/sigmod-record/0903/p17.40year.codd.pdf)Wasay, Stella, Oyu, Sam

[Architecture of future data base systems (1981)](http://dl.acm.org/citation.cfm?doid=984471.984473)Wilson, Alex,

[Third-generation database system manifesto (1990)](http://dl.acm.org/citation.cfm?doid=101077.390001)Mike, Kevin

[Database research: achievements and opportunities into the 21st century (1996)](http://dl.acm.org/citation.cfm?doid=381854.381886)Lukas, Siera, Ore

[Jim Gray Speaks Out (1998)](http://www.sigmod.org/publications/sigmod-record/0806/p05.jimgray.pdf)Archie, Rob, Yifan

[Database Research Opportunities in Computer Games (2007)](http://www.sigmod.org/publications/sigmod-record/0709/p07.article-gehrke.pdf)Nithin, Neel

[Surajit Chaudhuri Speaks Out (2008)](http://www.sigmod.org/publications/sigmod-record/0812/p086.profiles.chaudhuri.pdf)Jenny, Tarik, Michael

[What Next? A Dozen Information-Technology Research Goals](http://research.microsoft.com/pubs/68660/ms_tr_99_50_turingtalk.pdf)ALL







### Syllabus

**CS265 SYLLABUS** (tentative: check back for updates)

**Fall 2014**

Welcome to CS265!

**Professor**: Stratos Idreos

url: [http://stratos.seas.harvard.eduοffice:](http://stratos.seas.harvard.edu%CE%BFffice:) MD139email: <stratos@seas.harvard.edu>

**TF:** Manos Athanassoulis<manos.athanassoulis@gmail.com>

**What is this class about?**

Big data is everywhere. A fundamental goal across numerous modern businesses and sciences is to be able to exploit as many machines as possible, to consume as much information as possible and as fast as possible. The big challenge is "how to turn data into useful knowledge". This is far from a simple task and a moving target as both the underlying hardware and our ability to collect data evolve. In this class, we will discuss how to design data systems and algorithms that can "scale up" and "scale out". Scale up refers to the ability to use a single machine to all its potential, i.e., to exploit properly the memory hierarchy and the multiple CPU and GPU cores. Scale out refers to the ability to use more than 1 machines (typically 100s or 1000s) effectively. We will use examples from several areas, including relational systems and distributed databases, graph processing systems (i.e., for social networks), key value stores, noSQL and newSQL systems as well as mobile computing and interactive analytics (such as dbTouch). In a fast moving industry and research environment such skills are in high demand.

**Why take this class?**

Data is everywhere. Every year we create even more data. As it stands, **every two years we create as much data as much we created from the dawn of humanity up to 2003** \[Eric Schmidt, Google\]. Sciences, businesses and everyday life are severely affected. Data systems are in the middle of all this. Data systems is how we store and access data, i.e., they are the backbone for any data-driven application. It is a $100B industry, growing 10% every year \[Economist, “Data, data everywhere”\]. At the same time data systems research and the whole industry are going through a major and continuous transition; given that new data-driven scenarios and applications continuously pop up, **there is a continuous need to redefine what is a good data system** in such dynamic environments.

This class will also be ideal for undergrads who ask the question “what is research?”.

**Expected learning outcomes**

-Understanding the basic tradeoffs in designing modern big data systems.

-Being able to design a new big data system given a data-driven scenario.

-Develop basic research skills: reading, writing and understanding research papers.

**Who can take this class?**

The class is accessible to students who have taken CS165 and to students who have a good background of modern data systems and designing algorithms for big data. Contact the instructor if you are unsure about whether you will be able to follow the class or not.

**Lectures**

The class meets twice a week: Wednesdays and Fridays 1:00pm-2:30pm in MD323. Class starts at 1:10pm. We will not have traditional lectures. All classes will be based on a discussion format. Each week we will discuss 2 research papers. Each student will be tasked to lead the discussion for at least 1 paper through the semester.

**Office hours**

Starting Week 1, Prof. Stratos Idreos will hold office hours every Wednesday 2:30pm-3:30pm (time slot can change if it is not convenient for everyone). Students are also welcome to schedule ad-hoc meetings with the instructor as often as needed.

TF office hours will be announced soon.

**Research and Brainstorming Sessions**

We will often schedule "brainstorming sessions". In such sessions, each group may work with the instructor independently on the details of their research project and also each group may get feedback on their progress from the rest of the groups.

**Required textbook**

We will use recent research papers and surveys which will be posted on the class website and you will have access to them through the Harvard network.

**Research project**

Each student will work on a research project throughout the semester. Students may work on groups of 2-4. Each group will work on its own project which will be defined within the first month. The projects will be tailored such as they follow both the interests of the students as well as open and challenging questions in the big data systems area. An ideal project will lead to 1) a set of new ideas on how to solve a specific problem, 2) an analysis that demonstrates the effectiveness and benefits of the new approaches as well as 3) a write up that is close to a full research paper (and may lead to a publication).

**Assessment and grading**

Class participation: 10%Project: 90%

**Collaboration Policy**

You are welcome to discuss your projects and designs with each other. In fact you are expected to do so during our brainstorming sessions and to give valuable feedback to your classmates. However, all material you deliver (code, design docs and reports) should be produced by you. For ideas that have greatly affected your designs and came outside a project group, you should acknowledge your fellow students in your report. Your final grade will be assessed after several 1:1 meetings with the instructor and the TF where you will be expected to demonstrate not only the end result of your project but also that you "own" the results and the design.

**Online discussions**

We will use Piazza where you will be able to ask questions and discuss issues related to the course. **sign up url:** <https://piazza.com/harvard/fall2014/cs265>**class url:** <https://piazza.com/harvard/fall2014/cs265/home>