CS265: Big Data Systems | Stratos Idreos

Semester: Fall

Year offered: 2014

Big data is everywhere. A fundamental goal across numerous modern businesses and sciences is to be able to exploit as many machines as possible, to consume as much information as possible and as fast as possible. The big challenge is "how to turn data into useful knowledge". This is far from a simple task and a moving target as both the underlying hardware and our ability to collect data evolve. In this class, we will discuss how to design data systems and algorithms that can "scale up" and "scale out". Scale up refers to the ability to use a single machine to all its potential, i.e., to exploit properly the memory hierarchy and the multiple CPU and GPU cores. Scale out refers to the ability to use more than 1 machines (typically 100s or 1000s) effectively. This is a research oriented class. Every week we will read two modern research papers; one from the scale up area and one from the scale out area. We will use examples from several areas, including relational systems and distributed databases, graph processing systems (i.e., for social networks), key value stores, noSQL and newSQL systems as well as mobile computing. Each student will work on a semester long data systems research project (in groups of 2-4 students) which can be in any of the above areas and will be based on an open research problem.

Class Materials:

Classes

Here we will maintain the reading list and schedule for CS265 as well as guidelines for various admin issues. Always check this page for updates in the schedule or reading list.

SECTIONS AND OFFICE HOURS

Class meets for sections twice a week: Tuesdays 4-5:30pm and Thursdays 5:30-7pm. Sections are not mandatory; you should go to sections to meet with the TF and other students to discuss about your projects, technical problems as well as about readings. We will also hold sections where we will do lectures about specific topics; such sections will be announced upfront in class and Piazza.

Stratos holds office hours after class every Wednesday.

You are welcome to arrange ad hoc meetings with Stratos or the TFs outside office hours and sections if you need more input. Just send an email at least one day before.

BACKGROUND READING

If you are not familiar with the papers below, you are expected to read them within the first month of the class. We will schedule off-class meetings to discuss these papers in detail.

Architecture of a Database System. J. Hellerstein, M. Stonebraker and J. Hamilton.

Foundations and Trends in Databases, 2007

The Design and Implementation of Modern Column-Oriented Database Systems. D. Abadi, P. Boncz, S. Harizopoulos, S. Idreos and S. Madden. Foundations and Trends in Databases, 2013

Modern B-Tree Techniques. Goetz Graefe. Foundations and Trends in Databases, 2011

A Survey of Large-Scale Analytical Query Processing in MapReduce. Christos Doulkeridis and Kjetil Nørvag. Very Large Databases Journal, 23(3), 2014.

GUIDELINES FOR PRESENTATIONS AND DISCUSSIONS

In every class, one or two students are responsible to lead the discussion for one or more research papers. It is not necessary to prepare slides. Each group can choose between creating slides and using the projector or simply creating a set of notes with bullets to lead the discussion and sharing this trough Piazza right before class.

The goal of each group is to lead a lively discussion in class. The notes or presentation should have enough material to lead discussion for at least the following topics:

(1) Summary of the paper: What is the problem it studies? Why is it an important problem? Why past/naive solutions do not work? What is the main intuition of the proposed solution? Why is it new?

(2) At least three strong points for the paper: Why this is a paper worth reading?

(3) At least three weak points: Anything you would do differently. Missing analysis; etc.

(4) Open topics: What ideas come to mind after reading the current paper(s)? Any open research problems? Can you think of a possible startup because of the new technology? Does it enable any new applications?

GUIDELINES FOR REVIEWS

For every class, each student will prepare a one page review of the primary paper(s) of the day. The review should cover the four points mentioned in the presentation guidelines section above, i.e., (1) summary, (2) strong points, (3) weak points and (4) open problems. You should send the review in PDF to the TF prior to each class.

SCHEDULE

Here we will maintain a list of the readings per class.

In each class we will have one or two papers marked with (P) which means that these are the primary papers that you are expected to read in detail and one or more papers marked with (B) which means that these are papers that we suggest you start from if you need more background on the particular topic or if you would like to read more about similar research.

For all papers, a good plan is to follow a few of the papers cited in the readings. You should see our readings as a trigger to read about an active area of research. The papers we read were recently published in premier data systems venues but there is typically many more great papers, both recent and past, in similar areas. If you would like to read more about a specific area, just ask for more suggestions.

In most classes we will have one primary paper. In some cases we will have two primary papers. The reason is that in these cases you will get a much better picture of the area by reading both papers while at the same time the papers have such a significant overlap that the added effort to read the second paper is reasonably small.

All links to the papers are accessible through the Harvard network at the time of creating this list. If you have trouble accessing any of the papers, please let us know.

Class 1 Sep 3		Introduction to data systems & CS265 Presenter: Stratos slides
Class 2 Sep 5		DB Architectures and Column-stores basics Presenter: Stratos slides
Class 3 Sep 10		(P) Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age.Viktor Leis, Peter A. Boncz, Alfons Kemper, Thomas Neumann. ACM SIGMOD International Conference on Management of Data. 2014 Presenters: Yihe & Sam (B) MonetDB/X100: Hyper-Pipelining Query ExecutionPeter A. Boncz, Marcin Zukowski, Niels Nes Conference on Innovative Data Systems Research (CIDR), 2005
Class 4 Sep 12		(P) Megastore: Providing Scalable, Highly Available Storage for Interactive Services Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson,Jean-Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh. Conference on Innovative Data Systems Research (CIDR), 2011 (P) Spanner: Google’s Globally-Distributed Database.James C. Corbett, Jeffrey Dean, Michael Epstein, et all. USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2012 Presenters: Ore & Archie (B) The State of the art in distributed query processingDonald Kossmann ACM Computing Surveys, 2000
Class 5 Sep 18		Class will meet on Thursday Sep 18 at 5:30pm. There will be a section during the normal class time on Wednesday Sep 17. Invited lecture: Nikita Shamgunov from memsql
Class 6 Sep 19		(P) H2O: A Hands-free Adaptive Store Ioannis Alagiannis, Stratos Idreos and Anastassia Ailamaki ACM SIGMOD International Conference on Data Management, 2014 Presenters: Stella & Wasay (B) Efficiently Compiling Efficient Query Plans for Modern HardwareThomas Neumann Proceedings of the Very Large Databases Endowment (PVLDB), 2011
Class 7 Sep 24		(P) SQL-on-Hadoop: Full Circle Back to Shared-Nothing Database ArchitecturesAvrilia Floratou, Umar Farooq Minhas, Fatma Özcan Proceedings of the Very Large Databases Endowment (PVLDB), 2014 Presenter: Jenny & Alex (B) HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical WorkloadsAzza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Alexander Rasin, Avi Silberschatz Proceedings of the Very Large Databases Endowment (PVLDB), 2009
Class 8 Sep 26		(P) Indexing for interactive exploration of big data series.Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas. ACM SIGMOD International Conference on Data Management, 2014 Presenters: Michael (B) Buffering accesses to memory-resident index structures J.Zhouand and K. Ross International Conference on Very Large Databases (VLDB), 2003
Class 9 Oct 1		(P) SharedDB: Killing One Thousand Queries With One StoneGeorgios Giannikis, Gustavo Alonso and Donald Kossmann Proceedings of the Very Large Databases Endowment (PVLDB), 2014 (P) Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS Marcin Zukowski, Sándor Héman, Niels Nes, Peter A. Boncz International Conference on Very Large Databases (VLDB), 2007 Presenter: Wilson & Kevin (B) Main-memory scan sharing for multi-core CPUsLin Qiao, Vijayshankar Raman, Frederick Reiss, Peter J. Haas, Guy M. Lohman Proceedings of the Very Large Databases Endowment (PVLDB), 2008
Class 10 Oct 3		(P) Split query processing in polybase. David J. DeWitt, Alan Halverson, Rimma V. Nehme, Srinath Shankar, Josep Aguilar-Saborit,Artin Avanes, Miro Flasza, Jim Gramling ACM SIGMOD International Conference on Data Management, 2013 Presenters: Tarik & Lukas (B) A case for fractured mirrorsRavishankar Ramamurthy, David J. DeWitt, Qi Su International Conference on Very Large Databases (VLDB), 2003
Class 11 Oct 8		(P) The adaptive radix tree: ARTful indexing for main-memory databases.Viktor Leis, Alfons Kemper, Thomas Neumann International Conference on Data Engineering (ICDE), 2013 Presenters: Mike & Sierra (B) A study of index structures for main memory database management systemsT. J. Lehman and M. J. Carey International Conference on Very Large Databases (VLDB),1986 (B) Cache conscious indexing for decision-support in main memoryJ. Rao and K. RossInternational Conference on Very Large Databases (VLDB), 1999
Class 12 Oct 10		(P) The Researcher's Guide to the Data Deluge: Querying a Scientific Database in Just a Few SecondsMartin L. Kersten, Stratos Idreos, Stefan Manegold, Erietta Liarou Proceedings of the Very Large Databases Endowment (PVLDB), 2011 (P) BlinkDB: queries with bounded errors and bounded response times on very large dataSameer Agarwal, Barzan Mozafari, Aurojit Panda, Henry Milner, Samuel Madden, Ion Stoica EuroSys 2013 Presenters: Yifan, Rob (B) SciBORQ: Scientific data management with Bounds On Runtime and QualityLefteris Sidirourgos, Martin L. Kersten, Peter A. Boncz Conference on Innovative Data Systems Research (CIDR), 2011
Class 13 Oct 15		Project Presentations
Class 14 Oct 17		Project Presentations
Class 15 Oct 22		Reducing Database Locking Contention Through Multi-version ConcurrencyMohammad Sadoghi, Mustafa Canim, Bishwaranjan Bhattacharjee, Fabian Nagel, Kenneth A. Ross Proceedings of the Very Large Databases Endowment (PVLDB), 2014 Presenters: Siera, Oyu, Lukas
Class 16 Oct 24		Shark: SQL and rich analytics at scale Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica ACM SIGMOD International Conference on Data Management, 2013 Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster ComputingMatei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly,Michael J. Franklin, Scott Shenker, Ion Stoica USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012 Presenters: Nithin, Neel, Archie
Class 17 Oct 29		Energy Analysis of Hardware and Software Range Partitioning Lisa Wu, Orestis Polychroniou, Raymond J. Barker, Martha A. Kim, Kenneth A. Ross ACM Trans. Comput. Syst. 32(3): 8 (2014) Presenters: Ore, Sam
Class 18 Oct 31		Guest Lecture: C Mohan, IBM Research
Class 19 Nov 5		Track join: distributed joins with minimal network traffic Orestis Polychroniou, Rajkumar Sen, Kenneth A. Ross ACM SIGMOD International Conference on Data Management, 2014 Presenters: Kevin, Alex, Wilson
Class 20 Nov 7		Lazy evaluation of transactions in database systems Jose M. Faleiro, Alexander Thomson, Daniel J. Abad ACM SIGMOD International Conference on Data Management, 2014 Presenters: Stella, Mike, Jenny
Class 21 Nov 12		MLbase: A Distributed Machine Learning System T. Kraska, A. Talwalkar, J.Duchi, R. Griffith, M. Franklin, M.I. Jordan Conference on Innovative Data Systems Research , 2013 Map-Reduce for Machine Learning on Multicore Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun NIPS 2006 Presenters: Tarik, Wasay
Class 22 Nov 14		Guest Lecture by Alkis Simitsis and Georgia Koutrika from HP Labs.
Class 22 Nov 19		In-Memory Performance for Big DataGoetz Graefe, Haris Volos, Hideaki Kimura, Harumi A. Kuno, Joseph Tucek, Mark Lillibridge, Alistair C. Veitch Proceedings of the Very Large Databases Endowment (PVLDB), 2014 Presenters: Yifan, Michael, Rob
Class 23 Nov 21		45 years after Codd’s proposal: Derivability, Redundancy and Consistency of Relations Stored in Large Data Banks (1969) Wasay, Stella, Oyu, Sam Architecture of future data base systems (1981) Wilson, Alex, Third-generation database system manifesto (1990) Mike, Kevin Database research: achievements and opportunities into the 21st century (1996) Lukas, Siera, Ore Jim Gray Speaks Out (1998) Archie, Rob, Yifan Database Research Opportunities in Computer Games (2007) Nithin, Neel Surajit Chaudhuri Speaks Out (2008) Jenny, Tarik, Michael What Next? A Dozen Information-Technology Research Goals ALL

Syllabus

CS265 SYLLABUS (tentative: check back for updates)

Fall 2014

Welcome to CS265!

Professor: Stratos Idreos

url: http://stratos.seas.harvard.eduοffice: MD139

email: stratos@seas.harvard.edu

TF: Manos Athanassoulis

manos.athanassoulis@gmail.com

What is this class about?

Big data is everywhere. A fundamental goal across numerous modern businesses and sciences is to be able to exploit as many machines as possible, to consume as much information as possible and as fast as possible. The big challenge is "how to turn data into useful knowledge". This is far from a simple task and a moving target as both the underlying hardware and our ability to collect data evolve. In this class, we will discuss how to design data systems and algorithms that can "scale up" and "scale out". Scale up refers to the ability to use a single machine to all its potential, i.e., to exploit properly the memory hierarchy and the multiple CPU and GPU cores. Scale out refers to the ability to use more than 1 machines (typically 100s or 1000s) effectively. We will use examples from several areas, including relational systems and distributed databases, graph processing systems (i.e., for social networks), key value stores, noSQL and newSQL systems as well as mobile computing and interactive analytics (such as dbTouch). In a fast moving industry and research environment such skills are in high demand.

Why take this class?

Data is everywhere. Every year we create even more data. As it stands, every two years we create as much data as much we created from the dawn of humanity up to 2003 [Eric Schmidt, Google]. Sciences, businesses and everyday life are severely affected. Data systems are in the middle of all this. Data systems is how we store and access data, i.e., they are the backbone for any data-driven application. It is a $100B industry, growing 10% every year [Economist, “Data, data everywhere”]. At the same time data systems research and the whole industry are going through a major and continuous transition; given that new data-driven scenarios and applications continuously pop up, there is a continuous need to redefine what is a good data system in such dynamic environments.

This class will also be ideal for undergrads who ask the question “what is research?”.

Expected learning outcomes

-Understanding the basic tradeoffs in designing modern big data systems.

-Being able to design a new big data system given a data-driven scenario.

-Develop basic research skills: reading, writing and understanding research papers.

Who can take this class?

The class is accessible to students who have taken CS165 and to students who have a good background of modern data systems and designing algorithms for big data. Contact the instructor if you are unsure about whether you will be able to follow the class or not.

Lectures

The class meets twice a week: Wednesdays and Fridays 1:00pm-2:30pm in MD323. Class starts at 1:10pm. We will not have traditional lectures. All classes will be based on a discussion format. Each week we will discuss 2 research papers. Each student will be tasked to lead the discussion for at least 1 paper through the semester.

Office hours

Starting Week 1, Prof. Stratos Idreos will hold office hours every Wednesday 2:30pm-3:30pm (time slot can change if it is not convenient for everyone). Students are also welcome to schedule ad-hoc meetings with the instructor as often as needed.

TF office hours will be announced soon.

Research and Brainstorming Sessions

We will often schedule "brainstorming sessions". In such sessions, each group may work with the instructor independently on the details of their research project and also each group may get feedback on their progress from the rest of the groups.

Required textbook

We will use recent research papers and surveys which will be posted on the class website and you will have access to them through the Harvard network.

Research project

Each student will work on a research project throughout the semester. Students may work on groups of 2-4. Each group will work on its own project which will be defined within the first month. The projects will be tailored such as they follow both the interests of the students as well as open and challenging questions in the big data systems area. An ideal project will lead to 1) a set of new ideas on how to solve a specific problem, 2) an analysis that demonstrates the effectiveness and benefits of the new approaches as well as 3) a write up that is close to a full research paper (and may lead to a publication).

Assessment and grading

Class participation: 10%

Project: 90%

Collaboration Policy

You are welcome to discuss your projects and designs with each other. In fact you are expected to do so during our brainstorming sessions and to give valuable feedback to your classmates. However, all material you deliver (code, design docs and reports) should be produced by you. For ideas that have greatly affected your designs and came outside a project group, you should acknowledge your fellow students in your report. Your final grade will be assessed after several 1:1 meetings with the instructor and the TF where you will be expected to demonstrate not only the end result of your project but also that you "own" the results and the design.

Online discussions

We will use Piazza where you will be able to ask questions and discuss issues related to the course. sign up url: https://piazza.com/harvard/fall2014/cs265class url: https://piazza.com/harvard/fall2014/cs265/home