This course investigates the theory and practice of modern large-scale database systems. Large-scale approaches include distributed relational databases; data warehouses; and non-relational databases including HDFS, Hadoop, Accumulo for query and graph algorithms, and Mahout bound to Spark for machine learning algorithms. Topics discussed include data design and architecture; database security, integrity, query processing, query optimization, transaction management, concurrency control, and fault tolerance; and query formulation, graph algorithms, and machine learning algorithms on large-scale distributed data systems. At the end of the course, students will understand the principles of several common large-scale data systems including their architectures, performance, and costs. Students will also gain a sense of which approach is recommended for different requirements and circumstances.
The course content is divided into modules. Modules can be accessed by clicking Course Content on the menu. A module will have several sections including the overview, content, readings, discussions, and assignments. Students are encouraged to preview all sections of the module before starting. Most modules run for a period of seven (7) days, exceptions are noted on the Course Outline page. Students should regularly check the Calendar and Announcements for assignment due dates.
Module | Dates | Topics | Assignments |
Module 1 | Week 1 | Introduction to Distributed Database Systems and Distributed Database Architectures | Ozsu, M. and Valduriez, P., Principles of Distributed Database Systems, Springer, 2011; chaps. 1, 2 & 3 · Learning Activity – Design Database · Assessment – Write up of database design, represent queries in relational algebra notation |
Module 2 | Week 2 | Horizontal Partitioning | Principles of Dist. DB Sys.; chap. 3 · Learning Activity – Horizontally partition a single table of a database · Assessment – Find the optimal horizontal partitioning of a table given a set of queries |
Module 3 | Week 3 | Vertical Partitioning | Principles of Dist. DB Sys.; chap. 3 · Learning Activity – Vertically partition a single table of a database · Assessment – Find the optimal vertical partitioning of a table given a set of queries |
Module 4 & Module 5 | Week 4 | Semantic Data Control & Distributed Query Processing | Principles of Dist. DB Sys.; chaps. 5, 6 & 7 · Assessment – Answer questions about the effects of semantic integrity rules · Learning Activity – Online discussion about query trees |
Module 6 | Week 5 | Distributed Query Optimization | Principles of Dist. DB Sys.; chap. 8 · Learning Activity – Analyze the cost/benefit of implementing different query optimization approaches · Assessment – Apply cost heuristics to optimize query plans |
Module 7 | Week 6 | Distributed Transaction Management & Concurrency Control | Principles of Dist. DB Sys.; chap. 10 · Learning Activity – Online discussion of ACID properties, concurrency control, and/or deadlock managment · Assessment – Demonstrate knowledge of query schedule equivalence · Assessment – Answer questions about various concurrency control algorithms |
Module 8 | Week 7 | Distributed Reliability Protocols and the Data Warehouse | C. Mohan, ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992, pp. 94–162 http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/p94-mohan.pdf Inmon, William. Building the Data Warehouse. Wiley; 4th edition. 2005. Principles of Dist. DB Sys.; chap. 12 · Learning Activity – Online discussion of ARIES operations · Assessment – Work through an ARIES example |
Module 9 & Module 10 | Week 8 | Cloud Computing MapReduce Hadoop Data File System (HDFS) & Advanced Hadoop | Hogan, M. D., Liu, F., Sokol, A. W., Jin, T., NIST Cloud Computing Standards Roadmap, NIST Publication SP - 500-291, 2011. http://www.nist.gov/manuscript-publication-search.cfm?pub_id=909024 Liu, F., Tong, J., Mao, J., Bohn, R. B., Messina, J. V., Badger, M. L., Leaf, D. M., NIST Cloud Computing Reference Architecture, NIST Publication SP - 500-292 , 2011 http://www.nist.gov/manuscript-publication-search.cfm?pub_id=909505 Sosinsky, B., Cloud Computing Bible. Wiley, 2011. ISBN: 978-0470903568 Dean, Jeffrey and Ghemawat, Sanjay (2004). "MapReduce: Simplified Data Processing on Large Clusters". http://research.google.com/archive/mapreduce.html Tom White, Hadoop: The Definitive Guide. O’Reilly Media, 4th edition. 2015. Apache Hadoop. http://wiki.apache.org/hadoop/ · Learning Activity – Online discussion of the characteristics of algorithms best suited for MapReduce · Assessment – Write MapReduce pseudocode |
Module 11 | Week 9 | Accumulo Architecture & Programming | Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. Bigtable: A distributed storage system for structured data. Proceedings of the 7th Conference on USENIX Symposium on Operating Systems Design and Implementation, Vol. 7. 2006. http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.101.9822 Apache – Accumulo, http://incubator.apache.org/accumulo/ Apache – Hive, http://hive.apache.org/ Apache – Pig, http://pig.apache.org/ Cloudera – Sqoop, http://www.cloudera.com/blog/2009/06/introducing-sqoop/ Google – Cloud SQL, http://code.google.com/apis/sql/ · Learning Activity – Online discussion about what must be done to make Accumulo ACID compliant · Assessment – Write pseudocode for Accumulo text analytics |
Module 12 | Week 10 | Advanced Accumulo – Security & Data Analytics | Apache – Accumulo, http://incubator.apache.org/accumulo/ Wikipedia – Large Triple Stores, http://www.w3.org/wiki/LargeTripleStores · Learning Activity – Discuss the advantages of graph data models · Assessment – Write pseudocode for searching geospatial data represented in Accumulo |
Module 13 | Week 11 | Machine Learning – Collaborative Filtering | Apache – Mahout, http://mahout.apache.org/ · Learning Activity – Students will discuss the two common collaborative filtering methods and how they can be improved · Assessment – Write pseudocode for a collaborative filtering problem |
Module 14 | Week 12 | Machine Learning – Clustering and Classification | Apache – Mahout, http://mahout.apache.org/ · Learning Activity – Online discussion of differences between collaborative filtering, classification, and clustering · Assessment – Write pseudocode for a clustering algorithm |
To provide an understanding of the deep issues of the design and implementation of massive-scale data systems.
Required
Ozsu, M. Tamer and Valduriez, Patrick. Principles of Distributed Database Systems, 3rd Edition. Springer, 2011.
ISBN-10: 1441988335
ISBN-13: 978-1441988331
Textbook information for this course is available online through the appropriate bookstore website: For online courses, search the MBS website at http://ep.jhu.edu/bookstore .
OptionalAdditionally, any of the following texts or other texts that you may have from previous courses may be useful for this class if you find yourself struggling with specific skills:
It is expected that each class will take approximately 4–7 hours per week to complete. Here is an approximate breakdown: reading the assigned sections of the texts (approximately 2–3 hours per week) as well as some outside reading, listening to the audio annotated slide presentations (approximately 1–2 hours per week), and online homework assignments (approximately 1–2 hours per week).
This course will consist of four basic student requirements:
Preparation and Participation (Class Discussions) (20% of Final Grade Calculation)Each student is responsible for carefully reading all assigned material and being prepared for discussion. The majority of readings are from the weekly modules and the course text. Additional reading may be assigned to supplement text readings.
It is recommended that you post your initial response to the discussion questions by the evening of day 3 for that module week. Posting one or more responses to the discussion question is part one of your grade for class discussions.
Part two of your grade is your discussions and interactions (i.e., responding to classmate postings with thoughtful responses) with at least two classmates. Just posting your response to a discussion question is not sufficient; we want you to interact with your classmates. Be detailed in your postings and in your responses to your classmates' postings. Feel free to agree or disagree with your classmates. Please ensure that your postings are civil and constructive.
Sometimes, a student may post the entire solution to a question. In this case, I still want you to post your own solutions to the question and any unique insights that you may have. There are always nuances in each person’s solutions and I am very interested in each of your perspectives.
I do not want posts that are generated directly from LLMs. However, you may use LLMs to help inform your responses just like you may use web pages, papers, and books to help inform your responses. Nevertheless, you must present your own responses in your own words.
Some of the discussions are collaborative homework assignments. This is tricky. I don't want one person to post the solution to a homework problem early while no one else can add too much to the solution. I would rather see people post parts of the solution and discuss trade-offs and observations at each step of the solution. As always, I welcome feedback early and often. Most importantly, I'll be looking for interesting or unique insights that you bring to each topic.
David Silberberg will monitor class discussions and will respond to some of the discussions as discussions are posted.
Evaluation of preparation and participation is based on contribution to discussions. Preparation and participation is evaluated by the following grading elements:
You will receive up to 5 points for each week’s discussion. Preparation and participation is graded as follows:
5 points — Interesting Concepts [offers interesting insights; references to other research or products]; Critical Thinking [rich in content; full of thoughts, insight, and analysis]; excellent interaction with other students.
4 points — Interesting Concepts [offers some insights; references to other research or products]; Critical Thinking [substantial information; thought, insight, and analysis has taken place]; good interaction with other students.
3 points — Interesting Concepts [offers some insights; few references to other research or products]; Critical Thinking [generally competent; information is thin and commonplace]; some interaction with other students.
2 points — Interesting Concepts [offers few insights; no references to other research or products]; Critical Thinking [information is thin and commonplace]; little interaction with other students.
1 points — Interesting Concepts [few real insights]; Critical Thinking [information is thin and commonplace]; hardly any interaction with other students.
0 points — No postings.
Learning Activities, Homework, and Quizzes (80% of Final Grade Calculation)Reading assignments will be important sources of material for your learning activity assignment.
Learning activity assignments will include a mix of homework problem sets. Include a cover sheet with your name and assignment identifier. Also include your name and a page number indicator (i.e., page x of y) on each page of your homework submissions. Each problem should have the problem statement, the steps required to arrive at the solution, and the solution. All Figures and Tables should be captioned and labeled appropriately.
All homework assignments and quizzes are due according to the dates in the Calendar.
Late submissions will be reduced by 10 points for each day late (no exceptions without prior coordination with the instructors).
Learning Activities are evaluated by the following grading elements:
100–90 = A
89–80 = B
79–70 = C
Student assignments are due according to the dates in the Calendar. The instructor will post grades one week after assignment due dates.
I generally do not directly grade spelling and grammar. However, egregious violations of the rules of the English language will be noted without comment. Consistently poor performance in either spelling or grammar is taken as an indication of poor written communication ability that may detract from your grade.
Grade of A+, A, and A- indicate achievement of consistent excellence and distinction throughout the course—that is, conspicuous excellence in all aspects of assignments and discussion in every week.
Grades of B+, B, and B- indicate work that meets all course requirements on a level appropriate for graduate academic work. These criteria apply to both undergraduates and graduate students taking the course.
100–90 = A (with pluses and minuses)
89–80 = B (with pluses and minuses)
79–70 = C (with pluses)
Final grades will be determined by the following weighting:
Item | % of Grade |
Preparation and Participation (Class Discussions) | 20% |
Item | % of Grade |
Learning Activities, Homework, Quizzes | 80% |
Deadlines for Adding, Dropping, and Withdrawing from Courses
Students may add a course up to one week after the start of the term for that particular course. Students may drop courses according to the drop deadlines outlined in the EP academic calendar. Between the 6th week of the class and prior to the final withdrawal deadline, a student may withdraw from a course with a W on their academic record. A record of the course will remain on the academic record with a W appearing in the grade column to indicate that the student registered and withdrew from the course.
Academic Misconduct Policy
Students with Disabilities - Accommodations and Accessibility
Student Conduct Code
Classroom Climate
JHU is committed to creating a classroom environment that values the diversity of experiences and perspectives that all students bring. Everyone has the right to be treated with dignity and respect. Fostering an inclusive climate is important. Research and experience show that students who interact with peers who are different from themselves learn new things and experience tangible educational outcomes. At no time in this learning process should someone be singled out or treated unequally on the basis of any seen or unseen part of their identity. If you have concerns in this course about harassment, discrimination, or any unequal treatment, or if you seek accommodations or resources, please reach out to the course instructor directly. Reporting will never impact your course grade. You may also share concerns with your program chair, the Assistant Dean for Diversity and Inclusion, or the Office of Institutional Equity. In handling reports, people will protect your privacy as much as possible, but faculty and staff are required to officially report information for some cases (e.g. sexual harassment).
Course Auditing
When a student enrolls in an EP course with “audit” status, the student must reach an understanding with the instructor as to what is required to earn the “audit.” If the student does not meet those expectations, the instructor must notify the EP Registration Team (EP-Registration@exchange.johnshopkins.edu) in order for the student to be retroactively dropped or withdrawn from the course (depending on when the "audit" was requested and in accordance with EP registration deadlines). All lecture content will remain accessible to auditing students, but access to all other course material is left to the discretion of the instructor.