605.741.81 - Large-Scale Database Systems

Computer Science
Summer 2026

Description

This course investigates the theory and practice of modern large-scale database systems. Large-scale approaches include distributed relational databases; data warehouses; and non-relational databases including HDFS, Hadoop, Accumulo for query and graph algorithms, and Mahout bound to Spark for machine learning algorithms. Topics discussed include data design and architecture; database security, integrity, query processing, query optimization, transaction management, concurrency control, and fault tolerance; and query formulation, graph algorithms, and machine learning algorithms on large-scale distributed data systems. At the end of the course, students will understand the principles of several common large-scale data systems including their architectures, performance, and costs. Students will also gain a sense of which approach is recommended for different requirements and circumstances.

Instructor

Profile photo of David Silberberg.

David Silberberg

dsilber1@jh.edu

Course Structure

The course content is divided into modules. Modules can be accessed by clicking Course Content on the menu. A module will have several sections including the overview, content, readings, discussions, and assignments. Students are encouraged to preview all sections of the module before starting. Most modules run for a period of seven (7) days, exceptions are noted on the Course Outline page. Students should regularly check the Calendar and Announcements for assignment due dates.

Module

Dates

Topics

Assignments

Module 1

Week 1

Introduction to Distributed Database Systems and Distributed Database Architectures

Ozsu, M. and Valduriez, P., Principles of Distributed Database Systems, Springer, 2011; chaps. 1, 2 & 3

·       Learning Activity – Design Database

·       Assessment – Write up of database design, represent queries in relational algebra notation

Module 2

Week 2

Horizontal Partitioning

Principles of Dist. DB Sys.; chap. 3

·       Learning Activity – Horizontally partition a single table of a database

·       Assessment – Find the optimal horizontal partitioning of a table given a set of queries

Module 3

Week 3

Vertical Partitioning

Principles of Dist. DB Sys.; chap. 3

·       Learning Activity – Vertically partition a single table of a database

·       Assessment – Find the optimal vertical partitioning of a table given a set of queries

Module 4

&

Module 5

Week 4

Semantic Data Control

&

Distributed Query Processing

Principles of Dist. DB Sys.; chaps. 5, 6 & 7

·       Assessment – Answer questions about the effects of semantic integrity rules

·       Learning Activity – Online discussion about query trees

Module 6

Week 5

Distributed Query Optimization

Principles of Dist. DB Sys.; chap. 8

·       Learning Activity – Analyze the cost/benefit of implementing different query optimization approaches

·       Assessment – Apply cost heuristics to optimize query plans

Module 7

Week 6

Distributed Transaction Management & Concurrency Control

Principles of Dist. DB Sys.; chap. 10

·       Learning Activity – Online discussion of ACID properties, concurrency control, and/or deadlock managment

·       Assessment – Demonstrate knowledge of query schedule equivalence

·       Assessment – Answer questions about various concurrency control algorithms

Module 8

Week 7

Distributed Reliability Protocols and the Data Warehouse

C. Mohan, ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM Transactions on Database Systems, Vol. 17, No. 1, March 1992, pp. 94–162 http://www.sai.msu.su/~megera/postgres/gist/papers/concurrency/p94-mohan.pdf

Inmon, William. Building the Data Warehouse. Wiley; 4th edition. 2005.

Principles of Dist. DB Sys.; chap. 12

·       Learning Activity – Online discussion of ARIES operations

·       Assessment – Work through an ARIES example

Module 9

&

Module 10

Week 8

Cloud Computing MapReduce

Hadoop Data File System (HDFS) & Advanced Hadoop

Hogan, M. D., Liu, F., Sokol, A. W., Jin, T., NIST Cloud Computing Standards Roadmap, NIST Publication SP - 500-291, 2011. http://www.nist.gov/manuscript-publication-search.cfm?pub_id=909024

Liu, F., Tong, J., Mao, J., Bohn, R. B., Messina, J. V., Badger, M. L., Leaf, D. M., NIST Cloud Computing Reference Architecture, NIST Publication SP - 500-292 , 2011 http://www.nist.gov/manuscript-publication-search.cfm?pub_id=909505

Sosinsky, B., Cloud Computing Bible. Wiley, 2011. ISBN: 978-0470903568

Dean, Jeffrey and Ghemawat, Sanjay (2004). "MapReduce: Simplified Data Processing on Large Clusters". http://research.google.com/archive/mapreduce.html   

Tom White, Hadoop: The Definitive Guide. O’Reilly Media, 4th edition. 2015.

Apache Hadoop. http://wiki.apache.org/hadoop/ 

·       Learning Activity – Online discussion of the characteristics of algorithms best suited for MapReduce

·       Assessment – Write MapReduce pseudocode

Module 11

Week 9

Accumulo Architecture & Programming

Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. Bigtable: A distributed storage system for structured data. Proceedings of the 7th Conference on USENIX Symposium on Operating Systems Design and Implementation, Vol. 7. 2006. http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.101.9822 

Apache – Accumulo, http://incubator.apache.org/accumulo/ 

Apache – Hive, http://hive.apache.org/

Apache – Pig, http://pig.apache.org/

Cloudera – Sqoop, http://www.cloudera.com/blog/2009/06/introducing-sqoop/

Google – Cloud SQL, http://code.google.com/apis/sql/

·       Learning Activity – Online discussion about what must be done to make Accumulo ACID compliant

·       Assessment – Write pseudocode for Accumulo text analytics

Module 12

Week 10

Advanced Accumulo – Security & Data Analytics

Apache – Accumulo, http://incubator.apache.org/accumulo/ 

Wikipedia – Large Triple Stores, http://www.w3.org/wiki/LargeTripleStores

·       Learning Activity – Discuss the advantages of graph data models

·       Assessment – Write pseudocode for searching geospatial data represented in Accumulo

Module 13

Week 11

Machine Learning – Collaborative Filtering

Apache – Mahout, http://mahout.apache.org/  

·       Learning Activity – Students will discuss the two common collaborative filtering methods and how they can be improved

·       Assessment – Write pseudocode for a collaborative filtering problem

Module 14

Week 12

Machine Learning – Clustering and Classification

Apache – Mahout, http://mahout.apache.org/   

·       Learning Activity – Online discussion of differences between collaborative filtering, classification, and clustering

·       Assessment – Write pseudocode for a clustering algorithm


Course Topics

Course Goals

To provide an understanding of the deep issues of the design and implementation of massive-scale data systems.

Course Learning Outcomes (CLOs)

Textbooks

Required

Ozsu, M. Tamer and Valduriez, Patrick. Principles of Distributed Database Systems, 3rd Edition. Springer, 2011.

ISBN-10: 1441988335

ISBN-13: 978-1441988331

Textbook information for this course is available online through the appropriate bookstore website: For online courses, search the MBS website at http://ep.jhu.edu/bookstore .

Optional

Additionally, any of the following texts or other texts that you may have from previous courses may be useful for this class if you find yourself struggling with specific skills:

Student Coursework Requirements

It is expected that each class will take approximately 4–7 hours per week to complete. Here is an approximate breakdown: reading the assigned sections of the texts (approximately 2–3 hours per week) as well as some outside reading, listening to the audio annotated slide presentations (approximately 1–2 hours per week), and online homework assignments (approximately 1–2 hours per week).

This course will consist of four basic student requirements:

Preparation and Participation (Class Discussions) (20% of Final Grade Calculation)

Each student is responsible for carefully reading all assigned material and being prepared for discussion. The majority of readings are from the weekly modules and the course text. Additional reading may be assigned to supplement text readings.

It is recommended that you post your initial response to the discussion questions by the evening of day 3 for that module week. Posting one or more responses to the discussion question is part one of your grade for class discussions.

Part two of your grade is your discussions and interactions (i.e., responding to classmate postings with thoughtful responses) with at least two classmates. Just posting your response to a discussion question is not sufficient; we want you to interact with your classmates. Be detailed in your postings and in your responses to your classmates' postings. Feel free to agree or disagree with your classmates. Please ensure that your postings are civil and constructive.

Sometimes, a student may post the entire solution to a question. In this case, I still want you to post your own solutions to the question and any unique insights that you may have. There are always nuances in each person’s solutions and I am very interested in each of your perspectives.

I do not want posts that are generated directly from LLMs. However, you may use LLMs to help inform your responses just like you may use web pages, papers, and books to help inform your responses. Nevertheless, you must present your own responses in your own words.

Some of the discussions are collaborative homework assignments. This is tricky. I don't want one person to post the solution to a homework problem early while no one else can add too much to the solution. I would rather see people post parts of the solution and discuss trade-offs and observations at each step of the solution. As always, I welcome feedback early and often. Most importantly, I'll be looking for interesting or unique insights that you bring to each topic.

David Silberberg will monitor class discussions and will respond to some of the discussions as discussions are posted.

Evaluation of preparation and participation is based on contribution to discussions. Preparation and participation is evaluated by the following grading elements:

  1. Interesting and insightful concepts (50%)
  2. Interaction with other students demonstrating critical thinking (50%)

You will receive up to 5 points for each week’s discussion. Preparation and participation is graded as follows:

5 points — Interesting Concepts [offers interesting insights; references to other research or products]; Critical Thinking [rich in content; full of thoughts, insight, and analysis]; excellent interaction with other students.

4 points — Interesting Concepts [offers some insights; references to other research or products]; Critical Thinking [substantial information; thought, insight, and analysis has taken place]; good interaction with other students.

3 points — Interesting Concepts [offers some insights; few references to other research or products]; Critical Thinking [generally competent; information is thin and commonplace]; some interaction with other students.

2 points — Interesting Concepts [offers few insights; no references to other research or products]; Critical Thinking [information is thin and commonplace]; little interaction with other students.

1 points — Interesting Concepts [few real insights]; Critical Thinking [information is thin and commonplace]; hardly any interaction with other students.

0 points — No postings.

Learning Activities, Homework, and Quizzes (80% of Final Grade Calculation)

Reading assignments will be important sources of material for your learning activity assignment.

Learning activity assignments will include a mix of homework problem sets. Include a cover sheet with your name and assignment identifier. Also include your name and a page number indicator (i.e., page x of y) on each page of your homework submissions. Each problem should have the problem statement, the steps required to arrive at the solution, and the solution. All Figures and Tables should be captioned and labeled appropriately.

All homework assignments and quizzes are due according to the dates in the Calendar.

Late submissions will be reduced by 10 points for each day late (no exceptions without prior coordination with the instructors).

Learning Activities are evaluated by the following grading elements:

  1. Writing quality and technical accuracy (20%) (Writing is expected to meet or exceed accepted graduate-level English and scholarship standards. That is, all homework assignments will be graded on grammar and style as well as content).
  2. Each step of the solution is addressed with rationale and answered correctly (80%).
Learning Activities are graded as follows:

100–90 = A

89–80 = B

79–70 = C

Grading Policy

Student assignments are due according to the dates in the Calendar. The instructor will post grades one week after assignment due dates.

I generally do not directly grade spelling and grammar. However, egregious violations of the rules of the English language will be noted without comment. Consistently poor performance in either spelling or grammar is taken as an indication of poor written communication ability that may detract from your grade.

Grade of A+, A, and A- indicate achievement of consistent excellence and distinction throughout the course—that is, conspicuous excellence in all aspects of assignments and discussion in every week.

Grades of B+, B, and B- indicate work that meets all course requirements on a level appropriate for graduate academic work. These criteria apply to both undergraduates and graduate students taking the course.

100–90 = A (with pluses and minuses)

89–80 = B (with pluses and minuses)

79–70 = C (with pluses)

Final grades will be determined by the following weighting:

Item

% of Grade

Preparation and Participation (Class Discussions)

20%

Item

% of Grade

Learning Activities, Homework, Quizzes

80%

Academic Policies

Deadlines for Adding, Dropping, and Withdrawing from Courses

Students may add a course up to one week after the start of the term for that particular course. Students may drop courses according to the drop deadlines outlined in the EP academic calendar. Between the 6th week of the class and prior to the final withdrawal deadline, a student may withdraw from a course with a W on their academic record. A record of the course will remain on the academic record with a W appearing in the grade column to indicate that the student registered and withdrew from the course. 

Academic Misconduct Policy

All students are required to read, know, and comply with the Johns Hopkins University Krieger School of Arts and Sciences (KSAS) / Whiting School of Engineering (WSE) Procedures for Handling Allegations of Misconduct by Full-Time and Part-Time Graduate Students. This policy prohibits academic misconduct, including but not limited to the following: cheating or facilitating cheating; plagiarism; reuse of assignments; unauthorized collaboration; alteration of graded assignments; and unfair competition. Course materials (old assignments, texts, or examinations, etc.) should not be shared unless authorized by the course instructor. Any questions related to this policy should be directed to EP’s academic integrity officer at ep-academic-integrity@jhu.edu.

Students with Disabilities - Accommodations and Accessibility

Johns Hopkins University values diversity and inclusion. We are committed to providing welcoming, equitable, and accessible educational experiences for all students. Our courses are designed with a proactive approach to accessibility to minimize the need for disability disclosure and accommodation requests, but we recognize that you may need additional support. Students with disabilities (including those with psychological conditions, medical conditions, and temporary disabilities) can request accommodations for this course by providing an Accommodation Letter issued by Student Disability Services (SDS). Please request accommodations for this course as early as possible to provide time for effective communication and arrangements.  For further information or to start the process of requesting accommodations, please contact EP Student Disability Services at ep-disability-svcs@jhu.edu

Student Conduct Code

The fundamental purpose of the JHU regulation of student conduct is to promote and to protect the health, safety, welfare, property, and rights of all members of the University community as well as to promote the orderly operation of the University and to safeguard its property and facilities. As members of the University community, students accept certain responsibilities which support the educational mission and create an environment in which all students are afforded the same opportunity to succeed academically. For a full description of the code please visit the Student Conduct Code website.

Classroom Climate

JHU is committed to creating a classroom environment that values the diversity of experiences and perspectives that all students bring. Everyone has the right to be treated with dignity and respect. Fostering an inclusive climate is important. Research and experience show that students who interact with peers who are different from themselves learn new things and experience tangible educational outcomes. At no time in this learning process should someone be singled out or treated unequally on the basis of any seen or unseen part of their identity. If you have concerns in this course about harassment, discrimination, or any unequal treatment, or if you seek accommodations or resources, please reach out to the course instructor directly. Reporting will never impact your course grade. You may also share concerns with your program chair, the Assistant Dean for Diversity and Inclusion, or the Office of Institutional Equity. In handling reports, people will protect your privacy as much as possible, but faculty and staff are required to officially report information for some cases (e.g. sexual harassment).

Course Auditing

When a student enrolls in an EP course with “audit” status, the student must reach an understanding with the instructor as to what is required to earn the “audit.” If the student does not meet those expectations, the instructor must notify the EP Registration Team (EP-Registration@exchange.johnshopkins.edu) in order for the student to be retroactively dropped or withdrawn from the course (depending on when the "audit" was requested and in accordance with EP registration deadlines). All lecture content will remain accessible to auditing students, but access to all other course material is left to the discretion of the instructor.