605.788.81 - Big Data Processing Using Hadoop

Computer Science
Fall 2024

Description

Organizations today are generating massive amounts of data that are too large and too unwieldy to fit in relational databases. Therefore, organizations and enterprises are turning to massively parallel computing solutions such as Hadoop for help. The Apache Hadoop platform, with Hadoop Distributed File System (HDFS) and MapReduce (M/R) framework at its core, allows for distributed processing of large data sets across clusters of computers using the map and reduce programming model. It is designed to scale up from a single server to thousands of machines, offering local computation and storage. The Hadoop ecosystem is sizable in nature and includes many subprojects such as Hive and Pig for big data analytics, HBase for real-time access to big data, Zookeeper for distributed transaction process management, and Oozie for workflow. This course breaks down the walls of complexity of distributed processing of big data by providing a practical approach to developing applications on top of the Hadoop platform. By completing this course, students will gain an in-depth understanding of how MapReduce and Distributed File Systems work. In addition, they will be able to author Hadoop-based MapReduce applications in Java and also leverage Hadoop subprojects to build powerful data processing applications. Course Note(s): This course may be counted toward a threecourse track in Data Science and Cloud Computing.

Instructor

Profile photo of Karthik Shyamsunder.

Karthik Shyamsunder

karthik.shyamsunder@gmail.com

Course Structure

The course materials are divided into modules which can be accessed by clicking Course Modules on the course menu. A module will have several sections including the overview, content, readings, discussions, and assignments. You are encouraged to preview all sections of the module before starting. Most modules run for a period of seven (7) days, exceptions are noted in the Course Outline. You should regularly check the  Announcements for assignment due dates.

Course Topics

  1. Big Data Revolution
  2. Hadoop Architecture and Ecosystem
  3. Setting up Hadoop
  4. Hadoop Distributed File System (HDFS) Architecture
  5. Hadoop Distributed File System (HDFS) Programming Basics
  6. Hadoop Distributed File System (HDFS) Programming Advanced
  7. YARN and MapReduce Architecture
  8. MapReduce Programming Basics
  9. MapReduce Programming Intermediate
  10. MapReduce Programming Advanced
  11. Data Analysis using Hive
  12. Data Analysis using Pig
  13. Hadoop NOSQL Database HBase
  14. Spark
  15. Miscellaneous Hadoop Topics

Course Goals

By the end of this course, you will be able to: 

Textbooks

Textbook (Optional)
Title: Hadoop The Definitive Guide
Author: Tom White Edition: Fourth Edition

Required Software

Oracle VM VirtualBox
You will need to install VirtualBox Oracle VM. This is open source software. Information on how to install and configure will be provided in module 2.

Ubuntu Linux
You will need to install CentOS Linux as your own VM on VirtualBox. This is open source software. Information on how to install and configure will be provided in module 2. Don’t start now, follow the class modules.

Apache Hadoop
You will need to install Apache Hadoop inside your CentOS Linux VM. Hadoop is open source software. Information on how to install and configure will be provided in module 13. Don’t start now, follow the class modules.

Apache Hive
You will need to install Apache Hive inside your CentOS Linux VM. Hadoop is open source software. Information on how to install and configure will be provided in module 10. Don’t start now, follow the class modules.

Apache Pig
You will need to install Apache Pig inside your CentOS Linux VM. Hadoop is open source software. Information on how to install and configure will be provided in module 11. Don’t start now, follow the class modules.

Apache HBase
You will need to install Apache Pig inside your CentOS Linux VM. Hadoop is open source software. Information on how to install and configure will be provided in module 12. Don’t start now, follow the class modules.

Apache Spark
You will need to install Apache Spark inside your CentOS Linux VM. Hadoop is open source software. Information on how to install and configure will be provided in module 13. Don’t start now, follow the class modules.

IDE (NetBeans or Eclipse)
You will need to install NetBeans or Eclipse inside your CentOS VM. NetBeans or Eclipse is your standard open source IDE software.

Student Coursework Requirements

It is expected that the class will take approximately 6-7 hours per week: watching the videos (approximately 2 – 3 hours per week) as well as some outside reading, module quiz preparation and taking the quiz(1-2 hours per week), and developing and deploying assignments (approximately 2 – 3 hours per week).

This course will consist of four basic student requirements:

Requirement 1: Participation (Class Discussions) (11% of Final Grade Calculation)


In the Discussions area of the course, you, as a student, can interact with your instructor and classmates to explore questions and comments related to the content of this course. Discussions will always close Tuesday, 11:59 P.M. of that week.

 A successful student in online education is one who takes an active role in the learning process. You are therefore encouraged to participate in the discussion areas to enhance your learning experience throughout each week.

The discussions will be graded for:

High: Your contributions to each Topic indicate your mastery of the 
materials assigned. Your responses might integrate multiple views and/or show value as a seed for reflection for other participants' responses to the thread. You provide evidence that you are reading the assigned materials and other student postings and are responding accordingly, bringing out interesting interpretations. You know the facts and are able to analyze them and handle conceptual ideas.
Medium: Your responses build on the ideas of another participant (or more) and dig deeper into assignment questions or issues. When you make intelligent posts during the week, including some good critique of the course material, then you have demonstrated you have an understanding of the material, are reading posts of your colleagues, and are contributing to the class. Your posts demonstrate confidence with the materials, but may be just a bit off target in one area or another.
Low: You have meaningful interaction with other participants' postings. Posts that state I agree or I disagree include an explanation of what is disagreed or agreed upon and why, or introduce an argument that adds to the discussion. However, you may have rambling, lengthy posts that show no sign of having been re-read and refined before posting, and your writing suffers lack of clarity and comprehension.
Unsatisfactory: You will receive little credit in the week's discussion by just showing up and making trivial comments, without adding any new thought to the discussion. At the low end of the spectrum, no participation gets a "0." If you are not in the discussion, you do not earn any points.


Requirement 2: Assignments (50% of Final Grade Calculation)

There will be 10 programming assignments during the term of 14 weeks. The assignment details will be listed in the assignment section of the respective modules.  All assignments are due according to the dates in the Course Outline. Late submissions will be reduced by one letter grade for each week late (no exceptions without prior coordination with the instructors). The assignment details will be listed in the assignment section of the respective modules. All assignments are due according to the dates in the Course Outline. Late submissions will be reduced by one letter grade for each week late (no exceptions without prior coordination with the instructors).

Requirement 3: Quizzes (24% of Final Grade Calculation)

There will be 8 quizzes during the term of 14 weeks. The quizzes may be combinations of True/False, multiple choices, fill in the blanks etc. Check the course outline to see the due dates for these quizzes.


Requirement 4: Class Project (15 % of Final Grade Calculation)

A class project will be assigned in the fourth module and more details will be given in that module.

Grading Policy

Provide a detailed explanation of your grading policies and outline the scale used for letter grades.

EP uses a +/- grading system, but this class uses A, B, C, D.

Score RangeLetter Grade
90-100= A
80-89= B
70-79= C
60-69= D
<60= F

Course Policies

All assignments are due within one week. Late assignments will lose 10% per week.

Students are expected to submit the following to receive a grade for the course:

Not submitting a deliverable will receive a grade of 0 for that activity be it quiz, discussion, assignment, project.

Academic Policies

Deadlines for Adding, Dropping and Withdrawing from Courses

Students may add a course up to one week after the start of the term for that particular course. Students may drop courses according to the drop deadlines outlined in the EP academic calendar (https://ep.jhu.edu/student-services/academic-calendar/). Between the 6th week of the class and prior to the final withdrawal deadline, a student may withdraw from a course with a W on their academic record. A record of the course will remain on the academic record with a W appearing in the grade column to indicate that the student registered and withdrew from the course.

Academic Misconduct Policy

All students are required to read, know, and comply with the Johns Hopkins University Krieger School of Arts and Sciences (KSAS) / Whiting School of Engineering (WSE) Procedures for Handling Allegations of Misconduct by Full-Time and Part-Time Graduate Students.

This policy prohibits academic misconduct, including but not limited to the following: cheating or facilitating cheating; plagiarism; reuse of assignments; unauthorized collaboration; alteration of graded assignments; and unfair competition. Course materials (old assignments, texts, or examinations, etc.) should not be shared unless authorized by the course instructor. Any questions related to this policy should be directed to EP’s academic integrity officer at ep-academic-integrity@jhu.edu.

Students with Disabilities - Accommodations and Accessibility

Johns Hopkins University values diversity and inclusion. We are committed to providing welcoming, equitable, and accessible educational experiences for all students. Students with disabilities (including those with psychological conditions, medical conditions and temporary disabilities) can request accommodations for this course by providing an Accommodation Letter issued by Student Disability Services (SDS). Please request accommodations for this course as early as possible to provide time for effective communication and arrangements.

For further information or to start the process of requesting accommodations, please contact Student Disability Services at Engineering for Professionals, ep-disability-svcs@jhu.edu.

Student Conduct Code

The fundamental purpose of the JHU regulation of student conduct is to promote and to protect the health, safety, welfare, property, and rights of all members of the University community as well as to promote the orderly operation of the University and to safeguard its property and facilities. As members of the University community, students accept certain responsibilities which support the educational mission and create an environment in which all students are afforded the same opportunity to succeed academically. 

For a full description of the code please visit the following website: https://studentaffairs.jhu.edu/policies-guidelines/student-code/

Classroom Climate

JHU is committed to creating a classroom environment that values the diversity of experiences and perspectives that all students bring. Everyone has the right to be treated with dignity and respect. Fostering an inclusive climate is important. Research and experience show that students who interact with peers who are different from themselves learn new things and experience tangible educational outcomes. At no time in this learning process should someone be singled out or treated unequally on the basis of any seen or unseen part of their identity. 
 
If you have concerns in this course about harassment, discrimination, or any unequal treatment, or if you seek accommodations or resources, please reach out to the course instructor directly. Reporting will never impact your course grade. You may also share concerns with your program chair, the Assistant Dean for Diversity and Inclusion, or the Office of Institutional Equity. In handling reports, people will protect your privacy as much as possible, but faculty and staff are required to officially report information for some cases (e.g. sexual harassment).

Course Auditing

When a student enrolls in an EP course with “audit” status, the student must reach an understanding with the instructor as to what is required to earn the “audit.” If the student does not meet those expectations, the instructor must notify the EP Registration Team [EP-Registration@exchange.johnshopkins.edu] in order for the student to be retroactively dropped or withdrawn from the course (depending on when the "audit" was requested and in accordance with EP registration deadlines). All lecture content will remain accessible to auditing students, but access to all other course material is left to the discretion of the instructor.