Instructor Information

Perry Wilson

Cell Phone: 240-463-9126

Course Information

Course Description

Organizations today are generating massive amounts of data that are too large and too unwieldy to fit in relational databases. Therefore, organizations and enterprises are turning to massively parallel computing solutions such as Hadoop for help. The Apache Hadoop platform, with Hadoop Distributed File System (HDFS) and MapReduce (M/R) framework at its core, allows for distributed processing of large data sets across clusters of computers using the map and reduce programming model. It is designed to scale up from a single server to thousands of machines, offering local computation and storage. The Hadoop ecosystem is sizable in nature and includes many subprojects such as Hive and Pig for big data analytics, HBase for real-time access to big data, Zookeeper for distributed transaction process management, and Oozie for workflow. This course breaks down the walls of complexity of distributed processing of big data by providing a practical approach to developing applications on top of the Hadoop platform. By completing this course, students will gain an in-depth understanding of how MapReduce and Distributed File Systems work. In addition, they will be able to author Hadoop-based MapReduce applications in Java and also leverage Hadoop subprojects to build powerful data processing applications. Course Note(s): This course may be counted toward a threecourse track in Data Science and Cloud Computing.

Prerequisites

EN.605.202 Data Structures; EN.605.681 Principles of Enterprise Web Development or equivalent Java experience.

Course Goal

The main goal of the course is to give students the opportunity to explore a number of Hadoop-based tools for storage, retrieval, and analysis of large quantities of data in a variety of formats. The course will include lectures that stress practical examples and students will be required to complete programming assignments and an in-class presentation on technologies not included in the lecture material.

Course Objectives

  • Understand Cloud and Big Data architectures, storage and processing techniques; apply the newly acquired skills by developing Java applications on top of Hadoop eco-system.
  • Understand techniques for storing and processing large amounts of structured and unstructured data
  • Become familiar with Hadoop eco-system (HDFS, MapReduce, YARN, HBase, etc...)
  • Implement and deploy Java software projects/assignments to the Semi Distributed Installation of the Hadoop eco-system

When This Course is Typically Offered

The course is usually offered at APL on Thursdays during the Spring semester.

Syllabus

  • Big Data Overview
  • Introduction to Hadoop
  • Configuring a Hadoop Development Environment
  • HDFS Architecture
  • HDFS Programming Fundamentals
  • MapReduce Architecture
  • MapReduce Programming Basics
  • MapReduce Programming Intermediate
  • MapReduce Programming Advanced
  • Hive
  • Pig
  • NOSQL
  • HBase

Student Assessment Criteria

Class Preparation and Participation 10%
Programming Assigments 80%
In-Class Presentation 10%

A grade of A indicates achievement of consistent excellence and distinction throughout the course—that is, conspicuous excellence in all aspects of assignments and discussion in every week.

A grade of B indicates work that meets all course requirements on a level appropriate for graduate academic work. These criteria apply to both undergraduates and graduate students taking the course.

100-90 = A
89-80 = B
79-70 = C
<70 = F

Computer and Technical Requirements

605.481 - Principles of Enterprise Web Development or equivalent Java experience.  In addition to required classwork the course has the following requirements:
  • Strong (and recent) Java programming skills/experience
  • Ability to spend 8-15 hours a week outside of class
  • linux/unix experience
  • Some scripting experience
You will also need administrative access to a computer suitable for the development environment: Hadoop Semi Distributed Installation; linux- based setup is encouraged; Lectures will outline Virtual Machine (VM) based set-up using VirtualBox product; the guest Operating System (OS) is Ubuntu where 3G+ of RAM is allocated to the VM. Semi Distributed Hadoop Installation runs many daemons therefore at least 2+CPUs and 4G+ RAM is required for the native OS

Participation Expectations

The class format will include lectures, class discussions, and live demonstrations of the tools and programming techniques. 

There will be some reading assignments to support class discussion. These will be given ahead of time.

Programming assignments are expected to be turned in on the website as indicated in the assignment tool; it will be considered late if it is received after that time. Special circumstances (e.g., temporary lack of internet access) can be cheerfully accommodated if the student informs us in advance. Homework that is unjustifiably late will have the grade reduced for lateness.

Students are expected to participate/submit the following to receive a grade for the course:

  • Preparation and Participation (Class Discussions)
  • Homework
  • In-class Presentation

We generally do not directly grade spelling and grammar. However, egregious violations of the rules of the English language will be noted without comment. Consistently poor performance in either spelling or grammar is taken as an indication of poor written communication ability that may detract from your grade.

Textbooks

Textbook information for this course is available online through the MBS Direct Virtual Bookstore.

Course Notes

There are no notes for this course.

Term Specific Course Website

http://blackboard.jhu.edu

(Last Modified: 03/02/2017 10:30:47 AM)