685.648.81 - Data Science

Data Science
Spring 2024

Description

This course will cover the core concepts and skills in the interdisciplinary field of data science. These include problem identification and communication, probability, statistical inference, visualization, extract/transform/load (ETL), exploratory data analysis (EDA), linear and logistic regression, model evaluation and various machine learning algorithms such as random forests, k-means clustering, and association rules. The course recognizes that although data science uses machine learning techniques, it is not synonymous with machine learning. The course emphasizes an understanding of both data (through the use of systems theory, probability, and simulation) and algorithms (through the use of synthetic and real data sets). The guiding principles throughout are communication and reproducibility. The course is geared towards giving students direct experience in solving the programming and analytical challenges associated with data science. The assignments weight conceptual (assessments) and practical (labs, problem sets) understanding equally. Prerequisite(s): A working knowledge of Python scripting and SQL is assumed as all assignments are completed in Python.

Instructors

Profile photo of Stephyn Butcher.

Stephyn Butcher

steve.butcher@jhu.edu

Profile photo of Andrew Stewart.

Andrew Stewart

andrew.stewart@jhu.edu

Course Structure

The course materials are divided into modules which can be accessed by clicking Modules on the menu in Canvas. A module will have several sections including the overview, content, readings, discussions, and assignments. You are encouraged to preview all sections of the module before starting. All modules run for a period of seven (7) days, Wednesday to Tuesday.

You should check Canvas Inbox and Teams/Slack every day. Additionally, make sure that the email on file with the Registrar and associated with your student account is either forwarded to an email you check everyday or is an email you check every day. (My understanding is that you should use your JHU email and have it forwarded if you don’t check it every day. See http://my.jhu.edu/). In any emergency, we may try to reach you with regular email. Make sure you'll receive it.

Online courses are much more challenging than in-person courses. We both know this from personal experience. Additionally, text communication lacks nuance so the opportunities for misunderstanding or misinterpretation are increased. Try both to be on your best behavior, take an extra minute to think “is this really the way I want to phrase this?” before posting. Be magnanimous in the interpretations of others intentions. Or, as they say, never attribute to malice what can be attributed to an honest mistake.

We are a class of students, colleagues, interested in learning to do data science and we should try to help each other the best we can (although see the specific restrictions below). In that vein, use of unofficial channels outside the official Canvas and Teams channels will be considered to be violations of the EP JHU Academic Misconduct Policy.

We send an Announcement at the start of a Module. You should also check the Calendar for due dates. There is also important information in the Course Outline.

Course Topics

Course Goals

Data Science is a multi-disciplinary, field in constant flux. The goal of this course is to introduce you to the foundational topics and skills of a Data Scientist. If you want to know about TensorFlow, watch YouTube. At the end of the course, you should be able to determine if a problem appropriate to Data Science, produce a plan for solving the problem and execute on that plan following the Data Science Process.

Course Learning Outcomes (CLOs)

Textbooks

There is no required textbook for this course because no such textbook exists. There are quite a few “popular” books on Data Science published by O’Reilly but none are suitable for a graduate level course in Data Science.

Instead, there are course notes in the form of a book entitled, Fundamentals of Data Science or Fundamentals, for short. The book is in a constant state of editing and updating and is always more current than the recorded lectures. You should focus on the readings.

The industry is becoming increasingly aware that Data Scientists should write code that follows best practices. To that end, you should consider purchasing Clean Code, by Robert Martin. Some of the topics are not directly related to Data Science coding and some of the practices are only now making their way into data science, but, in every case, you would do well to learn the lessons contained therein.

Required Software

You can find the software instructions here:

https://gist.github.com/actsasgeek/954c73d28503eb67f01d12a12b1e1181

Student Coursework Requirements

The general rule for undergraduate courses is 3 hours per credit per week outside of class. At 3 hours “in class” and 3 credits, that’s 3 * 3 + 3 = 12 hours per week for undergraduate content. STEM and graduate courses are more difficult so you can expect to spend more time than that. Online classes are inherently more challenging. You’ll need to make allowances for that, as well.

The suggested course load at JHU EP is one course per semester. Additionally, this a graduate course, not an undergraduate course and therefore our expectations are higher. Remember that full time students in graduate programs at Homewood take 2-3 courses per semester. If you are taking that many as a part time student, with a full time job, that is your responsibility.

We have found that those who spend more than 20 hours per week have various deficiencies and are not adequately prepared for the class. You must make sure you have completed the respective Core Foundation courses and that you can program well in Python.

There is the additional challenge that completing assignments that involve both data and programming is fairly non-deterministic. One mistake can set you back an hour or more. Work more slowly and methodically to finish faster.

This course is highly structured and there are different kinds of assignments due every week or every two week, all with different learning goals.

Lectures and Readings

Each week starts with the lectures and readings. The goal of the lectures and readings is to introduce the topics and concepts of the module and give you examples of their application using various frameworks, guidelines, and processes. Pay special attention to frameworks (you will need to be able to apply them) and examples (they are there as references). The Lectures are recorded. The readings are in the corresponding chapters of Fundamentals, both the PDF and the corresponding notebooks. The PDF/notebooks are always more up-to-date than the Lectures because the text Fundamentals is under constant revision.

Assessments

Each Module will include an Assessment. The purpose of Assessments is to gauge your conceptual understanding of the topics covered. "Quizzing" also helps you retain information better. These Assessments are cumulative. They will contain 10 questions from the current Module and 5 questions from any of the previous Modules (15 total). The Assessment is time limited to 30 minutes. It will automatically submit after that time. You have only one attempt. You will be shown one question at a time with no backtracking.

The Assessment is due on Tuesday (you can always do it earlier). Assessments are individual effort. They are not group projects (yes, we have to say this).

Exceptions...Fall/Spring: the first Assessment has 10 questions and each is worth 1.5 points. Summer: There are two assessments during the combined Module 5 and Module 13 week.

Suggestions...Fundamentals contains review questions at the end of each chapter. Answer them.

Labs & Lab Groups ("Group Discussion")

Here's the basic "weekly" schedule...

* Discussions (on Teams/Slack) should start as students work through the material, first Wednesday at the start of the new Module week.
* Labs are due by Sunday (to Canvas). Submit an HTML version of your Notebook to Canvas.
* Lab solutions are released Monday & Labs are Peer Graded by Tuesday.
* Lab Evaluations are reviewed by all on Wednesday.

Each week there is a Lab that asks you to apply the methods and concepts covered in the Module. The goal of the Lab is to demonstrate the ability to do data science, or at least that part of data science covered in the Module. The Lab is to be individual effort. You should actually review the Lab before you start the lectures and readings for the week so that you get a sense of where the topic is going (and the Review questions in Fundamentals!).

Discussion. You will undoubtedly have questions about the concepts and the Lab. At the start of the semester, you will be assigned to a Lab Group of about 4-6 other students. If you have any questions about the Module materials, readings, concepts, and especially the Lab, you should ask questions in your Lab Group. You should not post a question about the Module content that you did not already ask in your Lab Group.

If other members of your Lab Group have questions, you should endeavor to answer them. Class participation depends on both asking and answering questions. However, you should not give away the answers to any assignment. Do not just paste code. You should understand what you’re doing and why; you should always understand what the code is doing.

The pre-submission discussion will take place on Teams/Slack. We hope that is will lead to better and more immediate discussions. It is graded Complete/Incomplete without comments. We will sample them at random to check for “good faith efforts”. You should submit an HTML version of your Lab to the Grade Center in Canvas. Except as described above, Labs are individual effort. They are not group projects.

First, remember the Lab is only pass/fail for effort. It is the only assignment where effort counts. We only check that you tried to do the work. Second, research suggests that we learn better when we try and fail, rather than just being given the answer. This, again, is why the Lab is only Complete/Incomplete.

On Monday, you should post the HTML version of your Lab to your group chat. We will then do an online version of "pass your paper to the right". Everyone will have a "grader" from their group who evaluates their Lab. The goal is actually not to give a grade but to provide feedback based on the Lab Solution that will be released on Monday. The Evaluation is due by Tuesday. Evaluations count as part of your own Lab score so if you complete a Lab and fail to complete an Evaluation, you get a 0 for the assignment.

The Lab is due to Canvas on Sunday, Teams/Slack on Monday, and your Peer Evaluation to Teams/Slack on Tuesday.

On Wednesday, you should review all the Evaluations to see if there's anything you may have missed in your own assignment or your 

The quality of the course and learning experience depends on the quality of the group interactions. The Group Discussion grade depends on...

1. Asking and answering questions in a skillful way.
2. Posting your Lab on time.
3. Posting a substantive evaluation of your peer on time.
4. Reviewing all group evaluations on time.

Problem Sets

The course builds by starting with lectures and readings, Labs and discussion, assessments on concepts, and culminates in Problem Sets. The cadence is two weeks of Labs followed by a Problem Set. This means that Problem Set 1 covers Labs 1 and 2. Problem Set 2 covers Labs 3 and 4. Problem Set 3 covers Labs 5 and 6, etc. These are the real exams of the course. They are tests of your ability to do data science.

You may think of PS 1 + 2 + 3 = midterm and PS 4 + 5 + 6 as the final, except you don't do them all at once (which diversifies risk).

You should always refer to the corresponding Module Labs, Lab Solutions, and Examples in Fundamentals when completing a Problem Set. As with exams, we give you less direction. This is why the Labs are so important. If, on a Problem Set, we give you a data set and say, “Do EDA”, you should be able to do it according to the framework presented in the course. Problem Sets are exams and should be treated as such. They are individual effort. They are not group projects.

Except...

Summer: Problem Set 5 is divided into two parts, Problem Set 5.1 and Problem Set 5.2 instead of Problem Set 6.

Grading Policy

Assignments are due according to the dates posted on Canvas or Course Outline. You may check these due dates in the Course Calendar or the Assignments in the corresponding modules. We will try to post grades one week after assignment due dates but no later than two weeks.

It’s worth noting that the due date is not the do date. You can and should start much earlier in the week on your assignments.

Grading Standards


“A” – Excellent. You completed the assignment in a timely manner, demonstrating a thorough understanding of the technique, tool or concept and conducted an excellent exploration of its use. If it is a discussion, your post was substantive, did not just quote other materials, and contributed to the on-going discussion. You went above what was required, asked for or expected. Over the course of the semester, this means consistent excellence and distinction throughout the course—that is, conspicuous excellence in all aspects of assignments and discussion in every week.

“B” – Satisfactory. You completed the assignment in a timely manner, you did exactly what was requested, demonstrating a sufficient understanding of the technique, tool, or concept. There may have been minor deficiencies. If it was a discussion post, the post contributed to the discussion but it may have been a reference to other materials or perhaps even slightly off topic. You may have done more too much in the hopes that something was correct. Verbosity is often a sign of some confusion. Over the course of the semester, this means work that meets all course requirements on a level appropriate for graduate academic work.

“C”, “D” – Unsatisfactory. You either did not complete the assignment, it was not timely or you did what was minimally required. There are significant areas of confusion. A lack of exploration or curiosity about the concept, tool or technique. If it was a discussion post, it may have been off topic. Listing many things, hoping that one is correct, is often a sign of confusion.

“F” - Oops. You did not submit the assignment on-time or post on-time or no bona fide effort was evident.

We cannot stress this enough, merely working hard is not grounds for an A. You have to do the right thing in the right way. Writing too much is often a sign of confusion.

We generally do not directly grade spelling and grammar. However, egregious violations of the rules of the English language will be noted without comment. Consistently poor performance in either spelling or grammar is taken as an indication of poor written communication ability that may detract from your grade.

However, communication is very important in Data Science so we will tend to be a bit more picky about formatting, grammar and spelling. If your submissions look like a ransom note, however correct they might otherwise be, they will be counted as wrong.

Grading System

We use a threshold grading system in this course. You must have sufficient mastery over the topics to get a B, or an A. As a result, this class does not use the traditional 100 point scale, we do not weight things. There is no point gaming, percents, or averages. Instead we use just Pass/Fail or A-F for all assignments. And the final grade is based on the majority of your grade for each of the different categories of work:

Here are the categories of work:

Labs14 (Summer: 12)
Assessments14 (Summer: 13)
Problem Sets6
Group Discussions14 (Summer: 12)

A few assignments are binary (pass/fail). We merely note if you turned them in or if they had an acceptable level of effort (an incomplete Lab might be a 0, for example). Final Grades are based on meeting minimum grade thresholds on all assignments.

For an A, you must at least achieve:

Labs*14 of 14 submitted with Complete (12 of 12 in the Summer)
Assessments*11 of 14 submitted with an "A" (10 of 13 in Summer)
Problem Sets4 of 6 with a "A" (the remainder must be "B" or better)
Lab Discussion*14 of 14 with "A" (12 of 12 in the Summer)

For a B, you did not meet the requirements for an A but reached the following:

Labs14 of 14 submitted with Complete.
Assessments11 of 14 submitted with a "B" or better (4.0 - 5.0) with only one 0 (F)
Problem Sets4 of 6 with a "B" or better
Lab Discussion14 of 14 with "B"


Important

1. If you score a C or lower on a Problem Set, you will be asked to revise your assignment, except for the last Problem Set, which is due the last day of the semester.

So there are a few things to note:

In other words, you have to do all of the assignments because they teach you the topics and that's the goal. Grades are not the goal.

Anything below this level of accomplishment will result in a C or lower. As the semester unfolds, we may find it necessary to adjust both the assignments, criteria or both. We may award "pluses and minuses" at our discretion. We may, at our discretion, change the thresholds down.

Canvas Specific

Canvas is not better than Blackboard on every feature. The Grading systems is one of them. We think we have the systems setup for Labs, Group Discussions, and Problem Sets. For Assessments, the grading scale is as follows:

PointsGrade
13-15A
11-12B
8-10C
6-7D
<6F



Course Policies

Late Policy

We do not accept late submissions for a grade without prior consultation, except in the case of extreme emergencies (the birth of a child, incapacitating illness, etc). The following are not legitimate reasons: work, taking other classes, weddings, family reunions, holidays, anniversaries, vacations, etc. However, emergencies of all stripes to arise. The key here is prior consultation.

The main issue is that lateness on Module 5’s programming assignment snowballs into Module 6’s programming assignment. To fall behind may mean never being able to catch up. Additionally, with things like the Lab Groups, other students are counting on your participation.

COVID-19

We are still in the grips of a pandemic and the situation is fluid. Exceptions will be granted for documented COVID-19 illness. This requirement is twofold. First, I really did have students last year say “I think I have COVID”…some twice. Second, if you really think you have COVID, you need to get tested and quarantine.

 

Academic Policies

Deadlines for Adding, Dropping and Withdrawing from Courses

Students may add a course up to one week after the start of the term for that particular course. Students may drop courses according to the drop deadlines outlined in the EP academic calendar (https://ep.jhu.edu/student-services/academic-calendar/). Between the 6th week of the class and prior to the final withdrawal deadline, a student may withdraw from a course with a W on their academic record. A record of the course will remain on the academic record with a W appearing in the grade column to indicate that the student registered and withdrew from the course.

Academic Misconduct Policy

All students are required to read, know, and comply with the Johns Hopkins University Krieger School of Arts and Sciences (KSAS) / Whiting School of Engineering (WSE) Procedures for Handling Allegations of Misconduct by Full-Time and Part-Time Graduate Students.

This policy prohibits academic misconduct, including but not limited to the following: cheating or facilitating cheating; plagiarism; reuse of assignments; unauthorized collaboration; alteration of graded assignments; and unfair competition. Course materials (old assignments, texts, or examinations, etc.) should not be shared unless authorized by the course instructor. Any questions related to this policy should be directed to EP’s academic integrity officer at ep-academic-integrity@jhu.edu.

Students with Disabilities - Accommodations and Accessibility

Johns Hopkins University values diversity and inclusion. We are committed to providing welcoming, equitable, and accessible educational experiences for all students. Students with disabilities (including those with psychological conditions, medical conditions and temporary disabilities) can request accommodations for this course by providing an Accommodation Letter issued by Student Disability Services (SDS). Please request accommodations for this course as early as possible to provide time for effective communication and arrangements.

For further information or to start the process of requesting accommodations, please contact Student Disability Services at Engineering for Professionals, ep-disability-svcs@jhu.edu.

Student Conduct Code

The fundamental purpose of the JHU regulation of student conduct is to promote and to protect the health, safety, welfare, property, and rights of all members of the University community as well as to promote the orderly operation of the University and to safeguard its property and facilities. As members of the University community, students accept certain responsibilities which support the educational mission and create an environment in which all students are afforded the same opportunity to succeed academically. 

For a full description of the code please visit the following website: https://studentaffairs.jhu.edu/policies-guidelines/student-code/

Classroom Climate

JHU is committed to creating a classroom environment that values the diversity of experiences and perspectives that all students bring. Everyone has the right to be treated with dignity and respect. Fostering an inclusive climate is important. Research and experience show that students who interact with peers who are different from themselves learn new things and experience tangible educational outcomes. At no time in this learning process should someone be singled out or treated unequally on the basis of any seen or unseen part of their identity. 
 
If you have concerns in this course about harassment, discrimination, or any unequal treatment, or if you seek accommodations or resources, please reach out to the course instructor directly. Reporting will never impact your course grade. You may also share concerns with your program chair, the Assistant Dean for Diversity and Inclusion, or the Office of Institutional Equity. In handling reports, people will protect your privacy as much as possible, but faculty and staff are required to officially report information for some cases (e.g. sexual harassment).

Course Auditing

When a student enrolls in an EP course with “audit” status, the student must reach an understanding with the instructor as to what is required to earn the “audit.” If the student does not meet those expectations, the instructor must notify the EP Registration Team [EP-Registration@exchange.johnshopkins.edu] in order for the student to be retroactively dropped or withdrawn from the course (depending on when the "audit" was requested and in accordance with EP registration deadlines). All lecture content will remain accessible to auditing students, but access to all other course material is left to the discretion of the instructor.