605.646.81 - Natural Language Processing

Computer Science
Fall 2024

Description

This course surveys the principal difficulties of working with written language data, the fundamental techniques that are used in processing natural language, and the core applications of NLP technology. Topics covered in the course include language modeling, text classification, labeling sequential data (tagging), parsing, information extraction, question answering, machine translation, and semantics. The dominant paradigm in contemporary NLP uses supervised machine learning to train models based on either probability theory or deep neural networks. Both formalisms will be covered. A practical approach is emphasized in the course, and students will write programs and use open source toolkits to solve a variety of problems. Course prerequisite(s): There are no formal prerequisite courses, although having taken any of EN.605.649 Introduction to Machine Learning, EN.605.744 Information Retrieval, or EN.605.645 Artificial Intelligence is helpful. Course note(s): A working knowledge of Python is assumed. While some of the assigned exercises can be done in any programming language, we will sometimes provide example code in Python, and many of the labs are best solved in Python.Course note(s): A working knowledge of Python is assumed. While some of the assigned exercises can be done in any programming language, we will sometimes provide example code in Python, and many of the labs are best solved in Python.

Expanded Course Description

This course surveys the principal difficulties of working with written language data, the fundamental techniques that are used in processing natural language, and the core applications of NLP technology.  Topics covered in the course include language modeling, information retrieval, text classification, neural networks for natural language processing, vector representations, labeling sequential data (tagging), parsing, word sense disambiguation, lexical semantics, question answering, and machine translation. The dominant paradigm in contemporary NLP uses supervised machine learning to train models based on either probability theory or deep neural networks. Both formalisms will be covered.  A practical approach is emphasized in the course, and students will write programs and use open source toolkits to solve a variety of problems.

Instructors

Profile photo of Paul McNamee.

Paul McNamee

mcnamee@jhu.edu

Default placeholder image. No profile image found for James Mayfield.

James Mayfield

Course Structure

The course materials are divided into modules, which can be accessed by clicking Modules on the course menu. A module will have several sections including the content, readings, and assignments. Most modules run for a period of seven (7) days; exceptions are noted in the Course Outline. You should regularly check the Calendar and Announcements for assignment due dates.  Each module will introduce a new lecture topic, and many modules will have a corresponding laboratory assignment, which is typically due at the end of the current Module (or the next).  Labs are a core part of the course, and they are the single largest factor for determining course grades.  Each student will complete a course project, which is due towards the end of the semester. There will also be weekly quizzes that cover material from the lectures and the assigned readings.  There will be no exams.

Course Topics

Note: these topics or their order may change as needed.

Course Goals

The course introduces a broad array of techniques for processing digital text. Lab assignments will give students hands-on experience using these methods and open-source software packages on realistic problems. Supervised machine learning is a cornerstone of contemporary NLP – students will use a variety of tools based on both statistical and neural approaches.

Course Learning Outcomes (CLOs)

Textbooks

We will mainly use readings from Jurafsky and Martin, Speech and Language Processing (3rd edition draft).  For Fall 2022 the book is not available in print, but the chapters we will read are available free online: https://web.stanford.edu/~jurafsky/slp3/. There will also be supplemental readings and videos.  Note: there is a 2nd edition of the text in print, but it is now substantially dated, and would not be of much help in the course.

Student Coursework Requirements

605.646 is a graduate computer science course, and completing the work for each week will likely take at least 12 hours on average, depending on the material and your background. An approximate breakdown of the main components is: (a) reading the assigned materials (1 to 2 hours per week); (b) reviewing video lectures (~2 hours per week); (c) completing a quiz (~ 30 minutes); (d) assigned labs (5 to 8 hours, some labs more); (e) working on the class project (variable).

Grading Policy

Course grades are based on the following components:

Course grades will be assigned using letter grades with plus/minus modifiers (see below). Submitting a project is required to attain a grade of A- or higher, but the project is optional if not aiming for a grade above B+. Students not submitting a project will have their grades based on their other course work, and will not be eligible for a grade above B+. A grade of A indicates achievement of consistent excellence and distinction throughout the course—that is, conspicuous excellence in all aspects of assigned work. A grade of B indicates work that meets all course requirements on a level appropriate for graduate academic work.

100-97 = A+

96-93 = A

92-90 = A−

89-87 = B+

86-83 = B

82-80 = B−

79-70 = C

< 70 = F

Course Evaluation

Lab Assignments (60%)

There will be eight lab assignments. The major emphasis is on correctness, but clarity both in writing and in coding is also very important. Source code should be clear and easy to read, and should contain meaningful variable names, consistent style, and straightforward logic. Programs should be meaningfully organized and should contain suitable comments that primarily explain what the code does or intends to do.

Generally, programs should be validated using demonstrative test cases or some other evidence of correctness. On some assignments we ask for specific test cases; otherwise you are free to use examples of your own choosing. If a program is not working 100% correctly, you can still provide examples or give an explanation of which aspects work correctly and which do not. 

Your lowest lab score will be dropped from your average. While we think there is educational benefit in doing all of the problem sets, this means that you could choose to not submit one lab assignment without affecting your grade; that score would be zero, which would be ignored in the average. 

Class Project (20%)

Individual class projects permit students to explore a course-related topic in greater depth. Projects are usually based on conducting an experiment, working with interesting textual data, or developing a proof-of-concept NLP system. Detailed information about the project will be communicated several weeks into the course. The primary deliverables are a written report (approx. 5-8 pages) and a pre-recorded video presentation for the class during the last module. 

Quizzes (20%) 

There will be short, time-limited quizzes in most Modules. The quizzes will test knowledge of material from lecture materials and readings. The two lowest quiz grades will be dropped in computing the quiz average. Quizzes must be submitted by the end of the Module -- there are no makeup quizzes, and quizzes cannot be submitted late.

Course Policies

Submitting Individual Assignments

Your name, the course number and a title (e.g., "Lab #4") must be present on the first page of each submission. Work for the class, such as Lab Assignments (including source code) must be submitted in Canvas as a single PDF file. However, on some labs we may also ask for separate file submissions to provide test results. We generally do not directly grade spelling and grammar. However, violations of the rules of the English language may be noted without comment. Consistently poor spelling or grammar that detracts from the understandability of the submission may detract from your grade. A PDF file generated from a Jupyter notebook is a reasonable way to submit an assignment written in Python. However, if you do so, be sure to examine the output to ensure that it is readable and does not contain truncated lines. Dark text on a light background is expected.

Policy on Late Work

Lab assignments must be submitted in Canvas by 11:59pm EST on Day 7 of the module when the lab is due (unless other directions are given by the instructors). A late assignment will be accepted up to one week late with an automatic 20% deduction. No assignment will be accepted more than one week late – the assignment will be given a grade of zero instead. Generally speaking, it is better to submit something slightly incomplete or imperfect on time than to submit it late. Remember, the lowest grade will be dropped when computing your lab assignment average. In extraordinary circumstances you should contact the instructors. Reasonable accommodation will be made for an extended hospitalization or other serious situations. However documentation is expected (​e.g., signed note on letterhead with printed contact information of the physician, ​etc...)​.

In some situations, withdrawing from the course (no permission needed) or taking an incomplete (permission required) are appropriate. You are encouraged to speak with the instructors and/or your academic advisor if you are considering pursuing either course of action.

Additional Comments on Academic Honesty

Discussions among students are an important part of learning and are key to success in a graduate course. It is permissible, and often even desirable for you to discuss the general nature of course content and assignments with your peers. However, the line between collaboration and cheating needs to be carefully delineated. You should not discuss or reveal solutions to assigned work with others, or share any unpublished source code. When you submit work with your name on it for evaluation it must represent an original, individual effort by you alone. 

This course requires you to write computer programs, and unless explicitly prohibited on an assignment, it is perfectly acceptable to make use of published examples and source code from the literature or public domain–but only if attribution is given​. You must provide a citation for source code or other material that you do not write yourself (e.g., URLs to websites, pointers to GitHub repos, Numerical Recipes in C, Stack Overflow, etc...).  Use of generative AI (or similar means) to produce code is permitted, however, students should disclose how they were used on the assignment. Use of generative AI on quizzes is prohibited.  Contact the instructors if you have questions about this policy.

The content of quizzes may not be discussed with other students (in Canvas, or otherwise) until the quiz has been graded by the instructors.

Academic Policies

Deadlines for Adding, Dropping and Withdrawing from Courses

Students may add a course up to one week after the start of the term for that particular course. Students may drop courses according to the drop deadlines outlined in the EP academic calendar (https://ep.jhu.edu/student-services/academic-calendar/). Between the 6th week of the class and prior to the final withdrawal deadline, a student may withdraw from a course with a W on their academic record. A record of the course will remain on the academic record with a W appearing in the grade column to indicate that the student registered and withdrew from the course.

Academic Misconduct Policy

All students are required to read, know, and comply with the Johns Hopkins University Krieger School of Arts and Sciences (KSAS) / Whiting School of Engineering (WSE) Procedures for Handling Allegations of Misconduct by Full-Time and Part-Time Graduate Students.

This policy prohibits academic misconduct, including but not limited to the following: cheating or facilitating cheating; plagiarism; reuse of assignments; unauthorized collaboration; alteration of graded assignments; and unfair competition. Course materials (old assignments, texts, or examinations, etc.) should not be shared unless authorized by the course instructor. Any questions related to this policy should be directed to EP’s academic integrity officer at ep-academic-integrity@jhu.edu.

Students with Disabilities - Accommodations and Accessibility

Johns Hopkins University values diversity and inclusion. We are committed to providing welcoming, equitable, and accessible educational experiences for all students. Students with disabilities (including those with psychological conditions, medical conditions and temporary disabilities) can request accommodations for this course by providing an Accommodation Letter issued by Student Disability Services (SDS). Please request accommodations for this course as early as possible to provide time for effective communication and arrangements.

For further information or to start the process of requesting accommodations, please contact Student Disability Services at Engineering for Professionals, ep-disability-svcs@jhu.edu.

Student Conduct Code

The fundamental purpose of the JHU regulation of student conduct is to promote and to protect the health, safety, welfare, property, and rights of all members of the University community as well as to promote the orderly operation of the University and to safeguard its property and facilities. As members of the University community, students accept certain responsibilities which support the educational mission and create an environment in which all students are afforded the same opportunity to succeed academically. 

For a full description of the code please visit the following website: https://studentaffairs.jhu.edu/policies-guidelines/student-code/

Classroom Climate

JHU is committed to creating a classroom environment that values the diversity of experiences and perspectives that all students bring. Everyone has the right to be treated with dignity and respect. Fostering an inclusive climate is important. Research and experience show that students who interact with peers who are different from themselves learn new things and experience tangible educational outcomes. At no time in this learning process should someone be singled out or treated unequally on the basis of any seen or unseen part of their identity. 
 
If you have concerns in this course about harassment, discrimination, or any unequal treatment, or if you seek accommodations or resources, please reach out to the course instructor directly. Reporting will never impact your course grade. You may also share concerns with your program chair, the Assistant Dean for Diversity and Inclusion, or the Office of Institutional Equity. In handling reports, people will protect your privacy as much as possible, but faculty and staff are required to officially report information for some cases (e.g. sexual harassment).

Course Auditing

When a student enrolls in an EP course with “audit” status, the student must reach an understanding with the instructor as to what is required to earn the “audit.” If the student does not meet those expectations, the instructor must notify the EP Registration Team [EP-Registration@exchange.johnshopkins.edu] in order for the student to be retroactively dropped or withdrawn from the course (depending on when the "audit" was requested and in accordance with EP registration deadlines). All lecture content will remain accessible to auditing students, but access to all other course material is left to the discretion of the instructor.