Large-Scale Multimedia Analysis

11775, Fall 2023

Instructors: Alex Hauptmann, Zhi-Qi Cheng

Home People Syllabus Homeworks Project


Date and Time: 17:00-18:20, Monday and Wednesday
Location: GHC-4307
Websites: Canvas Piazza

Course Description:

Can a robot watch "Youtube" to learn about the world? What makes us laugh? How to bake a cake? Why is Kim Kardashian famous?

Large-scale multi-media is an incomparable window into our world, with thousands of hours of data available on almost every aspect of our everyday life. The analysis of such data is a unique opportunity to perform deep multi-modal analysis that goes beyond image or video retrieval, speech to text, or other existing tasks. This is a 12-unit class or lab covering fundamentals of large-scale computer vision, audio and speech processing, multi-media files and streaming, multi-modal signal processing, video retrieval, semantics, and text (possibly also: speech, music) generation.

Target Audience/ Prerequisites: This is a graduate course primarily for students in LTI, HCII, CSD, Robotics, ECE; others, for example undergraduate students in CS, by prior permission of the instructor(s). Strong implementation skills, experience on working with large data sets, and familiarity with some (not all) of the above fields (e.g. 11-611, 11-711, 11-751, 11-755, 11-792, 16-720, or equivalent), will be helpful.

Learning Objectives: Instructors will give an overview of relevant recent work and benchmarking efforts (Trecvid, Mediaeval, etc.). Students will work on research projects to explore these ideas and learn to perform multi-modal retrieval, summarization and inference on large amounts of "Youtube"-style data. The experimental environment for the practical part of the course will be given to students in the form of Virtual Machines.

Course Outcomes: Students who successfully complete the course should be able to at the minimum understand all aspects of a state-of-the art multi-media search system. They will understand the fundamental algorithms of information retrieval, speech recognition and audio processing, image and video processing, and understand the complexities of handling large amounts of heterogeneous multi-media data. They will have in-depth and hands-on experience with some of the algorithms involved in processing (recognition and/ or synthesis) any of these modalities and/or with multi-modal fusion. They should be able to apply that knowledge to other domains and/ or data on their own.

Grading: The overall grade will be determined as follows:
  • Assignments (30%)
  • Term Project (70%)
It is typically possible to audit the class, if helpful.