Large-Scale Multimedia Analysis

11775, Spring 2024

Instructors: Alex Hauptmann, Zhi-Qi Cheng

Home

People

Syllabus

Homeworks

Project

Lecture:

Date and Time: 17:00-18:20, Monday and Wednesday
Location: GHC-4102
Websites: Canvas Piazza

Course Description:

Can a robot watch "Youtube" to learn about the world? What makes us laugh? How to bake a cake? Why is Kim Kardashian famous?

Large-scale multi-media is an incomparable window into our world, with thousands of hours of data available on almost every aspect of our everyday life. The analysis of such data is a unique opportunity to perform deep multi-modal analysis that goes beyond image or video retrieval, speech to text, or other existing tasks. This is a 12-unit class or lab covering fundamentals of large-scale computer vision, audio and speech processing, multi-media files and streaming, multi-modal signal processing, video retrieval, semantics, and text (possibly also: speech, music) generation.

Target Audience/ Prerequisites:

This is a graduate course primarily for students in LTI, HCII, CSD, Robotics, ECE; others, for example undergraduate students in CS, by prior permission of the instructor(s). Strong implementation skills, experience on working with large data sets, and familiarity with some (not all) of the above fields (e.g. 11-611, 11-711, 11-751, 11-755, 11-792, 16-720, or equivalent), will be helpful.

Learning Objectives:

Instructors will give an overview of relevant recent work and benchmarking efforts (Trecvid, Mediaeval, etc.). Students will work on research projects to explore these ideas and learn to perform multi-modal retrieval, summarization and inference on large amounts of "Youtube"-style data. The experimental environment for the practical part of the course will be given to students in the form of Virtual Machines.

Course Outcomes:

Students who successfully complete the course should be able to at the minimum understand all aspects of a state-of-the art multi-media search system. They will understand the fundamental algorithms of information retrieval, speech recognition and audio processing, image and video processing, and understand the complexities of handling large amounts of heterogeneous multi-media data. They will have in-depth and hands-on experience with some of the algorithms involved in processing (recognition and/ or synthesis) any of these modalities and/or with multi-modal fusion. They should be able to apply that knowledge to other domains and/ or data on their own.

Grading:

The overall grade will be determined as follows:

Assignments (30%)
Term Project (70%)

It is typically possible to audit the class, if helpful.

Deadlines and Lateness:

Homework (assignments and exercises) and term project results are worth full credit at the beginning of the class on the due date. Unless granted an extension in advance, it is worth at most 75% credit for the next 48 hours, at most 50% credit after that. If you need an extension, please ask for it as soon as the need for it is known. Extensions that are requested promptly can be granted more liberally. You must turn in all assignments.

Collaboration among Students:

We encourage collaboration between students and studying materials in groups when the purpose of this is to facilitate learning, not to circumvent problems. It is allowed to seek help from other students in understanding the material needed to solve a particular problem. However, students must submit individual material and solutions, unless otherwise specified. Students should declare any collaboration on the first page of homework assignments (or equivalently on exercises). If the instructors believe the collaboration is improper, your grade may be affected. Collaboration without full disclosure will be handled in compliance with CMU's Policy on Cheating and Plagiarism. If in doubt, ask!