Carnegie Mellon University

A hand with fingers outstretched, pink nail polish, and the middle finger painted red, which becomes a red lighning bolt on the back of the hand

Computational Models of Human Gestures

By LP Morency

We propose to take advantage of our expertise in multimodal machine learning and social intelligence understanding to attempt to build new generative models and data resources.

We describe below the four (4) research goals and aims of this Project.

Research Goal and Aim 1: Project initialization and literature review
As a first research goal and aim, we plan to study relevant literature in gesture generation, with a focus on recent work which looks at social interactions within dyads. We will also focus particularly on literature relevant to learning to generate motion from music. As part of this first research goal and aim, we also plan to perform any extra data pre-processing of the current PATS dataset to have it ready for learning of new generative models (as detailed in research goal and aim 2).

Research Goal and Aim 2: Design of language- and socially-grounded generative models
The second research goal and aim will focus on attempting to create generative models for the task for gesture generation from speech, with a strong focus in going beyond current approaches, to better align gestures not only with the acoustic signal but also with the language. We are particularly interested to learn gesture generation models that are illustrating ideas and concepts expressed in language. While the PATS dataset will start as an initial data source for this exploration, we also plan to explore other scenarios where language is directly aligned gestures, such as educational videos where slides may be present as extra source of information. We are also interested in exploring how gestures are expressed in social settings such as dialogue between two people, where roles of speaker and listener alternates.

Research Goal and Aim 3: Building PATS v3 dataset
We will attempt to create a new extension of the PATS dataset to diversify its content and start exploring a different setting: how human motion relates to music during dance. The interesting aspect of this setting is the wide range of motion expressed by people during dance. Another interesting extension for PATS is to explore videos where the gestures are closely related to language, either spoken or even written (e.g., language in a slide during educational presentation). We plan to extract speaking videos from resources freely available online, including online sharing websites such as YouTube. Videos will be manually selected to show natural speaking and moving behaviors.

Research Goal and Aim 4: Baseline models for motion generation
We plan to take advantage of any new PATS v3 dataset created in Goal and Aim 3 to learn how to predict dance motion from audio. For the purpose of this Goal and Aim 4, we plan to evaluate pre-existing approaches for gesture generation and see how well they generalize for the task of generating dance-related motion. One of the approaches that we will explore is our previously proposed MixStage model was shown to generate gestures very well from speech. This exploration of how well gesture generation models generalize to dance-motion generation will give us good initial baselines and allow us to better understand future challenges dance-motion generation.