Video Manipulations Beyond Faces: A Dataset with Human-Machine Analysis

Trisha Mittal1, Ritwik Sinha2, Viswanathan Swaminathan2, John Collomosse2, Dinesh Manocha1

1University of Maryland, 2Adobe Research


As tools for content editing mature, and artificial intelligence (AI) based algorithms for synthesizing media grow, the presence of manipulated content across online media is increasing. This phenomenon causes the spread of misinformation, creating a greater need to distinguish between “real” and “manipulated” content. To this end, we present VideoSham, a dataset consisting of 826 videos (413 real and 413 manipulated). Many of the existing deepfake datasets focus exclusively on two types of facial manipulations—swapping with a different subject’s face or altering the existing face. VideoSham, on the other hand, contains more diverse, context-rich, and human-centric, high-resolution videos manipulated using a combination of 6 different spatial and temporal attacks. Our analysis shows that state-of-the-art manipulation detection algorithms only work for a few specific attacks and do not scale well on VideoSham. We performed a user study on Amazon Mechanical Turk with 1200 participants to understand if they can differentiate between the real and manipulated videos in VideoSham. Finally, we dig deeper into the strengths and weaknesses of performances by humans and SOTA-algorithms to identify gaps that need to be filled with better AI algorithms.


  • ATTACK 1 (Adding an entity/subject): In this attack we select an entity or a subject from some other sources and place them in the current video.
  • ATTACK 2 (Removing an entity/subject): In this attack, we basically select an entity or a subject in the video and remove it from all the frames and fill in the gap with background settings.
  • ATTACK 3 (Background/Color Change): We focus on a particular aspect of the video, and change the background of the video, or color of a small entity in the video.
  • ATTACK 4 (Text Replaced/Added): We perform edits like adding some text in the video or removing or replacing already existing text in the video.
  • ATTACK 5 (Frames Duplication / Removal/ Dropping): This attack is specifically to render the video temporally inconsistent. We choose to perform one of these manipulations, randomly duplicating frames, removing or dropping frames in the video. This also includes slowing down a video.
  • ATTACK 6 (Audio Replaced): Audio modality is a very important aspect for videos. To manipulate this, we replace the existing audio with some other audio.

Paper Dataset


Please cite our paper if you use the VideoSham Dataset in your work.

  title={Video Manipulations Beyond Faces: A Dataset with Human-Machine Analysis},
  author={Mittal, Trisha and Sinha, Ritwik and Swaminathan, Viswanathan and Collomosse, John and Manocha, Dinesh},
  journal={arXiv preprint arXiv:2207.13064},