MOTIVATED: Playing with MP4 Metadata

@kawaii.social

Can you believe someone would just LIE in their container metadata?

Objective

Hello chat.

To celebrate Bluesky's launch of video, I decided to see if I could exploit the new feature and upload videos that didn't meet the posted restrictions. The goal was to modify the metadata of an MP4 video to make it appear as 59 seconds long while retaining the content of a longer video and keeping the full duration playable. This needed to be achieved without negatively impacting playback, file integrity, or detection by the AppView or the server-side encoder.

Understanding MP4 Container Structures and Atoms

Before diving into the methodology, let's discuss how MP4 containers and their atoms work.

MP4 Container Basics

An MP4 file is a container format that stores multimedia data, such as video and audio streams. It is organized into a hierarchical structure of atoms, sometimes called boxes, where each atom contains specific information about the media file. The minimum size of an atom is 8 bytes. The first four bytes specify the size of the atom, and the next four specify the type, with any remainder as data. [SSSSTTTT...]

Key Atoms

  • mvhd (Movie Header Box): Provides overall information about the entire video, including timescale and duration. It affects the global timing of the media presentation.
  • mdhd (Media Header Box): Contains timing information specific to a media track (either audio or video), such as timescale and duration.
  • stts (Decoding Time-to-Sample Table Box): Describes the timing (duration) of each sample in the track. It maps sample counts to their respective durations, allowing precise control over playback timing.

How Video Encoders Use Container Metadata

Video encoders use container metadata to:

  • Determine Playback Parameters: Timescale and duration inform players how to present the media.
  • Manage Synchronization: Metadata ensures audio and video tracks are synchronized during playback.
  • Optimize Seeking: Indexing information helps players jump to different parts of the media without decoding the entire file.

By manipulating these metadata atoms, I can influence how the media file is interpreted without altering the actual content.

Operating Constraints

  • File Size Limitation: Keeping the output video under 50 MB required some initial bitrate and compression tuning via ffmpeg, which is not really pertinent to this post.
  • Metadata Manipulation: I needed to manipulate the MP4 file's metadata, specifically the mvhd, mdhd, and stts atoms, which report the video's duration.
  • Ensuring Playback Integrity: Modifying metadata without breaking video playback or causing artifacts required doing some digging on MP4 container structures.

Initial Missteps

Hex Editing Attempt

Initially, I attempted manual hex editing to alter the MP4 metadata. At first glance, this appears to be simple; you view the hexdump of the video file in a hex editor and find the headers, then modify the corresponding bytes. While theoretically possible, this approach caused issues with players that were doing file integrity checks:

  • File Corruption: Editing the file's binary directly led to corrupted MP4 structures. Video players were unable to parse the modified file, resulting in playback failures.
  • Atom Size and Structure: Without careful attention to atom sizes and offsets, hex edits broke the file structure, damaging file integrity and leading to errors.
  • Checksum Mismatch: Many systems check file integrity via checksums or internal consistency checks. The manual modifications caused mismatches, flagging the file as corrupt.

Why stts Manipulation Was Dropped

I originally assumed I would need to manipulate the stts atom, which controls the presentation times for each frame. The approach centered around manually editing the stts (Decoding Time to Sample) atom in the MP4 file using hex editing techniques. The stts atom defines the timing of each frame by specifying how many consecutive frames use the same decoding time increment. The structure of the stts box contains pairs of values: one for the number of frames, and one for the time increment (or duration) between them.

Process:

  1. Locating the stts Box: Using a hex editor, the first step was to locate the stts box within the MP4 file. The stts atom can be identified by its four-character code (stts) and is typically located within the mdia (media) atom, which is part of the larger trak (track) atom.
  2. Modifying Frame Timing: Once the stts atom was located, the editing focused on manipulating the frame timing data. Each entry in the stts box consists of:
    • Sample count: The number of consecutive frames that use the same time increment.
    • Sample delta: The duration (in terms of the media’s timescale) between consecutive frames. By reducing the sample delta values, I attempted to shorten the apparent duration of the video. This involved calculating new, lower time increments to create the illusion of a shorter video without actually removing frames.
  3. Hex Editing: In the hex editor, the sample count and sample delta values are represented in binary format. I manually changed the sample delta values to smaller numbers, aiming to make the decoding time between frames much shorter. The idea was that by adjusting these deltas, the video player would interpret the frames as being closer together in time, thus shortening the reported video duration.
  4. Maintaining Frame Data: While adjusting the stts box, I left the actual frame data untouched. The goal was to modify only the timing metadata, ensuring the frames were still intact, but spaced closer together temporally from the perspective of the decoder.

However, this was ultimately unnecessary because:

  • Incorrect Assumptions About Duration: I assumed that the video duration was a hard rule, but it appears it was more programmatic semantics in the initial upload through the AppView.
  • Complications with Frame Timing: Modifying stts could have led to unsynchronized audio, jittery video playback, or frame drops. Since the keyframe and metadata manipulations were already sufficient, further tampering with stts was not needed.
  • Passed Validation: Without stts manipulation, the video still passed validation checks via ffprobe (a multimedia stream information tool that's part of ffmpeg), which focused more on the file’s overall size and metadata rather than detailed frame-level timing.

This was also a huge relief, because editing this box when you have a lot of frames to work with gets extremely tedious.

Understanding the Validation Logic

Through some discussions with some other Bluesky developers, I realized that the validation logic was simpler than I initially thought:

  • Blob Size Emphasis: The system placed more emphasis on the blob size of the video file rather than solely relying on the reported duration.
  • ffprobe dependency: ffmpeg is a widely-adopted multimedia processor, which meant that ffprobe was also likely being used on the server (it was). If I could tailor my file to get through ffprobe locally, it would likely work when uploaded.

Why Keyframes Matter

I was burning the midnight oil at this point, and the idea of keyframe manipulation had slipped my mind until it was mentioned by another developer.

Determining the actual length of a video without fully decoding it is inherently difficult due to how video data is structured:

  • Compressed Streams: Video files store compressed data streams that require decoding to accurately assess duration.
  • Keyframes and Indexing: Keyframes (I-frames) play a crucial role in indexing video data, but relying solely on them doesn't provide an accurate measure of duration.

Keyframes are also essential for:

  • Efficient Compression: Fewer keyframes can lead to better compression, reducing file size.
  • Seek Operations: Keyframes allow players to seek to different parts of the video without decoding the entire stream.

However, a low keyframe rate can make it harder for systems to analyze and determine the actual duration of a video without full decoding.

Changing My Approach

The normal keyframe interval for most video encodings is around 2 to 10 seconds of video, depending on the use case:

Common keyframe intervals:

  • 30 to 300 frames: For videos running at 30 fps, this corresponds to 1 to 10 seconds.
  • 60 to 250 frames: For 24 fps, this corresponds to 2.5 to 10 seconds.

Use cases:

  • Streaming: Lower intervals (1-2 seconds) are often preferred for better seeking and quick recovery from packet loss.
  • Stored video: Longer intervals (5-10 seconds) optimize file size, as fewer keyframes are needed.

For my use case, I decided on a keyframe interval of 60 frames in the hopes that this would strike a good balance between keeping the file small and ensuring the keyframe manipulation had the desired effect.

In general:

  • Low interval: Results in more keyframes, which can increase file size but improves the ability to seek through the video or recover from packet loss in streaming.
  • High interval: Fewer keyframes result in better compression and smaller file size but might make the video harder to seek through and less resilient to corruption or packet loss.

Moving the Process to Go

With the insights regarding keyframes, I decided to re-encode the file and set a relatively low keyframe interval with:

  • Keyframe Interval (-g 60): Ensured that a keyframe appears every 60 frames (approximately every 2 seconds for a 30 fps video). This helped maintain a consistent keyframe spacing, reducing the complexity of the file.
  • Minimum Keyframe Interval (-keyint_min 60): Forced a minimum of 60 frames between keyframes, preventing any unnecessary keyframe placements.

This approach would reduce the number of keyframes but not by a drastic amount, making the video harder to analyze for duration based on keyframe structure. Fewer keyframes also led to better compression, helping keep the file size under the 50 MB limit.

To handle the process, I wrote a Go script that:

  1. Re-encoded the video using ffmpeg:
    • Used ffmpeg to copy the existing video and audio streams while adjusting keyframes.
  2. Manipulated the needed MP4 metadata without touching the stts atom:
    • Utilized the mp4ff library to manipulate the mvhd and mdhd atoms.
    • Set these values to report a duration of 59 seconds, spoofing the video's length.

The Code

compressVideo Function:

  • Uses ffmpeg to adjust the keyframe interval without re-encoding the streams. By copying streams without re-encoding, the process is faster and preserves the original quality.
  • The -g and -keyint_min parameters set the GOP (Group of Pictures) size, effectively controlling keyframe frequency.
// Keyframe manipulation + recompression assuming video has already been preoptimized to meet 50MB limit
func compressVideo(inputFile, outputFile string) error {
	cmd := exec.Command("ffmpeg",
		"-y",
		"-i", inputFile,
		"-c:v", "copy",
		"-c:a", "copy",
		"-g", "60",
		"-keyint_min", "60",
		outputFile)

	cmd.Stderr = os.Stderr
	cmd.Stdout = os.Stdout

	if err := cmd.Run(); err != nil {
		return fmt.Errorf("ffmpeg command failed: %v", err)
	}
	return nil
}

manipulateMetadata Function:

  • Opens the MP4 file and decodes it using the mp4ff library.
  • Modifies the mvhd atom's Duration field based on the timescale to report 59 seconds.
  • Iterates over each trak (track) and modifies the mdhd atom's Duration field similarly.
  • Encodes and saves the modified MP4 file.

Key Fields:

  • {atom}.Timescale: Represents the number of "time units" per second of video. For example, if the timescale is set to 600, it means that 600 time units equal one second.
  • {atom}.Duration: Stores the total duration of the video, measured in the same time units defined by the timescale. This value is proportional to the length of the video.

The math here is: desiredDurationInSeconds * timescale (time units per second) = durationInTimeUnits

func manipulateMetadata(filePath string, outputFile string) error {
	file, err := os.OpenFile(filePath, os.O_RDWR, 0644)
	if err != nil {
		return fmt.Errorf("failed to open file: %v", err)
	}
	defer file.Close()

	mp4File, err := mp4.DecodeFile(file)
	if err != nil {
		return fmt.Errorf("failed to decode mp4 file: %v", err)
	}

	if mvhd := mp4File.Moov.Mvhd; mvhd != nil {
		fmt.Println("Found mvhd atom, modifying duration...")
		if mvhd.Timescale != 0 {
			mvhd.Duration = uint64(59 * mvhd.Timescale)
		} else {
			return fmt.Errorf("Invalid mvhd timescale detected.")
		}
	} else {
		return fmt.Errorf("mvhd atom not found")
	}

	for _, trak := range mp4File.Moov.Traks {
		if mdhd := trak.Mdia.Mdhd; mdhd != nil {
			fmt.Println("Found mdhd atom, modifying duration...")
			if mdhd.Timescale != 0 {
				mdhd.Duration = uint64(59 * mdhd.Timescale)
			} else {
				return fmt.Errorf("Invalid mdhd timescale in track detected.")
			}
		} else {
			return fmt.Errorf("mdhd atom not found in track")
		}
	}

	output, err := os.Create(outputFile)
	if err != nil {
		return fmt.Errorf("failed to create output file: %v", err)
	}
	defer output.Close()

	err = mp4File.Encode(output)
	if err != nil {
		return fmt.Errorf("failed to encode modified mp4 file: %v", err)
	}

	return nil
}

Success

The Fix

My initial recommendation was to tighten up file integrity checks to avoid similar exploits in the future. After discussing with Jaz, they added a few extra validation measures for their encoding process, ensuring that the duration of the encoded video meets expectations.

After the fix was implemented, I tested the patch and confirmed my approach no longer works.

Conclusion

And so, for a brief moment, we had a 6-minute video on the timeline. It got through all the validation checks and the video played back without errors, while also displaying its true duration within the Bluesky embed player. The file has since been purged from Bluesky. F.

Was this fun? Yeah. Do I wish I got a full-night's sleep instead? Probably.

Stay MOTIVATED chat.

✌️

kawaii.social
mishawawa

@kawaii.social

ephemeral internet girl 🏳️‍⚧️

cypherpunk

Post reaction in Bluesky

*To be shown as a reaction, include article link in the post or add link card

Reactions from everyone (0)