Skip to main content

My Sad Existence - Building a Video Player with MSE

Part III - A detour into encoding

How did I split my source video into chunks? How did I create the init segments? How did I separate the audio from the video? How did I specify the codecs used for all of this?

Let me give you a quick and dirty introduction to the world of FFMPEG.

Installation & Preparation

You can grab FFMPEG from the official website.

Test that it's working by running ffmpeg -version. You should see something like this:

ffmpeg version 7.1 Copyright (c) 2000-2024 the FFmpeg developers
built with Apple clang version 16.0.0 (clang-1600.0.26.4)
configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/7.1_4 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-audiotoolbox --enable-neon
libavutil 59. 39.100 / 59. 39.100
libavcodec 61. 19.100 / 61. 19.100
libavformat 61. 7.100 / 61. 7.100
libavdevice 61. 3.100 / 61. 3.100
libavfilter 10. 4.100 / 10. 4.100
libswscale 8. 3.100 / 8. 3.100
libswresample 5. 3.100 / 5. 3.100
libpostproc 58. 3.100 / 58. 3.100

I started with a single .mp4 containing both audio and video. The source video can be found here.

Slicing into segments

Let's just start off with a terrifyingly long command:

ffmpeg -i radio_free_europe.mp4 \
-x264-params scenecut=0:min-keyint=120:keyint=120 \
-map 0:v:0 -b:v:0 350k -c:v:0 libx264 -filter:v:0 'scale=320:-1' \
-map 0:v:0 -b:v:1 1000k -c:v:1 libx264 -filter:v:1 'scale=640:-1' \
-map 0:v:0 -b:v:2 3000k -c:v:2 libx264 -filter:v:2 'scale=1280:-1' \
-profile:v baseline \
-map 0:a:0 -c:a aac -profile:a aac_low \
-media_seg_name 'segment-$RepresentationID$-$Bandwidth$-$Number$.mp4' \
-init_seg_name 'init-$RepresentationID$-$Bandwidth$.mp4' \
-seg_duration 4 \
-f dash converted-ffmpeg/output.mpd

It is nearly as scary as it looks. Maybe 90% as scary as it looks. A generic translation of this command into something human reable is as follows:

"Hey ffmpeg, take this video file and convert it into three output streams with different quality levels, encoding each with H264. Also, take the audio and encode it in AAC. Take the resulting output streams and break them into chunks with the specified naming templates."

Breaking it down line by line:

ffmpeg -i radio_free_europe.mp4

Run ffmpeg with radio_free_europe.mp4 as the input file.

-x264-params min-keyint=120:keyint=120

This is the first bit of real meat. We are encoding our content in x264 for wide compatibility. These parameters are specific to x264 encoding. To summarize at the highest level, x264 encodes your video into groups of individual frames. Each of these groups is called a GOP (Group of Pictures). Most frames in an x264 encoded video file merely store difference in image content as compared to other frames. Scattered throughout are standalone full pictures called keyframes. All of these parameters give the encoder instructions about where to insert these keyframes.

  • scenecut=0 - Consider a given movie scene consisting of several camera shots. In a single shot, most video frames are going to be pretty similar to each other, so it will be most efficient to reduce the number of keyframes within a single shot and instead store the differences in each frame. However, once an actual camera cut happens, it will be most efficient to insert a new keyframe and then store the difference relative to the new camera shot's keyframe. For streaming, it's simpler for us to instead have keyframes at predictable regular intervals. This setting tells the encoder not to try inserting extra keyframes when it detect scene cuts. For simplicity and predictability in the resulting segments, this is best left disabled.
  • min-keyint=120 - Minimum number of frames between each key frame. This video is 30fps, so this is 4 seconds.
  • keyint=120 - Maximum number of frames between each key frame.

These parameters will apply to all video x264 video encoding in this ffmpeg command.

-map 0:v:0 -b:v:0 350k -c:v:0 libx264 -filter:v:0 'scale=320:-1'

This line creates a new output stream 0 from video stream 0 of the input (the only video stream in the input). It sets the bitrate to 350k and specifies that we are using the x264 codec. Finally, we are scaling it to 320 pixels wide. The -1 indicates that the height will be proportional and auto calculated.

The next two lines perform a similar task:

 -map 0:v:0 -b:v:1 1000k -c:v:1 libx264 -filter:v:1 'scale=640:-1' \
-map 0:v:0 -b:v:2 3000k -c:v:2 libx264 -filter:v:2 'scale=1280:-1' \

The first will create output stream 1 at bitrate 1000k, scaling to 640p. The second creates output stream 2 at bitrate 3000k, scaling to 1280p.

Altogether, these commands create three new output streams, each taking the same input video stream and encoding it to a different quality level.

-profile:v baseline

As mentioned in the previous article, the baseline profile is the most widely supported profile for streaming video. It may not be used by default, though, so we have to specify it here.

-map 0:a:0 -c:a aac -profile:a aac_low

Similar to the above, this creates one output stream using the input audio stream. We'd like the audio to be encoded in AAC LC (Low Complexity) for the highest compatibility.

-media_seg_name 'segment-$RepresentationID$-$Bandwidth$-$Number$.mp4'

This is a template for the filenames of the media segments. The $RepresentationID$, $Bandwidth$, and $Number$ are all variables that will be replaced with the actual values of the representation ID, bandwidth, and segment number.

info

You likely aren't familiar with 'Representation ID' at this stage, but that's ok! We will get there.

-init_seg_name 'init-$RepresentationID$-$Bandwidth$.mp4'

This is a template for the filenames of the init segments. Variables are similar to above.

-seg_duration 4

This is the desired maximum segment duration in seconds. We are arbitrarily using 4 second segments here.

-f dash converted-ffmpeg/output.mpd

The explanation of this line will be purposely a bit vague, so forgive me in advance.

This line tells ffmpeg to output a DASH manifest. DASH is a standard human readable and text based file format to describe segmented adaptive streaming media. The usefuless of this format will become very apparent in the coming articles.

For now, all that needs to be known is that this is what tells ffmpeg to break our video and audio streams into small chunks.