Skip to main content

My Sad Existence - Building a Video Player with MSE

Part V: Manifests

Let's step back and just take a look at the code we've written. The code seems full of an awful lot of assumptions about the structure of the video stream. A short list of assumptions we've made:

  • We've assumed that all segments are 4 seconds long
  • We've assumed there is a single audio track
  • We've assumed the structure and format of the URL for each segment

What if we wanted to try and play back an entirely different video? We would have to rewrite a ton of code just to support this.

Lucky for us, there are several standard formats that can describe all of the things we've assumed and hardcoded into our player. The two heavy hitters in this space are DASH (Dynamic Adaptive Streaming over HTTP) and HLS (HTTP Live Streaming). DASH is an open standard, while HLS is a proprietary Apple format.

Dynamic Adaptive Streaming over HTTP (DASH)

DASH is an XML based format that describes the structure of your stream, the specification for which can be found here.

A DASH stream is broken down into logical chunks of segments called Periods. Each Period contains one or more AdaptationSets, which contain one or more Representations. Each Representation is a different version of the same content, but at a different quality level. Looking at a small example is the easiest way to see how this all fits together:

<?xml version="1.0" encoding="utf-8"?>
<MPD xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="urn:mpeg:dash:schema:mpd:2011" xmlns:xlink="http://www.w3.org/1999/xlink" xsi:schemaLocation="urn:mpeg:DASH:schema:MPD:2011 http://standards.iso.org/ittf/PubliclyAvailableStandards/MPEG-DASH_schema_files/DASH-MPD.xsd" profiles="urn:mpeg:dash:profile:isoff-live:2011" type="static" mediaPresentationDuration="PT3M59.9S" maxSegmentDuration="PT4.0S" minBufferTime="PT8.0S">
<Period id="0" start="PT0.0S">
<AdaptationSet id="0" contentType="video" startWithSAP="1" segmentAlignment="true" bitstreamSwitching="true" frameRate="27484/917" maxWidth="1280" maxHeight="960" par="4:3" lang="und">
<Representation id="0" mimeType="video/mp4" codecs="avc1.42c00d" bandwidth="350000" width="320" height="240" sar="1:1">
<SegmentTemplate timescale="27484" initialization="init-$RepresentationID$-$Bandwidth$.mp4" media="segment-$RepresentationID$-$Bandwidth$-$Number$.mp4" startNumber="1">
<SegmentTimeline>
<S t="0" d="110040" r="58" />
<S d="103621" />
</SegmentTimeline>
</SegmentTemplate>
</Representation>
<Representation id="1" mimeType="video/mp4" codecs="avc1.42c01e" bandwidth="1000000" width="640" height="480" sar="1:1">
<SegmentTemplate timescale="27484" initialization="init-$RepresentationID$-$Bandwidth$.mp4" media="segment-$RepresentationID$-$Bandwidth$-$Number$.mp4" startNumber="1">
<SegmentTimeline>
<S t="0" d="110040" r="58" />
<S d="103621" />
</SegmentTimeline>
</SegmentTemplate>
</Representation>
<Representation id="2" mimeType="video/mp4" codecs="avc1.42c020" bandwidth="3000000" width="1280" height="960" sar="1:1">
<SegmentTemplate timescale="27484" initialization="init-$RepresentationID$-$Bandwidth$.mp4" media="segment-$RepresentationID$-$Bandwidth$-$Number$.mp4" startNumber="1">
<SegmentTimeline>
<S t="0" d="110040" r="58" />
<S d="103621" />
</SegmentTimeline>
</SegmentTemplate>
</Representation>
</AdaptationSet>
<AdaptationSet id="1" contentType="audio" startWithSAP="1" segmentAlignment="true" bitstreamSwitching="true" lang="eng">
<Representation id="3" mimeType="audio/mp4" codecs="mp4a.40.2" bandwidth="128000" audioSamplingRate="48000">
<AudioChannelConfiguration schemeIdUri="urn:mpeg:dash:23003:3:audio_channel_configuration:2011" value="2" />
<SegmentTemplate timescale="27484" initialization="init-$RepresentationID$-$Bandwidth$.mp4" media="segment-$RepresentationID$-$Bandwidth$-$Number$.mp4" startNumber="1">
<SegmentTimeline>
<S t="0" d="110040" r="58" />
<S d="103621" />
</SegmentTimeline>
</SegmentTemplate>
</Representation>
</AdaptationSet>
</Period>
</MPD>

Look familiar? This is the same content we've been working with, but now described as a DASH manifest. Each separate stream (video, audio) is described by an Adaptation set. Each video quality level is described by a Representation. The SegmentTemplate describes the structure of the URLs for the segments. If, for example, we had multiple Audio quality levels, we would have multiple Representation elements inside of the audio AdaptationSet.

Segment Templates

A lot of the information in the manifest is fairly straightforward, but one area worth expanding upon is the SegmentTemplate:

<SegmentTemplate timescale="27484" initialization="init-$RepresentationID$-$Bandwidth$.mp4" media="segment-$RepresentationID$-$Bandwidth$-$Number$.mp4" startNumber="1">
<SegmentTimeline>
<S t="0" d="110040" r="58" />
<S d="103621" />
</SegmentTimeline>
</SegmentTemplate>

As I mentioned above, this is the structure that describes each segment and how to build the URL for each segment.

  • timescale is the number of units per second used for any time values in this template.
  • initialization is the URL template for the initialization segment.
  • media is the URL template for the media segments.
  • startNumber is the number to start counting from when building the URLs.

The actual segments are described by the SegmentTimeline. Each S element describes a segment, with t being the start time of the segment in units of timescale, d being the duration of the segment in units of timescale, and r being the number of additional times to repeat this segment.

Translating this timeline, we can see that the first segment entry starts at time 0, has a duration of (110040 / 27484) 4 seconds, and is repeated 58 times. The next segment entry has a duration of (103621 / 27484) 3.77 seconds and does not repeat.

Actual building of the URL here is pretty straightforward. Any values surrounded by $ are variables. The DASH specification (section 5.3.9.4.4) described possible values that can be present in the URL template. In this scenario, $RepresentationID$ is sourced from the id attribute of the parent Representation element, and $Bandwidth$ is sourced from the bandwidth attribute of the parent Representation element. $Number$ is the number of the segment, starting from startNumber.