Skip to main content

My Sad Existence - Building a Video Player with MSE

Part II - Introducing Sound

When are you going to answer all of the questions I have from the last article?

I'm getting there, I promise. The whole process is far too complicated to absorb all at once, so this is going to be an exercise in acceptance of the unknown as we continue to build.

Echoes of Thomas Edison - Adding sound to our video player

A SourceBuffer is only able to handle a single specific codec at a time. This means that if we want to add audio to our video player, we need to create a separate SourceBuffer for the audio track.

I've used the mysterious and wonderful FFMPEG to separate the audio from the video here, and it's been sliced into 4 second segments matching the segment size of the video.

note

It's not necessarily the case that the audio and video need to share the same segment size, and it's also not the case that segment sizes need to be consistent throughout the video! To simplify the code, however, I've unified around 4 second segments.

The player has been slightly rearchitected from the previous example to handle the audio track. Each track (audio and video) is represented by a SourceBufferEngine object, which is responsible for fetching and appending data to the SourceBuffer for that track. Note that both of these engines are completely independent from one another, except for the fact that they share the same MediaSource object. You can see now why the segment sizes need not necessarily be the same between the two tracks.

Detour - What is an init segment?

An init segment is a special segment that contains metadata about the video or audio track. It contains no actual video or audio data, but instead contains all of the information necessary for the browser to initialize the decoder for the track.

Using mp4ff to see what's inside an init segment

MP4FF is a pretty sweet tool for parsing and writing mp4 files. It's written in Go -- but don't worry, you don't need to know Go to use it.

To get started, install Go and clone the repo. Once you've done this, you can download the sample init segment here and run the following command inside the mp4ff repo:

go run cmd/mp4ff-info/main.go ~/Downloads/init.mp4

Change the path to the downloaded init segment accordingly if you've placed it somewhere else.

You should see some pretty gnarly looking output:

[ftyp] size=28
- majorBrand: iso5
- minorVersion: 512
- compatibleBrand: iso5
- compatibleBrand: iso6
- compatibleBrand: mp41
[moov] size=819
[mvhd] size=108 version=0 flags=000000
- timeScale: 1000
- duration: 0
- creation time: 0
- modification time: 0
[trak] size=566
[tkhd] size=92 version=0 flags=000003
- trackID: 1
- duration: 0
- creation time: 0
- modification time: 0
- Width: 320.0, Height: 240.0
[edts] size=36
[elst] size=28 version=0 flags=000000
- entry[1]: segmentDuration=0 mediaTime=0, mediaRateInteger=1 mediaRateFraction=0
[mdia] size=430
[mdhd] size=32 version=0 flags=000000
- timeScale: 27484
- creation time: 0
- modification time: 0
- language: und
[hdlr] size=45 version=0 flags=000000
- handlerType: vide
- handlerName: "VideoHandler"
[minf] size=345
[vmhd] size=20 version=0 flags=000001
[dinf] size=36
[dref] size=28 version=0 flags=000000
[url ] size=12 version=0 flags=000001
[stbl] size=281
[stsd] size=205 version=0 flags=000000
[avc1] size=189
- width: 320
- height: 240
- compressorName: ""
[avcC] size=48
- AVCProfileIndication: 66
- profileCompatibility: c0
- AVCLevelIndication: 13
- SPS: 6742c00dd90141fb016a020202800001ca80006b5c078a1524
- PPS: 68cb8cb2
[colr] size=19
- colorType: nclx
- ColorPrimaries: 1, TransferCharacteristics: 1, MatrixCoefficients: 1, FullRange: false
[pasp] size=16
- hSpacing:vSpacing: 1:1
[btrt] size=20
- bufferSizeDB: 0
- maxBitrate: 350000
- AvgBitrate: 350000
[stts] size=16 version=0 flags=000000
[stsc] size=16 version=0 flags=000000
[stsz] size=20 version=0 flags=000000
[stco] size=16 version=0 flags=000000
[mvex] size=40
[trex] size=32 version=0 flags=000000
- trackID: 1
- defaultSampleDescriptionIndex: 1
- defaultSampleDuration: 0
- defaultSampleSize: 0
- defaultSampleFlags: 00000000 (isLeading=0 dependsOn=0 isDependedOn=0 hasRedundancy=0 padding=0 isNonSync=false degradationPriority=0)
[udta] size=97
[meta] size=89 version=0 flags=000000
[hdlr] size=33 version=0 flags=000000
- handlerType: mdir
- handlerName: ""
[ilst] size=44
[©too] size=36
[data] size=28
- data: Lavf61.9.108

Uh oh! In short, an MP4 file is broken down into a series of 'boxes' describing a different component of the content, each with its own metadata. This article is a great jumping off point for the low level details here.

Determining the MSE codec string

What we are particularly interested in here is the avcC box. This box contains all of the required information about the video codec that the browser needs to initialize the decoder. I've encoded this stream using H.264, which is a compression format also known as AVC1 (Advanced Video Coding). The avcC box is contained within the Sample Description (stsd) box. This information is what we can use to build the 'codecs="avc1.42c00d' part of the MIME type for the video track. More information about the composition of this codec string for other video/audio formats can be found on MDN.

note

This web based utility is also pretty handy if you are just looking to figure out the codec string for a given video file.

The information about the specific H.264 encoding profile can be found in this box.

[avcC] size=48
- AVCProfileIndication: 66
- profileCompatibility: c0
- AVCLevelIndication: 13
- SPS: 6742c00dd90141fb016a020202800001ca80006b5c078a1524
- PPS: 68cb8cb2

The codec string for H.264 (RFC6190 Section 7.1 spec and MDN Docs are also helpful) is built as follows:

  • avc1 - The codec type
  • 42 - The profile indication in hex (mp4ff will give you the decimal value). A profile indicates the specific algorithms and encoding techniques that were used to encode the video. A given decoder may only support a subset of AVC profiles. 42 here refers to the Baseline profile, which is the most widely supported profile.
  • c0 - The profile compatibility in hex (mp4ff will give you the hex value). Since bit 7 is set, this indicates that we are using the 'contrained' baseline profile. This is a subset of the baseline profile that is intended to provide the widest possible compatibility, especially for devices with limited processing power. The MSB here doesn't matter according to RFC6190, and could either be 0 or 1. In other words, 40 would also be acceptable here.
  • 0d - The level indication in hex (mp4ff will give you the decimal value). This is a hex representation of a fixed point decimal number with 1 digit of precision, so in this case the level is 1.3. The level, in short, specifies the maximum bitrate and resolution amongst other things that the decoder must be able to handle in order to play back content. Increasing levels mean that the decoder must be able to handle more data per second. If a decoder indicates support for a given level, it must also be able to support all lower levels.