Skip to main content

My Sad Existence - Building a Video Player with MSE

Part I: Introduction

Why must video playback be so complicated?

In the not so distant past, a developer could (and often did) simply slap a <video> element on a webpage (or use Flash), collect a paycheck, and then go home to their family. Such a simple, innocent time in history. But then, the world changed. The <video> element was no longer enough.

The world demanded more.

People wanted to stream video from cellular connections.

People wanted to stream an entire movie without waiting an hour for it to buffer.

The simplest possible way to play video in a browser

Here's the simplest possible playback of a video in the browser -- a <video> element being supplied a URL to a video file.

Is this experience acceptable? What if my internet connection changes mid playback? What if the video is 800mb and I only want to watch the first 5 minutes?

This is neither acceptable for the user nor is it acceptable for the server. It's wasteful on both ends. The user has to download the entire video file, and the server has to serve the entire video file.

How does a modern streaming service work?

Modern streaming services like Netflix and Peacock utilize the concept of adaptive streaming. This means that the video player is constantly monitoring the user's internet connection and adjusting the quality of the video stream accordingly.

Practically, this is done by:

  • Converting a high quality source video into multiple lower quality versions (often called renditions or representations)
  • Breaking each rendition into small chunks (often called segments)
  • Storing these segments on a server
A simplified representation of a segmented piece of content

The video player then requests these segments from the server in real time, grabbing the appropriate rendition based on the user's internet connection. The determination of which rendition to use at a given point in time is beyond the scope of this article, but it's a fascinating topic in its own right.

It's worth nothing that of course there is an almost infinite well of complexity that can be added to this process, but this is the basic idea. The subsequent articles in this series will attempt to build a simple video player that can handle adaptive streaming, introducing some sidebars into encoding, advert insertion, and DRM along the way.

The actual low level playback of video is wildly different across devices. A Playstation 5 will have a different video playback pipeline than a Roku, which will have a different pipeline than a web browser. To keep this accessible, we'll be focusing on the web browser.

Media Source Extensions (MSE)

The Media Source Extensions API is a browser API that allows JavaScript low level control of the media playback process. This is the API that allows us to build our own video player, and it's the API that we'll be using in this series.

Playback Process

The basic process of playing a video (no audio -- we will add this later) with MSE is as follows:

  1. Create a MediaSource object - This is an object that represents the media source that we will be playing. We can generate a virtual "blob" URL for this object that we can use as the src attribute of a <video> element.
  2. Create a SourceBuffer object - This is an object that represents a buffer that we can write video data to. We must continuously feed this buffer video data in real time in order to keep the video playing.
  3. Wait for the MediaSource object to fire the sourceopen event - This event is fired when the MediaSource object is ready to accept a SourceBuffer object.
  4. Add the SourceBuffer to the MediaSource object - This is done by calling the addSourceBuffer method on the MediaSource object.
  5. Fetch video data - This can be done in a variety of ways, but for our purposes we will be using the fetch API to download video data from a server.
  6. Append video data to the SourceBuffer object - This is done by calling the appendBuffer method on the SourceBuffer object.
  7. Repeat steps 4 and 5 until the video is finished.
A simplified representation of MSE

The Simplest Possible MSE Player

150 lines just to play a video? It's true, but it's not as painful as it sounds.

The player above is a heavily commented implementation of a fairly simple MSE player. It's not particularly useful in its current state, but it's a good starting point for understanding the basic concepts of MSE. You can click the "Run" button to see it in action, or click "Download" to get a standalone runnable .html page with this particular example.

Once the Media Source is ready, we snag the first segment from the server. From then on, once we detect that the buffer contains less than 4 seconds of data, we will fetch the next segment and feed the buffer with it. This goes on and on until the video is complete.

After looking at the above code, you likely have some questions:

  • What is an init segment?
  • What if the segments are not 4 seconds, or they are not all the same size?
  • Your code seems to be making an awful lot of assumptions about the URL structure of the video segments. What if the URL structure changes?
  • How did you break this video into chunks?
  • Where did that magic codec string 'avc1.42c00d' come from?
  • Why does the whole thing freeze if I try to seek?
  • I thought the whole point of this was to change the video quality based on the user's internet connection. You're not doing that. Why did you lie to me?

These are all totally reasonable questions, but I unfortunately will be answering exactly zero of them right now! The answers will be revealed to you as we build upon this simple example in subsequent articles.