My Sad Existence - Building a Video Player with MSE
Part I: Introduction
Why must video playback be so complicated?
In the not so distant past, a developer could (and often did) simply slap a <video> element on a webpage (or use Flash), collect a paycheck, and then go home to their family. Such a simple, innocent time in history.
But then, the world changed. The <video> element was no longer enough.
The world demanded more.
People wanted to stream video from cellular connections.
People wanted to stream an entire movie without waiting an hour for it to buffer.
The simplest possible way to play video in a browser
Here's the simplest possible playback of a video in the browser -- a <video> element being supplied a URL to a video file.
Is this experience acceptable? What if my internet connection changes mid playback? What if the video is 800mb and I only want to watch the first 5 minutes?
This is neither acceptable for the user nor is it acceptable for the server. It's wasteful on both ends. The user has to download the entire video file, and the server has to serve the entire video file.
How does a modern streaming service work?
Modern streaming services like Netflix and Peacock utilize the concept of adaptive streaming. This means that the video player is constantly monitoring the user's internet connection and adjusting the quality of the video stream accordingly.
Practically, this is done by:
- Converting a high quality source video into multiple lower quality versions (often called renditions or representations)
- Breaking each rendition into small chunks (often called segments)
- Storing these segments on a server
The video player then requests these segments from the server in real time, grabbing the appropriate rendition based on the user's internet connection. The determination of which rendition to use at a given point in time is beyond the scope of this article, but it's a fascinating topic in its own right.
It's worth nothing that of course there is an almost infinite well of complexity that can be added to this process, but this is the basic idea. The subsequent articles in this series will attempt to build a simple video player that can handle adaptive streaming, introducing some sidebars into encoding, advert insertion, and DRM along the way.
The actual low level playback of video is wildly different across devices. A Playstation 5 will have a different video playback pipeline than a Roku, which will have a different pipeline than a web browser. To keep this accessible, we'll be focusing on the web browser.
Media Source Extensions (MSE)
The Media Source Extensions API is a browser API that allows JavaScript low level control of the media playback process. This is the API that allows us to build our own video player, and it's the API that we'll be using in this series.
Playback Process
The basic process of playing a video (no audio -- we will add this later) with MSE is as follows:
- Create a
MediaSourceobject - This is an object that represents the media source that we will be playing. We can generate a virtual "blob" URL for this object that we can use as thesrcattribute of a<video>element. - Create a
SourceBufferobject - This is an object that represents a buffer that we can write video data to. We must continuously feed this buffer video data in real time in order to keep the video playing. - Wait for the
MediaSourceobject to fire thesourceopenevent - This event is fired when theMediaSourceobject is ready to accept aSourceBufferobject. - Add the
SourceBufferto theMediaSourceobject - This is done by calling theaddSourceBuffermethod on theMediaSourceobject. - Fetch video data - This can be done in a variety of ways, but for our purposes we will be using the
fetchAPI to download video data from a server. - Append video data to the
SourceBufferobject - This is done by calling theappendBuffermethod on theSourceBufferobject. - Repeat steps 4 and 5 until the video is finished.
The Simplest Possible MSE Player
150 lines just to play a video? It's true, but it's not as painful as it sounds.
The player above is a heavily commented implementation of a fairly simple MSE player. It's not particularly useful in its current state, but it's a good starting point for understanding the basic concepts of MSE. You can click the "Run" button to see it in action, or click "Download" to get a standalone runnable .html page with this particular example.
Once the Media Source is ready, we snag the first segment from the server. From then on, once we detect that the buffer contains less than 4 seconds of data, we will fetch the next segment and feed the buffer with it. This goes on and on until the video is complete.
After looking at the above code, you likely have some questions:
- What is an init segment?
- What if the segments are not 4 seconds, or they are not all the same size?
- Your code seems to be making an awful lot of assumptions about the URL structure of the video segments. What if the URL structure changes?
- How did you break this video into chunks?
- Where did that magic codec string
'avc1.42c00d'come from? - Why does the whole thing freeze if I try to seek?
- I thought the whole point of this was to change the video quality based on the user's internet connection. You're not doing that. Why did you lie to me?
These are all totally reasonable questions, but I unfortunately will be answering exactly zero of them right now! The answers will be revealed to you as we build upon this simple example in subsequent articles.