Why We Didn’t Implement Mean Opinion Score (MOS) in Avalanche Adaptive Bitrate

Spirent has always been on the cutting-edge of video and audio testing. When the hot technology was Microsoft's MMS, we added that. When RTSP rose, we implemented it with a myriad of player emulation (Microsoft, Apple, Real, BitBand …). When RTMP (Flash) video was the next best thing, we added that too.  These are just a few examples.

When presenting the solutions to the customers, we always made it clear that in video/voice testing, you want to look not only at how many streams the system you test can handle, but also the quality of the streams. If your customers are plagued with bad video quality, they will not use your Video on Demand service any more. If they can't understand what the person on the other side of the IP phone is saying, they will switch to a different VoIP provider.

This is why Avalanche (as well as Spirent TestCenter and some of our other products) always implemented Quality of Experience (QoE) metrics. There are network layer metrics: the Media Delivery Index (MDI) -related stats; and the "human-level" metrics: the Mean Opinion Score (MOS). These are pretty much industry standard metrics that are relevant when testing RTSP and SIP.

But now we support Adaptive Bitrate (ABR) and…. We don’t provide MOS or MDI. People are surprised and with reason. I was surprised too, at the beginning, until a discussion on our internal mailing list got me to think more about it. Let's explore the reasons why we didn't implement MOS and MDI for ABR, but let's first recap what MOS means in the context of load testing.

What is MOS, again?

MOS is a score on a scale of 5 that reflects the quality of a video as a human would evaluate it. A score of 5 is "perfect" (and not achievable by design). A score below 1 is "please make it stop, my eyes are bleeding." A typical good score is somewhere between 4.1 and 4.5.

As soon as a video is encoded, you can use tools to calculate its MOS. A good (usually one you have to pay for) encoder will give you a high score. A bad (that you sometimes also pay for) encoder will give you a bad score. I will not get into the details of how the score is achieved, but in two words it depends on your test methodology. Some people will compare the source and retrieved files (PEVQ). If you use R-Factor you’ll use codec and bitrate information and so on. When this video is streamed, the MOS on the receiving end cannot be higher than the source video. At best, in perfect network conditions, the MOS score will be equal. This is what you look for when looking at MOS during load tests. You're looking at the evolution of the MOS (it shouldn't get lower), not only its absolute value. If the source video MOS score is 2, it's pretty bad, but if it's still 2 when it reaches the client, your network is not degrading the quality: your network is good.

What makes MOS go down then? Typically, it's packet loss. RTSP and Multicast video streaming typically use RTP/UDP for the data stream (RTSP, which is TCP-based, is only used for control). If you're reading this blog, you should know that UDP is an unreliable transport protocol – there's no re-transmit feature, among other things. (People have been trying to work around that using RTCP, but it adds overhead, which is why RTP wasn't based on TCP in the first place I think, so it’s not an ideal solution).

Why is it irrelevant then?

As we have just seen, in a live network, a MOS score will decrease because of bad network conditions because the underlying transport protocol (UDP) is unreliable. But Adaptive Bitrate is based on HTTP, which itself is based on TCP! So there will be no packet loss – TCP’s re-transmit mechanisms will kick-in to make sure you get that lost packet.

This means your clients video quality score will always be the same as the source, because ABR relies on TCP to make sure there’s no lost data. Therefore, measuring this is irrelevant.

But re-transmits brings other problems. First, there is the overhead. Not much can be done about that. ABR is a technology that will favor quality over verbosity.

Then it takes time to re-transmit packets. You've got an extra round-trip to make when you re-transmit. On a fairly lousy network, the re-transmits will multiply and slow down the network. How will this manifest (pun intended) for the users? They will not have enough data (fragments) to keep playing the video without interruption. This is known as Buffering Wait Time. You don't want that.When this threatens to happen, the ABR client will tend to downshift to a lower bitrate. This is what makes this technology brilliant. As the name implies, it will adapt to the network conditions. This is what you want to look at. As we've seen, the video quality is a given. What is not a given, and a very good metric to look at, is the total number of downshifts. Or the total number of buffer under runs. Or the average Buffering Wait Time. And guess what, Avalanche measures all that!

But you do have some metrics, right?

People tend to have one metric to simplify the results analysis, and they are right. While this metric cannot be as precise all looking in details at all the stats, it’s important to have it. 

In Avalanche we call it the Adaptivity Score. We look at the total bandwidth used by the users, and compare it to the potential maximum bandwidth (that’s the maximum bitrate multiplied by the number of users). We then rationalize it over 100.

Let’s have an example. If we have 10 users, connecting to an ABR server serving streams at a bitrate of 1 Mbps and 500 Kbps. That’s a maximum potential bandwidth of 10 Mbps. If all 10 users are on the 1 Mbps stream, the score will be 100:

(current bitrate / max bitrate) * 100

((10x1 Mbps) / 10 Mbps) * 100 = 100

Now let's pretend that half of the users go to the 500 Kbps stream.

(((5x0.5 Mbps) + (5x1Mbps)) / 10 Mbps) * 100 = 75

And since we do this calculating at every result sampling interval, you can analyze this after the test has been executed.

In the example below, I used Avalanche to emulate both the client and server of the Apple HLS implementation of ABR. I have a bunch of bitrates in the Manifest, and enabled a bitrate shift algorithm. The users are configured to start at the minimum possible bitrate and work their way up. The video lasts 5 minutes (to allow enough time to shift all the way up).

The first graph shows the Adaptivity Score. The second graph shows which bitrate "buckets" the users are streaming from. We can see that as the users go to higher bitrate channels the score goes higher.

Adaptivity Score (Spirent Avalanche)

Active Channels: This graph shows which bitrate 'buckets' the users are streaming from.

And just for fun, here’s a screenshot of the throughput in that test. That’s almost 30 Gbps :)

Screenshot of throughput test


If there is one thing to take from this article, it's that in HTTP Adaptive Bitrate, you know that the video quality will stay the same because TCP will make sure all of the video data will reach the clients. We know that the quality of the video as viewed by the clients will be equal to the source. But the cost of this is that you might have an increased buffering time as packets are potentially re-transmitted.

The second part is that if a Service Provider wants to make sure its customers have the best possible experience, they need to make sure these clients can smoothly stream from the highest available bitrate. That's your "quality of experience" measurement in ABR : how close to the maximum available bandwidth your clients can reach.

comments powered by Disqus
× Spirent.com uses cookies to enhance and streamline your experience. By continuing to browse our site, you are agreeing to the use of cookies.