Measuring User Perception of Video Quality

Cute little girl looking thru a magnifying glass

Testing is an Engineer’s Task

There is a great deal of science and engineering involved in testing and validation of all the pieces which make up state-of-the-art wireless communication offerings. Conceived, created, developed, deployed, and operated by engineering professionals, it is no surprise that numbers and metrics are at the root of all testing approaches. For all the aspects of performance which measure objective artifacts in the physical world like time, frequency, power, and distance, the world of hard science provides a rich, no-discussion method to compare and validate performance.

But at the edge of our LTE networks, beyond the radio towers and the sensors, the ultimate end users still remain humans. Humans, which evolved and developed in timeframes measured in millions of years. When thought of from this vantage point, the use of the term “Long Term Evolution” to describe our networks seems quite out of place.

To bridge the gap between the network and the user, we use sensors and transducers to translate sounds and light into bits to traverse the network, and then convert them back into physical vibrations and light to deliver the content to the receiver. Not that long ago, the network’s overwhelmingly dominant role was to deliver simple word-based messages instantly across the world. When words are being delivered, the task of translating into bits is fairly straightforward given the human language of words is inherently experienced as a sequence of symbols – whether represented by a string of letters or a sequence of phonemes.

More and more, however, we are using the smartphone to deliver both sound and sight as part of an entertainment experience. Unlike simple messaging, the domain of media-driven experiences brings with it a myriad of content which does not neatly fit into a well-defined binary code, and extremely subtle changes in the inflection of the sights and sounds can change the ultimate message received by the user. These subtleties allow us to name whether it is Springsteen or Tom Petty on guitar, pick our friend’s face out of the graduating class photo, or makes us tear up while watching It’s a Wonderful Life.

What makes Measuring Video Quality a Challenge

To better appreciate the task being taken on, it is important to appreciate the way in which visual images are processed in the human brain. Every engineer is familiar with the underlying theory of encoding information and the foundational work done by Shannon and Nyquist, which underpins all modern communication science. However, many are not familiar with the work initially pioneered by Karl Kupfmuller. Following his early work with Nyquist and others, Karl’s pursuits evolved to use information theory concepts to better understand human information processing.

The figure below (building on the foundations of Kupfmuller) comes from a breakthrough paper by Siegfried Lehrl and Bernd Fischer first published in 1987, a time when compression algorithms for audio and visual content were in their infancy and the shift to digital encoding of audio (and ultimately visual) entertainment had barely started. Although the exact bandwidths experienced by the mind cannot be measured with the same precision as can be applied to electronic systems, the numbers in the figure tell a clear picture and help frame the challenge of applying engineering techniques to measuring the human experience of sight and sound.

Block diagram

On the left, the 10Mbit/s content of the images discernable by the eye and 1Mbit/s of the audio sensed by the ear represent approximately 90% and 9% respectively of all the inputs we accept. Clearly, as a human, vision is king. More noteworthy, however, is the incredibly narrow bandwidth estimation of the human conscious state- 15bits/s! Current estimations suggest that between 30% to 40% of the entire brain is dedicated to the single task of achieving that translation of the raw visual image received by the eye and pre-processing (aka compressing) it down to the 15 bit/s we are consciously aware of at any given point in time. On the surface, this gap of raw data input versus amount of data actively absorbed by the conscious mind suggests that a perfect compression algorithm would allow the bandwidths/ bitrates needed for video transition and storage could be reduced to parts per million.

In practice, achieving even a 10:1 compression on video without impacting the user experience is a challenge and state-of-the art coders are starting to accomplish it in the 100:1 level of performance. Moreover, even with these high quality codecs, certain images and certain movements can result in human-perceptible distortions in the experienced images. At the heart of the challenge is that little is objectively understood as to how the brain filters down the raw information, and more importantly the 15Bit/s is a size estimate of only the conscious awareness. Phrases such as “something feels wrong,” “it does not quite look right,” “I can’t put my finger on what’s wrong” are all reminders that we often are aware of many things without being able to put it into words.

Video Quality: Measure with (Extreme) Care

The direct method to judge the quality of anything is to inspect it intensely and make certain the approach is as closely matched as possible to the manner in which the ultimate user will approach it. Therefore, when judging the user experience of a video on a smartphone, the most straightforward approach is to simply ‘look’ at it. Of course to ‘look’ requires use of a video camera which can outperform the required level of service in terms of resolution in space, time and color depth. Capturing (or more specifically over-capturing) the optical image presented to the user is the only path to ensure that measurement of the user’s experience can be ultimately attained. Although it may be tempting to bypass the screen and ‘grab’ the image in the digital domain, the performance of the decoder, the screen, the scaling algorithms, the image fusion (for picture in picture, or PIP, configurations), the chromatic color depth of rendering, and a myriad of other elements will not be validated.

In addition, by capturing the image optically, the device under test can be observed operating in the configuration identical to how it will ultimately be used by its human owner. Such non-intrusive testing assures the device is not hampered by sharing its processing power with intrusive application code and the image has not been distorted by digital processing, bandwidth limiting or other impairments added as the content passes through anything other than the screen on the front of the device. Of course, the use of optical capture brings with it challenges in the mechanical area as the smartphone and the camera need to be held in perfect alignment.

At Spirent, we have mastered the effective and efficient use of optical screen capture through the use of carefully designed apparatus and patented signal processing to automate the control and alignment precision required to ensure accurate measurements, which will capture not only the 15bit/s conscious experience of the user, but it will not miss the artifacts and imperfections which other shortcuts may obscure.

Abstract image of a lot of video thumbnails

Tomorrow, Video Quality will be Even More Critical than Today

User expectations and application demands all show signs of driving for increasing attention to the quality of video being delivered. An ever-increasing amount of mobile traffic is video based, and a growing percentage of that video content is streaming entertainment content balanced between live sporting events and cinema. These shifts are propelling the smartphone from being a “messaging device” to an “experiencing device” where the line of translating content to bits shifts from a pure science to more of an art.

At the January 2017 Consumer Electronics Show (CES), the initial 2016 sales of virtual reality (VR) devices were seen as running above forecast and a bullish 2.5-million-unit market for 2017. Given the screen in a VR device covers 45 degrees of the user’s field of vision, the rapidly growing VR marketplace will drive higher and higher resolution screens and rapidly accelerate the demand for 2K/4K video images.

In the other direction (using the device as an optical source rather than display), the exploding drone marketplace is quickly setting new standards for resolution of drone-mounted cameras and the cellular networking used to capture and deliver the images to the earth-bound customer. At the same CES show, a 3.5-million-unit market for non-toy drone devices using optical capture was set for 2017. 

In such a competitive market, it is crucial for organizations to properly assess the quality of their video delivery.  For a more in-depth look at how video capture methods can affect this performance assessment, reference Spirent’s white paper Do You See What I See? A comparison of capture methodologies for video quality assessment.


comments powered by Disqus
× Spirent.com uses cookies to enhance and streamline your experience. By continuing to browse our site, you are agreeing to the use of cookies.