Single-Frame Video Quality Comparisons are Deeply Flawed

Background

As encoding enthusiasts, we're all familiar with making slow.pics image comparisons in order to prove that our set of encoding parameters is superior to the status quo. Often, we think of this type of comparison as being definitive proof that the parameters are better and that you "should trust your eyes over the metrics". However, I'm here to show that these types of comparisons can be highly misleading and that they can be carefully controlled to produce the desired result regardless of reality. Using the highly reliable SSIMULACRA2 metric created by Cloudinary to promote the JPEG-XL image format, we can tune our image comparisons to get results that defy what any sane person would think upon watching the full video clips. Not only can we deliberately manipulate the comparison just by choosing the right frames, it's also highly likely that if a random frame is chosen, it still won't be representative of the overall difference in video quality.

Testing Conditions

In order to find out just how biased of a comparison we can make, we'll take this clip (which I shot on my Nikon Z6 and release to the public domain) and make a few different encodes of it. First, we'll make target-bitrate 2-pass encodes to 4500 kbps using FFmpeg libaom-av1 CPU-used 6 as well as FFmpeg libx265 medium preset. These encodes are lazy and they don't represent what either codec is capable of, nor is it meant to be a valid comparison of that. However, libaom-av1 and libx265 should behave very, very differently, allowing us to find frames where there is a large difference in performance both ways. Next, we'll settle a common debate in the AV1 encoding world: whether the CDEF artifact-removal filter in AV1 is good. To do this, we'll make two encodes at 2-pass target bitrate 7200 kbps using libaom-av1 cpu-used 4, with the only difference being whether or not CDEF is enabled. Again, this is lazy, and doesn't represent what AV1 is capable of, nor is it meant to actually find which one is better, only to show how single-frame video quality comparison can skew the result. Those two clips should be more similar in rate-control behavior, meaning that the potential bias and manipulation should be a bit smaller. After we have all these encodes, we'll measure the per-frame SSIMULACRA2 scores and load them all into a spreadsheet. From there, we can find the "x265 bias" of each frame in the first pair (x265 score - AOM score) and the "CDEF bias" of each frame in the second pair (CDEF score - no-CDEF score). Each pair's data can be sorted by the bias either way, which means that the frame that looks the worst and best for each side can be easily found.

What We're Looking For

In order to prove that a single-frame image quality comparison is flawed, we're looking for the following statistical patterns:

Frames that have a significant difference between encode 1 and encode 2 that contradicts what the full-video mean scores say, especially if that difference is larger. For example, if on average encode 1 scores 35 and encode 2 scores 39, but on a certain frame encode 1 scores 42 and encode 2 scores 30, that's a huge problem.
More than 5% of frames, if randomly selected, would show that encode 1 is better than encode 2, when full-video mean scores say the opposite (means that p>0.05 for a single frame)
Frames at both extremes of bias (greatest and least encode 1 score - encode 2 score) twist the narrative significantly

AOM AV1 vs x265 HEVC test

Again, as a disclaimer, this is NOT intended as a test of whether AV1 or HEVC is a better codec. Presets are not equivalent, x265's actual bitrate ended up being higher, and AOM's default settings are not ideal. This is only to show just how much a carefully engineered single-frame video quality comparison can skew which encode is perceived as better.
In this test, the AV1 encode scored 18.67 and the HEVC encode scored 27.59, so the HEVC encode, on average, is 8.92 SSIMULACRA2 points better. This means that when you actually watch the full clips, you'd notice that the HEVC version is quite a bit better. If you wanted to do a genuinely useful single frame video comparison, you'd want to select a frame with an "x265 bias" around 8.92, but more on that later.
Despite this, we were able to find a frame (frame 10 of 300) where the AV1 version scored a whopping 33.31 points better than HEVC. Not only is this an inversion of which one is actually better, it's almost 4x the average difference between the two clips. Just by choosing frame #10 to upload to slow.pics, you can turn it from "HEVC is a decent bit better" to "AV1 is enormously better". We went ahead and uploaded that as a comparison on slow.pics. Here, you can see what exactly a 33-point difference looks like.
In addition to that, we were also able to find a frame (frame 251 of 300) where the HEVC version scored an incredible 38.24 points better than AV1. It still leads to the correct conclusion (the HEVC version is better), but amplifying the difference by over 4x is misleading and disingenuous. We also uploaded that as another comparison so that you can observe for yourself.
But what about that second point I made for what we're looking for? Well, by using a simple =COUNTIF(), we find that 18.67% of all the frames show AOM AV1 doing better than x265 HEVC. That means that when you pick a random frame, there's an 18.67% chance that you will look at it and draw the wrong conclusion. Hence, p is greater than 0.05 for any given random-frame comparison, and therefore even a random frame is statistically invalid. If you were to compare 3 random frames and consider it a best-2-out-of-3, I calculated that there's a 3.49% chance that you will draw the wrong conclusion. That's statistically significant, but it's only possible to trust that if the frames are truly random. It is possible to select three frames that show either bias if that's the intent.

AOM AV1 - CDEF vs No CDEF test

Once again, this is NOT intended as a test of whether CDEF is actually better than no CDEF. This is a single clip, and to conclusively prove that CDEF is or isn't good, you need a huge database of all types of videos. This is only to show that carefully engineered single-frame video quality comparisons can still skew the result, even if only a single setting with minor effects is being changed.
In this test, the encode with CDEF scored on average 28.61 and the encode without CDEF scored 28.73. You would never notice that small of a difference (0.12) when playing the video back, and if you saw two images with those scores side by side, you would not be able to conclusively decide which one is better, no matter how hard you looked. This means that it's mostly futile to do an honest single-frame video quality comparison and only metrics can discern between them. However, it might be possible to do a dishonest comparison with a noticeable difference.
We were able to find a frame (frame 272 of 300) that favors the encode with CDEF by 0.81 SSIMULACRA2 points. This might be noticeable in an image comparison and contradicts the real difference, while amplifying it by an insane 7x. Along with that, we were able to find another frame (frame 57 of 300) that favors the encode without CDEF by 1.08 points. This is the same conclusion that the average scores draw, but amplified 9x, which is a big problem. Even with SSIMULACRA2 score differences around 1, it's still in the "debatable" range, which again makes the concept of a single-image video quality comparison unreliable for this small of a difference in the first place.
With this set of data, even assuming that you can discern a SSIMULACRA2 difference of <0.01, 25.33% of all frames favor CDEF, which is the wrong conclusion. That means that even if you did compare 3 frames and consider it best-2-out-of-3, there is a 6.42% chance that you draw the wrong conclusion, making it statistically insignificant. Therefore, using single-image video quality comparisons to refine parameters like that is worse than useless.

Can you do an honest comparison?

Maybe. If you put all the score differences into a spreadsheet and select a frame where the score difference is very close to the average score difference, you can make a single-image comparison that is as representative as it can be of the actual video quality. However, single-image comparison still isn't all that reliable for small differences in quality, where people might disagree as to which one is better. Personally, I think that the full-video average SSIMULACRA2 score is always going to be more reliable than any given single image comparison, but in some media, like advertising, it isn't practical to use metrics. For honest image comparison, you should choose the frame that shows the average difference. If you want to advertise your video encoder or encoding service and your ad copy involves a single-image video quality comparison, it is possible for you to choose a biased frame that maximizes the perceived advantage of your product, but this would be disingenuous.

Conclusion

Video codec developers: DON'T USE THEM! Single-image video quality comparisons are an unreliable way to evaluate video quality, no matter what the absolute difference between the videos is, and they may be worse than even bad metrics (like PSNR). If you're looking at a single-image video quality comparison, be wary of it, because the conclusion of such a comparison can be easily fudged and used for disinformation purposes. The only time when this is a reliable showing of quality is when the thing being evaluated is itself a single image (e.g. AVIF vs JPEG-XL).