For the bridge and face image, Methods
5
(B-spline interpolation) and
8
(edge-enhanced zooming) were preferred by the human panel, and had low
visual difference scores. Furthermore the panels' choices would not have
been predicted by signal-to-error ratio. For the text image the visual
difference scores did not predict the panels preference. This may be
because human observers are judging
readability
not image quality
. We also found a very simple method based on truncating the DCT
coefficients [
7
] was quite effective, so for applications where computational
complexity is important this may be a good choice.
It is of interest to formally compare the visual difference scores to
the human panel. We do this via the robust Spearman rank correlation
coefficient,
, [
14
]. Each observer produced a separate ranking so we have computed
between each observer and each of the error measures. Table
4
gives the mean of the rank correlation coefficient over all 13
observers.
Table 4:
Mean Spearman rank correlation coefficients for each image and error
metric.
The rank correlation coefficients have a zero significance statistic
that is distributed according to a Student's
t
distribution with 6 degrees of freedom. We find that a correlation of
0.28 is roughly at the
significance level. What this means is that all the correlations in
Table
4
, apart from those marked with a *, are significantly different from
zero.
The visual difference score appears to predict human performance for a variety of grey scale images (in this paper we show only two of the images we have tried), but does not work well for images where the observers may use additional interpretation to assess image quality. The text image presented here is an example of this, but we have also found some face images have produced a disparity between the visual difference score and human scores.
A further problem when testing resampling methods is the provenance of images. All of the images shown here were generated using known methods, but we have noted inconsistent results when using well known test images. We suspect that several of these images have been previously resampled.
Currently we are repeating the human tests using, what we think is, an improved method in which observers are presented with image pairs for a short duration and are asked to select the better of the two. A ranking is produced by sorting these pairwise comparisons.
Stephen King ESE PG