This note should be read in the context of the MeasTex framework for measuring the performance of a texture classification algorithm.
In this note, we describe the measure of accuracy computed from the output of an algorithm ; we call this measure of accuracy the algorithm's score. The test images component of the framework has a hierarchical structure: it consists of a set of texture problem suites, each of which consist of a set of texture problems.
A texture problem contains several sets of images. For each class of texture in the problem there are several sample images and several validation images. All sample images, and their true class, are given to the texture classification algorithm, which builds a model of each texture class. Then, each validation image is presented to the algorithm, which must classify the image.
An accepted measure of classification performance for a test problem is simple percentage correct classification. We can write the per-test-problem score as
where C is the set of all classes,
is the set of validation
images of class c and
is the number of images in this set,
is the chosen class of image i and
is the true class of
the same image.
if
and 0
otherwise.
The measure we just described takes into account only the correctness of the classifier, not the confidence. Consider an algorithm classifying a validation image in a two class problem. The classifier might compute that the image is 90% likely to be of class A, or might compute that it is 60% likely to be of class A. In either case, if the classifier is required to just nominate a single class, it should nominate class A. However, we believe that the confidence of the classifier is important information. In particular, if a classifier is 90% confident, but wrong, it should be punished more than a classifier that is 60% confident but wrong. We therefore require the classifier to output a vector with one non-negative element for each class. We have designed a new measure which awards a classifier the highest expected score (in the statistical sense) if it writes out the probabilities it computes for each class.
Let us introduce some more notation. Let
be the the
output of the classifier for the i'th validation image of a problem
suite and
be the element of
corresponding to class c. Likewise, let
and
be the algorithm's computed probability vector and its
c'th element. Let
be the element associated with the true
class of the i'th validation image. We will first describe how a
score is computed for each validation image, and then describe how
that score is aggregated to form per-test-problem and
per-test-suite scores.
The new measure of an algorithm on a single validation image is given by
where
denotes the L2 norm. Given the algorithm's computed
probabilities, this measure will have the expected value
The expected value is maximised when
and
are
collinear ; that is, when the
correspond to the algorithm's
computed probabilities. This measure is consistent with rewarding
algorithms which write out their computed probabilities .
Incidentally, the percentage correct measure is obtained with this
same notation by replacing the L2 norm above with the L1 norm. The
expected value for the L1 formula is maximised when the
corresponding to the largest
is equal to 1, and the other
are equal to zero. This is effectively nominating a single
class, and discarding the probability, or confidence, information.
We wish to allow the contribution of a particular class to the score to be weighted. This weighting is incorporated in the formula for a per-test-problem score:
where the
is the weighting of class c. We constrain the class
weights so that
.
It should be noted that a positive score does not necessarily indicate useful computation. The scores should be considered as a basis for comparison between algorithms and not as any kind of absolute measure of an algorithm's intrinsic knowledge about textures.
Equation 1 gives the formula for scoring an algorithm on a single texture problem. Such scores may be useful for the detailed analysis of an algorithm. However, we expect that summary scores for a suite of problems will be a more useful measure of a texture classification algorithm. The summary score for a suite is computed by simply averaging an algorithm's scores on all problems in the suite.