Colored Pixels: Classes Apart

Curse of Dimensionality

A usual problem associated with pattern recognition is the so-called curse of dimensionality. A increased number of features being considered for the classification task exponentially increases the required number of needed samples. This is a problem because of it increases the required computational complexity in order to maintain an acceptable classifier performance. Oftentimes, there is not much gain especially if samples can be classified into classes based on a smaller number of features.

This activity focuses on demonstrating how class separability measures are utilized in selecting and reducing the number of features and/or otherwise measuring the performance of classifier. It is assumed that we already have a classifier and we only need to measure its performance.

Class Separability Measures (Scatter Matrices)

In order to select features that allow maximal classification, it is useful to be able to transform data according to some optimality criterion in order to measure or discriminate which features maximizes classification performance. In general this involves selecting features that maximizes between-class separation among the sample data, while minimizing within-class separation.

This is illustrated in the figure below (modified figure 5.5 in Pattern Recognition book)

(Pattern Recognition, p. 282)

In the above figure, we have samples classified using two features into three classes. In (a) the small within-class separation can be seen in how points cluster about the three classes. Although visual inspection seem to indicate that the samples were classified well even with a small between-class separation, there is the possibility that any outlier from the three classes will be incorrectly classified. In (b), the classification is even more problematic because of the large within-class and small between-class separation. This may indicate that the features selected for the classifier are not enough or do not discriminate well. Finally in (c), this is the best case scenario where there is a large between-class separation while maintaining the small within-class. This indicates a higher performing classifier than is demonstrated by (a) or (b). Any outlier in any of the three classes may classed as an exception.

Scatter Matrices

The difficulty in using class separability measures is that they are not easily computed unless a Gaussian assumption is employed, i.e. data is normally distributed or may be modeled using a Gaussian distribution. A simpler criteria is available using scatter matrices.

(Pattern Recognition, p. 280)

(Pattern Recognition, p. 281)

where P_i ~ n_i/N is the prior probability that the sample belongs to class i. n_i is the number of samples out of the total N. u_iis the mean vector of all samples belonging to class i. The results are expressed matrices because of the multiplication between the feature vectors and/or the mean vector.

From these matrices, another can be derived, the mixture matrix S_m, as well as the separability measures J₁, J₂, and J₃.

(Pattern Recogntion, p. 281)

From the form of these matrices, we can see that it computable for any number of features as well as any number of classes and samples. It is also computable when the number of samples between classes is not equal, though this is generally not ideal.

The trace function is the sum of the matrix' diagonal, while the |x| computes the determinant of the matrix x. trace{S_b} is a measure of the average distance of the mean over all classes from the global. trace{S_w} is the average (over all classes) of the variance of the features. Finally, the trace of S_m is the sum of variances of the features around their respective global mean.

J₁values are large when samples in the l-dimensional feature space are well clustered around their mean, within each class, and the clusters of the different classes are well separated. Large values of J₁ also correspond to the large values of the criterion J₂ and J₃are also invariant under linear transformations.

Demonstration

We now follow this discussion with a demonstration

A. large between-class, small within-class

Characteristics

Property	X₁	X₂	X₃
mean (feature 1, 2)	2.5, 7.5	7.5, 7.5	5.0, 2.5
Standard Deviation (feature 1, 2)	0.25, 0.25	0.50, 0.50	0.75, 0.75

S_b

4.1845548	-0.1811118
-0.1811118	5.9490106

S_w

0.92267549	0.02379992
0.02379992	0.94401401

Separability measures

Measure	Value
J₁	6.428629
J₂	40.41522
J₃	12.85402
trace{S_w}	1.86669
trace{S_b}	10.13357

B. small between-class, large within-class

Characteristics

Property X₁ X₂ X₃

mean (feature 1, 2) 3.5, 7.5 6.5, 7.5 5.0, 3.5

Standard Deviation (feature 1, 2) 0.75, 0.75 0.65, 0.65 0.80, 0.80

S_b

1.47388079 0.06511365

0.06511365 3.80642500

S_w

1.3612369 0.0512204

0.0512204 1.3673277

Separability measures

Measure Value

J₁ 2.935195

J₂ 7.884645

J₃ 5.868463

trace{S_w} 2.728565

trace{S_b} 5.280306

Property	X₁	X₂	X₃
mean (feature 1, 2)	3.5, 7.5	6.5, 7.5	5.0, 3.5
Standard Deviation (feature 1, 2)	0.75, 0.75	0.65, 0.65	0.80, 0.80

Measure	Value
J₁	2.935195
J₂	7.884645
J₃	5.868463
trace{S_w}	2.728565
trace{S_b}	5.280306

C. large between class, small within-class

Characteristics

Property

X₁

X₂

X₃

mean (feature 1, 2)

1.5, 7.5

8.5, 7.5

5.0, 1.5

Standard Deviation (feature 1, 2)

0.50, 0.50

0.20, 0.25

0.05, 0.05

S_b

8.296655045

0.009157317

0.009157317

8.022083912

S_w

0.055378282

-0.005966965

-0.005966965

0.076863988

Separability measures

Measure

Value

J₁

124.4003

J₂

16025.31

J₃

258.3551

trace{S_w}

0.1322423

trace{S_b}

16.31874

Property	X₁	X₂	X₃
mean (feature 1, 2)	1.5, 7.5	8.5, 7.5	5.0, 1.5
Standard Deviation (feature 1, 2)	0.50, 0.50	0.20, 0.25	0.05, 0.05

Measure	Value
J₁	124.4003
J₂	16025.31
J₃	258.3551
trace{S_w}	0.1322423
trace{S_b}	16.31874

To simplify this demonstration, we used only two features and only 100 samples for each class. Variances used in the Gaussian distribution are kept equal between features in each class,

As demonstrated in the three cases above, the separability measures (J₁, J₂, and J₃) are large if there is a good separation between classes (S_b) and the within class variance (S_w) is small. Between (A) and (B), we see the drop in the measurements. While the best case scenario (C), have shown significant gains.

Receiver Operating Characteristics Curve (ROC Curve)

Another useful tool in gauging the performance of the classifier as well as feature selection is through the use of the Receiver Operating Characteristic Curve (ROC curve). In the simplest case, it measures the performance of a binary classifier (i.e. two output classifications).

(Pattern Recognition, p. 275)

In the example above (Figure 5.3 of Pattern Recognition, 2009), are two overlapping distributions, with the second distribution inverted for easier visualization. The vertical line separating the overlapped regions representing the classifier threshold.

The ROC measures the performance of the classifier by comparing the number of those correctly classified compared to those incorrectly classified, with respect to the threshold. Each point compares the number of correct classifactions (probability 1-B) at threshold a. If there is minimal overlap, i.e. the classifier correctly classifies the majority of the sample, it is possible to measure a larger shaded area in the ROC curve. Thus the curve will approach the corner (a = 0, 1-B = 1). Consequently, if the classifier were poor, i.e. more overlapping regions, the area under the curve is reduced.

In case the there are more than two classes being compared, the comparison is usually one to all, i.e. compare one class to the all of the other classes.

Demonstration

We generated two sets of samples with 1000 data points each, both from the Gaussian distribution, and with these characteristics

Property

X₁

X₂

mean

0.40

0.50

Standard Deviation
0.05

0.05

Property	X₁	X₂
mean	0.40	0.50
Standard Deviation	0.05	0.05

Shown above is the histogram of the sample data points for two classes, red and blue. To construct the ROC curve, we count the number of correct classifications of the blue curve with respect to the threshold.

At each threshold value a, we count the total number of those classified as blue and divide it by the number of samples (1000). We also count the number of samples falsely classified, i.e. toal number of red samples above threshold.

For example, at a threshold value of a = 0.5, the number of correct classifications (blue) is 513 while the number of red samples incorrectly classified as blue is 14. Dividing both by 1000 (number of samples per class; may be different), we obtain a coordinate value of (0.014, 0.513) at threshold a = 0.5.

ROC curve for correct classifcations into blue class as a function of threshold a

The full ROC curve representing the thresholds 0.0 <= a <= 1.0 at 0.1 intervals is shown above. The curve can be smoothened by increasing the number intervals in the a. If we increase the separation further, we should see an improvement in the ROC curve.

classifier outputs with no overlaps

ROC curve for classifier outputs with no overlaps

If the classifier outputs show no overlap, as shown in the histogram above, we see a vastly improved ROC curve completely encompassing the area above the threshold line. We also increased the number of intervals for a to 100.

classifier outputs with a significant amount of overlap

If we increase the overlap further between the distributions, we see a reduced area under the ROC curve.

References

S Theoridis and K Koutroumbas. Pattern Recognition. 4th Ed. United Kingdom: Academic Press, 2009

https://github.com/daelsepara/pixelprocessing/blob/master/Feature/class-separability.R

http://www.dataschool.io/roc-curves-and-auc-explained/

Colored Pixels

Thursday, December 17, 2015

Classes Apart

Curse of Dimensionality

Class Separability Measures (Scatter Matrices)

Scatter Matrices

Demonstration

A. large between-class, small within-class

B. small between-class, large within-class

C. large between class, small within-class

Receiver Operating Characteristics Curve (ROC Curve)

Demonstration

Property

X₁

X₂

mean

0.40

0.50

Standard Deviation
0.05

0.05

References

No comments:

Post a Comment

Thursday, December 17, 2015

Classes Apart

Curse of Dimensionality

Class Separability Measures (Scatter Matrices)

Scatter Matrices

Demonstration

A. large between-class, small within-class

B. small between-class, large within-class

C. large between class, small within-class

Receiver Operating Characteristics Curve (ROC Curve)

Demonstration

Property X1 X2 mean 0.40 0.50 Standard Deviation 0.05 0.05

References

No comments:

Post a Comment

Property

X₁

X₂

mean

0.40

0.50

Standard Deviation
0.05

0.05