Curse of Dimensionality
A usual problem associated with pattern recognition is the so-called curse of dimensionality. A increased number of features being considered for the classification task exponentially increases the required number of needed samples. This is a problem because of it increases the required computational complexity in order to maintain an acceptable classifier performance. Oftentimes, there is not much gain especially if samples can be classified into classes based on a smaller number of features.
This activity focuses on demonstrating how class separability measures are utilized in selecting and reducing the number of features and/or otherwise measuring the performance of classifier. It is assumed that we already have a classifier and we only need to measure its performance.
Class Separability Measures (Scatter Matrices)
In order to select features that allow maximal classification, it is useful to be able to transform data according to some optimality criterion in order to measure or discriminate which features maximizes classification performance. In general this involves selecting features that maximizes between-class separation among the sample data, while minimizing within-class separation.
This is illustrated in the figure below (modified figure 5.5 in Pattern Recognition book)
![]() |
| (Pattern Recognition, p. 282) |
In the above figure, we have samples classified using two features into three classes. In (a) the small within-class separation can be seen in how points cluster about the three classes. Although visual inspection seem to indicate that the samples were classified well even with a small between-class separation, there is the possibility that any outlier from the three classes will be incorrectly classified. In (b), the classification is even more problematic because of the large within-class and small between-class separation. This may indicate that the features selected for the classifier are not enough or do not discriminate well. Finally in (c), this is the best case scenario where there is a large between-class separation while maintaining the small within-class. This indicates a higher performing classifier than is demonstrated by (a) or (b). Any outlier in any of the three classes may classed as an exception.
Scatter Matrices
The difficulty in using class separability measures is that they are not easily computed unless a Gaussian assumption is employed, i.e. data is normally distributed or may be modeled using a Gaussian distribution. A simpler criteria is available using scatter matrices.
![]() |
| (Pattern Recognition, p. 280) |
From these matrices, another can be derived, the mixture matrix Sm, as well as the separability measures J1, J2, and J3.
![]() |
| (Pattern Recogntion, p. 281) |
The trace function is the sum of the matrix' diagonal, while the |x| computes the determinant of the matrix x. trace{Sb} is a measure of the average distance of the mean over all classes from the global. trace{Sw} is the average (over all classes) of the variance of the features. Finally, the trace of Sm is the sum of variances of the features around their respective global mean.
J1 values are large when samples in the l-dimensional feature space are well clustered around their mean, within each class, and the clusters of the different classes are well separated. Large values of J1 also correspond to the large values of the criterion J2 and J3 are also invariant under linear transformations.
Demonstration
We now follow this discussion with a demonstration
A. large between-class, small within-class
Characteristics| Property | X1 | X2 | X3 |
|---|---|---|---|
| mean (feature 1, 2) | 2.5, 7.5 | 7.5, 7.5 | 5.0, 2.5 |
| Standard Deviation (feature 1, 2) | 0.25, 0.25 | 0.50, 0.50 | 0.75, 0.75 |
Sb
| 4.1845548 | -0.1811118 |
| -0.1811118 | 5.9490106 |
Sw
| 0.92267549 | 0.02379992 |
| 0.02379992 | 0.94401401 |
Separability measures
| Measure | Value |
|---|---|
| J1 | 6.428629 |
| J2 | 40.41522 |
| J3 | 12.85402 |
| trace{Sw} | 1.86669 |
| trace{Sb} | 10.13357 |
B. small between-class, large within-class
Characteristics
Property X1 X2 X3
mean (feature 1, 2) 3.5, 7.5 6.5, 7.5 5.0, 3.5
Standard Deviation (feature 1, 2) 0.75, 0.75 0.65, 0.65 0.80, 0.80
Sb
1.47388079 0.06511365
0.06511365 3.80642500
Sw
1.3612369 0.0512204
0.0512204 1.3673277
Separability measures
Measure Value
J1 2.935195
J2 7.884645
J3 5.868463
trace{Sw} 2.728565
trace{Sb} 5.280306
| Property | X1 | X2 | X3 |
|---|---|---|---|
| mean (feature 1, 2) | 3.5, 7.5 | 6.5, 7.5 | 5.0, 3.5 |
| Standard Deviation (feature 1, 2) | 0.75, 0.75 | 0.65, 0.65 | 0.80, 0.80 |
Sb
| 1.47388079 | 0.06511365 |
| 0.06511365 | 3.80642500 |
| 1.3612369 | 0.0512204 |
| 0.0512204 | 1.3673277 |
Separability measures
| Measure | Value |
|---|---|
| J1 | 2.935195 |
| J2 | 7.884645 |
| J3 | 5.868463 |
| trace{Sw} | 2.728565 |
| trace{Sb} | 5.280306 |
C. large between class, small within-class
Characteristics
Property
X1
X2
X3
mean (feature 1, 2)
1.5, 7.5
8.5, 7.5
5.0, 1.5
Standard Deviation (feature 1, 2)
0.50, 0.50
0.20, 0.25
0.05, 0.05
Sb
8.296655045
0.009157317
0.009157317
8.022083912
Sw
0.055378282
-0.005966965
-0.005966965
0.076863988
Separability measures
Measure
Value
J1
124.4003
J2
16025.31
J3
258.3551
trace{Sw}
0.1322423
trace{Sb}
16.31874
To simplify this demonstration, we used only two features and only 100 samples for each class. Variances used in the Gaussian distribution are kept equal between features in each class,
Property
|
X1
|
X2
|
X3
|
|---|---|---|---|
mean (feature 1, 2)
|
1.5, 7.5
|
8.5, 7.5
|
5.0, 1.5
|
Standard Deviation (feature 1, 2)
|
0.50, 0.50
|
0.20, 0.25
|
0.05, 0.05
|
Sb
8.296655045
|
0.009157317
|
0.009157317
|
8.022083912
|
Sw
0.055378282
|
-0.005966965
|
-0.005966965
|
0.076863988
|
Separability measures
Measure
|
Value
|
|---|---|
J1
|
124.4003
|
J2
|
16025.31
|
J3
|
258.3551
|
trace{Sw}
|
0.1322423
|
trace{Sb}
|
16.31874
|
As demonstrated in the three cases above, the separability measures (J1, J2, and J3) are large if there is a good separation between classes (Sb) and the within class variance (Sw) is small. Between (A) and (B), we see the drop in the measurements. While the best case scenario (C), have shown significant gains.
Receiver Operating Characteristics Curve (ROC Curve)
Another useful tool in gauging the performance of the classifier as well as feature selection is through the use of the Receiver Operating Characteristic Curve (ROC curve). In the simplest case, it measures the performance of a binary classifier (i.e. two output classifications).
![]() |
| (Pattern Recognition, p. 275) |
In the example above (Figure 5.3 of Pattern Recognition, 2009), are two overlapping distributions, with the second distribution inverted for easier visualization. The vertical line separating the overlapped regions representing the classifier threshold.
The ROC measures the performance of the classifier by comparing the number of those correctly classified compared to those incorrectly classified, with respect to the threshold. Each point compares the number of correct classifactions (probability 1-B) at threshold a. If there is minimal overlap, i.e. the classifier correctly classifies the majority of the sample, it is possible to measure a larger shaded area in the ROC curve. Thus the curve will approach the corner (a = 0, 1-B = 1). Consequently, if the classifier were poor, i.e. more overlapping regions, the area under the curve is reduced.
In case the there are more than two classes being compared, the comparison is usually one to all, i.e. compare one class to the all of the other classes.
Demonstration
We generated two sets of samples with 1000 data points each, both from the Gaussian distribution, and with these characteristics
Property
X1
X2
mean
0.40
0.50
Standard Deviation
0.05
0.05
Property
|
X1
|
X2
|
|---|---|---|
mean
|
0.40
|
0.50
|
Standard Deviation
| 0.05 |
0.05
|
Shown above is the histogram of the sample data points for two classes, red and blue. To construct the ROC curve, we count the number of correct classifications of the blue curve with respect to the threshold.
At each threshold value a, we count the total number of those classified as blue and divide it by the number of samples (1000). We also count the number of samples falsely classified, i.e. toal number of red samples above threshold.
For example, at a threshold value of a = 0.5, the number of correct classifications (blue) is 513 while the number of red samples incorrectly classified as blue is 14. Dividing both by 1000 (number of samples per class; may be different), we obtain a coordinate value of (0.014, 0.513) at threshold a = 0.5.
![]() |
| ROC curve for correct classifcations into blue class as a function of threshold a |
![]() |
| classifier outputs with no overlaps |
![]() |
| ROC curve for classifier outputs with no overlaps |
![]() |
| classifier outputs with a significant amount of overlap |
References
- S Theoridis and K Koutroumbas. Pattern Recognition. 4th Ed. United Kingdom: Academic Press, 2009














No comments:
Post a Comment