scatter plot for a multiclass dataset with class imbalance and class overlapping

2.3k views Asked by At

I'm using Weka to develop a classifier for detecting semantic relations. Lets supose I have a multiclass dataset. The dataset, at first, contains 4 numeric features (could be over 4) and a class attribute, where a valid class attribute value is "HYPERNYM", "SYNONYM" or "NO", i.e., three classes. So, examples of instances could be:

   feat1   feat2   feat3   feat4   class
    ....
    0.32    0.45    0.15      5       NO
    0.26    0.48    0.93     20       HYPER
    0.65    0.32    0.43     13       NO
    0.43    0.19    0.89     45       SYN
    ...

This is a typical classification problem. However, we must consider the dataset is inflicted by class imbalance problem (it is a problem in machine learning where the total number of a class of data (positive) is far less than the total number of another class of data (negative)) and class overlapping (examples of different classes have very similar characteristics).

The question is: How can I represent each instance in a graph 2D, in a way that I can visualize the degree of overlapping between classes?

I have found a picture which illustrates a possible example of graph, like a scatter plot. However, I don't know how to plot this.

Is there an easy way to make a figure similar, but in R or using Weka?

1

There are 1 answers

2
Enrique On BEST ANSWER

You can use Multidimensional Scaling (MDS) to first, reduce the dimension of your data and then plot it. This method tries to preserve the distances between points when projecting into a lower dimension.

Here is an example in R for the iris dataset

data <- iris
colors <- as.integer(as.factor(data$Species))
d <- dist(data[,1:4])
fit <- cmdscale(d,k=2)# k is the resulting dimension
x <- fit[,1]
y <- fit[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="MDS", pch=19, col=colors)

enter image description here

Or you could also reduce it to 3 dimensions and plot it using the scatterplot3d library.

fit <- cmdscale(d,k=3)# k is the resulting dimension
x <- fit[,1]
y <- fit[,2]
z <- fit[,3]
scatterplot3d(x,y,z, color = colors, pch = 19)

enter image description here

About the class imbalance problem, I don't know how you would like to represent it in the scatter plot. Maybe by increasing the size of the points from the minority classes.