Reduce data dimensionality using curve fitting

322 views Asked by At

I am a newbie to machine learning and haven't used scikit-learn before. I am working on a project. As a part of the project I need to train a machine learning algorithm to classify some observations into separate classes. I have processed the observations from the sensor to yield some data. But the problem with the processed data is that it is a vector of different lengths for different observations.

enter image description here

This image shows some of the data. The green line is the raw data after the application of a Gaussian filter, and the red line shows a 16 degree polynomial fit to the data. The 1st row consists of data with 3 peaks, 2nd row contains data with 4 peaks and 3rd row contains data with 5 peaks. I want to be able to be able to classify the data into separate classes.

I currently plan to use the coefficients of the polynomial as my feature vector. The first row is just fine but in cases with larger number of peaks the polynomial yields a poor fit and may not lead to proper classification. I have tried using higher degree polynomials but it causes the cases with lesser number of peaks to misbehave. Just counting the number of peaks cannot be used for classification because this is just a subset of the several classes, the classification will also depend on the relative sizes and separations of the observed peaks, but this information is not very clear in the polynomial fits for larger number of peaks.

I want to know if there is

  • Some other method instead of trying to fit a polynomial which can help me to generate a feature vector to classify the data.
  • A way to visualize data of high dimensionality in python

EDIT:

I am now fitting spline instead of a polynomial to the data using scipy.interpolate.UnivariateSpline and it is generating a much better fit. I can now use the locations of knots and the coefficients of the spline. But the length of these vectors is not constant and differs even for two repetitions of same observation. Can someone suggest a way to map it to a constant length vector.

enter image description here

2

There are 2 answers

3
fferri On

Another way to compress your signal into a feature vector could be to do FFT analysis and use the first n FFT coefficients as your features.

Or you can do a windowed FFT, so that you obtain a sequence of coefficients.

3
Andrzej Pronobis On

If your problem is mostly due to the varying number of samples, interpolation and than re-sampling could indeed be the way to go. However, I would use a less constrained interpolation technique. If you have a large number of points, even linear interpolation would work. Alternatively, you could use Gaussian Process regression which is nicely implemented in scikit learn. You don't need to appply a Gaussian filter first then, since you are using regression.

Check this link out for an example of how to apply GPs for regression: http://scikit-learn.org/stable/modules/gaussian_process.html