The dataset I want to cluster consists of ~1000 samples and 10 features, which have different scales and ranges (negative, positive, both). Using scipy.stats.normaltest() I found that none of the features are normally-distributed (all p-values < 1e-4, small enough to reject the null hypothesis that the data are taken from a normal distribution). But all of the distance measures that I'm aware of assume normally-distributed data (I was using Mahalanobis until I realized how non-uniform the data was). What distance measures would one use in this situation? Or is this where one simply has to normalize every feature and hope that that doesn't introduce bias?
distance metrics for clustering non-normally distributed data
593 views Asked by kevinafra At
1
There are 1 answers
Related Questions in CLUSTER-ANALYSIS
- Cluster Analysis after a process
- Threshold scaling along a straight line
- create a bubble plot (or something similar) from cluster analysis in R
- Project idea about clustering and sentences similarity
- Mahalanobis distance computation in Python
- Adding a Bubble Plot as a Complex Heatmap Annotation
- Clustering Medium length (100bp) DNA Sequences
- Indicating the same clusters by colour between two Igraph plots using k mean clustering
- how to specify the maximum number of clusters for the STC algorithm in Solr admin console?
- Text clustering based on “stance” rather than the distribution of embeddings as the basis for clustering
- R ComplexHeatmap cannot reproduce exact row orders when apply row clusters to new matrix
- Principal Component Analysis and Clustering - Better Discrimination between Classes
- Recreating a spectral analysis and cluster graph example from RPUBS using K-means algorithm
- flowMatch metaclustering throws unexpteced error
- How to change 2D k-means algorithm to 2D EM-algorithm?
Related Questions in DISTANCE
- Algorithm to find neighbours of point by distance with no repeats
- distance matrix api gives incorrect data
- How to make this sensor keep taking readings once its when_in_range function has been activated?
- Threshold scaling along a straight line
- How to subtract large binary numbers?
- sf_distance within for within for each parallelisation
- How do I update the coordinates based on most recent datetime?
- Prediction Accuracy Zero (y_pred == y_test) & ValueError after Binary Projection, kNN Hamming, Xtrain/y_train appear accurate
- Mahalanobis distance between two multivariate Gaussian distribution
- problem with inputs for estimating earth mover distance with emd from python ot package
- Finding the most-similar color palette
- scipy.spatial.distance: cityblock between lat/long points: What is the unit of the results?
- How can i add measure tool in openlayers3?
- Is calculating the distance between two floating-point numbers symmetrical?
- Fast computation of squared norm and normalized vector with Eigen
Related Questions in NON-UNIFORM-DISTRIBUTION
- How to adjust for non-uniform sampling (log-scale/polar) in Monte Carlo integration?
- Do the non-uniform distribution scores reduce Redis Sorted Set performance?
- How do I map non-uniform int ranges to certain string values in C#?
- Fill a vector with a specific distribution of nonuniform screen points
- Non-uniform Distributions for Haskell data types
- distance metrics for clustering non-normally distributed data
- Creating a Javascript function that returns random integers but with a specified distribution/"weight"
- equivalent function to numpy.random.choice in C++
- plot 3D spherical parametric surface using non-regular angles
- Spawn gameObject horde, modify concentration of spawned objects
- Create equally spaced data from sensor data to apply Fast Fourier Transform
- Why does taking the salted hash of the mod of a hash result in a very non-uniform distribution?
- Random non-uniform distribution with given proportion
- Effective Java Item 47: Know and use your libraries - Flawed random integer method example
- uniform value from a range of numbers?
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Popular Tags
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Why do you think all distances would assume normal (which btw. is not the same as uniform) data?
Consider Euclidean distance. In many physical applications this distance makes perfect sense, because it is "as the crow flies". Manhattan distance makes a lot of sense when movement is constrained to two axes that cannot be used at the same time. These are completely appropriate for non-normal distributed data.