I have 100,000 points that I would like to cluster using the OPTICS algorithm in ELKI. I have a upper triangular distance matrix of about 5 billion entries for this point set. In the format that ELKI wants the matrix, it will take about 100GB in memory. I am wondering does ELKI handle that sort of data load? Can any one confirm if you have made this work before?
Related Questions in MACHINE-LEARNING
- Trained ML model with the camera module is not giving predictions
- Keras similarity calculation. Enumerating distance between two tensors, which indicates as lists
- How to get content of BLOCK types LAYOUT_TITLE, LAYOUT_SECTION_HEADER and LAYOUT_xx in Textract
- How to predict input parameters from target parameter in a machine learning model?
- The training accuracy and the validation accuracy curves are almost parallel to each other. Is the model overfitting?
- ImportError: cannot import name 'HuggingFaceInferenceAPI' from 'llama_index.llms' (unknown location)
- Which library can replace causal_conv1d in machine learning programming?
- Fine-Tuning Large Language Model on PDFs containing Text and Images
- Sketch Guided Text to Image Generation
- My ICNN doesn't seem to work for any n_hidden
- Optuna Hyperband Algorithm Not Following Expected Model Training Scheme
- How can I resolve this error and work smoothly in deep learning?
- ModuleNotFoundError: No module named 'llama_index.node_parser'
- Difference between model.evaluate and metrics.accuracy_score
- Give Bert an input and ask him to predict. In this input, can Bert apply the first word prediction result to all subsequent predictions?
Related Questions in CLUSTER-ANALYSIS
- Cluster Analysis after a process
- Threshold scaling along a straight line
- create a bubble plot (or something similar) from cluster analysis in R
- Project idea about clustering and sentences similarity
- Mahalanobis distance computation in Python
- Adding a Bubble Plot as a Complex Heatmap Annotation
- Clustering Medium length (100bp) DNA Sequences
- Indicating the same clusters by colour between two Igraph plots using k mean clustering
- how to specify the maximum number of clusters for the STC algorithm in Solr admin console?
- Text clustering based on “stance” rather than the distribution of embeddings as the basis for clustering
- R ComplexHeatmap cannot reproduce exact row orders when apply row clusters to new matrix
- Principal Component Analysis and Clustering - Better Discrimination between Classes
- Recreating a spectral analysis and cluster graph example from RPUBS using K-means algorithm
- flowMatch metaclustering throws unexpteced error
- How to change 2D k-means algorithm to 2D EM-algorithm?
Related Questions in DATA-MINING
- How can I compare the similarity between multiple sets?
- I can't click the xpath address after 2 iteration
- Text clustering based on “stance” rather than the distribution of embeddings as the basis for clustering
- Using a BERT Model, I keep getting the error: Op type not registered 'CaseFoldUTF8' in binary running on MacBook-Pro-21.lan
- How to generate all possible association rule using frequent itemset?
- Representation of sequential rules in data mining (sequence pattern mining)
- Add rows to the weather data for each day, placing the corresponding date at the top
- The Output of this python code is not what I am expecting
- Preparing CSV files for pm4py event-log conversion
- KNIME Concatenate node with List Files/Folders loop?
- Weka attribute problems
- What is a more optimal method for performing this Pandas Computation
- Scrape Company opening amd closing time on Google map
- Python as_strided method, how does it work?
- Why is this .csv file not woking in Weka?
Related Questions in DBSCAN
- How to add another parameter to sklearn DBSCAN
- How to provide core points in DBSCAN?
- how to use the DBSCAN to do the taxi passengers hot spot recognition with taxi GPS data?
- How can I keep the group of clusters that are inside the lane?
- optics/dbscan/hdbscan in RStudio
- Implementation of DBSCAN on PySpark not working
- Is my python DBSCAN workflow correct for identifying users that have similar user ratings and genre profiles? Horizontal-Like graph produced
- Mahalanobis Distance in DBSCAN Clustering with R
- Can I choose the distance in Scikit K means clusterization?
- Clustering lat/long data points that are very close to each other
- Elbow method for tuning DBSCAN when minPts=1
- Cluster algorithm for coordinate based clustering with revenue density
- Is there a way to automatically split large clusters that are greater than some maximum number of points?
- How can DBSCAN be applied to image with sobel filter in python?
- R Text Clustering (words belong to what Cluster)
Related Questions in ELKI
- Getting row indices back from the DBIDs neighbours in ELKI CorePredicate DBCAN
- WeightedCorePredicate Implementation for ELKI - An example
- Elki GDBSCAN Java/Scala - how to modify the CorePredicate
- Visualization results of dbscan using ELKI
- DBSCAN: How to Cluster Large Dataset with One Huge Cluster
- ELKI: How to Specify Feature Columns of CSV for K-Means
- ELKI: LOF score as infinite
- how to install ELKI on windows?
- Should DBSCAN and its index have the same distance function
- sample_weight option in the ELKI implementation of DBSCAN
- Create Dendrogram with Elki
- KMeans usage in ELKI, comprehensive example
- How can I cluster data using a distance matrix with the ELKI library?
- ELKI KNNDistancesSampler
- Can ELKI cluster non-normalized negative points?
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
I frequently use ELKI with 100k points, up to 10 million.
However, for this to be fast you should use indexes.
For obvious reasons, any dense matrix based approach will scale at best
O(n^2), and needO(n^2)memory. Which is why I cannot process these data sets with R or Weka or scipy. They usually first try to compute the full distance matrix, and either fail halfway through, or run out of memory halfway through, or fail with negative allocation size (Weka, when your data set overflows the 2^31 positive integers, i.e. is around 46k objects).In the binary format, with float precision, the ELKI matrix format should be around
100000*999999/2*4 + 4bytes, maybe add another 4 bytes for size information. This is 20 GB. If you use the "easy to use" ascii format, then it will indeed be more. But if you use gzip compression it may end up being about the same size. It's common to have gzip compress such data to 10-20% of the raw size. In my experience gzip compressed ascii can be as small as binary encoded doubles. The main benefit of the binary format is that it will actually reside on disk, and memory caching will be handled by your operating system.Either way, I recommend to not compute distance matrixes at all in the first place.
Because if you decide to go from 100k to 1 million, the raw matrix would grow to 2 TB, and when you go to 10 million it will be 200 TB. If you want double precision, double that.
If you are using distance matrixes, your method will be at best
O(n^2), and thus not scale. Avoiding computing all pairwise distances in the first place is an important speed factor.I use indexes for everything. For kNN or radius bound approaches (for OPTICS, use the epsion parameter to make indexes effective! Choose a low epsilon!) you can precompute these queries once, if you are going to need them repeatedly.
On a data set I frequently use, with 75k instances and 27 dimensions, the file storing the precomputed 101 nearest neighbors + ties, with double precision, is 81 MB (note: this can be seen as a sparse similarity matrix). By using an index for precomputing this cache, it takes just a few minutes to compute; and then I can ran most kNN based algorithms such as LOF on this 75k dataset in 108 ms (+262 ms for loading the kNN cache + parsing the raw input data 2364 ms, for a total runtime of 3 seconds; dominated by parsing double values).