I have 100,000 points that I would like to cluster using the OPTICS algorithm in ELKI. I have a upper triangular distance matrix of about 5 billion entries for this point set. In the format that ELKI wants the matrix, it will take about 100GB in memory. I am wondering does ELKI handle that sort of data load? Can any one confirm if you have made this work before?
Related Questions in MACHINE-LEARNING
- How to not load all database records in my TListbox in Firemonkey Delphi XE8
- How to catch WM_DEVICECHANGE in a control other than TForm?
- show information with Rolling / moving messages delphi xe7
- What is the different between "Console target" and "GUI target" in DCC32 option?
- How to add new online ressources to RAD Studio help system
- C# and Delphi code have different behaviour when importing unmanaged dll
- Loop through records on a cxgrid and update a field/column
- Delphi 7 - Save to a Specific .INI Files Name
- TImagelist for large images
- how to modify a function so it returns an array of strings
Related Questions in CLUSTER-ANALYSIS
- How to not load all database records in my TListbox in Firemonkey Delphi XE8
- How to catch WM_DEVICECHANGE in a control other than TForm?
- show information with Rolling / moving messages delphi xe7
- What is the different between "Console target" and "GUI target" in DCC32 option?
- How to add new online ressources to RAD Studio help system
- C# and Delphi code have different behaviour when importing unmanaged dll
- Loop through records on a cxgrid and update a field/column
- Delphi 7 - Save to a Specific .INI Files Name
- TImagelist for large images
- how to modify a function so it returns an array of strings
Related Questions in DATA-MINING
- How to not load all database records in my TListbox in Firemonkey Delphi XE8
- How to catch WM_DEVICECHANGE in a control other than TForm?
- show information with Rolling / moving messages delphi xe7
- What is the different between "Console target" and "GUI target" in DCC32 option?
- How to add new online ressources to RAD Studio help system
- C# and Delphi code have different behaviour when importing unmanaged dll
- Loop through records on a cxgrid and update a field/column
- Delphi 7 - Save to a Specific .INI Files Name
- TImagelist for large images
- how to modify a function so it returns an array of strings
Related Questions in DBSCAN
- How to not load all database records in my TListbox in Firemonkey Delphi XE8
- How to catch WM_DEVICECHANGE in a control other than TForm?
- show information with Rolling / moving messages delphi xe7
- What is the different between "Console target" and "GUI target" in DCC32 option?
- How to add new online ressources to RAD Studio help system
- C# and Delphi code have different behaviour when importing unmanaged dll
- Loop through records on a cxgrid and update a field/column
- Delphi 7 - Save to a Specific .INI Files Name
- TImagelist for large images
- how to modify a function so it returns an array of strings
Related Questions in ELKI
- How to not load all database records in my TListbox in Firemonkey Delphi XE8
- How to catch WM_DEVICECHANGE in a control other than TForm?
- show information with Rolling / moving messages delphi xe7
- What is the different between "Console target" and "GUI target" in DCC32 option?
- How to add new online ressources to RAD Studio help system
- C# and Delphi code have different behaviour when importing unmanaged dll
- Loop through records on a cxgrid and update a field/column
- Delphi 7 - Save to a Specific .INI Files Name
- TImagelist for large images
- how to modify a function so it returns an array of strings
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Popular Tags
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
I frequently use ELKI with 100k points, up to 10 million.
However, for this to be fast you should use indexes.
For obvious reasons, any dense matrix based approach will scale at best
O(n^2)
, and needO(n^2)
memory. Which is why I cannot process these data sets with R or Weka or scipy. They usually first try to compute the full distance matrix, and either fail halfway through, or run out of memory halfway through, or fail with negative allocation size (Weka, when your data set overflows the 2^31 positive integers, i.e. is around 46k objects).In the binary format, with float precision, the ELKI matrix format should be around
100000*999999/2*4 + 4
bytes, maybe add another 4 bytes for size information. This is 20 GB. If you use the "easy to use" ascii format, then it will indeed be more. But if you use gzip compression it may end up being about the same size. It's common to have gzip compress such data to 10-20% of the raw size. In my experience gzip compressed ascii can be as small as binary encoded doubles. The main benefit of the binary format is that it will actually reside on disk, and memory caching will be handled by your operating system.Either way, I recommend to not compute distance matrixes at all in the first place.
Because if you decide to go from 100k to 1 million, the raw matrix would grow to 2 TB, and when you go to 10 million it will be 200 TB. If you want double precision, double that.
If you are using distance matrixes, your method will be at best
O(n^2)
, and thus not scale. Avoiding computing all pairwise distances in the first place is an important speed factor.I use indexes for everything. For kNN or radius bound approaches (for OPTICS, use the epsion parameter to make indexes effective! Choose a low epsilon!) you can precompute these queries once, if you are going to need them repeatedly.
On a data set I frequently use, with 75k instances and 27 dimensions, the file storing the precomputed 101 nearest neighbors + ties, with double precision, is 81 MB (note: this can be seen as a sparse similarity matrix). By using an index for precomputing this cache, it takes just a few minutes to compute; and then I can ran most kNN based algorithms such as LOF on this 75k dataset in 108 ms (+262 ms for loading the kNN cache + parsing the raw input data 2364 ms, for a total runtime of 3 seconds; dominated by parsing double values).