Why does ELKI need db.in file in addition to distance matrix? Also what should db.in file contain?

311 views Asked by At

I tried to follow this tutorial on using ELKI with pre-computed distances for clustering.

http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances

I used the following set of command line options:

-dbc.filter FixedDBIDsFilter -dbc.startid 0 -algorithm clustering.OPTICS 
-algorithm.distancefunction external.FileBasedDoubleDistanceFunction 
-distance.matrix /path/to/matrix -optics.minpts 5 -resulthandler ResultWriter

ELkI fails with a configuration error saying db.in file is needed to make the computation.

The following configuration errors prevented execution:
No value given for parameter "dbc.in":
Expected: The name of the input file to be parsed.    
No value given for parameter "parser.distancefunction":
Expected: Distance function used for parsing values.

My question is what is db.in file? Why should I provide it in addition to the distance matrix file since the pair-wise distance matrix file completely specifies all the information about the point cloud. (also I don't have access to any other information other than the pair-wise distance information).

What should I do about db.in? Should I override it, or specify some dummy information etc. Kindly help me understand.

thank you.

1

There are 1 answers

0
Erich Schubert On BEST ANSWER

This is documented in the ELKI HowTos:

http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances

Using without primary data

-dbc DBIDRangeDatabaseConnection -idgen.count 100

However, there is a bug (patch is on the howto page, and will be in the next release) so you right now can't fully use this; as a workaround you can use a text file that enumerates the objects.

The reason for this is that ELKI is designed to work on multi-relational data. It's not just processing matrixes. But some algorithms may e.g. need a geographic representation of an object, some measurements for this object, and a label for evaluation. That is three relations.

What the DBIDRange data source essentially does is create a single "fake" relation that is just the DBIDs 0 to 99. On algorithms that don't need actual data, but only distances (e.g. LOF or DBSCAN or OPTICS), it is sufficient to have object IDs and a distance matrix.