I'm using ELKI to cluster, in a hierarchical way, a dataset of geolocations using OPTICSXi. The result of the execution of the algorithm is a set of files.
The content of a file could be:
# Cluster: nameOfCluster
# OPTICSModel
# Parents: nameOfParents (this element doesn't exist for the root cluster)
# Children: nameOfChild_0, nameOfChild_1 ... nameOfChild_n, (optional)
ID=1 lat0 lon0 reachability=?
ID=3062 lat1 lon1 reachability=1.30972586 predecessor=1
ID=7383 lat2 lon2 reachability=2.56784445 predecessor=3062
ID=42839 lat3 lon3 reachability=4.05510623 predecessor=1
I don't understand if the elements that are in each file (in the example there are four elements) belong to the same cluster or could belong to different clusters. In the latter case, I need to write some code that builds the clusters ( for example looking at the predecessor of each node), or there are some parameters that could I specify in Elki to obtain each single cluster?
By default, ELKI will produce a directory with one file per cluster. Unless the output file already exists, in which case you will get all the clusters written into the same file, separated with comments as seen above.
With a hierarchical result, such as
OPTICSXi
, your should however also treat all members of the child clusters to be also part of the parent. These are clusters nested into the parent. They are not repeated in the parent, to reduce redundancy in the output.Compare the output of
OPTICSXi
toOPTICS
output. What the Xi approach does, is split the data for you, based on sudden drops in reachability-distance. All clusters of Xi should be subsequences of the original OPTICS cluster order.In your case, you may have chosen
minPts
too small, if your cluster has just 4 elements. (Although, you may have truncated the file, or you may have a lot of elements in child clusters; so the output may be fine).Also note that you will usually want to validate whether you want the first element(s) of your cluster to belong to the cluster or not; similarly the last elements.
OPTICSXi
tends to err on the first elements, but not in a systematic way that would be trivial to fix. The first and last elements are those that bridge the gap from one cluster to another. You really should verify these manually (which is a good reason to not chooseminPts
too small).I strongly recommend to build/use a visualization for your specific use case. Then you could just load such a cluster into your visualization and visually inspect if the result makes sense to you. I have used
OPTICSXi
on geographic data, and that worked very well for me.