I am trying to implement Kernel K Means clustering with the kkmeans() function from the kernlab R package. My problem is that my code returns the expected output when I specify some numbers of clusters with the function's clusters argument, but throws an error for other numbers of clusters:
Error in if (sum(abs(dc)) < 1e-15) break : missing value where TRUE/FALSE needed
My guess is that this is a convergence issue since the error seems to arise when I increase the number of clusters, but this would be surprising since I have many more rows than the number of clusters I'm specifying. While I can specify 10 clusters with success with an 8000x3 matrix, I receive an error with 100 clusters. Similarly, I can specify 5 clusters but not 10 with a 50-row subset of that data.
Below is a reproducible minimal example where my code replicates the success and the error.
Error if centers = 10
kernlab::kkmeans(mymat, centers=10)
#> Using automatic sigma estimation (sigest) for RBF or laplace kernel
#> Error in if (sum(abs(dc)) < 1e-15) break: missing value where TRUE/FALSE needed
No error if centers = 5
kernlab::kkmeans(mymat, centers=5)
#> Using automatic sigma estimation (sigest) for RBF or laplace kernel
#> Spectral Clustering object of class "specc"
#>
#> Cluster memberships:
#>
#> 1 1 1 1 2 1 1 3 3 5 5 5 3 2 2 2 4 4 3 3 5 2 2 5 5 5 5 5 5 2 4 3 3 3 2 2 5 3 3 5 5 4 4 4 3 1 4 2 5 3
#>
#> Gaussian Radial Basis kernel function.
#> Hyperparameter : sigma = 0.756590498067127
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 15.75871 -16.69486 191.5841
#> [2,] 16.74850 -21.94730 186.8914
#> [3,] 15.99483 -18.95892 190.2622
#> [4,] 15.45729 -18.13571 191.9611
#> [5,] 16.69136 -22.19600 187.0055
#>
#> Cluster size:
#> [1] 7 10 12 7 14
#>
#> Within-cluster sum of squares:
#> [1] 301006.7 443237.8 607889.4 305777.1 685823.5
Example data (50x3 matrix)
mymat <- structure(c(15.9390001296997, 15.9079999923706, 16.087999343872,
15.7930002212524, 15.9619998931884, 15.6129999160766, 15.7550001144409,
16.7740001678466, 16.9080009460449, 17.0769996643066, 16.3640003204345,
16.5960006713867, 16.579999923706, 16.4570007324218, 16.2320003509521,
16.1639995574951, 15.6180000305175, 15.5109996795654, 15.5120000839233,
15.628999710083, 16.9950008392333, 17.3530006408691, 17.2229995727539,
16.8910007476806, 17.1800003051757, 17.1709995269775, 16.9860000610351,
16.704999923706, 16.273000717163, 15.8830003738403, 15.6230001449584,
15.333999633789, 15.3839998245239, 15.3870000839233, 17.1119995117187,
17.6200008392333, 16.8349990844726, 16.4969997406005, 16.2479991912841,
16.1259994506835, 15.8059997558593, 15.378999710083, 15.4320001602172,
15.2100000381469, 15.2519998550415, 15.2150001525878, 15.4280004501342,
17.4790000915527, 16.6739997863769, 16.4330005645751, -16.6299991607666,
-16.9529991149902, -17.5610008239746, -17.8290004730224, -18.6200008392333,
-17.1079998016357, -16.25, -21.716999053955, -21.1219997406005,
-21.8209991455078, -20.1840000152587, -20.0450000762939, -20.9599990844726,
-19.5240001678466, -18.6590003967285, -19.4379997253417, -18.6280002593994,
-18.0669994354248, -16.204999923706, -15.5830001831054, -23.9489994049072,
-23.57200050354, -24.3969993591308, -23.2880001068115, -22.6019992828369,
-23.2329998016357, -22.5979995727539, -22.6140003204345, -20.8059997558593,
-19.4300003051757, -19.4729995727539, -17.5690002441406, -16.8110008239746,
-15.2930002212524, -25.2509994506835, -24.7649993896484, -24.8080005645751,
-21.9939994812011, -21.5189990997314, -20.329999923706, -20.25,
-19.1380004882812, -18.6180000305175, -18.5900001525878, -16.1620006561279,
-14.5329999923706, -14.4359998703002, -25.8169994354248, -24.2159996032714,
-22.57200050354, 190.996994018554, 190.996002197265, 190.18699645996,
191.039993286132, 190.205993652343, 191.919006347656, 191.766006469726,
187.14599609375, 186.889007568359, 186.225997924804, 188.60400390625,
187.932006835937, 187.837005615234, 188.453002929687, 189.382995605468,
189.360000610351, 191.25, 191.845001220703, 192.580001831054,
192.414993286132, 185.358001708984, 184.570999145507, 184.595993041992,
186.091995239257, 185.613998413085, 185.25, 186.235000610351,
187.003005981445, 188.744995117187, 190.169998168945, 190.921005249023,
192.628997802734, 192.768005371093, 193.281997680664, 184.602996826171,
183.796005249023, 185.414001464843, 187.811004638671, 188.615005493164,
189.263000488281, 190.167007446289, 191.781997680664, 191.837997436523,
192.582000732421, 193.399002075195, 194.184005737304, 193.509994506835,
183.776000976562, 186.173995971679, 187.774993896484), dim = c(50L,
3L), dimnames = list(NULL, c("x", "y", "z")))
This appears to be an issue with something randomly-generated internally by the function during your
kkmeans()call. I don't have an answer for "why" this is happening and you'll likely have to check with the authors to determine if it's a bug or intended behavior.While I reproduced your error with your data and code (running a fresh instance of R every time), the exact same function call also sometimes produces other errors and sometimes doesn't produce an error. However, whether it does so is entirely reproducible when you
set.seed(), suggesting it is has something to do with starting values that determine other parameters of the model.Below I show (a) that this can produce an alternative error (actually, I saw a third but didn't save the seed to reproduce it), (b) that even when it does "converge," it is producing pretty different clusters just on the basis of the random seed, and (c) the hyperparameter tuning is heavily influenced by the random number seed. I forgot to save the seed for the run where I was able to get some clustering results with 10 clusters.
I don't have an answer for why this happens: my hunch is that the automatically-generated settings are nonsensical/out of bounds in some cases and this is producing an error. This may be because your data are in some way strange or may be because the algorithm for setting the hyperparameter(s) doesn't make much sense. It could also be a bug, so perhaps worth posting as an issue.
In any case, a question to ask yourself is whether you want to use something where the behavior is this inconsistent at producing results, produces pretty different results across random seeds, and you don't know if the algorithm is actually doing what it says when it does, etc.
Example 1:
clusters=5, no error,set.seed(123)Example 2:
clusters=5, no error,set.seed(3)Works, but pretty different numbers of observations per cluster! Note the different hyperparameter.
Example 3:
clusters=5, no error,set.seed(999)Works, but pretty different numbers of observations per cluster! Note the different hyperparameter again!
Example 4:
clusters = 10, new error,set.seed(99)New error.
Example 5:
clusters = 10, new error,set.seed(3)Original error.
Not included: additional error with clusters = 10 (not finding all of the columns in the matrix) and successfully getting some clusters with clusters = 10.