Show family class in TukeyHSD

956 views Asked by At

I am use to conducting Tukey post-hoc tests in minitab. When I do, I usually get family grouping of the dependent/predictor variables.

Minitab Output - Tukey Pairwise

In R, using TukeyHSD() the family grouping is not displayed (or calculated?). It only displays the relationship between each of the dependent/predictor variables. Is it possible to display the family groupings like in minitab?

Using the diamonds data set:

av <- aov(price ~ cut, data = diamonds)
tk <- TukeyHSD(av, ordered = T, which = "cut")
plot(tk)

Output:

Fit: aov(formula = price ~ cut, data = diamonds)

$cut
                    diff        lwr       upr     p adj
Good-Ideal         471.32248  300.28228  642.3627 0.0000000
Very Good-Ideal    524.21792  401.33117  647.1047 0.0000000
Fair-Ideal         901.21579  621.86019 1180.5714 0.0000000
Premium-Ideal     1126.71573 1008.80880 1244.6227 0.0000000
Very Good-Good      52.89544 -130.15186  235.9427 0.9341158
Fair-Good          429.89331  119.33783  740.4488 0.0014980
Premium-Good       655.39325  475.65120  835.1353 0.0000000
Fair-Very Good     376.99787   90.13360  663.8622 0.0031094
Premium-Very Good  602.49781  467.76249  737.2331 0.0000000
Premium-Fair       225.49994  -59.26664  510.2665 0.1950425

Picture added to help clarify my response to Maruits's comment: enter image description here

2

There are 2 answers

0
tflutre On

See the cld function in the multcomp package, as explained here (copy-pasted below).

Example data set:

> data(ToothGrowth)
> ToothGrowth$treat <- with(ToothGrowth, interaction(supp,dose))
> str(ToothGrowth)
'data.frame':   60 obs. of  3 variables:
 $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
 $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
 $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
 $ treat: Factor w/ 6 levels "OJ.0.5","VC.0.5",..: 2 2 2 2 2 2 2 2 2 2 ...

Model fit:

> fit <- lm(len ~ treat, data=ToothGrowth)

All pairwise comparisons with Tukey test:

> apctt <- multcomp::glht(fit, linfct = multcomp::mcp(treat = "Tukey"))

Letter-based representation of all-pairwise comparisons (algorithm from Piepho 2004):

> lbrapc <- multcomp::cld(apctt)
> lbrapc
OJ.0.5 VC.0.5   OJ.1   VC.1   OJ.2   VC.2 
   "b"    "a"    "c"    "b"    "c"    "c" 
0
Maurits Evers On

Here is a step-by-step example on how to reproduce minitab's table for the ggplot2::diamonds dataset. I've included details/explanation as much as possible.

Please note that as far as I can tell, results shown in minitab's table are not dependent/related to results from Tukey's post-hoc test; they are based on results from the analysis of variance. Tukey's honest significant difference (HSD) test is a post-hoc test that establishes which comparisons (of all the possible pairwise comparisons of group means) are (honestly) statistically significant, given the ANOVA results.


In order to reproduce minitabs "mean-grouping" summary table (see the first table of "Interpret the results: Step 3" of the minitab Express Support), I recommend (re-)running a linear model to extract means and confidence intervals. Note that this is exactly how aov fits the analysis of variance model for each group.

Fit a linear model

We specify a 0 offset to get absolute estimates for every group (rather than estimates for the changes relative to an offset).

fit <- lm(price ~ 0 + cut, data = diamonds)
coef <- summary(fit)$coef;
coef;
#    Estimate Std. Error   t value Pr(>|t|)
#cutFair      4358.758   98.78795  44.12236        0
#cutGood      3928.864   56.59175  69.42468        0
#cutVery Good 3981.760   36.06181 110.41487        0
#cutPremium   4584.258   33.75352 135.81570        0
#cutIdeal     3457.542   27.00121 128.05137        0

Determine family groupings

In order to obtain something similar to minitab's "family groupings", we adopt the following approach:

  1. Calculate confidence intervals for all parameters
  2. Perform a hierarchical clustering analysis on the confidence interval data for all parameters
  3. Cut the resulting tree at a height corresponding to the standard deviation of the CIs. This will gives us a grouping of parameter estimates based on their confidence intervals. This is a somewhat empirical approach but justifiable as the tree measures pairwise distances between the confidence intervals, and the standard deviation can be interpreted as a Euclidean distance.

We start by calculating the confidence interval and cluster the resulting distance matrix using hierarchical clustering using complete linkage.

CI <- confint(fit);
hc <- hclust(dist(CI));

We inspect the cluster dendrogram

plot(hc);

enter image description here

We now cut the tree at a height corresponding to the standard deviation of all CIs across all parameter estimates to get the "family groupings"

grps <- cutree(hc, h = sd(CI))

Summarise results

Finally, we collate all quantities and store results in a table similar to minitab's "mean-grouping" table.

library(tidyverse)
bind_cols(
    cut = rownames(coef),
    N = as.integer(table(fit$model$cut)),
    Mean = coef[, 1],
    Groupings = grps) %>%
    as.data.frame()
#           cut     N     Mean Groupings
#1      cutFair  1610 4358.758         1
#2      cutGood  4906 3928.864         2
#3 cutVery Good 12082 3981.760         2
#4   cutPremium 13791 4584.258         1
#5     cutIdeal 21551 3457.542         3

Note the near-perfect agreement of our results with those from the minitab "mean-grouping" table: cut = Ideal is by itself in group 3 (group C in minitab's table), while Fair+Premium share group 1 (minitab: group A ), and Good+Very Good share group 2 (minitab: group B).