I built a simple linear regression model with 'Score' as the dependent variable, and 'Activity' as the independent one. 'Activity' has 5 levels: 'listen' (reference level), 'read1', 'read2', 'watch1', 'watch2'.
Call:
lm(formula = Score ~ Activity)
Residuals:
     Min       1Q   Median       3Q      Max 
-22.6154  -8.6154  -0.6154   7.1346  31.3846 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)      41.615      2.553  16.302   <2e-16 ***
Activityread1     6.385      7.937   0.804   0.4254    
Activityread2    20.885      9.552   2.186   0.0340 *  
Activitywatch1    3.885      4.315   0.900   0.3728    
Activitywatch2  -11.415      6.357  -1.796   0.0792 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.02 on 45 degrees of freedom
Multiple R-squared:  0.1901,    Adjusted R-squared:  0.1181 
F-statistic:  2.64 on 4 and 45 DF,  p-value: 0.04594
In order to obtain all pairwise comparisons, I performed a TukeyHSD test, whose output I'm having difficulty interpreting. While the output of the model shows that the only significant effect we have is due to the contrast between 'listen' and 'read2', the TukeyHSD results yield that the only significant contrast exists between 'watch2' and 'read2'. What does this mean?
> TukeyHSD(aov(mod4), "Activity")
  Tukey multiple comparisons of means
    95% family-wise confidence level
Fit: aov(formula = mod4)
$Activity
                    diff        lwr       upr     p adj
read1-listen    6.384615 -16.168371 28.937602 0.9279144
read2-listen   20.884615  -6.256626 48.025857 0.2034549
watch1-listen   3.884615  -8.376548 16.145779 0.8952957
watch2-listen -11.415385 -29.477206  6.646437 0.3885969
read2-read1    14.500000 -19.264610 48.264610 0.7397464
watch1-read1   -2.500000 -26.031639 21.031639 0.9981234
watch2-read1  -17.800000 -44.811688  9.211688 0.3466391
watch1-read2  -17.000000 -44.959754 10.959754 0.4278714
watch2-read2  -32.300000 -63.245777 -1.354223 0.0368820
watch2-watch1 -15.300000 -34.569930  3.969930 0.1783961 
                        
In your initial model summary,
Estimateis showing the estimated difference in mean for each group relative to the mean of the "listen" group (40.615). The "read2" group, has the largest shift (+20.885) away from the "listen" group is called significant withp = .0340when only these 4 comparisons are calculated.Since
TUKEYHSDis performing all pairwise comparisons for the group means (not just to reference level "listen" anymore), it is also performing p-value adjustments to account for all of these extra tests. Reason being, if you performed 20 comparisons on random data you'd expect one (1/20 or .05) to be called significant withp < .05simply because of doing that many tests. With the p-value adjustment factored in, your originally significant comparison between "listen - read2" no longer qualifies as significant.But the larger difference between "watch2 - read2" (-32.3), which wasn't tested in the original model summary, is large enough to be considered significant with
p = .03688even after doing all of the extra comparison adjusting.Hope that helps, you can read more about the multiple comparrison problem here . And see
?p.adjustfor R's implementation of the most popular methods.