Re-engineering CLDs

emmeans package, Version 1.10.0

Contents

  1. Introduction
  2. Grouping by underlining
  3. Grouping using letters or symbols
  4. Simulated example
  5. Alternative CLDs
    1. Equivalence sets
    2. Significance sets
  6. Conclusions
  7. References

Index of all vignette topics

Introduction

Compact letter displays (CLDs) are a popular way to display multiple comparisons, especially when there are more than a few means to compare. They are problematic, however, because they are prone to misinterpretation (more details later). Here we present some background on CLDs, and show some adaptations and alternatives that may be less prone to misinterpretation.

Grouping by underlining

CLDs generalize an “underlining” technique shown in some old experimental design and analysis textbooks, where results may be displayed something like this:

    trt1  ctrl   trt3   trt2   trt4
    ----------
          ------------------

The observed means are sorted in increasing order, so in this illustration, trt1 has the lowest mean, ctrl has the next lowest, and trt4 has the highest. The underlines group the means such that the extremes of each group are not significantly different according to a statistical test conducted at a specified alpha level. So in this illustration, trt1 is significantly less than trt3, trt2, and trt4, but not ctrl; and in fact trt4 is significantly greater than all the others.

This grouping also illustrates the dangers created by careless interpretations. Some observers of this chart might say that “trt1 and ctrl are equal” and that “ctrl, trt3, and trt2 are equal” – when in fact we have merely failed to show they are different. And further confusion results because mathematical equality is transitive – that is, these two statements of equality would imply that trt and trt2 must be equal, seemingly contradicting the finding that they are significantly different. Statistical nonsignificance does not have the transitivity property!

Grouping using letters or symbols

The underlining method becomes problematic in any case where the standard errors (SEs) of the comparisons are unequal – for example if we have unequal sample sizes, or a model with non-homogeneous variances. When the SEs are unequal, it is possible, for example, for two adjacent means to be significantly different, while two more distant ones do not differ significantly. If that happens, we can’t use underlines to group the means. The problem here is that lines are continuous, and that continuousness forces a continuum of groupings.

However, Piepho (2004) solved this problem by using symbols instead of lines, and creating a display where any two means associated with the same symbol are deemed to not be statistically different. Using symbols, it is possible to have non-contiguous groupings, e.g., it is possible for two means to share a symbol while an intervening one does not share the same symbol. Such a display is called a compact letter display. We do not absolutely require actual letters, just symbols that can be distinguished from one another. In the case where all the differences have equal SEs, the CLD will be the “same” as the result of grouping lines, in that each distinct symbol will span a contiguous range of means that can be interpreted as a grouping line.

The R package multcompView (Graves et al., 2019) provides an implementation of the Piepho algorithm. The multcomp package (Hothorn et al. 2008) provides a generic cld() function, and the emmeans package provides a cld() method for emmGrid objects.

Back to Contents

Simulated example

As a moving example, we simulate some data from an unbalanced design with 7 treatments labeled A, B, …, G; and fit a model to those

set.seed(22.10)
mu = c(16, 15, 19, 15, 15, 17, 16)  # true means
n =  c(19, 15, 16, 18, 29,  2, 14)  # sample sizes
foo = data.frame(trt = factor(rep(LETTERS[1:7], n)))
foo$y = rnorm(sum(n), mean = mu[as.numeric(foo$trt)], sd = 1.0)

foo.lm = lm(y ~ trt, data = foo)

There are only four distinct true means underlying these seven treatments: Treatments B, D, and E have mean 15, treatments A and G have mean 16, and treatments F and C are solo players with means 17 and 19 respectively.

Default CLD

Let’s see a compact letter display for the marginal means. (Call this CLD #1)

foo.emm = emmeans(foo.lm, "trt")

library(multcomp)
cld(foo.emm)
##  trt emmean    SE  df lower.CL upper.CL .group
##  B     14.6 0.246 106     14.1     15.1  1    
##  E     15.0 0.177 106     14.6     15.3  1    
##  D     15.3 0.224 106     14.8     15.7  1    
##  G     15.3 0.254 106     14.8     15.9  1    
##  A     16.4 0.218 106     15.9     16.8   2   
##  F     16.6 0.673 106     15.2     17.9  12   
##  C     19.3 0.238 106     18.9     19.8    3  
## 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 7 estimates 
## significance level used: alpha = 0.05 
## NOTE: If two or more means share the same grouping symbol,
##       then we cannot show them to be different.
##       But we also did not show them to be the same.

The default “letters” for the emmeans implementation are actually numbers, and we have three groupings indicated by the symbols 1, 2, and 3. This illustrates a case where grouping lines would not have worked, as we see in the fact that group 1 is not contiguous. We have (among other results) that treatment A differs significantly from treatments B, E, D, G, and C (at the default 0.05 significance level, with Tukey adjustment for multiple testing). and that C is significantly greater than all the other means since it is the only mean in group 3.

An annotation warns that two means in the same group are not necessarily the same; yet CLDs present a strong visual message that they are. The careless reader who makes this mistake will have trouble with the gap in group 1, asking how A can differ from G and yet G and F, are “the same.” The explanation is that the SE of F is huge, owing to its very small sample size, so it is hard for it to be statistically different from other means. It is almost a gift to obtain a non-contiguous grouping like this, as it forces the user to think more carefully about what these grouping do and do not imply.

Back to Contents

Alternative CLDs

Given the discussion above, one might wonder if it is possible to construct a CLD in such a way that means sharing the same symbol are actually shown to be the same? The answer is yes (otherwise we wouldn’t have asked the question!) – and it is quite easy to do, thanks to two things:

  1. The algorithm for making grouping letters is based on a matrix of Boolean values associated with each pair of means, where we set the value TRUE for any pair that is statistically different (those means must receive different grouping letters), and FALSE otherwise; and the algorithm works for any such Boolean matrix
  2. There is such a thing as equivalence testing, by which we can establish with specified confidence that two means do not differ by more than a specified threshold \(\delta\). One simple way to do this is to conduct two one-sided tests (TOST) whereby we can conclude that two means are equivalent if we show both that the difference exceeds \(-\delta\) and is less than \(+\delta\). We can use this TOST method to set each Boolean pair to FALSE is they are shown to be equivalent and TRUE if not shown to be equivalent.

Equivalence sets

For our example, suppose, based on subject-matter considerations, that two means that differ by less than 1.0 can be considered equivalent. In the emmeans setup, we specify that we want equivalence testing simply by providing this nonzero threshold value as a delta argument. In addition, we typically will not make multiplicity adjustments to equivalence tests. Here is the result we obtain (call this CLD #2)

cld(foo.emm, delta = 1, adjust = "none")
##  trt emmean    SE  df lower.CL upper.CL .equiv.set
##  B     14.6 0.246 106     14.1     15.1  1        
##  E     15.0 0.177 106     14.6     15.3  12       
##  D     15.3 0.224 106     14.8     15.7   2       
##  G     15.3 0.254 106     14.8     15.9   2       
##  A     16.4 0.218 106     15.9     16.8    3      
##  F     16.6 0.673 106     15.2     17.9     4     
##  C     19.3 0.238 106     18.9     19.8      5    
## 
## Confidence level used: 0.95 
## Statistics are tests of equivalence with a threshold of 1 
## P values are left-tailed 
## significance level used: alpha = 0.05 
## Estimates sharing the same symbol test as equivalent

So we obtain five groupings – but only two if we ignore those that apply to only one mean. We have that treatments B and E can be considered equivalent, and treatments E, D, and G are considered equivalent. It is also important to know that we cannot say that means in different groups are significantly different.

Unlike CLD #1, we are showing only groupings of means that we can show to be the same. The first four means, which were grouped together earlier, are now assigned to two equivalence groupings. And treatment F is not grouped with any other mean – which makes sense because we have so little data on that treatment that we can hardly say anything.

Significance sets

Another variation is to simply reverse all the Boolean flags we used in constructing CLD #1. Then two means will receive the same letter only if they are significantly different. Thus, we really obtain ungrouping letters. We label these groupings “significance sets.” The resulting display has a distinctively different appearance, because common symbols tend to be far apart rather than contiguous. (Call this CLD #3)

cld(foo.emm, signif = TRUE)
##  trt emmean    SE  df lower.CL upper.CL .signif.set
##  B     14.6 0.246 106     14.1     15.1  1         
##  E     15.0 0.177 106     14.6     15.3   2        
##  D     15.3 0.224 106     14.8     15.7    3       
##  G     15.3 0.254 106     14.8     15.9     4      
##  A     16.4 0.218 106     15.9     16.8  1234      
##  F     16.6 0.673 106     15.2     17.9      5     
##  C     19.3 0.238 106     18.9     19.8  12345     
## 
## Confidence level used: 0.95 
## P value adjustment: tukey method for comparing a family of 7 estimates 
## significance level used: alpha = 0.05 
## Estimates sharing the same symbol are significantly different

Here we have five significance sets. By comparing with CLD #1, you can confirm that each significant difference shown explicitly here corresponds to one shown implicitly (by not sharing a group) in CLD #1.

Back to Contents

Conclusions

Compact letter displays show symbols based on statistical testing results. In such tests, we have strong conclusions or findings – those that have small P values, and weak conclusions or non-findings – those where the P value is not less than some \(\alpha\). When we create visual flags such as grouping lines or symbols, those come across visually as findings, and the problem with standard CLDs is that those are the non-findings. We show two simple ways to use software that creates CLDs so that actual findings are flagged with symbols. It is hoped that people will find these modifications useful in visually displaying comparisons among means.

References

Graves, Spencer, Piepho Hans-Pieter, Selzer, Luciano, and Dorai-Raj, Sundar (2019). multcompView: Visualizations of Paired Comparisons. R package version 0.1-8, https://CRAN.R-project.org/package=multcompView

Hothorn, Torsten, Bretz, Frank, and Westfall, Peter (2008). Simultaneous Inference in General Parametric Models. Biometrical Journal 50(3), 346–363.

Piepho, Hans-Peter (2004). An algorithm for a letter-based representation of all pairwise comparisons, Journal of Computational and Graphical Statistics 13(2) 456–466.

Back to Contents

Index of all vignette topics