2  Effect Size: Explanation and Guidelines

Effect size is a simple idea that is finally gaining traction. It is just a set of statistics that describe the size of an effect. Effect size statistics are usually standardized, so a given effect size statistic can be compared directly with that same type of effect size statistic from other analyses—or even from other studies that sample the same or similar populations.

The effect being measured can be either a difference (such as the difference between an experimental-group and a control-group mean, or the difference in number of events between groups) or an association (like the correlation between two variables). Different effect size statistics are computed in different ways; this means that we cannot usually directly compare one effect size statistic to an other type of effect size statistic. We can, though, still can compare the same statistic from one analysis/sample to that same statistic from an other analysis/sample. In addition, as noted in Section 2.3, below, we can also often convert between effect size statistics if we need to.

Effect sizes are descriptive statistics. For measures of the size of an association (like a correlation), an effect size statistic may assume a linear relationship1, but they don’t assume, e.g., that the population is normally distributed. Since they make few assumptions, effect size statistics are inherently robust.

Effect size statistics can complement significance tests. Significance is, of course, a yes-or-no indication of whether there is “enough” of a difference/association relative to noise: An effect is either significant or not; there are no gradations to significance. Effect size statistics do show gradations and so can be used to properly provide the nuance that people seek when they report that something is “very” or “slightly”—or even “almost”—significant. (As noted in Section 2.2 below, effect size statistics are often described as being “small,” “medium,” or “large,” but this valuation of them doesn’t—well, shouldn’t—carry anything but an arbitrary weight.)

We can also combine reporting an effect size statistic with an informal test of significance by adding confidence intervals around an effect size statistic. An effect size statistic gives the magnitude of an effect; a significance test usually indicates whether we are 95% sure that a given effect is “not zero.” Therefore, if the 95% confidence interval for an effect does not overlap zero, then that effect is likely significant2.

2.1 Common Effect Sizer Statistics

2.1.1 Cohen’s d

It may be instructive to begin a deeper look at effect size statistics by starting with one of the most common, Cohen’s d. It’s also pretty straight forward: Cohen’s d is the difference between two means, just a difference that is standardized so that we can compare one mean difference (one Cohen’s d) to an other mean difference (an other Cohen’s d). The mean differences is standardized like most things in statistics by dividing it by the standard deviations (SDs):

\[\text{Cohen's }d = \frac{\text{First Mean}-\text{Second Mean}}{\text{Pooled }SD}.\]

We combine (or “pool” the SDs because there are two of them (one SD for each mean). To do this, we essentially take the average of the two SDs3.

Therefore, Cohen’s d is presented in terms of standard deviations. A Cohen’s d of 1 means that the means are one standard deviation apart.

You may remember that z-scores are also presented in terms of standard deviations—that a z-score of 1 means that that person’s score is one standard deviation away from the mean. This isn’t a coincidence and means that Cohen’s d can be looked at as a z-score.

Given is ease of interpretation and computation (nearly everyone reports means and SDs), Cohen’s d is used often, and in fact is usually the standard of measurement used to compare effects across studies in meta-analyses.

2.1.2 Correlations

Correlations—specifically squared correlations—also measure proportion of variance accounted for. And yes, squared correlations are indeed effect size statistics. For example, if the correlation between two variables is .50, i.e., if r = .50, then r2 = .502 = .25. In that case, the correlation accounts for 25% of the variance in each of the variables.

2.1.3 Cohen’s f2 and f

Cohen’s f2 (and f) are a little more complicated, but only in that they partial out other effects. They are used to describe terms in regression models and thus represent the proportion of the total variance accounted for in a given model term, be it a main effect, interaction, or whatever after removing the variance accounted for by any other terms (if there are any others). It is thus similar to a model R2, which also measures proportion of variance. R2 and f2 are in fact related, so it’s helpful to explain f2 through R2.

R2 shows the proportion accounted for by the whole model (and thus all of the terms in it combined) while f2 shows the proportion for just a particular term. In fact, Cohen’s f is defined in terms of the model R2 (Cohen, 1988, p. 410). Cohen’s f2 for a model with only one term in it is equal to \(\frac{R^2}{(1 - R^2)}\)4, thus the amount the variance explained by that model term relative to the amount of variance not explained by that model term. When there are more than one term in a model, Cohen’s f2 for each term the proportion of variance accounted for by that term after partialing out the effects of all other terms.

For example, if the R2 of a model with only one term in it is .2, then the f2 for that term is \(\frac{.2}{(1 - .2)} = \frac{.2}{.8} =\) .25. If there were more than one term in the model, then that .25 in that case would be divided up among the various terms, with more impactful terms taking larger pieces of that .25.

Cohen’s f2 is simply the square of Cohen’s f2. Why use both f and f2 in statistics? Because sometimes it’s easier to talk about the squared value (since it is in relation to R2) and sometimes it’s easier to talk about things not squared. f2, though, is more common.

More about Cohen’s f can be found at this Statistics How to page.

2.1.4 Eta-squared (η2)

Like Cohen’s f, η2 is an other common measure of effect size. It is often used in ANOVAs and thus with nominal variables; Cohen (1988, pp. 411 & 281) himself posits that the only real difference between between f2 and η2 is that η2 is used for nominal variables. Cohen’s f and η2 are computed slightly differently, though: η2 is \(\frac{\text{Cohen's }f^{2}}{(1 + \text{Cohen's }f^{2})}\). You won’t be using that formula much, but it shows you that they are measuring the same thing, just with somewhat-different formulas.

Therefore, like Cohen’s f, η2 the size of a parameter in a model, such as a main effect or interaction term. Specifically η2 is the sum of squares of the given effect divided by the total sum of squares; i.e., η2 \(= \frac{SS_{Effect}}{SS_{Total}}\). In other words—like R2—η2 indicates the proportion of total variance that is accounted for in that given variable (or interaction term). Note that if there are more than one terms in the model, the reported value is instead most likely partial η2, the effect of that variable/term after partialing out the effect of the other terms in the model.

This Analysis Factor post gives a good further explanation of η2. Recommendations on interpreting and reporting η2 are given well in this StackExchange Q&A.

2.1.5 Odds Ratios

Please note that this section needs a little work, expanding it out more beyond Chen et al. (2010).

Odds ratios represent how different the odds are (for some discrete outcome) between two groups—e.g., if the number of people developing pre-eclampsia are different for Black and non-Black women. Since they measure the magnitude of a difference, odds ratios are indeed a measure of effect size (e.g., the morbidity of pre-eclampsia).

Unlike most effect size statistics, though, odds ratios are not presented in terms of standard deviations5. Instead, they are presented as, well, odds. It is thus less easy to directly compare odds and odds ratios to other measures of effect size.

Chen et al. (2010) nonetheless gives some guidance by providing ranges of effect size criteria for odds ratios by comparing values with criteria for “small,” “medium,” and “large” Coden’s ds. Chen et al.’s (2010) rules of thumb for “small,” “medium,” and “large” odds ratios (below) deserve especial explanation. The size of an odds ratio depends not just on the difference in outcomes in a group (e.g., the numbers of Black woman with and without pre-eclampsia), but also the difference in outcomes in a comparison group (e.g., the numbers of non-Black women with and without pre-eclampsia). It is thus not so easy to compute simple (simplistic) rules of thumbs for the sizes of odds ratios6.

In addition, the exact values for what to consider as a “small,” “medium,” and “large” effect depend on the overall frequency, with smaller events require larger odds ratios to equate to a given level of Cohen’s d.

Nonetheless, Chen et al. (2010) presents some guidelines that can serve as guides in most cases. Using the median values suggested by their results:

  • “Small” \(\approx\) 1.5
  • “Medium” \(\approx\) 2.75
  • “Large” \(\approx\) 5

However, those suggestions can range considerably, depending on the absolute value of probability in the reference group (infection rates in the non-exposed group in Chen et al.’s article):

P of Event
in Reference Group
“Small” “Medium” “Large”
.01 1.68 3.47 6.71
.05 1.53 2.74 4.72
.10 1.46 2.50 4.14

Please consult their table on page 862 for more precise equivalents with Cohen’s d.

2.2 “Small,” “Medium,” & “Large” Effects

Like much of statistics, Cohen’s d in standardized into z-scores/SDs (remember, the formula for it is to divide it by SDs). However, simply reporting Cohen’s d without interpreting what that means has a couple of disadvantages: (a) z-scores are not intuitive for lay audiences, and (b) there are other measures of effect size than Cohen’s d—and they aren’t all measured on the same scale. Given both of these factors, in his seminal book, Statistical Power Analysis for the Behavioral Sciences, Jacob Cohen (1988) gave recommendations for how to interpret the magnitude of various effect size statistics in terms of “small,” “medium,” and “large” effects.

These “criteria” for evaluating the magnitude of an effect size have become quite popular. Indeed, the adoption of effect size statistics seems to be regulated by people’s uses and understandings of them in relation to these criteria. They therefore deserve further consideration.

2.2.1 Effect Size Criteria as Percent of Total Variance

Cohen generally defined effect sizes based on the percent of the total variance that effect accounted for:

  • small” effects account for 1%,
  • medium” effects account 25%, and
  • large” effects account 40%.

I say that he generally defined them as such because he didn’t see a need to be bound to this definition, in part because he repeatedly noted—as do I here—that these criteria were arbitrary. He defined them based on percent of total variance for d and then chose “small,” “medium,” and “large” values for other effect size statistics that corresponded to those values for d.

This meant, for example, that he chose levels for correlations that don’t always match up to what one would expect by squaring the correlations to get the percents of total variances. In other words, his criteria for correlations weren’t that a “small” correlations would be r = .1 (i.e., where r2 = .01), “medium” would be r = .5, and “large” r \(\approx\) .63. In justifying this, he notes) that he is not positing these criteria levels based on strict mathematical equivalences but instead on a concerted attempt to equate the sorts of effects one would obtain with one analytic strategy with an other analytic strategy; for example, the types of effects sizes (experimental psychologists) obtain with t-tests with those they would obtain through correlations.

2.2.2 Effect Size Criteria as Noticeability of Effects

Although Cohen was thorough in his descriptions of these effect size criteria in terms of proportions of total variance, he was also careful to couch them in practical and experimental terms.

A “small” effect is the sort he suggested one would expect to find in the early stages of a line of research when researchers have not yet determined the best ways to manipulate/intervene and when much of the noise had not yet been controlled.

A “small” effect can also be considered to be a subtle but non-negligible effect: the sorts of effects that are often found to be significant in field-based studies with typical samples and manipulations/interventions. Examples Cohen gives include:

  • The mean difference in IQs between twin and non-twin siblings7,
  • The difference in visual IQs of adult men and women, &
  • The difference in heights between 15- and 16-YO girls.

A “medium” is one large enough to see with the naked eye. Example Cohen gives include:

  • The mean difference in IQs between members of professional and managerial occupations,
  • The mean difference in IQs between “clerical” and “semiskilled” workers, &
  • The difference in heights between 14- and 18-YO girls.

A “large” effect is one that is near the upper limit of effects attained in experimental psychological studies. So yes, the generalization of this criterion to other areas of science—including nursing research—is certainly not directly supported by Cohen himself.

Examples include:

  • The mean difference in IQs between college freshmen and those who’ve earned Ph.D.s8,
  • The mean difference in IQs between those who graduate college and those who have a 50% chance of graduating high school, &
  • The difference in heights between 13- and 18-YO girls, &
  • The typical correlation between high school GPAs and scores on standardized exams like the ACT.

2.2.3 A Few Words of Caution

As useful as it is to talk about effect sizes being “small” or “large,” I must underline Cohen’s own admonition (e.g., p. 42) that we use this rule of thumb about “small,” “medium,” and “large” effects cautiously9. He notes, for example, that

when we consider r = .50 a large [effect size], the implication that .25 of the variance is accounted for is a large proportion [of the total variance] must be understood relatively, not absolutely.

The question, “relative to what?” is not answerable concretely. The frame of reference is the writer’s [i.e., Cohen’s own] subjective average of [proportions of variance] from his reading of the research literature in behavioral science. (pp. 78 – 79)

Many people—including reviewers of manuscripts and grant proposals—take them to be nearly as canonical as p < .05 for something being “significant.” This is a real shame since effect sizes offer us the opportunity to finally move beyond making important decisions based on simplistic, one-size-fits-all rules.

Therefore, effect size measures, including Cohen’s d, are best used objectively to compare effects between studies—not to establish some standardized gauge of the absolute value of an intervention.

Indeed, interventions and factors that have “small” effects can be quite important. This seems especially true for long-term changes, such as those one strives for in educational interventions or for the subtle but persistent effects of racism. Teaching a diabetic patient how to check their blood insulin may have only a small effect on their A1C levels in a given day, but can save their life (or at least a few toes) in the long run.

Given this, Kraft (2020) used a review of educational research to suggest different criteria for gauging what should be considered as “small,” “medium,” or “large” effects in education interventions. His recommendations are also presented below.

2.2.4 Table of Effect Size Statistics

Table 2.1: Effect Size Interpretations
Statistic Explanation Small Medium Large Reference
d Difference between two means 0.2 0.5 0.8 Cohen (1988, p. 25)
d For education interventions 0.05 \(<\) .2 \(\ge\) .2 Kraft (2020)
h Difference between proportions 0.2 0.5 0.8 Cohen (1988, p. 184)
w
(also called φ)
χ2 goodness of fit & contingency tables.
φ is also a measure of correlation in 2 \(\times\) 2 contingency tables, and ranges between 0 and 1.
0.1 0.3 0.5 Cohen (1988, p. 227)
Cramer’s V Similar to φ, Cramer’s V is used to measure the differences in larger contingency tables.
Like φ (and other correlations) it ranges between 0 and 1.
0.1 0.3 0.5 Cohen (1988, p. 223)
r Correlation coefficient (difference from r = 0) 0.1 0.3 0.5 Cohen (1988, p. 83)
q Difference between correlations 0.1 0.3 0.5 Cohen (1988, p. 115)
η2 Parameter in a linear regression & AN(C)OVA 0.01 0.06 \(\ge\) .14
f AN(C)OVA model effect; equivalent to \(\sqrt{f^2}\) 0.1 0.25 0.4 Cohen (1988, p. 285)
f For education interventions (i.e., f equivalent for Cohen’s ds suggested by Kraft,) 0.025 \(<\) .1 \(\ge\) .1 Kraft (2020)
f2 A translation of R2 0.02 0.15 0.35 • For multiple regression / multiple correlation, Cohen (1988, p. 413);
• For multivariate linear regression, multivariate R2, Cohen (1988, p. 477)
OR Odds ratio; can be used as effect size for Fisher’s exact test and contingency tables in general. 1.5
(or 0.67)
2.75
(or 0.36)
5
(or 0.20)
Chen et al. (2010, p. 862)

2.3 Converting Between Effect Size Measures

Although it was nearly inevitable, it is a bit unfortunate that the various measures of effect size are not all on the same dimensions. It is therefore useful to have access to how to convert one type of effect size to an other. Those are given here for the common effect size statistics described above.

This handy Excel spreadsheet can convert between Cohen’s d, r, η2, odds ratios, and area under the curve. In Chapter 7 of their book on meta-analysis, Borenstein et al. (2011) also cover well the conversions between measures. Finally, the effectsize package for R can both compute and convert between many effect size measures, including all those mentioned here.

2.3.1 Cohen’s d and Cohen’s f & η2

Cohen’s d is a measure of the difference between two means. If there is only one, dichotomous term in a given model then Cohen’s d (or η2) can be easily computed from Cohen’s f (or η2). However, if there are more than one term in the model or if the term for which an effect size is being measured has more than two levels to it (including if it’s a continuous variable), then one must use one of a few different formulas.

Converting between partial η2 and Cohen’s d can be done:

\[\text{partial }\eta^{2} = \frac{d^{2} \times N}{d^{2} \times N + (N - 1)}\]

R code:

eta2 <- (d^2 * N) / ((d^2 * N) + (N - 1))

\[\text{Cohen's }d = \sqrt{\frac{(N - 1)}{N}\times \frac{\text{partial }\eta^{2}}{(1 - \text{partial }\eta^{2})}}\]

R code:

d <- sqrt(((N - 1) / N) * (eta2 / (1 - eta2)))

where N is total number of participants in the analysis (and likely the study).

2.3.2 η2 and Cohen’s f2 (and f)

If there is only one term in the model (e.g., for a one-way ANOVA), then η2 is equal to the model R2. If there is more than one term in the model, then it’s in fact the partial η2 (which is what SPSS calls it).

It has become more commonly used than Cohen’s f2, but can be transformed into f2 with:

\[\eta^2 = \frac{f^2}{(1 - f^2)}\]

and

\[f^2 = \frac{\eta^2}{(1 - \eta^2)}\]

when there is only one term in the model. Partial η2s are less easily transformed into f2.

2.3.3 Correlation (r) to Cohen’s d

The equations below assume equal sample sizes for both groups.

\[d = \frac{2r}{\sqrt{1-r^2}}\]

R code:

d <- (2*r)/((1 - r^2)^.5)

\[r = \frac{d}{\sqrt{d^2 + 4}}\] R code:

r <- d/((d^2 + 4)^.5)

2.3.4 Cohen’s f (and f2) to Cohen’s d

Cohen’s f2 (and f) measures the effect size of an entire model (usually an ANOVA). Cohen’s d measures the effect size between two levels of single variable10. So, in order to convert between f2 and d, we have to know more about the model. For a one-way ANOVA with two groups11, d = 2f = 2\(\sqrt{f^2}\). In this particular case, then, f = \(\frac{d}{2}\).

More generally, when there is only one term in the model:

\[f^2 = \frac{d^2}{2k}\] R code:

f2 <- d^2/(2*k)

and

\[d = f\sqrt{2k}\]

d <- f/(2*k)^.5

where k is the number of groups in a variable in a one-way ANOVA.

It gets a bit more complicated when there are more than one terms in the model. This site covers some common situations.

2.3.5 Odds Ratio to Cohen’s d

\[d = \log(OR)\times\frac{\sqrt{3}}{\pi}\]

R code:

d <- log(OR)*((3^.5)/pi)

The variance of d (\(V_{d}\)) is simply and elegantly:

\[V_{d} = V_{\log(OR)\times}\frac{\sqrt{3}}{\pi}\]

2.3.6 Hedge’s g to Cohen’s d

\[\text{Hedge's }g = \frac{d}{\sqrt(\frac{N}{df})}\]

R code:

g <- d/((N/df)^.5)
d <- g((N/df)^.5)

2.4 Computing Effect Size Measures from t- and F-Scores

Effect size measures, especially Cohen’s d, are often used to compare effects across studies—as, e.g., in meta-analyses. It is fortunately increasingly common to report effect size along with significance test statistics, but if effect size measures can be computed from either t- or F-scores.

2.4.1 Cohen’s d and Student’s t

This is the t in t-test. The only additional piece of information we need to know to transform between Cohen’s d and Student’s t is the sample size, N:

\[t = d \times \sqrt{N}\]

\[\text{Cohen's }d = \frac{t}{\sqrt{N}}\]

R code:

# Assume the results of the t-test were saved as t.test.results:
t.test.results <- t.test(y ~ x, data = df)

# Then:
d <- t.test.results$statistic / (sqrt(t.test.results$parameter))

2.4.2 η2 and F

This F is that used in ANOVA-family models. Like the relationship between d and t, the only additional things we need to know to compute η2 from F are degrees of freedom (which are closely related to sample size). Here, though, we have degrees of freedom in both the numerator (top) and denominator (bottom12):

\[\eta^2 = \frac{F \times df_{Effect}}{F \times (df_{Effect} + df_{Error})}\]

So, η2 is dependent on the ratio of the dfs allotted to the given effect and the dfs allotted to it’s corresponding error term. Since we have the effect’s dfs in both the numerator and denominator, their effect will generally cancel out; this suggests that having more levels to a variable doesn’t appreciably affect the size of its effect. However, being able to allot more dfs to error does help us see the size of whatever effect is there. Larger samples won’t really change the size of the effects we’re measuring, but they can help us see ones that are there.

2.5 Additional Resources

Cohen’s duck


  1. In this case, it also would assume homoskedasticity. They also assume that samples are independently and identically distributed (“iid”), meaning that (a) the value of each data point in a given variable is independent from the value of all/any other data point for that variable and (b) each of those data points in that variable are drawn from the same distribution, e.g., they’re all drawn from a normal distribution.↩︎

  2. It is “likely” significant because significance depends not only on the type of test conducted but also if any other terms are considered (e.g., as covariates), and this may generate a different conclusion than the confidence intervals suggest.↩︎

  3. For what it’s worth, we actually take the square root of the sum of the variances, and then divide that by 2, i.e.: \(\text{Pooled }SD = \frac{\sqrt{(SD^2_{\text{First Mean}}+SD^2_{\text{Second Mean}})}}{2}\).↩︎

  4. And f is indeed \(\sqrt\frac{R^2}{(1 - R^2)}\).↩︎

  5. One could argue that they’re standardized in the sense that all odds & odds ratios are measure in the same units, though. They both measure the relative numbers of events in each condition.↩︎

  6. This is also true for, e.g., risk ratios, hazard ratios, means ratios, and hierarchical models.↩︎

  7. The source for this—Husén, T. (1959). Psychological twin research: A methodological study. Stockholm: Almqvist & Wiksell—was too old for me to see if he means mono- or dizygotic twins. But I tried!↩︎

  8. So, I guess a full higher education career does have a large effect on a person. And, yeah, Cohen does seem a little pre-occupied with IQ, doesn’t he?↩︎

  9. Cohen also only directly considered these criteria as they applied to experimental psychology—not, e.g., the health sciences. Indeed, he elsewhere notes that what experimental psychologists would call a “large” effect would be paltry in the physical sciences.↩︎

  10. Remember, Cohen’s d is just the difference between two means that is then standardized.↩︎

  11. Which is itself really just a t-test but using an ANOVA framework instead↩︎

  12. My mnemonic to remember which is which is to think of the saying, “The lowest common denominator.”↩︎