2  Effect Size: Explanation and Guidelines

Effect size is a simple idea that is finally gaining traction. It is just a set of statistics that describe the size of an effect. Effect size statistics are usually standardized, so a given effect size statistic can be compared directly with that same type of effect size statistic from other analyses—or even from other studies that sample the same or similar populations.

The effect being measured can be either a difference (such as the difference between an experimental-group and a control-group mean, or the difference in number of events between groups) or an association (like the correlation between two variables). Different effect size statistics are computed in different ways; this means that we cannot usually directly compare one effect size statistic to an other type of effect size statistic. However, the same type of effect size can be compared across different analyses or studies, and in many cases, effect size measures can be converted from one form to another (see Section 2.3).

Effect sizes are descriptive statistics. For measures of the size of an association (like a correlation), an effect size statistic may assume a linear relationship1, but they don’t assume, e.g., that the population is normally distributed. Since they make few assumptions, effect size statistics are inherently robust.

Effect size statistics can complement significance tests. Significance is, of course, a yes-or-no indication of whether there is “enough” of a difference/association relative to noise: An effect is either significant or not; there are no gradations to significance. Effect size statistics do show gradations and so can be used to properly provide the nuance that people seek when they report that something is “very” or “slightly”—or even “almost”—significant. (As noted in Section 2.2 below, effect size statistics are often described as being “small,” “medium,” or “large,” but this valuation of them doesn’t—well, shouldn’t—carry anything but an arbitrary weight.)

Effect sizes can also be reported with confidence intervals, providing an informal test of significance. Since an effect size measures magnitude, while a significance test determines whether an effect is “not zero,” an effect is likely significant if its 95% confidence interval does not include zero. However, statistical significance still depends on factors such as model specification and the inclusion of covariates.

2.1 Common Effect Sizer Statistics

2.1.1 Mean Differences

These measure the distance between two or more means. Like most effect size statistics, they are also standardized (measured in terms of standard deviations) so they can be compared between studies.

Cohen’s d

It may be instructive to begin a deeper look at effect size statistics by starting with one of the most common, Cohen’s d. It’s also pretty straight forward: Cohen’s d is the difference between two means, just a difference that is standardized so that we can compare one mean difference (one Cohen’s d) to an other mean difference (an other Cohen’s d). The mean differences is standardized like most things in statistics by dividing it by the standard deviations (SDs):

\[\text{Cohen's }d = \frac{\text{First Mean}-\text{Second Mean}}{\text{Pooled }SD}.\]

We combine (or “pool” the SDs because there are two of them (one SD for each mean). To do this, we essentially take the average of the two SDs2.

Therefore, Cohen’s d is presented in terms of standard deviations. A Cohen’s d of 1 means that the means are one standard deviation apart.

You may remember that z-scores are also presented in terms of standard deviations—that a z-score of 1 means that that person’s score is one standard deviation away from the mean. This isn’t a coincidence and means that Cohen’s d can be looked at as a z-score.

Given is ease of interpretation and computation (nearly everyone reports means and SDs), Cohen’s d is used often, and in fact is usually the standard of measurement used to compare effects across studies in meta-analyses.

Cohen’s f and f2

Cohen introduced f as a measure of effect size for F-tests, specifically to quantify differences among three or more means. In contrast, he developed d to measure the effect size between two means. The exact formula for computing f varies slightly depending on the number of levels in the factor and the variance structure.

To extend this concept to more complex models, Cohen introduced , which applies not only to ANOVA-family models but also to general(ized) linear regression. The primary distinction between f and is that is simply f squared. Cohen recommended using for complex models because it aligns with how other parameters, such as variance-explained measures, are typically computed using squared values.

An important advantage of is its flexibility: it can be used to assess the effect of a single predictor or a set of predictors, whether or not other variables in the model have been controlled for or partialed out.

More about Cohen’s f can be found at this Statistics How to page.

Other Measures of Maen Differences

Cohen’s d is not the only measure of the effect size of mean differences—although it is the most common. Two others—Hedges’ g and Glass’s Δ —are worth mentioning. All three are all standardized effect size measures used to quantify the difference between two groups in terms of standard deviations, but they differ slightly in calculation and applicability.

Table 2.1: Common Effect Size Measures of Mean Differences
Aspect Cohen’s d Hedges’ g Glass’s Δ
Denominator Pooled standard deviation Pooled standard deviation with small-sample correction Control group standard deviation
Use Case Large samples, equal variances Small samples Unequal variances
Correction Factor None Corrects for small sample bias None
Applicability Widely used in social sciences More accurate for small samples Best for heteroscedastic data

Summary

  • Use Cohen’s d for general purposes, particularly when group variances are similar and sample sizes are large.
  • Use Hedges’ g for small samples to account for bias in Cohen’s d.
  • Use Glass’s Δ when variances between groups are unequal or when the treatment is expected to increase variability.

In practice, the choice among these measures depends on the study’s design, sample size, and variance properties of the data.

1. Cohen’s d
  • Definition: Cohen’s d measures the standardized mean difference between two groups.
  • Formula: \[d = \frac{\bar{X}_1 - \bar{X}_2}{s_p}\] where:
    • \(\bar{X}_1\) and \(bar{X}_2\) are the means of the two groups.
    • \(s_p\) is the pooled standard deviation: \[s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}\]
  • Key Points:
    • Assumes equal variances between the groups (homoscedasticity).
    • Suitable for large samples.
    • Can overestimate the effect size for small sample sizes.-
2. Hedges’ g (Correction for Small Samples)
  • Definition: Hedges’ g is a variation of Cohen’s d that corrects for the small sample bias inherent in d.
  • Formula: \[g = d \times \left(1 - \frac{3}{4(n_1 + n_2 - 2) - 1}\right)\]
  • Key Points:
    • Incorporates a correction factor to reduce bias in small sample sizes.
    • Provides a more accurate effect size estimate when (n < 20).
    • For large samples, Hedges’ g converges to Cohen’s d.
3. Glass’s Δ
  • Definition: Glass’s Δ uses only the standard deviation of the control group ((s_2)) as the denominator, instead of a pooled standard deviation.
  • Formula: \[\Delta = \frac{\bar{X}_1 - \bar{X}_2}{s_2}\] where:
    • \(s_{2}\) is the standard deviation of the control group.
  • Key Points:
    • Useful when variances between groups are unequal (heteroscedasticity).
    • May produce biased estimates if the control group standard deviation is not representative.
    • Often applied in scenarios where the experimental treatment group might naturally have a higher variance (e.g., due to a treatment effect).

2.1.2 Proportions of Variance Explained

Cohen’s d and f measure the (standardized) difference between means. Cohen’s d measures it for two means, while Cohen’s f is used to measure it between three or more means. Both of these statistics can be as small as zero (when there is no difference) to positive infinity. Both simply represent the number of standard deviations between the means, and if the effect size is more than 1 SD, then the effect size will be greater than 1.

An other set of effect size measures are standardized differently: They measure proportions, and so can only range between 0 and 1. The ones describe in this section measure the proportion of total variance explained by a particular term in a regression model.

(Squared) Correlations

Perhaps the simplest measure of proportion of variance explained is correlations, specifically squared correlations. Squared correlations are indeed effect size statistics, and they measure the amount of variance explained in each of the two variables that is explained by their relationship compared to all of the variance in each of them.

For example, if the correlation between two variables is .50, i.e., if r = .50, then r2 = .502 = .25. In that case, the correlation accounts for 25% of the variance in each of the variables.

Eta-squared (η2) and Partial η2

The other three “proportion of variance explained” statistics are used to measure the effect size of individual terms in a linear regression model.

The first of these is eta-squared (\(\eta^2\)), which quantifies the proportion of total variance in the outcome variable that is explained by a given predictor. It is calculated as:

\[ \eta^2 = \frac{SS_{\text{Effect}}}{SS_{\text{Total}}} \]

This makes \(\eta^2\) conceptually similar to \(R^2\), which measures the total proportion of variance explained by all predictors in a regression model. Like the correlation coefficient \(r\), eta (\(\eta\)) itself can be understood as the proportion of standard deviation differences in the outcome explained by the predictor, while \(\eta^2\) represents variance explained as a proportion of total variance.

However, \(\eta^2\) has a notable limitation: it does not account for other predictors in the model. As additional terms are introduced, the individual \(\eta^2\) values for each predictor tend to decrease, since they represent only the variance uniquely attributable to each predictor relative to total variance.

To address this, researchers use partial eta-squared (\(\eta_p^2\)), which represents the proportion of variance explained by a specific predictor after accounting for other predictors in the model. Partial \(\eta^2\) is conceptually similar to partial \(r^2\), as it isolates the unique contribution of a predictor while removing variance shared with other terms.

In a one-way ANOVA (i.e., a model with a single categorical predictor), \(\eta^2\) is equivalent to the model \(R^2\). However, in more complex models, partial \(\eta^2\) is generally preferred for estimating the effect of a specific term.

2.1.3 Comparison with Cohen’s \(f\) and \(f^2\)

Cohen’s \(f\) and \(f^2\) serve a similar purpose but differ in how they handle variance:

  • \(\eta^2\) vs. \(f\) (ANOVA): While \(\eta^2\) measures the proportion of variance explained by a factor, \(f\) adjusts for unexplained variance, making it more suitable for cross-study comparisons. The relationship between them is:

\[f = \sqrt{\frac{\eta^2}{1 - \eta^2}}\]

  • Partial \(\eta^2\) vs. \(f^2\) (Regression): Partial \(\eta^2\) describes the proportion of variance explained by a predictor after controlling for other variables, while Cohen’s \(f^2\) expresses the incremental contribution of a predictor relative to the unexplained variance:

\[f^2 = \frac{R^2}{1 - R^2}\]

Since \(f^2\) explicitly models the variance explained relative to unexplained variance, it is commonly used in multiple regression, particularly for power analysis and comparing models across studies.

Thus, while \(\eta^2\) and partial \(\eta^2\) are useful for describing within-sample variance explained, \(f\) and \(f^2\) provide standardized effect size measures better suited for meta-analysis and statistical power estimation.

2.1.4 When to Use \(f\), \(f^2\), and \(\eta^2\)

Criterion \(\eta^2\) \(f\) (ANOVA) \(f^2\) (Regression)
Use Case ANOVA (variance explained) ANOVA (standardized effect size) Regression (incremental variance explained)
Interpretation Proportion of total variance explained Standardized measure of effect size Standardized measure of predictor impact
Best for Comparing Studies? No Yes Yes
Used in Power Analysis? No Yes Yes
Inflation in Small Samples? Yes No No

Therefore:

  • Use \(\eta^2\) to describe the proportion of variance explained in ANOVA and regression models.
  • Use Cohen’s \(f\) for standardizing effect sizes in ANOVA, making them comparable across studies.
  • Use Cohen’s \(f^2\) in regression to assess the impact of specific predictors, particularly when measuring incremental effects.
  • For a single dichotomous predictor, Cohen’s d and \(\eta^2\) can be converted into each other, but for more complex models, additional transformations are required.

This Analysis Factor post gives a good further explanation of η2. Recommendations on interpreting and reporting η2 are given well in this StackExchange Q&A.

Omega-squared (ω2)

ω2 is very similar to η2. They both measure proportion of total variance accounted for by a given term in a model, but compute it in slightly different ways3. The way η2 computes it makes it systematically overestimate the size of an effect—when it is used to measure the size of the effect for the population (i.e., when inferring from the sample to the population). Although this overestimation gets smaller as the sample gets larger, it always present (until the sample is the same size as the population).

The way ω2—and partial ω2—estimate unexplained variance makes them always smaller than η2 (and partial η2). ω2 is therefore a more conservative estimate of effect size than η2. Given this, many prefer ω2 over η2.

Epsilon-squared (ε2)

The third and final member of our Greek-alphabet soup of stats to measure the proportion of variance explained is ε2. Everyone agrees that η2 overestimates the effect. Some, like Okada (2013), argue that ω2 is sometimes too conservative, underestimating the true size of an effect.

ε2 (and partial ε2) may be closer to “just right,” giving what may be the least biased estimate. Anyway, its value is always between the other two (or equal to them).

It’s worth noting that in a one-way ANOVA, ε2 is equal to the adjusted R2.

2.1.5 Odds & Risk Ratios

Odds ratios and risk ratios (Section 5.2) are already standardized measures of effect size. As such, the odds/risks of one study can be compared to an other4

Risks are simply probabilities (and risk ratios are the relative probabilities between two groups), and so range from 0 to 1, as do the proportion of variance explained effect size statistics (Section 2.1.2).

Odds and odds ratios, however, can range beyond 1, so it may be less intuitive to compare them from one study to an other. Note, it’s fine to compare odds (or odds ratios) with others from other studies; it just may sometimes be clearer to transform them into a statistic that only ranges from 0 to 1, like many other effect size statistics.

Enter two of the oldest measures of association, φ (Greek lower-case “phi”) and Yule’s Q. Both are used to measure the magnitude of the relationship between two dichotomous variables, such as the relationship between having / not having cancer and being / not being a member of a caste-like minority5.

The equation for the φ statistic6 which admittedly doesn’t look anything like the equation for Pearson’s r, but does equate it for dichotomous counts.] reduces to the same equation for Pearson’s r, and is indeed simply the correlation between two dichotomous variables. φ is sometimes used as the effect size measure to go along with χ2-tests, although Cohen invested w to also be an effect size measure for χ2-tests7.

The φ statistic is fine and dandy. However, φ is sensitive to extreme values8 and can thus be unstable when there are very many or—more often the case—very few of a given outcome. It can also over-estimate the size of a relationship if the values in one dichotomous variable are very different than in the other (e.g., if comparing disease prevalences between one population with a lot of members to an other population with very few members). φ is therefore not the best measure to use when analyzing relatively rare events—like when discussing deaths per 100,000 people, as is often done in epidemiology and health care research. (Note, though, that no statistic is immune to being less interpretable with less data.)

Yule’s Q was invented in part to address this short-coming of φ. Yule’s Q was, in fact, designed to indeed measure the association between two odds—to essentially be an effect size measure for odds ratios9. It transforms an odds ratio—which varies from zero to infinity—into a statistic that varies from 0 to ±1, like correlations and their ilk.

2.2 “Small,” “Medium,” & “Large” Effects

Like much of statistics, Cohen’s d in standardized into z-scores/SDs (remember, the formula for it is to divide it by SDs). However, simply reporting Cohen’s d without interpreting what that means has a couple of disadvantages: (a) z-scores are not intuitive for lay audiences, and (b) there are other measures of effect size than Cohen’s d—and they aren’t all measured on the same scale. Given both of these factors, in his seminal book, Statistical Power Analysis for the Behavioral Sciences, Jacob Cohen (1988) gave recommendations for how to interpret the magnitude of various effect size statistics in terms of “small,” “medium,” and “large” effects.

These “criteria” for evaluating the magnitude of an effect size have become quite popular. Indeed, the adoption of effect size statistics seems to be regulated by people’s uses and understandings of them in relation to these criteria. They therefore deserve further consideration.

2.2.1 Effect Size Criteria as Percent of Total Variance

Cohen generally defined effect sizes based on the percent of the total variance that effect accounted for10:

  • small” effects account for 1%,
  • medium” effects account for 10%, and
  • large” effects account for 25%.

I say that he generally defined them as such because he didn’t see a need to be bound to this definition, in part because he repeatedly noted—as do I here—that these criteria were arbitrary. He defined them based on percent of total variance for d and then chose “small,” “medium,” and “large” values for other effect size statistics that corresponded to those values for d.

This meant, for example, that he chose levels for correlations that don’t always match up to what one would expect by squaring the correlations to get the percents of total variances. In other words, his criteria for correlations weren’t that a “small” correlations would be r = .1 (i.e., where r2 = .01), “medium” would be r = .5, and “large” r \(\approx\) .63. In justifying this, he notes) that he is not positing these criteria levels based on strict mathematical equivalences but instead on a concerted attempt to equate the sorts of effects one would obtain with one analytic strategy with an other analytic strategy; for example, the types of effects sizes (experimental psychologists) obtain with t-tests with those they would obtain through correlations.

2.2.2 Effect Size Criteria as Noticeability of Effects

Although Cohen was thorough in his descriptions of these effect size criteria in terms of proportions of total variance, he was also careful to couch them in practical and experimental terms.

A “small” effect is the sort he suggested one would expect to find in the early stages of a line of research when researchers have not yet determined the best ways to manipulate/intervene and when much of the noise had not yet been controlled.

A “small” effect can also be considered to be a subtle but non-negligible effect: the sorts of effects that are often found to be significant in field-based studies with typical samples and manipulations/interventions. Examples Cohen gives include:

  • The mean difference in IQs between twin and non-twin siblings11,
  • The difference in visual IQs of adult men and women, &
  • The difference in heights between 15- and 16-YO girls.

A “medium” is one large enough to see with the naked eye. Example Cohen gives include:

  • The mean difference in IQs between members of professional and managerial occupations,
  • The mean difference in IQs between “clerical” and “semiskilled” workers, &
  • The difference in heights between 14- and 18-YO girls.

A “large” effect is one that is near the upper limit of effects attained in experimental psychological studies. So yes, the generalization of this criterion to other areas of science—including nursing research—is certainly not directly supported by Cohen himself.

Examples include:

  • The mean difference in IQs between college freshmen and those who’ve earned Ph.D.s12,
  • The mean difference in IQs between those who graduate college and those who have a 50% chance of graduating high school, &
  • The difference in heights between 13- and 18-YO girls, &
  • The typical correlation between high school GPAs and scores on standardized exams like the ACT.

2.2.3 Effect Size Criteria for Odds Ratios

Cohen (1988) discussed proportions (aka risks) and presented effect size measures for a proportion’s difference from .5 (Cohen’s g) and the difference between two proportions (Cohen’s h), which could be used to present the magnitude of a risk ratio; even though a risk ratio per se is already a fine effect size stat, Cohen didn’t give size criteria for risk ratios, but instead for his h.

He didn’t, however, discuss odds or odds ratios directly, and thus didn’t give his opinion about what could be considered “small,”“medium,” and “large” values for odds or odds ratios. Yule’s Q (Section 2.1.5) can be considered comparable to risk ratios, risk ratios weren’t given size criteria either.

Chen et al. (2010) nonetheless gives some guidance by providing ranges of effect size criteria for odds ratios by comparing values with criteria for “small,” “medium,” and “large” Coden’s ds. Chen et al.’s (2010) rules of thumb for “small,” “medium,” and “large” odds ratios (below) deserve especial explanation. The size of an odds ratio depends not just on the difference in outcomes in a group (e.g., the numbers of Black woman with and without pre-eclampsia), but also the difference in outcomes in a comparison group (e.g., the numbers of non-Black women with and without pre-eclampsia). It is thus not so easy to compute simple (simplistic) rules of thumbs for the sizes of odds ratios13.

In addition, the exact values for what to consider as a “small,” “medium,” and “large” effect depend on the overall frequency, with smaller events require larger odds ratios to equate to a given level of Cohen’s d.

Nonetheless, Chen et al. (2010) presents some guidelines that can serve as guides in most cases. Using the median values suggested by their results:

  • “Small” \(\approx\) 1.5
  • “Medium” \(\approx\) 2.75
  • “Large” \(\approx\) 5

However, those suggestions can range considerably, depending on the absolute value of probability in the reference group (infection rates in the non-exposed group in Chen et al.’s article):

P of Event
in Reference Group
“Small” “Medium” “Large”
.01 1.68 3.47 6.71
.05 1.53 2.74 4.72
.10 1.46 2.50 4.14

Please consult their table on page 862 for more precise equivalents with Cohen’s d.

2.2.4 A Few Words of Caution about Effect Size Criteria

As useful as it is to talk about effect sizes being “small” or “large,” I must underline Cohen’s own admonition (e.g., p. 42) that we use this rule of thumb about “small,” “medium,” and “large” effects cautiously14. He notes, for example, that

when we consider r = .50 a large [effect size], the implication that .25 of the variance is accounted for is a large proportion [of the total variance] must be understood relatively, not absolutely.

The question, “relative to what?” is not answerable concretely. The frame of reference is the writer’s [i.e., Cohen’s own] subjective average of [proportions of variance] from his reading of the research literature in behavioral science. (pp. 78 – 79)

Many people—including reviewers of manuscripts and grant proposals—take them to be nearly as canonical as p < .05 for something being “significant.” This is a real shame since effect sizes offer us the opportunity to finally move beyond making important decisions based on simplistic, one-size-fits-all rules.

Therefore, effect size measures, including Cohen’s d, are best used objectively to compare effects between studies—not to establish some standardized gauge of the absolute value of an intervention. This is indeed part of what is done in meta-analyses.

It is also what I suggest doing within your own realm of research: Just like Cohen himself did, review what appears to be generally agreed on as “small,” “medium,” and “large” effects within your research realm. These could, for example, correspond to levels of clinical significance15. Unfortunately, though, Cohen’s suggestions for his realm of research have become themselves canonized as the criteria for most lines of research in the health and social sciences.

Indeed, interventions and factors that have “small” effects can be quite important. This seems especially true for long-term changes, such as those one strives for in educational interventions or for the subtle but persistent effects of racism. Teaching a diabetic patient how to check their blood insulin may have only a small effect on their A1C levels in a given day, but can save their life (or at least a few toes) in the long run.

Given this, Kraft (2020) used a review of educational research to suggest different criteria for gauging what should be considered as “small,” “medium,” or “large” effects in education interventions. His recommendations are also presented below.

2.2.5 Table of Effect Size Statistics

Table 2.2: Effect Size Interpretations
Statistic Explanation Small Medium Large Reference
d Difference between two means 0.2 0.5 0.8 Cohen (1988, p. 25)
d For education interventions 0.05 \(<\) .2 \(\ge\) .2 Kraft (2020)
h Difference between proportions 0.2 0.5 0.8 Cohen (1988, p. 184)
w
(also called φ)
χ2 goodness of fit & contingency tables.
φ is also a measure of correlation in 2 \(\times\) 2 contingency tables, and ranges between 0 and 1.
0.1 0.3 0.5 Cohen (1988, p. 227)
Cramer’s V Similar to φ, Cramer’s V is used to measure the differences in larger contingency tables.
Like φ (and other correlations) it ranges between 0 and 1.
0.1 0.3 0.5 Cohen (1988, p. 223)
r Correlation coefficient (difference from r = 0) 0.1 0.3 0.5 Cohen (1988, p. 83)
q Difference between correlations 0.1 0.3 0.5 Cohen (1988, p. 115)
η2 Parameter in a linear regression & AN(C)OVA 0.01 0.06 \(\ge\) .14
f AN(C)OVA model effect; equivalent to \(\sqrt{f^2}\) 0.1 0.25 0.4 Cohen (1988, p. 285)
f For education interventions (i.e., f equivalent for Cohen’s ds suggested by Kraft,) 0.025 \(<\) .1 \(\ge\) .1 Kraft (2020)
f2 A translation of R2 0.02 0.15 0.35 • For multiple regression / multiple correlation, Cohen (1988, p. 413);
• For multivariate linear regression, multivariate R2, Cohen (1988, p. 477)
OR Odds ratio; can be used as effect size for Fisher’s exact test and contingency tables in general. 1.5
(or 0.67)
2.75
(or 0.36)
5
(or 0.20)
Chen et al. (2010, p. 862)

2.3 Converting Between Effect Size Measures

Although it was nearly inevitable, it is a bit unfortunate that the various measures of effect size are not all on the same dimensions. It is therefore useful to have access to how to convert one type of effect size to an other. Those are given here for the common effect size statistics described above.

This handy Excel spreadsheet can convert between Cohen’s d, r, η2, odds ratios, and area under the curve. In Chapter 7 of their book on meta-analysis, Borenstein et al. (2011) also cover well the conversions between measures. Finally, the effectsize package for R can both compute and convert between many effect size measures, including all those mentioned here.

The following sections give the formulas for converting between most effect size statistics. I’ve also included simple R functions to do these, for those few who will find that useful. In addition to these, one of the easystats packages called effectsize that can convert t, z, and F to Cohen’s d:

install.packages("effectsize")
library(effectsize)

t_to_d(t, df_error, paired = FALSE, ci = 0.95, alternative = "two.sided", ...)

z_to_d(z, n, paired = FALSE, ci = 0.95, alternative = "two.sided", ...)

F_to_d(
  f,
  df,
  df_error,
  paired = FALSE,
  ci = 0.95,
  alternative = "two.sided",
  ...
)

2.3.1 Cohen’s d and Cohen’s f & η2

Cohen’s d is a measure of the difference between two means. If there is only one, dichotomous term in a given model then Cohen’s d (or η2) can be easily computed from Cohen’s f (or η2). However, if there are more than one term in the model or if the term for which an effect size is being measured has more than two levels to it (including if it’s a continuous variable), then one must use one of a few different formulas.

Converting between partial η2 and Cohen’s d can be done:

\[\text{partial }\eta^{2} = \frac{d^{2} \times N}{d^{2} \times N + (N - 1)}\]

R code:

eta2 <- (d^2 * N) / ((d^2 * N) + (N - 1))

\[\text{Cohen's }d = \sqrt{\frac{(N - 1)}{N}\times \frac{\text{partial }\eta^{2}}{(1 - \text{partial }\eta^{2})}}\]

R code:

d <- sqrt(((N - 1) / N) * (eta2 / (1 - eta2)))

where N is total number of participants in the analysis (and likely the study).

2.3.2 η2 and Cohen’s f2 (and f)

If there is only one term in the model (e.g., for a one-way ANOVA), then η2 is equal to the model R2. If there is more than one term in the model, then it’s in fact the partial η2 (which is what SPSS calls it).

It has become more commonly used than Cohen’s f2, but can be transformed into f2 with:

\[\eta^2 = \frac{f^2}{(1 - f^2)}\]

and

\[f^2 = \frac{\eta^2}{(1 - \eta^2)}\]

when there is only one term in the model. Partial η2s are less easily transformed into f2.

2.3.3 Correlation (r) to Cohen’s d

The equations below assume equal sample sizes for both groups.

\[d = \frac{2r}{\sqrt{1-r^2}}\]

R code:

d <- (2*r)/((1 - r^2)^.5)

\[r = \frac{d}{\sqrt{d^2 + 4}}\] R code:

r <- d/((d^2 + 4)^.5)

2.3.4 Cohen’s f (and f2) to Cohen’s d

Cohen’s f2 (and f) measures the effect size of an entire model (usually an ANOVA). Cohen’s d measures the effect size between two levels of single variable16. So, in order to convert between f2 and d, we have to know more about the model. For a one-way ANOVA with two groups17, d = 2f = 2\(\sqrt{f^2}\). In this particular case, then, f = \(\frac{d}{2}\).

More generally, when there is only one term in the model:

\[f^2 = \frac{d^2}{2k}\] R code:

f2 <- d^2/(2*k)

and

\[d = f\sqrt{2k}\]

d <- f/(2*k)^.5

where k is the number of groups in a variable in a one-way ANOVA.

It gets a bit more complicated when there are more than one terms in the model. This site covers some common situations.

2.3.5 Odds Ratio to Cohen’s d

\[d = \log(OR)\times\frac{\sqrt{3}}{\pi}\]

R code:

d <- log(OR)*((3^.5)/pi)

The variance of d (\(V_{d}\)) is simply and elegantly:

\[V_{d} = V_{\log(OR)\times}\frac{\sqrt{3}}{\pi}\]

2.3.6 Hedge’s g to Cohen’s d

\[\text{Hedge's }g = \frac{d}{\sqrt(\frac{N}{df})}\]

R code:

g <- d/((N/df)^.5)
d <- g((N/df)^.5)

2.3.7 Cohen’s d and Student’s t

This is the t in t-test. The only additional piece of information we need to know to transform between Cohen’s d and Student’s t is the sample size, N:

\[t = d \times \sqrt{N}\]

\[\text{Cohen's }d = \frac{t}{\sqrt{N}}\]

R code:

# Assume the results of the t-test were saved as t.test.results:
t.test.results <- t.test(y ~ x, data = df)

# Then:
d <- t.test.results$statistic / (sqrt(t.test.results$parameter))

2.3.8 η2 and F

This F is that used in ANOVA-family models. Like the relationship between d and t, the only additional things we need to know to compute η2 from F are degrees of freedom (which are closely related to sample size). Here, though, we have degrees of freedom in both the numerator (top) and denominator (bottom18):

\[\eta^2 = \frac{F \times df_{Effect}}{F \times (df_{Effect} + df_{Error})}\]

So, η2 is dependent on the ratio of the dfs allotted to the given effect and the dfs allotted to it’s corresponding error term. Since we have the effect’s dfs in both the numerator and denominator, their effect will generally cancel out; this suggests that having more levels to a variable doesn’t appreciably affect the size of its effect. However, being able to allot more dfs to error does help us see the size of whatever effect is there. Larger samples won’t really change the size of the effects we’re measuring, but they can help us see ones that are there.

2.4 Additional Resources

Cohen’s duck


  1. In this case, it also would assume homoskedasticity. They also assume that samples are independently and identically distributed (“iid”), meaning that (a) the value of each data point in a given variable is independent from the value of all/any other data point for that variable and (b) each of those data points in that variable are drawn from the same distribution, e.g., they’re all drawn from a normal distribution.↩︎

  2. For what it’s worth, we actually take the square root of the sum of the variances, and then divide that by 2, i.e.: \(\text{Pooled }SD = \frac{\sqrt{(SD^2_{\text{First Mean}}+SD^2_{\text{Second Mean}})}}{2}\).↩︎

  3. If you’re curious about how the three measures—η2; ω2; and the next one, ε2—are computed (from Maxwell, Camp, & Arvey, 1981, cited in Okada, 2013):\[\eta^2 = \frac{SS_{b}}{SS_{t}}\] \[\omega^2 = \frac{SS_{b} - df_{b}MS_{w}}{SS_{t} + SS_w}\] and \[\epsilon^2 = \frac{SS_{b} - df_{b}MS_{w}}{SS_{t}}\] where SSb is the sum of squares between groups, dfb is the degrees of freedom between groups, SSw is the sum of squares within each group, MSw is mean sum of squares between groups, and SSt is the total sum of squares (i.e., SSt = SSb + SSw).↩︎

  4. Assuming, of course, that one is still comparing sensible and comparable things.↩︎

  5. If measuring the association between nominal varialbes that have more than two levels, one can use Cramér’s V.↩︎

  6. Which, if you’re curious is\[\phi = \frac{AD - BC}{\sqrt{(A + B)(A + C)(D + B)(D + C)}}\]where A, B, C, and D are the counts in these cells:\[ \begin{array}{|c|c|c|} \hline & \text{Present} & \text{Not Present} \\ \hline \text{Group 1} & A & B \\ \hline \text{Group 2} & C & D \\ \hline \end{array} \]↩︎

  7. φ can be easily computed from χ2: \(\phi = \sqrt{\frac{\chi^2}{n}}\)↩︎

  8. Yeah, kinda like how outliers affect linear regression.↩︎

  9. Using that same table in the above footnote to denote the various cell frequencies, then:\[\text{Yule's }Q = \frac{AD - BC}{AD + BC}.\] Yule’s Q can also be computed directly from the odds ratio (OR):\[\text{Yule's }Q = \frac{OR - 1}{OR + 1}.\]↩︎

  10. These percents of variance accounted for are for zero-order correlations (i.e., correlations between two variables). The percent accounted for considered “small,” “medium,” and “large” for model R^2s are slightly higher (2%, 13%, and 26%, respectively).↩︎

  11. The source for this—Husén, T. (1959). Psychological twin research: A methodological study. Stockholm: Almqvist & Wiksell—was too old for me to see if he means mono- or dizygotic twins. But I tried!↩︎

  12. So, I guess a full higher education career does have a large effect on a person. And, yeah, Cohen does seem a little pre-occupied with IQ, doesn’t he?↩︎

  13. This is also true for, e.g., risk ratios, hazard ratios, means ratios, and hierarchical models.↩︎

  14. Cohen also only directly considered these criteria as they applied to experimental psychology—not, e.g., the health sciences. Indeed, he elsewhere notes that what experimental psychologists would call a “large” effect would be paltry in the physical sciences.↩︎

  15. With, say, the target level of outcome denoting a “medium” effect. Reaching \(\frac{1}{3}\) of that target could denote a “small” effect, and reaching \(\frac{2}{3}\)s more (167%) a “large” one. (This corresponds to the range between many of Cohen’s criteria. For example, criteria for r are .1, .3, and .5.↩︎

  16. Remember, Cohen’s d is just the difference between two means that is then standardized.↩︎

  17. Which is itself really just a t-test but using an ANOVA framework instead↩︎

  18. My mnemonic to remember which is which is to think of the saying, “The lowest common denominator.”↩︎