Power &
Effect Size

Overview

  1. Review & Elaboration of Hypothesis Testing
  2. Power
  3. Effect Size

Review & Elaboration of Hypothesis Testing

What Hypothesis Is Being Tested

  • Research questions are often couched in terms of an “alternate” hypothesis (HA) that there is some difference
    • E.g., that a mean is different than some value (e.g., the mean is not zero)
    • Or that there is a difference between two means
  • (Ignore for now that other times we’re testing not a difference, but a relationship)

What Hypothesis Is Being Tested (cont.)

  • But we actually test the viability of a “null” hypothesis (H0)
    • E.g., the probability of finding our sample mean if the population mean is actually zero
    • Or the probability of finding our mean difference if there really is no true difference between the groups
  • So, the p-value we report is the probability of finding our results if the null hypothesis is true

E.g., Testing Against H0 = 0

Testing Against an
Other H0 Distribution

Testing Between Two Distributions

Couched as Signal & Noise

How can we improve our chances of correctly finding a real difference?

1. Increase the Signal

How do we increase the signal?

Increasing the Signal


  • We can best increase the signal through good research design:
    • Improve the effectiveness of our intervention
    • Investigate impactful things
    • Measuring—and thus being able to detect—a wide range of outcomes
      • Including among a wide range of participants/situations

2. Reduce the Noise

How do we reduce the noise?

Reducing the Noise

  1. Improve measurement precision
    • Reduce the noise from imprecise measurements
      • More reliable instruments/measurements
    • Make finer measurements—not just more accurate
  2. Improve sampling techniques
    • Controlled data collection situations
    • Careful use of instructions, ordering of items/instruments
    • Consideration of sources of bias
      • And reasons for missing data / refusal to answer—or even participate

Reducing the Noise (cont.)

2. Improve sampling techniques (cont.)

  • Ensure it is a well-delimited population
    • Diverse in ways that improve “good” variance
    • Unitary in respects to other sampling criteria
    • (Essentially “measuring” the population well—clearly and across its full range)

3. Control for sources of noise

  • Measuring & managing
    • Accounting for similarities between participants, situations, etc.

Reducing the Noise (end)

4. Increase sample size

  • Remember, these distributions are of estimates of the population values
    • Viz., the estimates of the population mean
      • And the “noise” in these estimates is the standard error of the mean
      • The formula for the SEM is:

\[ SEM = \frac{SD}{\sqrt{N}} \]

  • So, we can reduce “noise” by increasing the sample size

Power

Understanding Power

  • Conceptually, power is our ability to see things clearly
    • A lot like increasing the magnification of a microscope

Image of microscope

Understanding Power (cont.)

  • “The power of a statistical test is the probability that it will yield statistically significant results.” (Cohen, 1988, p. 1)

  • The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis,

    • i.e., the probability that it will result in the conclusion that the phenomenon exists.” (p. 4)

Understanding Power (cont.)

  • In hypothesis testing, a lack of power is the probability of getting a false negative
    • I.e., the chance of missing an effect that actually is there
  • High power means little chance of missing a real effect
    • Low power means a real effect can easily be missed
  • The probability of a false negative is written as β
    • (To correspond with the chance of a false positive, which is written as α)
    • So power—i.e., not missing a real effect—is
      1 – β

Understanding Power (cont.)

Understanding Power (end)

Image from G*Power

  • Yes, a false positive (α) is also called a Type 1 error
  • And a false negative (β) is also called a Type 2 error
    • But everyone gets confused by that, so I just stick with “false positive” & “false negative”

Increasing Power

How can we increase power?

  1. Increase signal (also called “effect size”)
  2. Reduce noise
    • Through good measurement, sampling & design, and accounting for unexplained variance
    • By increasing sample size
  • But also…

Changing Power by Changing α

Changing Power by Changing α (cont.)

  • One-tailed tests are more powerful than two-tailed tests
  • But ask a more specific—restrictive—question:
    • Two-tailed tests ask:
      “Is there any difference between these groups?”
    • One-tailed tests ask:
      “Is Sample 2 larger than Sample 1?”

Changing Power by Changing α (end)

  • We generally use two-tailed tests
    • They are more conservative
    • Allow us to see even unexpected effects
  • But we should use whichever is theoretically appropriate
    • Remembering that with more power comes more responsibility

Post hoc Power Analysis

  • Aka “observed” (or “retrospective”) power
  • Remember:
    • When conducting hypothesis tests, we are actually assuming the null hypothesis (H0) is true
      • E.g., α is the chance of a false positive—incorrectly assuming a sample value is not from H0
      • While all the time still assuming that null hypothesis is true

Post hoc Power Analysis (end)

  • Power, then, is the chance of finding an effect
    • Assuming the null hypothesis is true
  • If we find significance, though, we know no longer assume the null is true
    • So conducting power analyses after hypothesis testing uses refuted assumptions
  • Post hoc power is thus nearly always fallacious
    • Post hoc power is only justified when the population values are know (e.g., O’Keefe, 2007)
    • And thus can be computed based on that distribution(s) instead of the null

Effect Size

Definition

“‘[T]he degree to which the phenomenon is present in the population’ or ‘the degree to which the null hypothesis is false.’” (Cohen, 1988, pp. 9 – 10)

  • The size of a difference
    • Between groups, a different times, etc.
    • Between the odds/risks of various groups, etc.
      • E.g., higher odds ratio
  • The strength of relationships between variables
    • Higher correlations
    • The tendency for things to co-occur
      • E.g., higher odds

Effect Size is the “Signal”

Effect Size Statistics

  • Effect size statistics are standardized
    • z scores
      • I.e., SDs
    • Proportions
      • Correlations (proportions of variance accounted for)
      • Odds ratios, risk/hazard ratios
  • We can thus compare a given effect size statistic across similar studies
    • (This is what is done in meta-analyses)
  • However, we cannot directly compare two different effect size statistics
    • But we can convert them mathematically

Effect Size Statistics (cont.)

  • We can also present effect sizes with, e.g., 95% confidence intervals
    • For differences and relationships,
      • The effect is significant if the 95% CI doesn’t overlap zero
    • For odds, risks, hazards (& their ratios)
      • It’s significant if the 95% CI doesn’t overlap 1
  • This can complement formal significance tests
    • And also shows not only the size, but our, well, confidence in any conclusions made about it

Common Effect Size Statistics

Type of Effect Statistic
Difference between two means Cohen’s d
Association between two variables Correlation
(r, rpb, φ, etc.)
Partial association between two variables Cohen’s f & f\(^2\), η\(^2\)
Likelihood of co-occurrence Odds, risks/hazards
Relative likelihood of co-occurrence Odds ratios, risk/hazard ratios
  • More info, including a few less common effect size statistics, is in the Companion

Those Deuced Sizes

  • I.e., whether an effect size statistic is “small,” “medium,” or “large”
  • Cohen proposed values for each measure he reivewed/created
  • Cohen never meant his criteria to be canonical
    • But to serve “as a convention” while pointing out that “all conventions are arbitrary” (p. 13)
      • “The terms ‘small,’ ‘medium,’ and ‘large’ are relative, not only to each other, but to the area of behavioral science or even more particularly to the specific content and research method being employed in any given investigation” (p. 25)

Those Deuced Sizes (cont.)

  • “A reader who finds that what is here defined as ‘large’ is too small (or too large) to meet what [their] area of behavioral science would consider appropriate standards is urged to make more suitable operational definitions. (p. 79)
  • “Thus, what a sociologist may consider a small effect size may well be appraised as medium by a clinical psychologist.” (p. 285)
  • “Again, the reader is urged to avoid the use of these conventions, if [they] can, in favor of exact values provided by theory.” (p. 113)

A “Small” Effect

  • Small: “must not be so small that seeking them amidst the inevitable operation of measurement and experimental bias and lack of fidelity is a bootless task”
    • The value was chosen in part because the “many relationships pursued in ‘soft’ behavioral science are of this order of magnitude” (p. 79)

A “Small” Effect (cont.)

A “Small” Effect (cont.)

A “Small” Effect (end)

  • About 85% overlap between two distributions
  • 1% of total variance accounted for (2% for R2)
  • Detecting prevalences of 55:45 versus 50:50
  • E.g.:
    • Mean difference in IQ between twins and non-twins
    • Approximately the size of the difference in mean height between 15- and 16-year-old girls (i.e., 0.5 in. where the σ \(\approx\) 2.1, p. 26)
  • For significance, would expect to need¹ Ns of:
    • t test: 394
    • r: 782

¹ Assuming power (1 – β) = .8, α = .05, and normally-distributed samples.

A “Medium” Effect

  • Medium: “one large enough to be visible to the naked eye” (p. 26)
    • For correlations, Cohen set a “medium” to be “a value at the midpoint of the range of correlations between discriminably different psychological variables” (p. 80)

A “Medium” Effect (cont.)

A “Medium” Effect (cont.)

A “Medium” Effect (end)

  • ~67% overlap between two distributions
  • ~10% of total variance accounted for (13% for R2)
  • Detecting prevalences of 65:35 versus 50:50
  • E.g., a mean difference in IQ between members of clerical and semi-skilled occupations
    • Correlation between adolescents’ creativity & IQ
    • Approximately the size of the difference in mean height between 14- and 18-year-old girls (i.e., 1 in. where the σ = 2)
  • For significance, would expect to need Ns of:
    • t test: 64
    • r: 84

A “Large” Effect

  • Large: “must not be so large that their quest by statistical methods is wholly a labor of supererogation,
    • or to use Tukey’s delightful term ‘statistical sanctification.’” (p. 13)
    • E.g., set “large” to be “a degree of correlation between two different variables ‘about as high as they come,’ [in the behavioral sciences]” (p. 81)

A “Large” Effect (cont.)

A “Large” Effect (cont.)

A “Large” Effect (end)

  • ~53% overlap between two distributions
  • 25% of total variance accounted for (~26% for R2)
  • Detecting prevalences of 75:25 versus 50:50
  • E.g., a mean difference in IQ between “between holders of the Ph.D. degree and typical college freshmen”
    • Or “between college graduates and persons with only a 50-50 chance of passing in an academic high school curriculum” p. 27)
  • For significance, would expect to need Ns of:
    • t test: 26
    • r: 28

Why Use Effect Size?

  • “Being forced to think in more exact terms than demanded by the Fisherian alternative ([i.e., that an effect size] is any nonzero value) is likely to prove salutary.

    [Researchers] can call upon theory for some help in answering the question and on [their] critical assessment of prior research in the area for further help [in using effect size to make nuanced decisions about data].” (p. 12)

And isn’t that the point of research in the first place?

The End