The Normal & χ² Distribution

Overview

  • Normal distribution
    • What it is
    • Why it’s important
  • \(\chi^2\) distribution
    • What it is
    • Why it’s important




The Normal Distribution

Formula for a normal distribution:

\[f(x, \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(x - \mu)^2}{2\sigma^2}}\]

  • x is some random variable
  • μ is the mean
  • σ is the standard deviation
  • There, now you can say you learned this

Characteristics of the Normal Distribution

Normal Curve

Characteristics of the Normal Distribution (cont.)

  • Most importantly:
    • It is well-understood
    • It is only a function of the mean & standard deviation
    • It roughly corresponds to many real-world distributions
  • The mean, median and mode are all equal
  • The total area under the curve equals 1
  • It’s symmetric
  • The curve approaches—
    but never touches—the x-axis

Q-Q Plots

  • “Quantile-quantile” plots
    • Compare the position in the distribution of a given quantity against the position of the same quantity in an other (e.g., normal) distribution
  • More simply, how & where two distributions deviate from each other
  • Frequently used to test if & how a sample deviates from normality
    • Or how residuals deviate from normality
  • They’re easily created in SPSS or R
  • And a few examples may help understand them…

Normally-Distributed

Short Tails (Looks like an S )

Long Tails

Long Right Tail (Positive Skew)

Long Left Tail (Negative Skew)

More Figures, More Views on Skew

From StackExchange

Q-Q Plots (cont.)

Alternatives to Q-Q Plots

  • Can also/instead formally test normality of sample data with, e.g.:
  • S-W and A-D are better than K-S, but all are strongly affected by sample size
    • Under-powered with small N
      • Can’t detect non-normality when we need to
    • Over-powered with large N
      • Overly sensitive to deviations when we don’t need to know
      • (I.e., have enough data to approximate population distribution without assuming normality)




The \(\chi^2\) Distribution

Background

  • Invented by Karl Pearson (in an abstruse 1900 article)
  • Originally to test “goodness of fit
    • How well a set of data fit a theoretical distribution
    • Or if two sets of data follow the same distribution

  • Technically, \(\chi^2\) is a type of Gamma distribution created by the distribution of sums of the squares of a set of standard normal random variables
    • A lot like we get when computing ordinary least squares for t-tests & ANOVAs
    • And is closely related to the t and F distributions

Characteristics

  • The distribution’s shape, location, etc. are all determined by the degrees of freedom
    • The mean = df
    • The variance = 2df
    • The maximum value for the y-axis = df – 2
      (when dfs >1)
  • As the degrees of freedom increase:
    • The \(\chi^2\) curve approaches a normal distribution
    • The curve becomes more symmetrical
  • It has no negative values
    • Since it is based on squared values
    • Making it good to test variances

Characteristics (cont.)

Uses of the \(\chi^2\)

  • Because it only depends on df
    • And resembles a normal distribution
  • It is useful for testing if data follow a normal distribution
    • Or often if the total set of deviations from normality
      • (Or any set of expected values)
    • Are greater than expected

  • It can do this for discrete values—like counts
    • t and F distributions technically can’t do this

Computing \(\chi^2\)

  • Formula for \(\chi^2\) value:

\[ \chi^2 = \sum{\frac{(Observed - Expected)^2}{Expected}}\]

  • So:

    1. Compute the differences between a data’s actual value from it’s expected value
    2. Square all of those differences & divide by the expected value
      •  Kinda like computing the odds
    3. Sum up those square differences “odds” for each group
    4. Check that summed value against a \(\chi^2\) distribution
      •  Where dfs = (Nrows – 1) \(\times\) (Ncolumns – 1)
    5. If the summed value is really far from the center of the distribution
      •  Then those actual-expected differences are significant

Example of Using a \(\chi^2\) (cont.)

  • Consider these opioid use disorder data:
Ethnicity Diagnosed Not Diagnosed Total
Latin 72 128 200


  • Presenting that a little differently:
Value Type Diagnosed Not Diagnosed Total
Observed 72 128 200
Expected 100 100 200

Example of Using a \(\chi^2\) (cont.)

  1. Taking the difference between observed & expected
  2. Squaring those differences & dividing by expected
  3. Add up those values
Value Type Diagnosed Not Diagnosed
Observed 72 128
Expected 100 100
Observed - Expected -28 28
  • \(28^2 = 784\)
  • \(\frac{784}{100} = 7.84\)
  • \(7.84 + 7.84 = 15.68\)
  • df = \((2 - 1)\times(2 - 1) = 1\times1 = 1\)

Example of Using a \(\chi^2\) (cont.)

  • The critical \(\chi^2\) value
    • For 1 df
    • And α = .05:
qchisq(df = 1, p = .05, lower.tail = FALSE)
## [1] 3.841459
  • Which is smaller than our 15.68
  • So our observed values are significantly different than our expected ones

Uses of the \(\chi^2\) (cont.)

  • The \(\chi^2\) distribution has many uses, including:
    1. Estimating of parameters of a population of an unknown distribution
    2. Checking the relationships between categorical variables
    3. Checking independence of two criteria of classification of multiple qualitative variables
    4. Testing deviations of differences between expected and observed frequencies
    5. Conducting goodness of fit tests

The End