Frequencies,
Counts, &
Distributions

Along with a dip into R just to make it worse

Overview

Frequencies & relative frequencies
Probabilities
- Risks & risk ratios
- Hazards & hazard ratios
Odds & odds ratios
Contingency / cross tables
- Fisher’s exact test
Important distributions
- Normal distribution
- χ² distribution

Frequencies &
Relative Frequencies

Frequencies

Simply how often something occurs

But can be expressed differently

Two main ways:

Relative to the entire population—including those singled out
- “Of these 100 patients, 80 had pain disorders”
- Presented as percents, probabilities, or risks

Relative to other subgroups
- “80 of these patients developed pain disorders, and 20 did not”
- Presented as odds

Relative Frequencies

We’re often interested not only in how something happens
- But if it’s more/less common in one group/situation than in others
This is the relative frequency
- Can be expressed as:
  - Relative risk (relative probability)
  - Relative odds
- These are usually instead written as:
  - Risk ratio
  - Odds ratio

Risks (Probabilities)
& Risk Ratios

Risks/Probablilities

Again, risk & probability are the same thing
- Risk is used more in epidemiology & healthcare
Expressed as the chance of something happening standardized to a scale of 0 to 1
- Often computed as the chance—or number of times—something happens out of all possible times

\[P(\mathsf{Target\ outcome}) = \frac{\mathsf{Number\ of\ target\ outcomes}}{\mathsf{Number\ of\ all\ outcomes}}\]

\[P = \frac{72}{200} = .36\]

Risks/Probablilities (cont.)

Creating an example in R using RStudio

pain.df <- data.frame(Ethnicity = c("Latin", "Non-Latin"),
                      Diagnosed = c(72, 223),
                      Not_Diagnosed = c(128, 241)
                      )

<- assigns a value, function, etc. to an “object,” here pain.df, which I’ve made a data.frame
c “concatenates” (combines) a series of values, etc. into a single list (“array” or “vector”)

Risks/Probablilities (cont.)

library(knitr)
library(kableExtra)
table.of.pain <- pain.df %>%
  kable(col.names = c("Ethnicity", "Diagnosed", "Not Diagnosed"), align = c("l", "c", "c")) %>%
  kable_styling(font_size = 42) %>% kable_material("hover") %>% 
  add_header_above(c(" " = 1, "Pain Disorder Diagnosis" = 2))

Most functions in R are available in add-on packages; library invokes a given package
- knitr prepares output for reports
- kableExtras pimps out tables (kables) made by knitr
%>% is a “pipe” command, which essentially nests commands inside other commands
- I’m piping commands to tweak the table into the pain.df data frame
- And having all of that assigned to the table.of.pain object

Risks/Probablilities (cont.)

Simply giving the name of an object invokes it
- (Kinda like Satan)

table.of.pain

	Pain Disorder Diagnosis
Ethnicity	Diagnosed	Not Diagnosed
Latin	72	128
Non-Latin	223	241

Risks/Probablilities (cont.)

\[P(\mathsf{Latin\ Patient\ Has\ a\ Pain\ Disorder}) = \frac{\mathsf{Target\ Event}}{\mathsf{All\ Events}}\]

\[= \frac{\mathsf{Latins\ \overline{c}\ Pain\ Disorders}}{(\mathsf{Latins\ \overline{c}\ Pain\ Disorders})+(\mathsf{Latins\ \overline{s}\ Pain\ Disorders})}\]

\[= \frac{72}{72+128} = \frac{72}{200} = .36\]

More Computations with R

Brackets ([]) index parts (“elements”) of an object

pain.df[1, 1] # First row, first column

## [1] "Latin"

pain.df[1, 2] # First row, second column

## [1] 72

pain.df[1, ] # The whole first row

##   Ethnicity Diagnosed Not_Diagnosed
## 1     Latin        72           128

More Computations with R (cont.)

We can run computations on (numeric) elements

pain.df

##   Ethnicity Diagnosed Not_Diagnosed
## 1     Latin        72           128
## 2 Non-Latin       223           241

pain.df[1, 2] + pain.df[1, 3]

## [1] 200

p.latin.patient.diagnosed.with.pain.disorder <- pain.df[1, 2] / (pain.df[1, 2] + pain.df[1, 3])
p.latin.patient.diagnosed.with.pain.disorder

## [1] 0.36

More Computations with R (cont.)

$ accesses a column (or other variable-like element) of an object…

pain.df$Ethnicity

## [1] "Latin"     "Non-Latin"

pain.df$Diagnosed

## [1]  72 223

More Computations with R (cont.)

We can do computations etc. on these columns

pain.df$Diagnosed / (pain.df$Diagnosed + pain.df$Not_Diagnosed)

## [1] 0.3600000 0.4806034

More Computations with R (cont.)

And assign those computations to an element
- That’s also part of that data frame

pain.df$Risk_of_Pain_Diagnosis <- pain.df$Diagnosed / (pain.df$Diagnosed + pain.df$Not_Diagnosed)
pain.df$Risk_of_Pain_Diagnosis

## [1] 0.3600000 0.4806034

pain.df

##   Ethnicity Diagnosed Not_Diagnosed Risk_of_Pain_Diagnosis
## 1     Latin        72           128              0.3600000
## 2 Non-Latin       223           241              0.4806034

Hazards &
Hazard Ratios

Hazards & Hazard Ratios

Hazard is simply risk within some time frame
- “What are the risks of relapse within one year?”
Hazard ratio is the hazard in one group relative to an other
- “What are the risks of Latins relapsing within one year compared to non-Latins?”

Presentation of Hazards

Since events in a timeframe may not be uniform,
- Hazards / hazard ratios are often accompanied by “Kaplan-Meier curves” depicting events over time

Probability of continued opioid use for ≤ 365 days for patients with childbirth, surgery, trauma, or other pain diagnosis in the week before their first opioid prescription or chronic pain diagnosis in the 6 months before their first opioid prescription.
Abbreviation: CNCP, chronic non-cancer pain
Shah, Hayes, & Martin (2017)

Odds &
Odds Ratios

Risks vs. Odds

Probability is the chance of something happening out of all possible occasions
- E.g., of the 200 Latin patients with OUDs, 72 were diagnosed with pain disorders

\[P(\mathsf{Being\ Diagnosed\ \overline{c}\ Pain\ Disorder})= \frac{72}{72+128} = \frac{72}{200}\]

I.e., $\frac{72}{200} = .36 \approx \frac{1}{3}$ of all Latin OUD patients were diagnosed with a pain disorder

“About one out of every three (p = .36) Latin OUD patients were diagnosed with a pain disorder.”

Risks vs. Odds (cont.)

Odds are the chance of something happening relative to them not happening
- E.g., 72 Latin OUD patients were diagnosed with pain disorders
- 128 Latin patients were not so diagnosed

\[\mathsf{Odds\ of\ Being\ Diagnosed\ with\ Pain\ Disorder} = \frac{72}{128} \approx .56\]

“Among Latin patients with OUDs,
- For every 1 diagnosed with a pain disorder,
- There were about 2 who were not (odds = 0.56).”

Odds Ratios

Odds ratios are the chance of something happening in one condition versus an other
- E.g., the chance of a patient with OUD being diagnosed with a pain disorder
  - If they are Latin versus if they are non-Latin
- So, odds ratios are always relative
  - Usually to a reference group
Again, an example (& equation) should help…

Odds Ratios: Example

Ethnicity	Diagnosed	Not Diagnosed
Latin	72	128
Non-Latin	223	241

Odds—not odds ratios—for each group:

Odds of pain diagnosis among Latin OUD patients: $\frac{N \mathsf{Latins\ \overline{c}\ Pain\ Disorders}}{N \mathsf{Latins\ \overline{s}\ Pain\ Disorders}} = \frac{72}{128} \approx .56$

Odds of pain diagnosis among non-Latin OUD patients: $\frac{N \mathsf{non-Latins\ \overline{c}\ Pain\ Disorders}}{N \mathsf{non-Latins\ \overline{s}\ Pain\ Disorders}} = \frac{223}{241} \approx .92$

Odds Ratios: Equation

Group	Present	Not Present)
Target	A	B
Reference	C	D

\[OR = \frac{(\textsf{Target & Present / Target & Not Present})}{(\textsf{Reference & Present / Reference & Not Present})}\]

\[OR = \frac{(\textsf{A / B)}}{\textsf{(C / D)}}\]

Odds Ratios: Example (cont.)

Ethnicity	Diagnosed	Not Diagnosed
Latin	72	128
Non-Latin	223	241

\[OR = \frac{(\textsf{Latin & Diagnosed / Latin & Not Diagnosed})}{(\textsf{Non-Latin & Diagnosed / Non-Latin & Not Diagnosed})}\]

\[OR = \frac{(72 / 128)}{(223 / 241)} \approx \frac{.56}{.92}\approx .61\]

Odds Ratios: Example (cont.)

Could also look at it from the non-Latin perspective

\[OR = \frac{(223 / 241)}{(72 / 128)} = \frac{.92}{.56} \approx 1.6\]

or simply: $\frac{1}{.61} \approx 1.6$

Non-Latin OUD patients are 1.6 times more likely than Latin patients to actually be diagnosed with a pain disorder

Contingency Tables &
Fisher’s Exact Test

Contingency Tables

Also called a cross table
Displays the counts (frequencies) of events in different categories
- And often used to compute probabilities
Is usually intended to be exclusive—that all options are presented
IFF it is exclusive, can conduct meaningful tests on whether the frequencies differ between groups, etc.
- Fisher’s exact test can be used to test 2 $\times$ 2 tables
- And contingency tables larger than 2 $\times$ 2
- In fact, it’s better than χ² with small cell counts (<5)

Contingency Tables (cont.)

Yeah, like this:

Ethnicity	Diagnosed	Not Diagnosed
Latin	72	128
Non-Latin	223	241

Odds among Latins = $\frac{72}{128} \approx .56$
Odds among non-Latins = $\frac{223}{241} \approx .92$
Odds ratio = $\frac{.56}{.92} \approx .61$

Fisher’s Exact Test

Invented to test if Fisher’s colleague, Muriel Bristol, could indeed tell if cream or tea were poured first into a cup
Called an “exact” test because it computes the exact p-value for rejection
- I.e., not an estimate based on
  inferences about the population
  (e.g., normality)
- So, isn’t inferential per se

Fisher’s Exact Test of Pain Diagnoses

pain.df.counts <- subset(pain.df, select = c("Diagnosed", "Not_Diagnosed"))
fisher.test(pain.df.counts)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  pain.df.counts
## p-value = 0.00491
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.4250886 0.8665430
## sample estimates:
## odds ratio 
##  0.6083658

The difference in odds (.56 vs. .92) is significant at α = .05
- We see this in p-value = .00491
- And that the confidence interval (0.425 – 0.866) for the odds ratio (0.608) doesn’t overlap 1

Fisher’s Exact Test of Pain Diagnoses (cont.)

fisher.test also creates an object
And not all elements of that object are automatically printed
We can look to see what all is there with either the powerful & flexible summary() command:

summary(fisher.test(pain.df.counts))

##             Length Class  Mode     
## p.value     1      -none- numeric  
## conf.int    2      -none- numeric  
## estimate    1      -none- numeric  
## null.value  1      -none- numeric  
## alternative 1      -none- character
## method      1      -none- character
## data.name   1      -none- character

Fisher’s Exact Test of Pain Diagnoses (cont.)

Or with the more revealing str() command:
- str() works to “look inside” most any object or function in R

str(fisher.test(pain.df.counts))

## List of 7
##  $ p.value    : num 0.00491
##  $ conf.int   : num [1:2] 0.425 0.867
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num 0.608
##   ..- attr(*, "names")= chr "odds ratio"
##  $ null.value : Named num 1
##   ..- attr(*, "names")= chr "odds ratio"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Fisher's Exact Test for Count Data"
##  $ data.name  : chr "pain.df.counts"
##  - attr(*, "class")= chr "htest"

Fisher’s Exact Test of Pain Diagnoses (cont.)

$ can access any of those elements of the fisher.test object:

fisher.test(pain.df.counts)$p.value

## [1] 0.004910079

fisher.test(pain.df.counts)$conf.int

## [1] 0.4250886 0.8665430
## attr(,"conf.level")
## [1] 0.95

Is Fisher’s Exact Test Too Conservative?

Fisher himself sure was
Fisher’s exact test can mis-estimate p-values when N is “small” (<50; Andrés & Tejedor, 1995)
- But the bias is usually small
- And common corrections (e.g., Yate’s correction) don’t change outcomes much (Crans & Shuster, 2008)
  - While making for somewhat more complicated analyses that rely on parametric assumptions

Upton (1992)

Upton (1992) argues that Fisher’s exact test is not too conservative
- Trouble comes because counts are discrete
  - While p-values assume the data are continuous
Also discusses how it can be hard to test this when something comes close to always (or never) happening
- Essentially, cell counts that are zero (or close to it) can be problematic
And that small cell counts can artificially push a p-value to be larger than .05
- See Upton’s discussion of Table 2 where the real rejection value ranged from .015 – .080

Upton (1992, cont.)

Also notes—correctly—that significance is largely determined by sample size
- Some argue for making α smaller as the sample size gets larger
- Or using the “Bayesian information criterion” to choose a model
Discussed “practical significance”
- I.e., how important it is to not have false positives for a given real-world situation
- “The experimenter must keep in mind that significance at the 5% level will only coincide with practical significance by chance!” (p. 397)

The Normal Distribution

Normal Distribution

Formula for a normal distribution:

\[f(x, \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-(x - \mu)^2}{2\sigma^2}}\]

x is some random variable
μ is the mean
σ is the standard deviation

There, now you can say you learned this

Characteristics of the Normal Distribution

Normal Curve

Characteristics of the Normal Distribution (cont.)

Most importantly, it is only a function of the mean & standard deviation
The mean, median and mode are all equal
The total area under the curve equals 1
It’s symmetric
The curve approaches—but never touches—the x-axis

Q-Q Plots

“Quantile-quantile” plots
- Compare the position in the distribution of a given quantity against the position of the same quantity in an other (e.g., normal) distribution
More simply, how & where two distributions deviate from each other
Frequently used to test if & how a sample deviates from normality
- Or how residuals deviate from normality
They’re easily created in SPSS or R
- They can be done—less easily—in Excel, etc
And a few examples may help understand them…

Normally-Distributed

Long Tails

Short Tails (Looks like an S )

Long Right Tail (Positive Skew)

Long Left Tail (Negative Skew)

More Figures, More Views on Skew

From StackExchange

Q-Q Plots (cont.)

Can put a “confidence envelope” around data in Q-Q plots:

Alternatives to Q-Q Plots

Can also/instead formally test normality of sample data with, e.g.:
- Kolmogorov-Smirnov Test (In SPSS & R)
- Shapiro-Wilk Test (In SPSS & R)
- Anderson-Darling Test (In R)
S-W and A-D are better than K-S, but all are strongly affected by sample size
- Under-powered with small N
  - Can’t detect non-normality when we need to
- Over-powered with large N
  - Overly sensitive to deviations when we don’t need to know
  - (I.e., have enough data to approximate population distribution without assuming normality)

The χ² Distribution

Background

Invented by Karl Pearson (in an abstruse 1900 article)
Originally to test “goodness of fit”
- How well a set of data fit a theoretical distribution
- Or if two sets of data follow the same distribution
Technically, χ² is a type of Gamma distribution created by the distribution of sums of the squares of a set of standard normal random variables
- A lot like we get when computing ordinary least squares for t-tests & ANOVAs
- And is closely related to the t and F distributions

Characteristics

The distribution’s shape, location, etc. are all determined by the degrees of freedom
- The mean = df
- The variance = 2df
- The maximum value for the y-axis = df – 2
  (when dfs >1)
As the degrees of freedom increase:
- The χ² curve approaches a normal distribution
- The curve becomes more symmetrical
It has no negative values
- Since it is based on squared values
- Making it good to test variances

Characteristics (cont.)

Uses of the χ²

Because it only depends on df
- And resembles a normal distribution
It is useful for testing if data follow a normal distribution
- Or often if the total set of deviations from normality
  - (Or any set of expected values)
- Are greater than expected
It can do this for discrete values—like counts
- t and F distributions technically can’t do this

Computing χ²

Formula for χ² value:

\[ \chi^2 = \sum{\frac{(Observed - Expected)^2}{Expected}}\]

So:
1. Compute the differences between a data’s actual value from it’s expected value
2. Square all of those differences & divide by the expected value
  • Kinda like computing the odds
3. Sum up those square differences “odds” for each group
4. Check that summed value against a χ² distribution
  • Where dfs = (N_rows – 1) $\times$ (N_columns – 1)
5. If the summed value is really far from the center of the distribution
  • Then those actual-expected differences are significant

Example of Using a χ² (cont.)

Remember our OUD data for Latins:

Ethnicity	Diagnosed	Not Diagnosed	Total
Latin	72	128	200

Presenting that a little differently:

Value Type	Diagnosed	Not Diagnosed	Total
Observed	72	128	200
Expected	100	100	200

Example of Using a χ² (cont.)

Taking the difference between observed & expected
Squaring those differences & dividing by expected
Add up those values

Value Type	Diagnosed	Not Diagnosed
Observed	72	128
Expected	100	100
Observed - Expected	-28	28

$28^2 = 784$
$\frac{784}{100} = 7.84$
$7.84 + 7.84 = 15.68$
df = $(2 - 1)\times(2 - 1) = 1\times1 = 1$

Example of Using a χ² (cont.)

The critical χ² value
- For 1 df
- And α = .05:

qchisq(df = 1, p = .05, lower.tail = FALSE)

## [1] 3.841459

Which is smaller than our 15.68
So our observed values are significantly different than our expected ones

Uses of the χ² (cont.)

The χ² distribution has many uses, including:
1. Estimating of parameters of a population of an unknown distribution
2. Checking the relationships between categorical variables
3. Checking independence of two criteria of classification of multiple qualitative variables
4. Testing deviations of differences between expected and observed frequencies
5. Conducting goodness of fit tests

Frequencies,Counts, &DistributionsAlong with a dip into R just to make it worse

Frequencies &Relative Frequencies

Frequencies

Relative Frequencies

Risks (Probabilities)& Risk Ratios

Risks/Probablilities

Risks/Probablilities (cont.)

Risks/Probablilities (cont.)

Risks/Probablilities (cont.)

Risks/Probablilities (cont.)

More Computations with R

More Computations with R (cont.)

More Computations with R (cont.)

More Computations with R (cont.)

More Computations with R (cont.)

Hazards &Hazard Ratios

Hazards & Hazard Ratios

Presentation of Hazards

Odds &Odds Ratios

Risks vs. Odds

Risks vs. Odds (cont.)

Odds Ratios

Odds Ratios: Example

Odds Ratios: Equation

Odds Ratios: Example (cont.)

Odds Ratios: Example (cont.)

Contingency Tables &Fisher’s Exact Test

Contingency Tables

Contingency Tables (cont.)

Fisher’s Exact Test

Fisher’s Exact Test of Pain Diagnoses

Fisher’s Exact Test of Pain Diagnoses (cont.)

Fisher’s Exact Test of Pain Diagnoses (cont.)

Fisher’s Exact Test of Pain Diagnoses (cont.)

Is Fisher’s Exact Test Too Conservative?

Upton (1992)

Upton (1992, cont.)

The Normal Distribution

Normal Distribution

Characteristics of the Normal Distribution

Characteristics of the Normal Distribution (cont.)

Q-Q Plots

Normally-Distributed

Long Tails

Short Tails (Looks like an S )

Long Right Tail (Positive Skew)

Long Left Tail (Negative Skew)

More Figures, More Views on Skew

Q-Q Plots (cont.)

Alternatives to Q-Q Plots

The χ² Distribution

Background

Characteristics

Characteristics (cont.)

Uses of the χ²

Computing χ²

Example of Using a χ² (cont.)

Example of Using a χ² (cont.)

Example of Using a χ² (cont.)

Uses of the χ² (cont.)

Frequencies,
Counts, &
Distributions

Along with a dip into R just to make it worse

Frequencies &
Relative Frequencies

Risks (Probabilities)
& Risk Ratios

Hazards &
Hazard Ratios

Odds &
Odds Ratios

Contingency Tables &
Fisher’s Exact Test