Appendix B — Common/Confusing Statistical & Scientific Terms

In addition to obtuse jargon¹, both science in general and statistics in particular use fairly-common words in particular ways. Some of these common and confusing terms are listed below.

B.1 Common/Confusing Statistical & Scientific Terms

Table B.1: Common/Confusing Statistical & Scientific Terms
Term	Meaning in Science & Statistics
Criterion	The output/outcome variable, the variable that is measured to see the effects of other variables on it. (See Predictor, below). Also called: • Dependent variable (DV) • Exogenous variable • Outcome (or “output”) variable • Response variable • Regressand • Target Calling it a criterion implies that it is being used as the basis or standard by which to test the importance of the predictors (or input variables) or even the success of our endeavors.
Descriptive	Descriptive statistics simply, well, describe a sample of data. They nearly always make no assumptions about the data—and make none about the population from which they were drawn (except perhaps that each datum is drawn independently from any other data and from the identical population.)
Factor	Either: • A predictor variable of any type • An independent variable manipulated or controlled by researchers; in this sense, a factor is usually—but not necessarily—a catagorical variable • A common source of variance/information in one or more ostensible variables
Fixed	No, nothing in science is every fixed (or broken : ). “Fixed” has two, mostly-different uses in statistics: 1. A fixed variable is one in which all possible levels are present in the sample data. For example, if a sample of nursing home residents indicated that some had fallen while others had not (and there are no other possible categories), then `falls` would be a fixed factor. Variables in which not all levels are present in the data are called random. Often (but not always), these are continuous variables where it is imnpossible for all levels to be present in a sample of data, such as height which could be measured to nearly infinite levels. The important difference between fixed and random factors is that analyses must estimate the effects of levels not present for random factors, an issue that can eat up degrees of freedom, etc. 2. Fixed effects in a linear regression (like in `SPSS`’s `MIXED` function, Chapter 17) are terms that have the same coefficient (e.g., same β weight for that term in the model) for all participants. For example, if all participants were assigned to either the Experimental or Control group, regardless of which hospital they were admitted to, then this would be a fixed effect. In hierarchical models, these are usually the levels that have something else nested in them, e.g., if patients were nested in hospitals. In this sense, random effects are terms that are nested within an other level².
iid	An abbreviation that describes two, important characteristics of a set of data collected on a given variable. It stands for “independently and identically distributed,” meaning that: 1. the value of each data point in a given variable is independent from the value of all/any other data point for that variable and 2. each of those data points in that variable are drawn from the same distribution, e.g., they’re all drawn from a normal distribution. That the data in a given variable is iid is one of the most important assumptions in inferential statistics. Often, it can’t be violated without us loosing the validity of conclusions drawn from those data.
Indicator	A synonym for a dummy variable that indicates whether something is present or not present, e.g., recovered versus not recovered.
Inferential	Inferential analyses rely on assumptions being made about the population from which those data were drawn. This often includes assuming that the population is normally distributed. They are typically distinguished from descriptive statistics that make no (or fewer) assumptions about the population.
Mean	The average of a set of data. I’m including it in this list of common/confusing terms simply to note the main ways a mean can be computed, and their respective uses: • Arthmetic mean: The one you know, in which values are summed and then divided by the number of values. It is used when there are no particular reasons to use an other method. • Geometric mean: Values are multiplied instead of added. We then take the nth root of this product. Geometric means are useful to compare very different values; to get the mean of percents, proportions, etc., and when the values are related to each other (like all the percents being of the same thing, like inflation). • Harmonic mean: It is computed as “the reciprocal of the average of the reciprocals of the data values”. It is used when we want to reduce the weight of larger values, such as when a distribution is positively skewed (i.e., has disproportionate number of large values). An example is length of time where events can’t be shorter than zero, but sometimes can take a lot longer than they should³ • Weighted mean: Any of the above types of means can also be weighted. In a weighted mean, some of the values are given heavier⁴ weights (their values are multiplied by some number to make them affect the overall mean differently) than other values so that those weighted values contribute more to the overall mean. This is commonly done when we were unable to sample enough people of a certain type, such as when we were unable to sample enough members of a minoritized group.
Multiple vs. multivariate (e.g., multiple vs. multivariate linear regression)	• “Multiple” indicates that there is more than one linear regression equation; this would happen if there is more than one outcome • “Multivariate” indicates that there is more than one predictor.
Non-ostensible	A variable that cannot be directly observed or other perceived. These are usually theoretical and abstract concepts—constructed ideas—that are assumed to give rise to “ostensible” variables that can be empirically perceived. Other terms for these and related concepts are: • Latent • Unobserved These are similar—sometimes even defined by—the factors that emerge from factor (latent variable) analysis.
Non-parametric	Non-parametric analyses make no (or few) assumptions about the population distribution, viz., that it is normally distributed. Non-parametric analyses tend to be more robust than parametric analyses; they also tend to be used for variables of lower measurement levels (e.g., for ordinal instead of interval/ratio data).
Ostensible	A variable that can be directly, empirically observed. This is used to distinguish a variable from non-ostensible ones that are theorized to manifest in observable ways (through ostensible variables). Other terms for these and related concepts are: • Manifest variable • Observable variable
Parametric	Parametric analyses are inferential analyses that make assumptions about the mathematical values (“parameters”) about the population’s distribution. Nearly always, this is the assumption that the population distribution is normal. Therefore, parametric analyses are those that assume normality. The term is also nearly always used to contrast these analyses with non-parametric ones that do not require (as many) assumptions about the population distribution. Making fewer assumptions, non-parametric analyses tend to be more robust.
Parsimony	The desirable trait of communicating efficiently, saying a lot of information clearly and succinctly. It is also said of explanations and theories, suggesting a strong explanation that is “elegantly” simple yet generally useful.
Population	A well-defined and -delimited group of individuals (patients, nurses, etc.) about which insights are made based on a smaller sample of members of that population. The sample are those chosen to be studied directly; the population are those to whom conclusions made from the sample can be justifiably applied.
Power	The probability to detect a real effect. The chance not to make a false negative (Type 2) error.
Predict	In statistics, prediction is our ability to use what we know to make inferences about what we don’t. This could be information we have from the past and present that we use to guess at the future. But it could also be using known information about the past to make inferences about other, past events we don’t know about. And yes, this certainly includes using sample data to infer population values. A rather clever use of prediction is to randomly split a larger set of data in half. Use half to create a model with a given set of parameters & values. And then see how well those parameters, etc. predict the other half of the data. This allow us to conduct a very authentic test of how good our estimates were.
Predictor	One of the many terms used to indicate the variables added to a linear regression model to test the effect on the outcome. Also called: • Explanatory variable • Independent variable (IV) • Input variable • Regressor
Random	Generally, the crux of randomness is that the value is unbiased and—in the long run—therefore an accurate representation of the true state. However, it can also refere to: • The process of selecting a participant, level, etc. without bias so that any value is either equally likely to be chosen or at least chosen by the same rules & odds as any other value • A “random variable” is a rather generic term for any empirical value that can take multiple values, and that the value is takes is “iid”: independent from and identically distributed as all other measurements taken on that variable • A “random effect” is a term in a hierarchical linear modle (aka multilevel model, Chapter 17) that is nested within an other variable, like patients nested in hospitals.
Spell	In longitudinal analyses, an occasion on which some outcome is present. For example, each time when a woman is pregnant or when someone with a substance abuse disorder recidivates. “Spell” is usually (but not always) used when an event can happen more than once in a longitudinal study.
Variance	A measure of dispersion. It is the square of the standard deviation (when computed for the dispersion of a variable); it is the square of the distance from a regression line (when computed in a linear regression). It is also a measure of the total amount of information in a variable; the more information, the richer a variable is, but the more there is to try to understand.
Wave	An instance of data collection at a given point in time. This terms is usually only used when there are more than one data collection occasions, viz., when a study is longitudinal. Also called: • Event • Endogenous variable • Instance • Period • Phase • Time point

B.2 Terms for Different Types of Analyses

Table B.2: Terms for Different Types of Analyses
	Measurement	Level
Analysis	Outcome Variables(s)	Predictor(s)	Uses & Notes
ANOVA	Continuous	Nominal	• Understood by many, so easily communicated • Variance determined by ordinary least squares • Typically only used to test significance of individual (main effect and/or interaction) terms
One-way ANOVA etc.	Continuous	One Nominal	• The “one-way” indicates that there is only one predictor. • If there are two predictors, it’s instead called a “two-way” ANOVA. • We could also use “three-way” ANOVA, etc., but we instead just give up and call them “multi-way” ANOVAs when there are ≥3 nomain predictors.
ANCOVA	Continuous	≥1 Nominal & ≥1 Continuous	• Contains one or more continuous “covariates” • Variance determined by ordinary least squares • Typically only used to test significance of individual (main effect and/or interaction) terms
Repeated-measures ANOVA	Continuous	Nominal	• A repeated-measures ANOVA not only tests the effect of ≥1 predictors, but also whether the effect(s) of the predictor(s) changes over time. • The times when data are collected must be evenly spaced • An, e.g., “repeated-measures ANCOVA” is an ANCOVA (with ≥1 nominal and ≥1 continuous) that tests differences in predictors’ effects over time
MANOVA	≥2 Continuous	Nominal	• A MANOVA includes two or more outcome variables. • The benefit of conducting a MANOVA over two (or more), separate ANOVAs (one for each outcome) is that a MANOVA also tests for (and accounts for) the relationships between the outcomes
General linear model	Continuous	Nominal and/or continuous	“General linear model” is the term for any linear model that: • Has a continuous outcome • Assumes a linear relationship between the predictors and the outcome. General linear models include ANOVAs (and their ilk), t-tests, F-tests, etc. They are one of the types of generalized linear models.
Generalized linear model	Any	Any	• “Generalized linear model” is the term used for any inferential analysis that uses the general formula of \(Y = b_{0} + b_{1}X_{1} + ... + b_{k}X_{k} + e\) to describe the relationship between \(k\) predictors and ≥1 outcomes. This includes general linear models (and thus ANOVAs, etc.), logistic regression, structural equation models, etc. • Although the formula is written as a linear equation, generalized linear models can model many types of non-linear relationships between predictors and outcomes. ◦ They can test dichotomous outcomes (viz., logistic regression), logarithmic outcomes, etc. • Data can be heteroskedastic. • Variables need not be normally distributed (but their distribution must still be correctly modeled by the equation) The term “generalized linear model” is not as commonly used as “general linear model” (instead one uses the term for the type of model conducted), but it is still useful to know the difference.
Logistic regression	Dichotomous	Nominal or continuous	Logistic regression models are used to estimate the effects on a dichotomous outcome, such as whether or not a patient has a given condition.
Ordinal regression	Ordinal	Nominal or continuous	Although I believe we can often treat ordinal variables as continuous (i.e., lump them in with interval & ratio variables), there are times when we should indeed treat a variable as assuming nothing more than ordinalism, such as when we have only a few levels and know the distances between them vary, but don’t know how much.
Multinomial (logistic) regression	Nominal	Nominal or continuous	These are very similar to logistic regression, differing mainly in that there are more than two levels to the outcome variable. Like logistic regression, one of the main goals of multinomial regression is to determine (predict) the category one falls into based on one’s values for the predictors.
Multiple linear regression	Continuous	≥2 Nominal or continuous	This is a general term for a linear model that has two or more predictors.
t-test	Continuous	One dichotomous	• t-Tests are used to test the difference in two values. They are used, for example, to test: • The difference between the means of two groups ◦ Such as two study groups (e.g,. experimental & control), ◦ Or the means for each level of a nominal variable with two levels (e.g., those diagnosed / not diagnosed with a condition) • Changes in one outcome at two, different points in time (a “paired” t-test). • Whether a single mean is different than some value, e.g., if the mean is not zero (a “one-sample t-test”). These are often done in linear regressions to test of a given parameter weight is significantly different from zero.

“Generalized autoregressive conditional heteroskedasticity model” anyone?↩︎
More specifically, these are terms were the intercept and/or slope is allowed to vary for each level, e.g., for each patient.↩︎
Which, of course, never applies to dissertations.↩︎
Sure, you could also/instead give lighter weights to some values—e.g., values from over-represented groups—but we usually instead give heavier weights to members of underrepresented groups.↩︎