1 Introduction to Measuring Relationships and Building Models

One of my goals in this curriculum is to give you a strong foundation in what I’ve learned to be among the principles and practices guiding major, currently-recommended analyses. As you may have also realized, doing this differs somewhat from how statistics is usually taught. I very sincerely hope my gamble is well made, and that you not only gain the same skills that others would gain from doctoral statistics courses, but that you magnify that with a good understanding of what the heck is going on.

To this end, I have tried to convey the importance of a few, inter-related concepts that permeate much of modern analyses.

1.1 Measuring Relationships

In a very readable ScienceAlert article, Neild describes a study in which researchers found that, although night owls tend to have shorter lives than morning larks, being a night owl per se doesn’t increase mortality in older adults. Instead, it was the riskier behavior these people of the night prefer more to do. But how could the researchers determine this? If pretty much all night owls didn’t live as long, how could they say that being a night owl didn’t increase mortality? What sort of analysis would allow them to remove—isolate—the effect of being a night owl (more technically, having a nocturnal chronotype)?

Isolating an effect begins with being able to measure the effect—more specifically, being able to measure the main effect of a variable (here, chronotype—early birds vs. night owls). Measuring it so allows us to circumscribe its effect—to locate and measure the size (and location) of its effect. That then lets us isolate the effect of it on other variables. We can “partial” out the effect, letting us see other relationships with that effect removed.

We must thus be able to measure the extent of these relationships well. The better it is measured, the better we can detect its boundaries and thus isolate it. Estimating the relationship between variables thus not only allows us to see how much they are related, but to isolate that relationship from other relationships.

How much two variables are associated with each other is often modeled as a linear relationship where the slope of the line (when the two variables plotted on the axes) shows how much they’re related: The greater that slope (the more it deviates from a flat line with a slope of zero), the stronger the relationship between them. We often use some assumptions to estimate the underlying nature of the variables, and then use some criterion (e.g., being 95% sure we’re right) based on those assumptions to decide whether we think the association between those two variables matters—whether it is “statistically significant.”

If the two variables are measured on the same scale—say z-scores—then the greatest relationship would be represented by a straight line going up (or down if it’s a negative relationship) at a 45⁰ angle; whenever the line goes one unit to the right, it also goes exactly one unit up. The slope is \(\frac{1}{1}\), or simply 1. Anything less than a perfect relationship results in a slope that is less than 1.

If we make no further assumptions about the two variables than that they have a linear relationship and that both sets of data are roughly homoscedastic, then we can compute a correlation coefficient. The exact method we use depends on the measurement level (dichotomous, ordinal, interval, etc.) of the variables, but all correlations are set to range between 0 and 1 (or -1) by convention.

So, one way to look at a correlation is the strength of the (linear) relationship between variables.

1.2 Signal-to-Noise Ratios

Another way to look at correlations is that they are one of the ways of measuring a “signal-to-noise” ratio—another idea that permeates much of statistics. Here, it’s not just a question of how much two variables are associated, but how much that association accounts for all the information we have in our data about those variables–how much of the total variance is covariance between those things. The amount that the variables move together—the amount they covary—is a measure of the strength of their relationship. The amount they move independent of each other—the amount they do not share variance—is a measure of how weak their relationship is. If they have a weak relationship, then we obviously don’t have a very good representation of what makes these two variables take on whatever values they have.

1.3 Building Models

The models we’re analyzing in the class activities—that, say, participating in a DBT program is linearly related to growths in executive functions—are not very good; they don’t account for much of the total variance in the data. Nonetheless, building models and testing how well they account for our observations (e.g., the data we have on hand and future observations we will make) is a third idea that permeates much of statistics.

Let’s go back to our example of a very simple model: a correlation between two variables. LEt us also say in this example that it’s a weak correlation, and we want to improve on this model to make it a better representation of what’s really going on in our data. We could do this by utterly throwing out that first model and trying another one (e.g., by seeing if another zero-order correlation is stronger). Or we could try tweaking our model, say, by adding in another, third variable (i.e., testing out some partial or semipartial correlations). This third variable may clarify the relationship between the first two variables or add new and unique information to our model. This third variable may also (or instead) “suck up” the information that exists in that first zero-order correlation thus making the original, direct relationship matter even less¹.

For example, imagine we want to predict what makes adolescents develop an important set of cognitive skills (those executive functions, EFs). To do this, we start with an outcome—a criterion—of interest, e.g., how much a general, self-report measure of all EFs changes throughout middle and high school. We want to see what is related to that measure of growth–to predict why one teen shows strong development and another in fact becomes worse.

We could look into what is related to the EF growth by looking at a bunch of correlations, Indeed, we started by doing just that: just getting a since of what is related to what in the data set, focusing on what is related to EF growth. But then we (I) decided to up our game: We started asking more specific questions of the data; we demanded more precise answers from it at the expense of having to make more precise assumptions.

In positioning EF growth as an outcome—a criterion or DV—and any other variables as the input / predictors / IVs, we are assuming that EF growth is a result of the other variables. One way to think about what we’re doing is this: We set EF growth down in our model, add another variable to it as a predictor, use ordinary least squares to “regress” the predictor down to a line, and then see how steep that line is. Mathematically, this is only slightly different than conducting a correlation: In a correlation, we choose a line based on the regression of both variables down to that line; in linear regression, we only regress the predictor down to that line. Although the math is slightly different, the values we derive are the same (as long as the scores are standardized).

So why go through the trouble of doing a linear regression at all if a correlation (or even a semipartial) gets us to the same point? Because linear regressions let us ask more precise questions and thus get more sophisticated answers. We can look at specific pieces of the model, for example looking at only the effect of one predictor on the criterion isolated from any influence of other predictors. We can also look at the error term and even run tests on it (e.g., to see if the error indeed approximates the normal distribution we assume it does). Linear models thus represent a more flexible approach that can be adapted to a wider range of data—and still generate more specific answers from it. (In fact, this approach is so flexible that—conceived of as generalized linear models—they undergird perhaps every analysis you will ever make.)

And so in linear regressions, we can test rather specific terms to see if those terms alone are significant. Does an adolescent’s 6th-grade EF score predict EF growth in subsequent years (is the intercept term significant)? Does DBT participation matter (is the β-weight for that term significant)? Even if DBT participation matters, does the economic hardship a student experiences moderate the effectiveness of the DBT program (is a DBT \(\times\) economic distress term significant)?

If we already know how much economic distress a student is facing, do we really need to know about the DBT program at all?

That last question is more sophisticated than it may first seem. The questions before it (about intercept, a main effect of DBT, and a DBT \(\times\) economic distress interaction) are all answerable through an ANOVA. That last one about whether DBT adds significant and significantly new information to a model that already contains economic distress requires a bit more—and a change in perspective about what we’re doing in these analyses.

Another way to think about what we’re doing in linear regression is this: We gather together a set of variables to act as predictors. Using some eldritch process, we produce a value for EF growth that we would expect to see based on these predictors. Then, we see how close the actual EF growth score (for that set of conditions) is to the value we predicted it would be; if our predicted score is close to the actual score, we say we can created a good model that can well explain what determines that level of EF growth we actually see.

In other words, not only can we build and test models, we can build two different models and see which model performs (predicts) better. This is an idea that not only permeates (more advanced) statistics, but guides much of experimental design. One way to design a study is to create two groups—and experimental and a control—and test if having something (the treatment given to the experimental group) is better than nothing. One way of testing significance is whether we can argue we have something (can reject the null) or whether we can’t argue that we have something (cannot reject the null)²

We can still test if a predictor itself is significant while we think in terms of the model: Is the overall model more significant when that predictor is added to it? The math going on under the hood is different when we think in terms of changes in the overall model, but the end result is the same whether we test the significance of that one term (as we do in an ANOVA) or whether we test the change in significance of the overall model (as we do in this new method). (Given certain assumptions & arrangements) we get the same value for a relationship between two variables when we compute a correlation coefficient a we do when we compute a β-weight in a simple linear regression.

Given certain assumptions & arrangements, predictors that we find are significant when testing them as terms alone will also be significant when we test them in terms of changes to model fit. But looking at analyses in terms of changes to model fit gives us more flexibility and precision. Yes, this means we also gain yet another layer of things to learn, but I am hopeful that we can learn how to compute analyses this new way, and thus have you all gain a more powerful and flexible tool to use in your burgeoning careers.

It may or may not be worth noting that all of the things I’ve discussed in this chapter are situated within the traditional statistical “Zeitgeist” of using models to simulate / test theories. Since the advent of “big data,” however, there has been a nearly countervailing strategy of not worrying (much, if at all) about modeling relationships between variables / constructs. Instead, the focus in this other wheelhouse is making good predictions from one set of data to other sets. This other tradition often worries so little about why their analyses work that they sometimes can’t even tell why it does. This underlies, e.g., the random forests and machine learning strategies that have taken firm root in a growing number of industries. And who knows? Maybe in ten years I’ll be including these in a course like this. But for now, I’ll only give you a few links to read more about this if somehow you’re superhuman enough to have time and temerity to read them:

Raper (2020) Leo Breiman’s “Two Cultures.” Significance, 17: 34–37. doi: 10.1111/j.1740-9713.2020.01357.x
Breiman (2001) Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231.
Breiman, L, & Cutler, A. Random Forests.

1.4 Additional Resources & Topies

1.4.1 Clinical Relevance

I believe the difference between statistical significance and clinical relevance should be self-evident to any competent researcher³. However, the over-reliance on p-values—and lazier critical thinking in general—has led not only to misinterpretations of study outcomes but also with conflating a significant effect with clear clinical relevance.

Resources about the distinction between :

Di Bartolo, C. A. & Braun, M. K. (2017). Significance: Statistical versus clinical. In: Pediatrician’s Guide to Discussing Research with Patients. Springer, Cham. doi: 10.1007/978-3-319-49547-7_4

The Jacobson-Truax Approach

One strategy to determine clinical relevance is the Jacobson-Truax Approach. This approach seems to me hauntingly similar to simply indicating effect sizes, but it does have a following and can be helpful.

More about it is at:

Clinical Signifcance Assessment with the Jacobson–Truax Approach, a supplement to #polit2017

That last possibility of “sucking up” the information can be investigated by seeing if the third variable is acting as a mediator or moderator.↩︎
Remember, though, that a more powerful and sophisticated test is not if we have something that’s better than nothing, but whether what we have is better than an alternative: which drug has fewer side effects, which type of prevention is most effective, which man is a better choice for husband.↩︎
Confidence intervals, effect size (Chapter 2), and generalizability are some strategies to help guide decisions about clinical relevance↩︎