From Score to Probability: Making Sense of a One-Predictor LPM
An educational explainer using simulated PHQ-9 data
mental health
statistics
PHQ-9
linear probability model
thresholds
calibration
visualization
R (Programming Language)
Author
A. Srikanth
Published
September 12, 2025
Data Byte
Context
We’re modeling how a single Patient Health Questionnaire-9 (PHQ-9) score relates to the probability of a depression diagnosis. To keep the story clean, I use a Linear Probability Model (LPM), which is ordinary least squares applied to a 0/1 outcome, on simulated data. This way, the focus stays on method rather than measurement.
Objectives
The idea is simple: take one PHQ-9 score and see how a straight line can turn it into a probability of diagnosis. Along the way, I want to show you what the slope really means, how thresholds shape decisions, and why you’ll probably want to move to logistic regression once you’ve seen the basics.
Data Sources
To protect privacy, the data shown here are completely simulated but designed to look like a realistic PHQ-9 distribution paired with diagnosis labels. This lets us focus on the modeling without exposing sensitive health information. In a real study, you wouldn’t stop at a single questionnaire score. You’d bring in actual clinical records, add relevant covariates such as demographics or comorbidities, and carefully check whether the diagnosed and non-diagnosed groups are balanced. Class imbalance is common in medical data, and if it’s ignored, the model can become biased toward predicting the majority class. Addressing that might involve weighting, resampling, or comparing multiple models to see which handles imbalance best.
Data Preparation
The PHQ-9 score is treated as a plain number. The outcome is coded as 1 if there’s a diagnosis and 0 if not. I split the data into training and test sets so I’m not just grading my own homework when checking performance. There’s no fancy feature engineering here—the point is to keep a single predictor and make the model’s interpretation crystal clear.
Methodology
The model is a Linear Probability Model (LPM), which applies ordinary least squares to a binary outcome. With \(Y \in \{0,1\}\) representing diagnosis status and \(X\) the PHQ-9 score, the model assumes a linear conditional mean:
\[
\mathbb{E}[Y \mid X] = \beta_0 + \beta_1 X
\]
For a person with score \(x\), the predicted probability is
\[
\hat p(x) = \hat\beta_0 + \hat\beta_1 x
\]
The key feature is that the slope (\(\hat\beta_1\)) is constant: every one-point increase in PHQ-9 changes the predicted probability by exactly (\(\hat\beta_1\)). For example, if \(\hat\beta_1 = 0.02\), that translates to a two-percentage-point increase at all PHQ-9 values.
Two caveats are important. First, linear predictions are not automatically bounded between \(0\) and \(1\), so plots here are visually clamped to that range without implying statistical correction. Second, the error variance depends on the mean, which makes the errors heteroskedastic:
\[
\mathrm{Var}(Y \mid X) = p\,(1 - p)
\]
Because of this, heteroskedasticity-robust standard errors are required for inference. Logistic regression is the stronger default in applied work, but the LPM is retained here because its line and slope make the relationship easy to explain.
Analysis
The fitted line makes the relationship legible: higher PHQ-9 scores align with higher predicted probabilities. Because the slope \(\hat{\beta}_1\) is a constant marginal effect, each one-point increase in PHQ-9 changes the predicted probability by exactly \(\hat{\beta}_1\) across the scale. Figures are visually clamped to \([0,1]\) for readability, which can create a plateau where the unconstrained LPM would exceed \(1\); that flat segment from PHQ-9 \(> 24\) is a display artifact, not a change in \(\hat{\beta}_1\).
A second view looks at the distribution of predicted probabilities by true status (Dx vs No Dx). When the Dx distribution shifts right and the overlap shrinks, the model is ranking people well (good discrimination). Turning scores into decisions requires a threshold \(\tau\):
Lower \(\tau\) catches more true cases but increases false alarms; higher \(\tau\) does the opposite. Reported performance should reflect that trade-off (e.g., precision, recall, specificity) at the chosen \(\tau\).
Finally, we can check calibration: are the predicted probabilities numerically accurate? In other words, among people with predictions near a given value \(q\), about \(q\) proportion should truly have the outcome. A quick diagnostic compares average \(\hat{p}\) to average outcomes within bins; larger departures signal the need for a better link (logistic) or for post-hoc calibration. For instance, we can adjust probabilities after training (e.g., using Platt scaling, which I won’t explain here) so they better match observed frequencies.
With simulated data, the fitted line rises monotonically with PHQ-9. Dx and No-Dx probability distributions partly separate but overlap in the middle, where most errors occur. The operating threshold should be selected to reflect the relative costs of false negatives and false positives.
For applied work within this field, we can add clinically relevant predictors, address class imbalance, and compare the LPM with logistic regression using cross-validation. By setting thresholds with an explicit cost or utility framework tied to the workflow, analysts can align model decisions with clinical priorities, making clear the trade-off between catching more true cases and avoiding false alarms.
Validate on a held-out site or period to test transportability. Quantify uncertainty with confidence or bootstrap intervals. Check subgroup performance for fairness issues. Prefer logistic regression when calibrated probabilities are required, or apply post hoc calibration to the scores. Document modeling choices, validation results, and threshold rationale succinctly.