Estimating Email Lift with Propensity Scores: Starbucks CRM Walkthrough
A practical PSM + Tobit pipeline
marketing analytics
causal inference
propensity score matching
Tobit
CRM data
R (Programming Language)
Author
A. Srikanth
Published
November 7, 2025
Data Byte
Context
Here is the question to answer: did the Starbucks email really cause higher spend, or were emails sent to customers who were already more active?
The goal is causal lift, not a proof that emailed customers spend more on average. A raw mean comparison mixes selection with effect because emailed customers often have higher prior revenue, more orders, and shorter recency. The fix is to approximate a randomized comparison by conditioning on pre-treatment variables. Propensity score matching pairs treated and control customers with similar assignment probabilities, which reduces confounding under a selection-on-observables view.
Many rows have zero spend in the analysis window, so the outcome is left-censored. A Tobit model respects that censoring and recovers expected observed spend from a latent linear model, which produces a currency-level estimate that business teams can use.
The analysis targets the effect on the treated. It relies on standard causal assumptions: stable unit treatment value, overlap so both treatments are plausible for observed covariates, and conditional ignorability once pre-period drivers are controlled.
(X): vector of pre-period covariates (revenue, orders, orders(^2), recency)
(e(x)): propensity score, the probability of treatment given (X)
(_0, ): intercept and coefficients in the assignment model
Objectives
The first objective is causal identification. The target estimand is the effect on the treated, defined as the average change in spend for those who received the email relative to a comparable no-email state. The second objective is interpretability. Model outputs are translated into expected spend in currency units so effect sizes are directly usable by commercial stakeholders. The third objective is diagnostic transparency. Balance is evaluated before and after matching, common support is checked, and design limitations are acknowledged where overlap or balance is weak. The fourth objective is operational usefulness. Outputs are prepared for downstream slicing and reporting without repeated model runs, and the workflow provides a clear bridge to experimental validation when decision risk warrants a test.
Data Sources
What are we looking at here, you may ask? The dataset is a 27k-row fictitious CRM export . Treatment is a binary flag for exposure to a Starbucks promotional email. The outcome is current-period purchase amount (purchase_amt), left-censored at zero, so many customers show no spend in the window.
Covariates for selection are strictly pre-period: we have prior revenue, number of orders (number_of_orders), squared orders to capture curvature, and recency in days (recency_days). Additional context fields support real-world slicing, including customer ID (customer), region (region), loyalty tier (loyalty_tier), mobile-order share (mobile_order_pct), and app usage (binary coding under the app_user column). The analysis window is aligned across treatment and control, and variables that could be influenced by the email in the same period are excluded.
Schema and type checks ensure that numeric variables are properly typed and the outcome is expressed in consistent units. A row identifier is attached to support traceability. A pre-match summary by treatment status documents selection patterns that motivate adjustment. The “missingness” of data is handled in a conservative manner: rows are removed only when required for logistic estimation, matching, or Tobit fitting. Feature engineering is limited to pre-period constructs, including the squared orders term. Outlier handling is minimal; design quality depends more on overlap than on occasional high spend values. When propensity distributions concentrate near zero or one, tails are trimmed prior to matching to enforce common support and reduce extrapolation.
We target the effect on the treated. Among customers who received the email, we ask: how much did the email raise spend in the analysis window?
The design has four key parts.
First, estimate a propensity model that predicts email receipt from pre-period behavior. The fitted probability is the score we match on.
Code
ps_mod <-glm( email ~ revenue + number_of_orders + number_of_orders2 + recency_days,family =binomial(),data = psmatch)# Caliper rule of thumb (Austin, 2011): 0.2 * SD of the propensity scorep_hat <-predict(ps_mod, type ="response")caliper_val <-0.2*sd(p_hat)caliper_val
[1] 0.04105843
A couple of important notes about what’s going on above –
The p_hat <- predict(...) and caliper_val <- 0.2 * sd(p_hat) lines are computing the propensity scores and the caliper width we’ll use for nearest-neighbor matching (spoiler alert: that’s coming up next!)
Second, build matches using nearest neighbor pairing with a caliper on the logit of the propensity. A practical, defensible choice is a caliper equal to 0.2 times the standard deviation of the propensity. This keeps pairs comparable without collapsing the sample.
Third, check covariate balance with standardized mean differences. Aim to keep absolute values under 0.1 on the covariates that drive targeting. Also scan variance ratios and verify that treated and control propensities overlap after matching.
Fourth, fit a Tobit model on the matched sample because the outcome is left-censored at zero. The Tobit gives a latent linear predictor and a scale parameter. We can convert these into expected observed spend so the contrast reads in dollars. For instance, here, we could repeat the same calculation in the unmatched data to show how large the naive estimate would have been. For extra rigor, one could add a simple sensitivity analysis for unobserved confounding and report how strong a hidden factor would need to be to flip the signs.
Work through the outputs in a fixed order. The propensity model should show that prior revenue, order counts, and recency predict targeting. That is expected and confirms that selection pressure is real. The matching summary will report how many treated customers were retained. A modest drop is fine if the balance improves. The balance table should show standardized mean differences pulling toward zero across all matching covariates. If one covariate remains stubborn, tighten the caliper, allow replacement, or add an exact match on a coarse tier or recency band.
Next, look at the Tobit results on the matched data. The email coefficient should be positive and statistically meaningful. More importantly, compute the expected observed spend for each customer and average by group. The difference between email and no-email in the matched sample is the lift per treated customer in currency units. The same difference in the unmatched data will usually be larger. Use that gap to explain selection bias to stakeholders. Translate the per customer lift to revenue and profit with your margin and campaign costs. If needed, report confidence intervals or bootstrap intervals for the expected value contrast. Finally, stress test the result by repeating the match with a tighter caliper, matching with replacement, or trimming extreme propensities. Stable conclusions under these small design changes earn more trust.
The matched analysis gives a practical answer to a practical question. The email increased expected spend for comparable customers by a measurable and defensible margin. The unmatched contrast is larger, which is consistent with selection inflating naive estimates. The lift can now be turned into profit and ROMI with a clear formula. That is the number a marketer can plan against.
The next steps are concrete. We can use simple interactions to surface segments with stronger lift, such as loyalty tier, region, or recency bands. Rank future sends by predicted incremental gain, not by likelihood to buy. If the decision has budget or reputational risk, validate the magnitude with a geo holdout or a time-based split. Monitor overlap over time and trim propensities if targeting gets sharper and the tails get heavier. Package the outputs so BI can slice by store, week, and segment without calling the model again. Keep a short design note with assumptions, diagnostics, and links to the artifacts so the work is easy to audit later. This is how measurement moves from a one-off analysis to a decision tool we’ll want to continue coming back to.