Lesson 23: Regression with a Categorical Predictor (Two-group case)
December 2, 2025
Review:
- Exam 3
- Final Exam: Tue, Dec 16, 1:00-3:20p in HSB 111
Presentation:
- A/B Testing = comparing groups with simple regression
- We can use a “Dummy Variable” to compare 2 groups
- Each group is given a value of 0 or 1 so the category can be used numerically
- Demonstration:
- Let’s say you want to run an experiment on a website:
- Version A (control): old landing page
- Version B (treatment): new landing page
- Let’s say you want to run an experiment on a website:
You run each version on 3 days and record the number of sign-ups per 100 visitors.
| Day | Version | Sign-ups per 100 visitors (Y) |
|---|---|---|
| 1 | A | 5 |
| 2 | A | 6 |
| 3 | A | 4 |
| 4 | B | 8 |
| 5 | B | 9 |
| 6 | B | 7 |
To use regression, we have to first recode the category (group A or B) as a binary number, 0 or 1.
This allows us to compare A vs B using simple regression with a dummy variable.
| Day | Version | X | Y |
|---|---|---|---|
| 1 | A | 0 | 5 |
| 2 | A | 0 | 6 |
| 3 | A | 0 | 4 |
| 4 | B | 1 | 8 |
| 5 | B | 1 | 9 |
| 6 | B | 1 | 7 |
Now we can use our Sum of Squares approach to creating a regression equation.
| Day | X | Y | (x^2) | (y^2) | (xy) |
|---|---|---|---|---|---|
| 1 | 0 | 5 | 0 | 25 | 0 |
| 2 | 0 | 6 | 0 | 36 | 0 |
| 3 | 0 | 4 | 0 | 16 | 0 |
| 4 | 1 | 8 | 1 | 64 | 8 |
| 5 | 1 | 9 | 1 | 81 | 9 |
| 6 | 1 | 7 | 1 | 49 | 7 |
| Σ | 3 | 39 | 3 | 271 | 24 |
With our column sums we can calculate Sum of Squares.
- SSxx = 1.5
- SSyy = 17.5
- SSxy = 4.5
Then we can calculate the regression equation.
- b1 = SSxy/SSxx = 4.5/1.5 = 3
- b0 = Σy/n – b1*(Σx/n) = 39/6 – 3*(3/6) = 6.5 – 1.5 = 5
- y-hat = b1*x + b0 = 3*x + 5
To interpret we calculate y-hat for both x=0 and x=1.
- For x=0, y-hat = 5
- For x=1, y-hat = 8
- Our predicted conversion rate for Version A = 5 signups per 100 visitors
- Our predicted conversion rate for Version B = 8 signups per 100 visitors
We can also calculate r and r^2 to understand how much of the variance is explained by Version A vs B.
- r = SSxy/Sqrt(SSxx*SSyy) = 4.5/Sqrt(1.5*17.5) = 0.878
- r^2 = (0.878)^2 = 0.771
- In other words, the group (Version A or B) explains about 77% of the variation in signups.
Why not just calculate the two means?
- Means only work for simple two-group comparisons.
- Regression handles:
- more than two groups (multiple dummies → ANOVA)
- continuous predictors (e.g., time on website)
- multiple predictors at once (multiple regression)
- In industry, this is exactly what A/B test software does.
Activity:
The same company wants to test two email subject lines:
- Version A (control)
- Version B (new)
They record the number of clicks per 50 emails sent for each version across several small batches.
| Batch | Version | Clicks per 50 emails (Y) |
|---|---|---|
| 1 | A | 6 |
| 2 | A | 5 |
| 3 | A | 7 |
| 4 | A | 4 |
| 5 | B | 8 |
| 6 | B | 10 |
| 7 | B | 9 |
| 8 | B | 7 |
- Use a dummy variable for A/B, converting to 0 and 1 for X
- Then calculate the sum of squares, regression equation and r^2
- Which version works better?
Assignment: