How does linear regression really work?
The math and intuition behind ordinary least squares (OLS)
Do you know how linear regression measures effects while “holding everything else constant”? Or how it minimizes the sum of squared error? In this post we’ll discuss how, and much more.
We will leverage both a matrix algebra and python to understand what’s going on. Where possible, we will go deep. However, due to the complexity of the subject, we will also link external resources with more information.
The primary source of this post are lecture notes from a Stanford statistics class.
Without further ado, let’s dive in…
Technical TLDR
Ordinary Least Squares (OLS) linear regression fits data by minimizing the square of the residuals. To do this, we…
- Formulate the residuals with respect to our linear regression equation.
- Take the derivative of the sum of the squared residuals.
- Solve for beta coefficient values where the above derivative is zero.
By squaring errors, our error curve becomes an upward facing parabola. So if we solve for optimums, we are guaranteed to be solving for the global minimum.
1 — But what’s actually going on?
Ok, we’re going to slow down a bit, walk through some demos to really understand what’s going on under the hood.
1.1 — The Purpose of Linear Regression
In this post we will be looking to model the weight of baby pandas (photo included for scientific purposes).
Some examples of our independent variables include age
(numeric), amount of bamboo eaten
(numeric), and whether they had siblings
(boolean). We will also add constant value of 1
to calculate the y-intercept.
Before even getting started, to model weight ~ predictors
we must make some basic assumptions, such as linearity, normality, and homoscedasticity. Here’s a really great in-depth resource on the assumptions of OLS, but for brevity we’re not going to cover these assumptions.
1.2 — The Mathematical Goal
Now there are many ways to fit the relationship between our X’s and Y. Ordinary Least Squares (OLS) is the simplest and most common method. It looks to minimize the sum of the squared residuals.
A residual is the distance between our linear regression line and a data point, as shown by the red lines in figure 1.
Squaring residuals is a common method for measuring model fit because it’s simple, easy to work with mathematically, and is guaranteed to be positive. There may be certain scenarios where we want to treat overestimates (positive residuals) differently from underestimates (negative residuals), but that’s a whole topic in itself.
Anyway, the goal of OLS is to find the slope of a line that minimizes the residuals.
In our case we have multiple predictors, so we’ll be fitting a high-dimensional hyperplane, as shown in figure 2. It’s hard to visualize anything beyond three dimensions, but hopefully this example with 3 axes (independent variables) can be conceptually extrapolated further.
Great! With that setup, let’s get into the math.
2 — The Matrix Algebra of OLS
We’re going to assume no prior knowledge of matrix algebra. Furthermore, because many readers are probably familiar with code, we will leverage python to understand matrix notation.
Before getting started, here’s all the source code for this post. Feel free to use that if needed.
Step 0: Set up our Data
We’re going to create independent variables that we mentioned above. To ensure there’s some level of linearity, we will use those to create our dependent variable, baby_panda_weight
.
Next, let’s plot age
against baby_panda_weight
(figure 3) to see how we did.
Overall, there seems to be a relatively linear relationship in the plot.
Next let’s use mathematical notation to do the same thing (without actually creating real data)…
In figure 4, each row corresponds to a unique observation and each column corresponds to a variable. The matrix Y
has one column corresponding to baby_panda_weight
and the matrix X
has 4 columns corresponding to our 4 independent variables.
A quick note before moving on. Pandas data frames in python use the same structure — columns are features and rows are unique observations. Default implementation for numpy arrays, however, are the inverse.
import pandas as pda, b = [1,2,3], [4,5,6]numpy_arr = np.array([a, b])
pandas_df = pd.DataFrame(dict(a=a, b=b))numpy_arr.shape # (2,3)
pandas_df.shape # (3,2)
When working in numpy, it’s often helpful to transpose data to return to a row x column shape.
Step 1: Define the Optimization Problem
Now that we have our baby panda data represented by matrices, let’s define OLS in simple matrix notation…
Here, we have four terms. y
and X
are the same as above —they’re vectors of our dependent and independent variables, respectively. However, we also added two new terms: beta (β
) and epsilon (ε
). Beta is a matrix of linear coefficients for our independent variables — it shows how much a one unit change in X moves Y. Epsilon is a matrix of errors — it shows how wrong our prediction is.
Don’t worry about interpretation for now. We’ll revisit later.
Step 2: Formulate Residuals
Cool. Now that we have defined our terms, let’s start working to find the best set of betas.
As noted above, the betas need to minimize the sum of the squared residuals. But, how do we calculate a residual?
Let’s start with python then move to matrix notation…
beta = 1.0true_value = baby_panda_weight
fitted_value = (age * beta)resid = true_value - fitted_value
As shown by the python code, a residual is just our true value minus the fitted value. Pretty simple.
And, as shown in figure 6, a residual can also be written as the OLS formula with the error term isolated. Xβ
is our fitted value, the value that our model predicted, and y
is our true value.
Step 3: Square the Residuals and Sum Them
Now, that we’ve derived our residuals, let’s square and sum them. The purpose of squaring is to allow us to easily solve for the best value of our betas — we’ll get to that in section 3.
sum_of_squares_python_style = np.sum(resid ** 2)
In the above python code, we first square our residuals then we sum them. That works perfectly, but linear algebra has another way of doing this.
Any time you multiply a vector by the transpose of itself, you’re calculating the sum of squares. Let’s see an example…
As shown in figure 7, vector multiplication requires that we transpose the first vector, converting it to a “row” instead of a “column.” Then, through fancy math shown on the right side of the equal sign, we get the sum of squares for that vector. This notation can be simplified to the value in figure 8.
Step 4: Take the Derivative
Ok so far so good. We have a robust mathematical method for determining our sum of squares.
The next step is formulate the derivative of our sum of squares term with respect to beta. We’ll see why this is relevant in the following step.
term_1 = matmul(-2 * x.T, y)
term_2 = matmul(matmul(2 * x.T, x), beta)
derivative = term_1 + term_2
The above python code shows the formula for our sum of squared residuals derivative. If you’re curious how we got this formulation, check out the matrix notation below…
For practical purposes, we don’t need to know how to take derivatives, so we’re going to move the next section. But, what you do need to know is what the derivative represents. A derivative is the instantaneous slope of a line at a given point.
If you want more, here’s a good tutorial by one of the best math YouTubers out there.
Step 5: Simplify and Solve
For our final step, to find the value of beta that minimizes our sum of squared error, we’re going to set the above derivative to zero and solve for beta (figure 10).
The algebra steps were omitted for brevity, but the last line shows the formula for getting beta coefficients for any OLS model — it’s that simple.
Let’s walk through that calculation step by step…
a = matmul(x.T, x)
b = inv(a)
c = matmul(b, x.T)
best_betas = matmul(c, y)best_betas_one_line = matmul(matmul(inv(matmul(x.T, x)), x.T), y)
a
is the variance/covariance matrix. Covariance is on the diagonal and variance is on the non-diagonal.b
is the inverse of that matrix — it essentially allows us to divide.c
is our independent variables “divided” by the variance/covariance matrix.best_betas
isc
scaled to our y values.
The intuition behind this calculation is we are essentially taking the covariance of X and Y then dividing by the variance of X. We are isolating how much Y varies due to X.
And there you have it! Using the above python code , you can solve for any beta coefficients.
3 — The Intuition of OLS
Now that we understand the math, let’s go a bit deeper into why it works.
3.1 — Why do we Know the Sum of Squared Residuals is Convex?
In step 5 above, we solve for beta where the derivative is equal to zero. By squaring error, we guarantee a single global minimum, so wherever the derivative is zero we minimize error.
There are two ways to approach this without a proof.
First, by squaring our residuals, we produce an 2-dimensional parabola. The functional form is y=a(x²)
, where a
is a scaling parameter. By definition, parabolas only have a single optimum (maximum or minimum).
Downward-facing parabolas require that a
in the above equation is negative. However, because we square all our residuals, a
is simply 1
and we’re guaranteed to have a global minimum for some value of beta.
Second, let’s do a quick example in code for single linear regression….
import numpy as npdef ssq(x, y, beta):
return np.sum((y - x * beta)**2)x = np.array([4,3,2])
y = np.array([3,2.2,1])betas = np.arange(-2, 2, 0.05)
errors = [ssq(x, y, b) for b in betas]
In the above code, we create two vectors then try lots of potential values for our regression coefficient, beta. If we plot our error against different values of beta (figure 11), as expected we get an upward facing parabola.
Now in this example, the error actually did get to zero but that’s not always guaranteed. But, by leveraging the fact that OLS has a single optimum (in this case a minimum), we know that the location of our minimum is where the derivative of our SSQ is 0.
3.2— How Should you think about the Beta Coefficients?
Interpreting OLS coefficients is surprisingly difficult. Let’s break it down…
- The size of the coefficient for each independent variable gives you the size of the effect that variable is having on your dependent variable.
- The sign on the coefficient (positive or negative) gives you the direction of the effect.
However, there’s some fine print so let’s walk through coefficient interpretation for each coefficient type.
3.2.1 — Y-Intercept (β0)
The y-intercept coefficient (β0) is the mean value of Y, holding all other predictors at zero.
You can think of β0 as the baseline value. Note that it doesn’t always make sense for predictors to have a value of 0 — what if we included year
as a feature. Well we definitely don’t have baby panda data from the year 0, so it’s important to consider whether the y-intercept has interpretive meaning.
Example interpretation: a y-intercept coefficient of 400.0 means the average weight of a panda in our dataset is 400.0 when the panda has 0 age
, amount of bamboo eaten
, and siblings
.
3.2.2 — Continuous Independent Variables
A continuous variable coefficient is the expected change in the mean of Y for a one-unit change in X, holding everything else constant.
Coefficients of continuous variables represent how much our Y
should change for a one-unit change in X
. As noted above, the larger the magnitude of the value, the more Y
will change relative to X
. If the sign is positive, Y
will increase with X
and conversely if the sign is negative, Y
will decrease with X
.
Example interpretation: a coefficient of 10.0 on age
indicates that each year a panda will on average grow by 10.0 units, holding all else constant.
3.2.3 — Binary Independent Variables
A binary variable coefficient is the expected change in the mean of Y for a one-unit change in X, holding everything else constant.
Binary variables can be interpreted in the same way as continuous variables. They just have 2 values: 0 or 1. Also note that we often represent categorical variables through several binary variable using a method called one-hot encoding.
Example interpretation: a coefficient of -4.0 on has_siblings
indicates that if a baby panda has siblings, it on average will weight 4 units less.
3.2.4 — Interactions
Interactions between variables are a more complex topic that we will not cover, but there are great resources out there. Here’s a solid post.
3.3 — Why is it holding everything constant?
For our final section, let’s talk about the concept of “holding everything else constant.”
One of the key strengths of OLS is that it isolates the impact of a given variable by controlling for all other independent variables.
A great way to think about this is to use the concept of residuals. In effect, each predictor is regressed on all other predictors, only leaving the residuals as a predictor of our dependent variable. Let’s walk through an example…
Let’s say we’re looking to fit a model that determines the impact age
on baby_panda_weight
, holding amount_of_bamboo_eaten
constant. To do that, we can create the following model…
baby_panda_weight ~ age + amount_of_bamboo_eaten
However, we can also perform the below steps and get the same regression coefficient for age
.
- Fit
age ~ amount_of_bamboo_eaten
and calculate the residuals. The residuals, let’s call themR_x
, correspond to trends inage
that could not be explained byamount_of_bamboo_eaten
. - Fit
baby_panda_weight ~ amount_of_bamboo_eaten
. The residuals, let’s call themR_y
, correspond to trends inbaby_panda_weight
that could not be explained byamount_of_bamboo_eaten
. - Fit
R_y ~ B3*R_x
.
The beta coefficient for our residualized values (B3
) will be equal to the beta coefficient for age
(B1
) in our complete model above. Here’s an example in python…
That’s how OLS holds stuff constant. Implicit in the math is a process that regresses all other predictors against the predictor of interest. It then looks to find the coefficient that fits the residuals from that regression process.
Pretty cool right?
By the way, that’s why multicollinearity is such a problem — if we regress a variable against another very similar variable, there won’t be much signal left in the residuals.
Finally, you be wondering how this “residualizing” works. Well, so am I so I wrote a question on stack exchange. If you can describe an intuitive explanation for how the linear algebra formulation “residualizes,” that would be amazing. I’ll also update this post accordingly.
4 — Summary
Ordinary Least Squares regression is the most common linear modeling technique. It’s fast, simple to implement, and easy to interpret. However, very few people actually understand what’s going on behind the scenes.
To find optimal beta coefficients, we formulate our error as a sum of squares (SSQ), thereby creating an upward facing parabola. We then solve for the values of beta where the derivative of our SSQ is 0, indicating an inflection point. The beta coefficients at this inflection point are guaranteed to minimize the squared error.
After building this model, we can interpret the coefficients as the change in our dependent variable for a one-unit change in our independent variable, holding everything else constant.
Thanks for reading! I’ll be writing 14 more posts that bring academic research to the DS industry. Check out my comment for links to the main source for this post and some useful resources.