Orthogonalizing Variables in a Regression
Table of Contents
1. Motivation
Say we want to include two correlated variables as explanatory variables in a regression. Doing this causes multicollinearity
, which does not bias coefficients however does increase coefficient standard errors. Note, you can for excessive multicollinearity with variance inflation factors. To remove the correlation between the variables we can use a process refered to as orthogonalization in the finance literature. We describe this process in the example below.
2. Example in R
I was going to do the example in python, however the yfinance
package was broken and so I will use R. R has the added benefit that running regressions and extracting residuals is much more straightforward than in the various python statistical libraries.
Say we want to test whether crude oil and natural gas prices affect stock market prices, however we are concerned that crude oil and natural gas returns might be too correlated. We therefore want to extract the part of crude oil returns which are uncorrelated with natural gas returns. We could also just estimate the regression and use variance inflation factors to see the effect of multicollinearity, but we'll do this in another notebook.
We'll load the quantmod
library to pull ETF prices:
library(quantmod)
Then pull daily ETF data for the stock market, crude oil, and natural gas:
market <- getSymbols("spy", auto.assign = FALSE) oil <- getSymbols("uso", auto.assign = FALSE) ng <- getSymbols("ung", auto.assign = FALSE)
Then convert the prices to returns and merge to get returns over a common time interval:
data <- merge.xts(Delt(Cl(oil)), Delt(Cl(ng)), Delt(Cl(market)), join="inner")[-1,] names(data) <- c("oil", "ng", "market")
We can now estimate a correlation coefficient between daily crude oil and natural gas returns:
cor(data$oil, data$ng)
0.16688267808796
Ok, that correlation is not too bad. We could estimate the regression without orthogonalizing (checking variance inflation factors), however we'll continue. The line of code below estimates the parameters of the following regression:
\[oil_t = \alpha + \beta ng_t + \epsilon_t\]
and extracts the estimated residuals \(\hat{\epsilon}_t\) into the variable names oil_orthogonalized
.
oil_orthogonalized <- lm(data$oil ~ data$ng)$resid
Now if we calculated the correlation between our orthogonalized crude oil and natural gas returns we find:
cor(oil_orthogonalized, data$ng)
4.1870964646196e-17
Which is computer for 0. Now there is no correlation between crude oil and natural gas, and we can run the following regression without worrying about the effect of multicollinearity.
summary(lm(data$market ~ oil_orthogonalized + data$ng))
## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.0004347 0.0001755 2.477 0.0133 * ## oil_orthogonalized 0.1965857 0.0075897 25.902 < 2e-16 *** ## data$ng 0.0409667 0.0058139 7.046 2.12e-12 *** ## --- ## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ## ## Residual standard error: 0.01167 on 4420 degrees of freedom ## Multiple R-squared: 0.1402, Adjusted R-squared: 0.1398 ## F-statistic: 360.3 on 2 and 4420 DF, p-value: < 2.2e-16
The only downside is it is now a bit harder to interpret the coefficient on oil. We can't say a 1% increase in oil is consistent with a 0.19% increase in the market—we have to say a 1% increase in oil which is uncorrelated with natural gas prices. So we don't have to worry about multicollinearity but we so have a harder time interpreting the regression. Of course if oil is our main variables of interest, you would choose to orthogonalize natural gas with respect to oil, so we have no difficulty interpreting the coefficient on oil.