# The Elastic Net¶

## Suggested Prerequisites¶

### Notes:¶

• Compromise between Ridge Regression and the Lasso

• Helps to reduce overfitting by bringing model coefficients towards zero and selectively to zero

• Arguably the most robust linear regularized method

### Loss Function and Optimization Problem¶

The associated loss function for the Elastic Net modifies the OLS loss function through the addition of both an $$L_1$$ and $$L_2$$ penalties controlled by tuning parameters $$\lambda$$ and $$\alpha$$ respectively as:

$L(\mathbf{\beta}) = \|\mathbf{y} - \mathbf{X}\mathbf{\beta}\|_2^2 + \lambda [(1-\alpha)\frac{1}{2} \|\mathbf{\beta}\|_2^2 + \alpha \|\mathbf{\beta}\| ] \: \: \: \text{ with tuning parameters \lambda \geq 0, 0 \leq \alpha \leq 1 }$

In this context, $$\alpha$$ can be considered as the parameter controlling the ratio of $$L_1$$ penalty and $$\lambda$$ is the intensity of regularization to apply.

Formulating the loss function as a least-squares optimization problem yields:

$\hat{\mathbf{\beta}} = \arg\min_{\mathbf{\beta}} L(\mathbf{\beta}) = \arg\min_{\mathbf{\beta}} \frac{1}{2n}\|\mathbf{y} - \mathbf{X}\mathbf{\beta}\|_2^2 + \lambda [(1-\alpha)\frac{1}{2} \|\mathbf{\beta}\|_2^2 + \alpha \|\mathbf{\beta}\| ]$

Similarly to the Lasso, a discrete optimization technique needs to be applied to yield a solution for the coefficient estimates.

### Pathwise Coordinate Descent¶

The algorithm is similar to that of the Lasso. Features should be should be standardized to have zero mean and unit variance. Coefficients should be updated as:

$\beta_j = \frac{\mathbf{S}(\beta_j^*, \lambda\alpha)}{1 + \lambda(1-\alpha)}$

where $$\mathbf{S}$$ is the same soft-thresholding operator applied in the case of the Lasso:

$sign(\beta_j^*)(\left|\beta_j^*\right| - \lambda\alpha)_+$

Furthermore, if warm starts are to be utilized then $$\lambda_{max}$$ can be found as:

$\lambda_{\text{max}} = \frac{\max_l \left|\langle x_l, y \rangle \right|}{n\alpha}$

### Implementation in Python Using NumPy¶

Warning

In practice it is recommended to use a cross-validation technique such as K-Fold cross-validation to choose the tuning parameter, $$\lambda$$.

## Sources¶

1

Regularization: ridge regression and the lasso. Nov 2006. URL: http://statweb.stanford.edu/~tibs/sta305files/Rudyregularization.pdf.

2

Anil Aswani. Ieor 165 – engineering statistics, quality control, and forecasting lecture notes 8. Jan 2021. URL: http://courses.ieor.berkeley.edu/ieor165/lecture_notes/ieor165_lec8.pdf.

3

Jerome Friedman, Trevor Hastie, and Rob Tibshirani. Regularization paths for generalized linear models via coordinate descent. Aug 2010. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2929880/.

4

Trevor Hastie, Jerome Friedman, and Rob Tibshirani. Fast regularization paths via coordinate descent talk. 2009. URL: https://web.stanford.edu/~hastie/TALKS/glmnet.pdf.

5

Trevor Hastie, Jerome Friedman, and Robert Tisbshirani. The Elements of statistical learning: data mining, inference, and prediction. Springer, 2017. URL: https://web.stanford.edu/~hastie/ElemStatLearn//printings/ESLII_print12_toc.pdf.

Contributions made by our wonderful GitHub Contributors: @wyattowalsh