Thursday 28 January 2016

Least Squares Regression

If you have a scatter graph of n points, how can you find a 'line of best fit' for the data?

There exists some line $l_i$ that we need to find:

$$l_i = m (x_i) + c$$

However, the y-intercept of this line (c) is the y-value when x = 0. This may not be the most useful results so one neat adjustment we can make is redefined the line so that it crosses the y axis when $x$ = $\bar{x}$.

$$l_i = m (x_i - \bar{x}) + c $$

For a least squares regression we want to find m and c such that the following sum is minimised:

$$ \sum_{i=1}^{n}(y_i - l_i)^2 = \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + c))^2 \equiv Q $$

Solution: 


Minimising $Q$ involves differentiating Q by $c$ and $m$, setting each result to zero and solving for $c$ and $m$.


Firstly differentiate wrt $c$: 


$$Q = \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + c))^2 $$

$$\frac{dQ}{dc} = 2 \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + c)) * (-1) = 0 $$

$$ =>  - \sum _{i=1}^{n} y_i + m  \sum _{i=1}^{n} (x_i - \bar{x}) + \sum _{i=1}^{n}c = 0 $$

Note that $  \sum _{i=1}^{n} (x_i - \bar{x}) = 0 $

Therefore:

$$  - \sum _{i=1}^{n} y_i  + n c = 0 $$

$$ => c = \frac{\sum _{i=1}^{n} y_i } { n } = \bar{y} $$

Now let's differentiate wrt $m$:


$$Q = \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + \bar{y}))^2 $$

$$ \frac{dQ}{dm} = 2 \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + \bar{y}))(\bar{x}-x_i) = 0 $$

$$ => \sum_{i=1}^{n} (y_i - (m x_i - m \bar{x} + \bar{y}))(\bar{x} - x_1) = 0 $$

$$ => \sum_{i=1}^{n} (y_i - m x_i + m \bar{x} - \bar{y}))(\bar{x} - x_1) = 0 $$

$$ => \sum_{i=1}^{n} [(y_i \bar{x}- m x_i \bar{x} + m \bar{x}^2 - \bar{y}\bar{x}) - (y_i x_i - m x_i^2 + m \bar{x} x_i - \bar{y} x_i)] = 0 $$

$$ => m \sum_{i=1}^{n} ( \bar{x}^2 + x_i^2 - 2 \bar{x} x_i) + \sum_{i=1}^{n} (y_i \bar{x} - \bar{y} \bar{x} - y_i x_i - \bar{y} x_i) = 0 $$ 

$$ => m  \sum_{i=1}^{n} (x_i - \bar{x})^2 -  \sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x}) = 0 $$

$$ => m = \frac{\sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x})}{\sum_{i=1}^{n} (x_i - \bar{x})^2 }$$

Voila. Now to find the line, we can easily calculate $c$ and $m$. 


No comments:

Post a Comment

Scala with Cats: Answers to revision questions

I'm studying the 'Scala with Cats' book. I want the information to stick so I am applying a technique from 'Ultralearning&#...