If you have a scatter graph of n points, how can you find a 'line of best fit' for the data?
There exists some line $l_i$ that we need to find:
$$l_i = m (x_i) + c$$
However, the y-intercept of this line (c) is the y-value when x = 0. This may not be the most useful results so one neat adjustment we can make is redefined the line so that it crosses the y axis when $x$ = $\bar{x}$.
$$l_i = m (x_i - \bar{x}) + c $$
For a least squares regression we want to find m and c such that the following sum is minimised:
$$ \sum_{i=1}^{n}(y_i - l_i)^2 = \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + c))^2 \equiv Q $$
Minimising $Q$ involves differentiating Q by $c$ and $m$, setting each result to zero and solving for $c$ and $m$.
$$Q = \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + c))^2 $$
$$\frac{dQ}{dc} = 2 \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + c)) * (-1) = 0 $$
$$ => - \sum _{i=1}^{n} y_i + m \sum _{i=1}^{n} (x_i - \bar{x}) + \sum _{i=1}^{n}c = 0 $$
Note that $ \sum _{i=1}^{n} (x_i - \bar{x}) = 0 $
Therefore:
$$ - \sum _{i=1}^{n} y_i + n c = 0 $$
$$ => c = \frac{\sum _{i=1}^{n} y_i } { n } = \bar{y} $$
There exists some line $l_i$ that we need to find:
$$l_i = m (x_i) + c$$
However, the y-intercept of this line (c) is the y-value when x = 0. This may not be the most useful results so one neat adjustment we can make is redefined the line so that it crosses the y axis when $x$ = $\bar{x}$.
$$l_i = m (x_i - \bar{x}) + c $$
For a least squares regression we want to find m and c such that the following sum is minimised:
$$ \sum_{i=1}^{n}(y_i - l_i)^2 = \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + c))^2 \equiv Q $$
Solution:
Minimising $Q$ involves differentiating Q by $c$ and $m$, setting each result to zero and solving for $c$ and $m$.
Firstly differentiate wrt $c$:
$$Q = \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + c))^2 $$
$$\frac{dQ}{dc} = 2 \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + c)) * (-1) = 0 $$
$$ => - \sum _{i=1}^{n} y_i + m \sum _{i=1}^{n} (x_i - \bar{x}) + \sum _{i=1}^{n}c = 0 $$
Note that $ \sum _{i=1}^{n} (x_i - \bar{x}) = 0 $
Therefore:
$$ - \sum _{i=1}^{n} y_i + n c = 0 $$
$$ => c = \frac{\sum _{i=1}^{n} y_i } { n } = \bar{y} $$
Now let's differentiate wrt $m$:
$$Q = \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + \bar{y}))^2 $$
$$ \frac{dQ}{dm} = 2 \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + \bar{y}))(\bar{x}-x_i) = 0 $$
$$ => \sum_{i=1}^{n} (y_i - (m x_i - m \bar{x} + \bar{y}))(\bar{x} - x_1) = 0 $$
$$ => \sum_{i=1}^{n} (y_i - m x_i + m \bar{x} - \bar{y}))(\bar{x} - x_1) = 0 $$
$$ => \sum_{i=1}^{n} [(y_i \bar{x}- m x_i \bar{x} + m \bar{x}^2 - \bar{y}\bar{x}) - (y_i x_i - m x_i^2 + m \bar{x} x_i - \bar{y} x_i)] = 0 $$
$$ => m \sum_{i=1}^{n} ( \bar{x}^2 + x_i^2 - 2 \bar{x} x_i) + \sum_{i=1}^{n} (y_i \bar{x} - \bar{y} \bar{x} - y_i x_i - \bar{y} x_i) = 0 $$
$$ => m \sum_{i=1}^{n} (x_i - \bar{x})^2 - \sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x}) = 0 $$
$$ => m = \frac{\sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x})}{\sum_{i=1}^{n} (x_i - \bar{x})^2 }$$
Voila. Now to find the line, we can easily calculate $c$ and $m$.
No comments:
Post a Comment