Alexandra's Tech: algo

Showing posts with label algo. Show all posts

Tuesday, 16 February 2016

The 12 Days of Christmas

Question.

On the first day of Christmas my true love gave to me: a partridge in a pear tree

On the second day of Christmas my true love gave to me: two turtle doves and a partridge in a pear tree

Etc.

On the 12th day of Christmas, of which type of bird/person do you have the greatest number?

Answer:

Exhaustively, on day 12:

1 * 12 partridges
2 * 11 turtle doves
3 * 10 french hens
4 * 9 calling birds
5 * 8 gold rings
6 * 7 geese a-laying
7 * 6 swans a-swimming
8 * 5 maids a-milking
9 * 4 ladies dancing
10 * 3 lords a-leaping
11 * 2 pipers piping
12 * 1 drummers drumming

(yes i had to google what was what on each day)

So the answer is that you have 42 geese and 42 swans.

Intuitively we can say that the biggest value is when we are multiplying the numbers that are closest together, in this case 6*7. So the answer for the max is $floor(\frac{13}{2}) * (13-floor(\frac{13}{2}) )$

How to prove this mathematically:

Firstly, is there a general formula?

Yes, where $i$ is the day you first getting a particular type of animal then clearly on day twelve you have $i*(13 - i) $

To find the max we differentiate:

$$\frac{d}{di} i*(13 - i) $$

$$ = \frac{d}{di} 13i - i^2 $$

$$ = 13 - 2i $$

Set this to zero and solve

$$ 13 - 2i = 0 => i = \frac{13}{2}$$

Is this definitely maximum? Differentiating again = $-2$ which means this is maximal.

Obviously, our data is discrete, so we check the values on either side of 6.5, 6 and 7. In this case they give the same answer: 6 * 7 and 7*6 = 42.

(Fun fact when I was fourteen my friends and I made up our own version of this song. All I can remember is the following:

5 g strings
4 manly hugs
3 french kisses
2 bulging biceps
And a nice piece of ass for me

I've forgotten everything I ever learned in A level chemistry, but it's good to know that the important stuff sticks. )

Quicksort in scala

Basic Algorithm

Step 1: Choose pivot p as some element in the array.
Step 2: Partition the array so that everything on the left of p is < p and everything on the right > p (this step is $O(n)$).
Step 3: Recurse on the left and right.

Best and worse cases

Intuitively the best case is when the pivot happens to be the median each time. $O(n * log(n))$

The worst case would be choosing pivot = first element where the array is already sorted. $O(n^2)$

Implementation

There are two main implementation to consider. The first is the most intuitive:

Analysis of Lomuto quicksort

We can see that the partition method is $O(n) = a*n + c_1$

The check 'if (r < l)' occurs in constant time, $c_2$

The go method is more interesting. We can say that in the best case, when the partition is at the median every time

$$T(n) = 2 T (\frac{n}{2}) + (a * n + c_1) + c_2 = 2 T (\frac{n}{2}) + a * n + c_3 $$

Constant doesn't really make much difference as $n \rightarrow \infty $ so let's drop it:

$$T(n) = 2 T (\frac{n}{2}) + a * n $$

$$T(\frac{n}{2}) = 2 T (\frac{n}{4}) + \frac{a * n}{2} $$
$$ => T(n) = 2 \{ 2 T (\frac{n}{4}) + \frac{a * n}{2} \} + a * n $$
$$ => T(n) = 4 T (\frac{n}{4}) + 2 * a * n $$
$$ => T(n) = 2^k * T(\frac{n}{2^k}) + k * a * n $$

We want to find this formula in terms of $T(1)$

$$ 1 = \frac{n}{2^k}$$

$$ => 2^k = n $$

$$ => k = log_2(n) $$

Substitute this value of $k$ into the earlier formula:

$$ => T(n) = 2^{log_2(n)} * T(1) + log_2(n) * a * n $$

$$ => T(n) = n * T(1) + a * n * log_2(n) $$

$T(1)$ will happen in constant time $c$

$$ => T(n) = n * c + a * n * log_2(n) $$

So we can see that the time complexity should be $n * log_2(n)$

Hoare Partitioning

The other implementation of quicksort is less intuitive. Apparently it does fewer swaps (x3 fewer) than lomuto, but the analysis is too complicated for me to bother with right now.

def hoare(): Unit = {
  def swap(i: Int, j:Int) = { val t = a(i); a(i) = a(j); a(j) = t }

  def aux(l: Int, r: Int): Unit = {
    val p = a(l + r / 2)
    var i = l
    var j = r

    while (i < j) {
      while (a(i) < p) i += 1            while (a(j) > p) j -= 1
      if (i <= j) { swap(i,j); i+= 1; j -= 1 }

      if (l < j) aux(l, j)
      if (r > i) aux(i,r)

    }
  }
  aux(0, a.length-1)
}

Thursday, 28 January 2016

Least Squares Regression

If you have a scatter graph of n points, how can you find a 'line of best fit' for the data?

There exists some line $l_i$ that we need to find:

$$l_i = m (x_i) + c$$

However, the y-intercept of this line (c) is the y-value when x = 0. This may not be the most useful results so one neat adjustment we can make is redefined the line so that it crosses the y axis when $x$ = $\bar{x}$.

$$l_i = m (x_i - \bar{x}) + c $$

For a least squares regression we want to find m and c such that the following sum is minimised:

$$ \sum_{i=1}^{n}(y_i - l_i)^2 = \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + c))^2 \equiv Q $$

Solution:

Minimising $Q$ involves differentiating Q by $c$ and $m$, setting each result to zero and solving for $c$ and $m$.

Firstly differentiate wrt $c$:

$$Q = \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + c))^2 $$

$$\frac{dQ}{dc} = 2 \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + c)) * (-1) = 0 $$

$$ => - \sum _{i=1}^{n} y_i + m \sum _{i=1}^{n} (x_i - \bar{x}) + \sum _{i=1}^{n}c = 0 $$

Note that $ \sum _{i=1}^{n} (x_i - \bar{x}) = 0 $

Therefore:

$$ - \sum _{i=1}^{n} y_i + n c = 0 $$

$$ => c = \frac{\sum _{i=1}^{n} y_i } { n } = \bar{y} $$

Now let's differentiate wrt $m$:

$$Q = \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + \bar{y}))^2 $$

$$ \frac{dQ}{dm} = 2 \sum_{i=1}^{n}(y_i - (m(x_i - \bar{x}) + \bar{y}))(\bar{x}-x_i) = 0 $$

$$ => \sum_{i=1}^{n} (y_i - (m x_i - m \bar{x} + \bar{y}))(\bar{x} - x_1) = 0 $$

$$ => \sum_{i=1}^{n} (y_i - m x_i + m \bar{x} - \bar{y}))(\bar{x} - x_1) = 0 $$

$$ => \sum_{i=1}^{n} [(y_i \bar{x}- m x_i \bar{x} + m \bar{x}^2 - \bar{y}\bar{x}) - (y_i x_i - m x_i^2 + m \bar{x} x_i - \bar{y} x_i)] = 0 $$

$$ => m \sum_{i=1}^{n} ( \bar{x}^2 + x_i^2 - 2 \bar{x} x_i) + \sum_{i=1}^{n} (y_i \bar{x} - \bar{y} \bar{x} - y_i x_i - \bar{y} x_i) = 0 $$

$$ => m \sum_{i=1}^{n} (x_i - \bar{x})^2 - \sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x}) = 0 $$

$$ => m = \frac{\sum_{i=1}^{n} (y_i - \bar{y})(x_i - \bar{x})}{\sum_{i=1}^{n} (x_i - \bar{x})^2 }$$

Voila. Now to find the line, we can easily calculate $c$ and $m$.

Wednesday, 27 January 2016

Dijkstra in scala

The easiest way to understand dijkstra is by watching a youtube video.

Step 1: Mark all distance to start node as 0 and distance to all other nodes as infinity.
Step 2: Take current node and update distance to all its linked nodes that are unvisited
Step 3: Choose the next node which is the closest unvisited node.
Repeat steps 2 and 3 until target nodes is marked visited.

The following gist shows the code. Full code on github.

Monday, 6 October 2014

Union Find Algorithm

Imagine:

- you have a set of n objects
- where objects can have some sort of connection between them
- and you want to determine if there is a path between two objects.

We assume that the connection between two of these objects is reflexive, symmetric and transitive.

How could this be implemented?

Basically, you can maintain disjointed subsets of components illustrating which are transitively connected.

Eg:

In this example the division of nodes would be: {0, 3, 4, 8} {1} {2, 5} {6, 7}

Implementation 1: Bias towards find

You could store it in an array showing the relationships:

[0, 1, 2, 0, 0, 2, 3, 3, 0]

(eg elements at index 2 and index 5 are both two meaning they are in the same subset)

With this algorithm find is O(1) but union is O(n) which is still a bit slow.

Implementation 2: Maintain little trees

Again maintain an array but this time it is an array of little trees.

Start with [0, 1, 2, 3, 4, 5, 6, 7, 8]

union(2,3) => Update 2 to have root 3: [0, 1, 3, 3, 4, 5, 6, 7, 8]

union(3, 8) => Update 3 to have root 8: [0, 1, 3, 8, 4, 5, 6, 7, 8]

union(1, 6) => Update 1 to have root 6: [0, 6, 3, 8, 4, 5, 6, 7, 8]

union(7, 3) => Update 7 to have root 8 (3's root): [0, 6, 3, 8, 4, 5, 6, 7, 8]

With this implementation both find and union are worst case O(n)

Improvements

The problem with implementation 2 is that your little trees can end up not so little and can get very tall. There are some optimisations you can do to improve performance to minimise tree depth.

Alexandra's Tech