Alexandra's Tech: 2014

Wednesday, 31 December 2014

MAC OS X: To open "IntelliJ IDEA" you need to install the legacy Java SE 6 runtime.

Change IntelliJ JVM version to a version you have installed to avoid this error.

1. Check your version of java:

alexandra$ java -version

java version "1.8.0_20"

Java(TM) SE Runtime Environment (build 1.8.0_20-b26)

Java HotSpot(TM) 64-Bit Server VM (build 25.20-b23, mixed mode)

2. Open IntelliJ configuration file:

vim /Applications/IntelliJ\ IDEA\ 13\ CE.app/Contents/Info.plist

3. Change JVMVersion to whatever version of Java you have installed

Change this -

<key>JVMVersion</key>

to this -

<key>JVMVersion</key>

Wednesday, 15 October 2014

Solution to Codility Lesson 11: Ladder

The Question:

You have to climb up a ladder. The ladder has exactly N rungs, numbered from 1 to N. With each step, you can ascend by one or two rungs. More precisely:

with your first step you can stand on rung 1 or 2,

if you are on rung K, you can move to rungs K + 1 or K + 2,

finally you have to stand on rung N.

Your task is to count the number of different ways of climbing to the top of the ladder.

For example, given N = 4, you have five different ways of climbing, ascending by:

1, 1, 1 and 1 rung,

1, 1 and 2 rungs,

1, 2 and 1 rung,

2, 1 and 1 rungs, and

2 and 2 rungs.

Given N = 5, you have eight different ways of climbing, ascending by:

1, 1, 1, 1 and 1 rung,

1, 1, 1 and 2 rungs,

1, 1, 2 and 1 rung,

1, 2, 1 and 1 rung,

1, 2 and 2 rungs,

2, 1, 1 and 1 rungs,

2, 1 and 2 rungs, and

2, 2 and 1 rung.

The number of different ways can be very large, so it is sufficient to return the result modulo 2P, for a given integer P.

Write a function:

class Solution { public int[] solution(int[] A, int[] B); }

that, given two non-empty zero-indexed arrays A and B of L integers, returns an array consisting of L integers specifying the consecutive answers; position I should contain the number of different ways of climbing the ladder with A[I] rungs modulo 2B[I].

For example, given L = 5 and:

    A[0] = 4   B[0] = 3
    A[1] = 4   B[1] = 2
    A[2] = 5   B[2] = 4
    A[3] = 5   B[3] = 3
    A[4] = 1   B[4] = 1

the function should return the sequence [5, 1, 8, 0, 1], as explained above.

Assume that:

L is an integer within the range [1..30,000];

each element of array A is an integer within the range [1..L];

each element of array B is an integer within the range [1..30].

Complexity:

expected worst-case time complexity is O(L);

expected worst-case space complexity is O(L), beyond input storage (not counting the storage required for input arguments).

Elements of input arrays can be modified.

The Solution

The key here is to notice that the number of ways of ascending for N=4 is Fibonacci(N+1).

I created a cache of the first x fibonacci numbers modulo 2^y where x is the max value in array a and y is the max value in array b.

Then, for every earlier fibonacci number f (f <= x) you can look up in the cache and take cache[f] and modulo it again by whatever value is in b.

So for example, if we consider the first few fibonacci numbers and cache them modulo 2^3:

Fib(index) = 1, 1, 2, 3, 5, 8, 13, 21

index = 0, 1, 2, 3, 4, 5, 6, 7

cache value = Fib(index) % 8 = 1, 1, 2, 3, 5, 0, 5, 5

Note that in this example, array b will contain values in the range (1, 3).

So we iterate over array a, look up the corresponding value in the cache and take it modulo corresponding value in b.

Happy days.

Solution is here:

Monday, 6 October 2014

Nerd conversation over dinner: Designing a key-value store

Question 1

Hi, so, imagine you have an interface that looks like this:

Interface {
void store(Object key, Object value);
Object getValue(Object key);
}

How would you implement this?

Answer 1

In memory hash map.

Question 2

Ok, so what if you have 12 billion values and they can be like a gig each?

Answer 2

So in this case you want to write to disk. You can store in memory until your heap is like 80% full then some background process (or 1 very expensive write) can write to files.

Question 3

How are these files structured?

Answer 3

Every time you write you create a new file. The advantage of this is when you write you are don't need any more disk seeks.

Question 4

So how does a read work?

Answer 4

When you do a read you search backwards through the files for your key.

Question 5

Can you think of any optimisations you could do to make your reads more efficient?

Answer 5

Bloom filter. Every file should have a bloom filter associated with it so you can look up the key in the bloom filter and determine either that the key does not exist in the file or that it might exist in the file.

Therefore, when you do a read you check each file's bloom filter to determine whether or not you need to search the file.

[Note that Cassandra uses bloom filters like this and in Cassandra you can resize your bloom filter to make it more accurate - in fact Cassandra allows you to configure bloom filters for trade-offs between accuracy and memory usage. ]

Another thing you could do is have some sort of indexing within your file to say hash(key) % x => your value lies in Y part of the file (e.g. first quarter of the file).

Question 6

What if you have update your value for a particular key so that your older files are full of stale data that isn't needed?

Answer 6

Kind of like garbage collection you could have some process that performs 'compaction'. Perhaps when there is lots of memory you could go through and compact the files, removing data that isn't needed. Locks on files won't be required if you can store new values in memory until your compaction process is finished.

Union Find Algorithm

Imagine:

- you have a set of n objects
- where objects can have some sort of connection between them
- and you want to determine if there is a path between two objects.

We assume that the connection between two of these objects is reflexive, symmetric and transitive.

How could this be implemented?

Basically, you can maintain disjointed subsets of components illustrating which are transitively connected.

Eg:

In this example the division of nodes would be: {0, 3, 4, 8} {1} {2, 5} {6, 7}

Implementation 1: Bias towards find

You could store it in an array showing the relationships:

[0, 1, 2, 0, 0, 2, 3, 3, 0]

(eg elements at index 2 and index 5 are both two meaning they are in the same subset)

With this algorithm find is O(1) but union is O(n) which is still a bit slow.

Implementation 2: Maintain little trees

Again maintain an array but this time it is an array of little trees.

Start with [0, 1, 2, 3, 4, 5, 6, 7, 8]

union(2,3) => Update 2 to have root 3: [0, 1, 3, 3, 4, 5, 6, 7, 8]

union(3, 8) => Update 3 to have root 8: [0, 1, 3, 8, 4, 5, 6, 7, 8]

union(1, 6) => Update 1 to have root 6: [0, 6, 3, 8, 4, 5, 6, 7, 8]

union(7, 3) => Update 7 to have root 8 (3's root): [0, 6, 3, 8, 4, 5, 6, 7, 8]

With this implementation both find and union are worst case O(n)

Improvements

The problem with implementation 2 is that your little trees can end up not so little and can get very tall. There are some optimisations you can do to improve performance to minimise tree depth.

Sunday, 5 October 2014

Auction Kata: Thread safety and the CopyOnWrite pattern

Recently I did a Kata where I implemented an 'Auction' class which accepts bids by users on items for some price.

The requirements specified that we should be able to retrieve the following:

- all the items that a user has bid on

- all bids on an item

- and the current winning bid on an item.

This suggested to me that I should have two maps:

HashMap<Item, PriorityQueue<Bid>> bidsForItems;

HashMap<User, HashSet<Item>> itemsUsersHaveBidOn;

Easy so far.

Making it Thread Safe

So it was quite easy to make this thread safe. I simple utilised Concurrent collections.

ConcurrentHashMap<Item, PriorityBlockingQueue<Bid>> bidsForItems;

ConcurrentHashMap <User, CopyOnWriteSet<Item>> itemsUsersHaveBidOn;

The bid method was pretty much unchanged.

However, there is one problem with this implementation: the two maps can be out of sync.

Eg imagine two threads:

Thread 1: Updates bidsForItems saying that there is a new bid on Item Fridge which happens to be a winning bid by User Chris;
Thread 1: Updates itemsUsersHaveBidOn map saying that User Chris has placed a bid on Item Fridge.

Thread 2 wants to call GetWinningBid for Item Fridge, find out which User has placed the winning bid, and see what other items this user has bid on.

If Thread 2 executes between the two actions by thread 1 then Thread 2 might get inconsistent results.

Making it Consistent

To make it consistent I utilised the CopyOnWrite Pattern..

Firstly, I abstracted the two maps into a 'DataStore' class:

Then the 'Auction' class must hold an atomic reference to the DataStore:

Then the bid method looks like this:

We take a copy of the existing datastore and create a new one from it to include the new bid. Then we perform a compare and swap operation to update the AtomicReference to the datastore. The compare and swap method in the while condition will return true if the value is updated (and the loop with terminate) but false if the datastore that we copied has been modified since we took our copy.

The problem with this implementation is that we are creating a new HashMap every time there is a bid. There is one nice optimisation that can be done where instead of creating a new Map every time, we have an immutable wrapper class around it like this:

Saturday, 4 October 2014

Getting Started with Ansible on Mac OS X

Ansible is a great tool for config management on a set of remote machines. However, Ansible can be used to execute commands locally too. My uses for Ansible include writing a"playbook" (i.e. an Ansible script) to install software locally if I get a new laptop and provisioning virtual machines in amazon's EC2.

In the this post you will see how to execute a one liner on set of hosts. More fun stuff will be in later posts. :-)

1. Firstly, install Ansible. Assuming you use homebrew (and if you are on a mac you really should be!) execute the following:

brew update
brew install ansible

2. Create a working directory.

mkdir ansible_wd
cd ansible_wd

3. Within your current directory create a file called hosts and open it with your favourite editor. This hosts file will contain a list of machine names/ip addresses etc.

vim hosts

4. In hosts paste the following:

[group_name]
localhost

This means you have a group called [group_name] (change this to something sensible!) with one host in the group - localhost.

5. Assuming you have an RSA key set up , you need to paste your public key in ~/.ssh/authorized keys.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

6. Now you are ready to run your first Ansible command:

Make sure you have your sound on and execute the following:

ansible all -i hosts -a "say hello ansible"

What does this do?

'all' means execute on all hosts
'- i' option means inventory which is your hosts file
'-a' means you are executing an ad-hoc command on the relevant hosts.
"say hello ansible" is the command to be executed on every hosts. Note that on Mac OS X, say should be installed by default.

Another example to install htop locally using homebrew:

ansible group_name -i hosts -a "brew install htop"

So you might be wondering what is the point? Why would I bother setting up Ansible to execute local commands like "brew install htop" in a more roundabout way than simply typing the command? As mentioned at the start of the post, one useful thing you can do is add commands to an Ansible script which you can execute on any machine to install software. This will save massive amounts of time when you get your new dev machine and you can magically install everything by running one script. For me, it's open-jdk, htop, wget among other things.

You might wonder why can't you simply write a shell script to execute a series of steps? Well you could, but Ansible is a bit cleverer than that - it can manage state and do all sorts of crazy stuff which we will see in coming posts.

In the next post we will learn how create an Ansible playbook to execute statements on remote or local machines.

Comparable or Comparator in Java?

I was writing some code recently where users can place bids on items.

I wanted to store the bids in a PriorityQueue to enable easy retrieval of the max bid on an item.

Bid.java

I wanted to implement either a Comparator or make Bid Comparable to compare two bids and order them by highest prices.

Attempt 1: Make Bid implement Comparable<Bid>:

However:

- this doesn't really make sense because it only makes sense to compare two bids if they are associated with the same Item.
- given the current class structure, there is no way to handle this in the compareTo(Bid bid) method.

Therefore we should write a Comparator specifically for our use case to be passed into the Priority Queue. :-)

So in general it depends on whether the implementation of compare two is specific to the object or the usage.

Setting up Active MQ on Unix

Install activemq locally

Download gzip'd tar'd file from one of the mirror sites

eg.

wget http://mirror.vorboss.net/apache/activemq/5.9.1/apache-activemq-5.9.1-bin.tar.gz

tar zxvf apache-activemq-5.9.1-bin.tar.gz

cd apache-activemq-5.9.1/bin

chmod 755 activemq

Start activemq

./activemq start

Check it is running:

netstat -an|grep 61616

You can go to the admin console in a web browser:

http://localhost:8161/admin/

Default username and password is admin/admin.

Create JMS project

I created a maven project from the command line:

mvn archetype:generate -DgroupId=com.ojha.play -DartifactId=activemq -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false

In the pom I added a dependency on active mq:

<dependency>
<groupId>org.apache.activemq</groupId>
<artifactId>activemq-all</artifactId>
<version>5.9.0</version>
</dependency>

Implement producer and consumer

Producer:

Consumer:

I ran the producer and checked the console in the web browser to see the message:

Then ran the consumer to dequeue the message.

Sunday, 20 July 2014

Question on Time Complexity

Question. Imagine it takes 1 second to sort 1000 elements. How long does it take to sort 10 000 elements assuming your sorting algorithm time complexity is proportional to
a) n^2
b) n log n

Answer a)

Complexity is proportional to

10,000 * 10,000 / 1000 * 1000 = 100

=> sorting 10,000 items takes 100 seconds

Answer b)

Complexity is proportional to

(10,000 * log_2 (10,000)) / (1000 * log_2 (1000))

Use the change of base rule [log_a (b) = log_c (b) / log_c (a)]

=> log_2 10,000 = log_10 (10,000) / log_10 (2)
and log_2(1000) = log_10 (1000) / log_10(2)

=> (10,000 * log_2 (10,000)) / (1000 * log_2 (1000))
= 10 * (log_10 (10,000) / log_10 (1000)) = 10 * 4/3

ArrayList vs LinkedList implementations in Java

LinkedList and ArrayList are two non-synchronized implementations of the List<> interface in java.

ArrayList > LinkedList

The main benefit of using an arraylist is that the get by index method operates in O(1) time rather than in O(n) time as with a LinkedList.

LinkedList > ArrayList

Iterator.remove and Iterator.add methods can be executed in O(1) time rather than in O(n-index) as with an ArrayList. Therefore a LinkedList implementation should be used if either of these methods will be called frequently.
When an ArrayList fills up, a new, larger backing array must be created and all the values copied into the new array. This can be an expensive operation. Arrays backing the ArrayList can only increase in size, so if there is a sudden anomalous spike in the number of objects stored in an ArrayList, there can be a lot of wasted memory allocation with a sparsely filled backing array. Therefore, if there is going to be fluctuation in the size of the list, you should consider using a LinkedList.

Summary Table showing common list methods:

Sunday, 8 June 2014

Java Hash Maps

Java hash maps can basically be though of as an array of LinkedLists.

The example above shows a hash map backed by an array of size 8. We describe this as being a hash map with 8 'buckets'. Each bucket contains a linked list, some of which contain elements and some of which are empty (eg bucket number 1 contains a linked list of 3 elements).

Storing

When you call hashMap.put(Object key, Object Value) the following occurs:

hash the key and take the modulus to identify which bucket you should insert into
add the key and value at the end of the linked list in that bucket

Note that collisions are resolved by storing the key and value in a linked list rather than overwriting entries with the same hashcode. This mechanism is referred to as separate chaining.

Retrieving

hash the key to identify the bucket
go to that bucket and iterate through the linked list. You must check for key equality in case two keys have hashed to the same value.

Clearly if the linked list in the bucket is very long, the performance is going to be very slow. For example, if you store n values in your hashmap and they all hash to the same value, then your retrieval will perform at O(n), which is clearly not ideal. Therefore, it is important to have a good hash code that minimises collisions.

Rehashing

By default, a hash map in Java has 16 buckets and load factor of 0.75. This means that when the number of elements in the hashmap = capactiy * load factor, the entire hash map is copied to a new, larger array (1.5 times the current size) and every element in the hash map must be rehashed into a bucket in the new array. This process is expensive, so it will positively impact the performance of your hash map if you set the initial capacity to the number of values you expect to hold in your hashmap. Of course, this isn't always easy to predict but you can usually make a rough guess (eg 10 objects verses 1 million).

You don't want to make your initial capacity too big though; if you are expecting 500 elements and you make your hashmap initial capacity = 100,000, then you are creating an array of 100,000 elements. This memory is allocated upfront which means nothing else can use this memory and it is essentially wasted.

Alternative hash map implementations

Another way of implementing hash maps is to use 'open addressing' collision resolution where you create an array of buckets, but fix the bucket size at 1. Then if you hash a key and the bucket is full, you keep going forward in the array of buckets to find the next empty bucket to insert into.

This is clearly going to be very inefficient if your hashmap is quite full.

However, the advantage is when the you have n keys with the same hash code. In a java hash map you end up with a bucket containing a linked list with n values and so performance of a get is O(n). However upfront you have had to create an array with memory usage = m so your total memory usage is m + n.

However, if you use this open addressing implementation, the get method still performs in O(n) but your memory usage is m.

Sunday, 1 June 2014

Junit Rules: Combining @Rule and @ClassRule

Really useful for when you need to repeat test setup/teardown in multiple test classes.

Imagine have a server and before we run the tests we want to call server.start() and after the tests are finished we want to call server.stop().

We create a rule that implements junit's 'TestRule' class which starts and stops the server:

Now, our test class doesn't need the two methods annotated with beforeclass and afterclass, only the rule:

Note the use of @ClassRule. An alternate annotation would be @Rule which would mean that the server was stopped and started between every test.

Combining Before and BeforeClass behaviour in one rule

Now imagine that we want to clear down server state between each test. Logically, this belongs in the rule we have already created. We want one rule to manage server state entirely, such that the rule starts and stops the server beforeclass and afterclass but clears the state between each test.

However, in the recent version of Junit, the rule variable in our test needs to be static.

The way round it is the following. Firstly, we modify our rule to behave so that if the server has been started it clears the state when the apply method is called:

Then in our junit test we must use both rule annotations for a @ClassRule and @Rule.

A whirlwind beginner's tour of Cassandra

Recently, I've made some contributions to an open source project for which an understanding of Cassandra would be useful. With no prior knowledge of Cassandra, the following post summarises what I have learned in a hour by conducting some online research.

Cassandra is decentralised. Data is stored across multiple nodes and this collection of nodes is known as a cluster. Among these nodes there is a lot of replication, as specified by the user. For example, imagine there are 12 nodes. The user can specify that they want each piece of data replicated on 6 of the 12 nodes which obviously increases redundancy but reduces the risk of data loss.

Replication factors are defined across keyspaces. A keyspace is defined as a container for application data, similar to a schema in a relational database.

When a user writes to Cassandra, they must also specify a consistency. This is the number of nodes to which the data must be written before the write is reported successfully to the user. Continuing from our earlier example, if we have 12 nodes in total, a replication factor of 6 and a user issues a write with consistency of three, then when the data has been written to 3/6 of the nodes (ie the replication is 50% complete) the user is informed that their write has been successful.

This idea of consistency also applies to reads. If a user tries to read some data with consistency two, then they must read from two of the six replication nodes and return the most up-to-date result to the user.

Possible consistencies are the following:

ANY - fire and forget for writes, one for reads
ONE
TWO
THREE
QUORUM - means that a majority of nodes must be consistent (so 4 out of our 6)
ALL - all six nodes must have the data

Cassandra can be configured to be data centre aware. If this is the case, then replication factors must be defined per data centre.

With data centre aware configuration, there are two additional consistencies that can be used. EACH_QUORUM means that you must be at QUORUM consistency in each data centre.

EACH_QUORUM Example:

2 data centres
Each data centre has 6 nodes so the cluster size is 12
6 node replication factor where Cassandra is not data centre aware.

QUORUM consistency means that the request is complete when 4 nodes are consistent. However these nodes could be unevenly distributed between data centres:

What if the London data centre has a power cut? Suddenly we have lost the benefit of the replication.

Therefore, we can make Cassandra data centre aware and define the replication factor = 3 per data centre.

Now, if we use consistency of EACHQUORUM we can have consistency between two nodes in each data centre:

If one of your data centres is unavailable for some reason then any read or write that a user attempts at QUORUM will fail. However, applications can be set up to reduce consistency if a data centre or some nodes are unavailable.

There is also LOCAL_QUORUM consistency which can be used when Cassandra is data centre aware. It means that you achieve QUORUM consistency in the data centre of the node you are connected to. If you are interested, the node you are currently connected to is called the coordinator.

Data centre awareness can extend to Cassandra knowing exactly what rack each node is on.

Hinted Handoffs

Let's look at one more example to understand hinted handoffs.

2 data centres in London and Winchester.
6 nodes in each data centre => cluster size = 12
Replication factor of 3 per data centre => replication factor = 6

There is a power outage at the data centre in London so it is unavailable.

The user is connected to one of the nodes in the Winchester data centre and attempts to do a write at consistency 1. This write succeeds and the data replicated among 3 nodes in Winchester. However, because the London data centre is unavailable, the data cannot be replicated across 3 nodes in London. However the coordinator stores a hint and remembers that this data needs to be synced in London. If the London data centre becomes available within 3 hours, the coordinators with stored hints will push to the recovered nodes to bring them up to date.

If it takes longer than 3 hours for nodes to recover, then this self healing is not possible and instead we need to run a repair on nodes. This means that the recovered nodes go round asking other nodes for information to bring their own state up to date.

Of course, this three hour limit is configurable in Cassandra.

Okay so that was fun. I've only just touched the surface with Cassandra. It seems to me that theoretically you could almost configure it to run like a relational database though apparently its query language (CQL) is less powerful that SQL in that is is missing joins and stuff. Of course replication has its pros and cons (expense vs safety) but it seems like it's a pretty clever little application and I look forward to learning more about it. :D

Overview of Http Requests

Get

Get requests are used to retrieve data from the server
Query parameters are sent as part of the url (www.blah.com?name=alexandra). However there is a maximum length restriction of 2048 on urls so the amount of data that can be sent in the url is limited.
Get requests can be bookmarked. This is because a bookmark is just a url and the url of a get request contains all the data required for the server to return the same data again.

Post

Post is used to send data to the server. Because post requests are not idempotent (ie because they can modify the server state), post requests cannot be cached.
Query parameters are sent in the message body rather than in the url
Post requests cannot be bookmarked. As we read earlier a bookmark is just a url. If you save the url of a post request so if you try to revisit the url of a post request the server won't know how to respond.

Head

This is very similar to a get request but only the headers are returned, not the message body.

Head request example usage: This would be useful for client side caching. The client issues a get request initially and caches the response. Then, if the client wants to request the same thing again, it can first issue a head request to get the last modification date of the response to determine whether or not the cached data is up to date.

Put

Put is used to for 'create' operations.
Post can be used to 'create' on the server, but the difference between them is that put is idempotent. Put requests either create or overwrite existing resources.

Wednesday, 28 May 2014

Builder Design Pattern Example

Builders are great: I've been putting them into some code recently because we're building an API but don't want to make breaking changes every time we need to modify a constructor. Backwards compatibility FTW.

I'll illustrate with a simple example.

Before builder:

With a PersonBuilder:

Now, if we want to add a nickname paramater to the person constructor, we add a corresponding withNickName(String nickname) on the builder and don't break anybody's existing code. (Saying that, of course we could achieve backwards compatibility with the telescoping pattern but that would be TOO GRIM FOR WORDS. )

Sunday, 11 May 2014

Github: Keeping a forked project up to date

I forked a project from here: https://github.com/chbatey/scassandra-java-client to a copy attached to my own account here: https://github.com/apojha/scassandra-java-client.

However, if user chbatey makes any changes to his version of the codebase, I want to keep up to date with his changes by continually merging them into my repo.

To merge changes on his repo to my copy, I added his repository as a remote repo.

Step 1:

Navigate to where I have checked out my fork of his project.

----------------------------
Before, the only branch is my own origin/master and the only remote project is my own:
bash-3.2$ git branch -r
origin/HEAD -> origin/master
origin/master

bash-3.2$ git remote -v
origin git@github.com:apojha/scassandra-java-client.git (fetch)
origin git@github.com:apojha/scassandra-java-client.git (push)
----------------------------

Step 2:

Execute the following command to add a new remote repo called 'upstream' (this name can be anything):

git remote add upstream git@github.com:chbatey/scassandra-java-client.git

----------------------------
After, his branch has been added (upstream/master) and there is a new remote repo called upstream:

bash-3.2$ git branch -r
origin/HEAD -> origin/master
origin/master
upstream/master

bash-3.2$ git remote -v
origin git@github.com:apojha/scassandra-java-client.git (fetch)
origin git@github.com:apojha/scassandra-java-client.git (push)
upstream git@github.com:chbatey/scassandra-java-client.git (fetch)
upstream git@github.com:chbatey/scassandra-java-client.git (push)
----------------------------

Now, if I want to pull his changes I execute the following:

git pull upstream master

which pulls any commits on 'upstream' into my master.

Contributing to someone else's Github project: Forking and generating pull requests

Recently, I have started to work on an existing open source project in GitHub.

Assumption: you have set up git locally and have an account on Github

Forking someone else's project

Step 1: Forking

Forking creates a copy of the other person's Github project under your own Github account. To fork, navigate to their Github project page 'https://github.com/TheirUsernameXYZ/ProjectNameABC' and click 'fork' and follow the instructions.

You should automatically be navigated to your new Github page for the repository:

https://github.com/YourName/ProjectNameABC

Step 2: Clone your own fork

You need to create a local copy of your own fork by executing 'git clone <your clone url>' at the command line.

Generating a pull request

Pull requests are used when you make changes to your copy of their repository ie you have made local changes and commit them to git. You generate a pull request to say to the original owner of the project 'Hey, I've made some commits on my copy of your project. Why don't you review these changes and merge them onto your original branch'.

Step 1: Open your project

Navigate to your own project page:

https://github.com/YourName/ProjectNameABC

Step 2: Create a new pull request

Click on 'Pull Requests' and follow the instructions

\o/ woohoo you have contributed to another project.

The next post details how you keep your fork up to date by merging any commits they make to their original branch.

Meow

Saturday, 10 May 2014

The problem with notifyAll() : Introducing Reentrant Locks

Here is a Buffer:

Imaging that you have two producers threads calling 'put(int i)' and one consumer thread calling get().

Imaging the following sequence of events:

- Consumer calls get() and blocks waiting for lock
- Producer1 calls put(int i) and, while it is in the synchronized block, Producer2 starts and also waits for the lock currently held by Producer1. Therefore both the Consumer and Producer2 are waiting.
- Producer2 releases the lock and calls notifyAll()
- The JVM chooses to wake Producer2 first
- Producer2 enters the synchronised block, sees it cannot write and goes back to waiting.
- Then the Consumer gets the lock, enters the block in the get() method and retrieves the variable.
- etc

This is totally functional code but the problem is that the JVM wakes all the threads, both producers and consumers. This is fine in our example but imagine we had thousands of producers and consumers; suddenly it becomes very inefficient and memory intensive to wake up all the threads, when most of them will just go straight back to waiting because their predicate is not fulfilled.

The solution? Java 5 reentrant locks!

These allow you to have different wait conditions and wake only the threads that are waiting for that particular condition.

The modified solution is shown below: