Alexandra's Tech: June 2014

Sunday, 8 June 2014

Java Hash Maps

Java hash maps can basically be though of as an array of LinkedLists.

The example above shows a hash map backed by an array of size 8. We describe this as being a hash map with 8 'buckets'. Each bucket contains a linked list, some of which contain elements and some of which are empty (eg bucket number 1 contains a linked list of 3 elements).

Storing

When you call hashMap.put(Object key, Object Value) the following occurs:

hash the key and take the modulus to identify which bucket you should insert into
add the key and value at the end of the linked list in that bucket

Note that collisions are resolved by storing the key and value in a linked list rather than overwriting entries with the same hashcode. This mechanism is referred to as separate chaining.

Retrieving

hash the key to identify the bucket
go to that bucket and iterate through the linked list. You must check for key equality in case two keys have hashed to the same value.

Clearly if the linked list in the bucket is very long, the performance is going to be very slow. For example, if you store n values in your hashmap and they all hash to the same value, then your retrieval will perform at O(n), which is clearly not ideal. Therefore, it is important to have a good hash code that minimises collisions.

Rehashing

By default, a hash map in Java has 16 buckets and load factor of 0.75. This means that when the number of elements in the hashmap = capactiy * load factor, the entire hash map is copied to a new, larger array (1.5 times the current size) and every element in the hash map must be rehashed into a bucket in the new array. This process is expensive, so it will positively impact the performance of your hash map if you set the initial capacity to the number of values you expect to hold in your hashmap. Of course, this isn't always easy to predict but you can usually make a rough guess (eg 10 objects verses 1 million).

You don't want to make your initial capacity too big though; if you are expecting 500 elements and you make your hashmap initial capacity = 100,000, then you are creating an array of 100,000 elements. This memory is allocated upfront which means nothing else can use this memory and it is essentially wasted.

Alternative hash map implementations

Another way of implementing hash maps is to use 'open addressing' collision resolution where you create an array of buckets, but fix the bucket size at 1. Then if you hash a key and the bucket is full, you keep going forward in the array of buckets to find the next empty bucket to insert into.

This is clearly going to be very inefficient if your hashmap is quite full.

However, the advantage is when the you have n keys with the same hash code. In a java hash map you end up with a bucket containing a linked list with n values and so performance of a get is O(n). However upfront you have had to create an array with memory usage = m so your total memory usage is m + n.

However, if you use this open addressing implementation, the get method still performs in O(n) but your memory usage is m.

Sunday, 1 June 2014

Junit Rules: Combining @Rule and @ClassRule

Really useful for when you need to repeat test setup/teardown in multiple test classes.

Imagine have a server and before we run the tests we want to call server.start() and after the tests are finished we want to call server.stop().

We create a rule that implements junit's 'TestRule' class which starts and stops the server:

Now, our test class doesn't need the two methods annotated with beforeclass and afterclass, only the rule:

Note the use of @ClassRule. An alternate annotation would be @Rule which would mean that the server was stopped and started between every test.

Combining Before and BeforeClass behaviour in one rule

Now imagine that we want to clear down server state between each test. Logically, this belongs in the rule we have already created. We want one rule to manage server state entirely, such that the rule starts and stops the server beforeclass and afterclass but clears the state between each test.

However, in the recent version of Junit, the rule variable in our test needs to be static.

The way round it is the following. Firstly, we modify our rule to behave so that if the server has been started it clears the state when the apply method is called:

Then in our junit test we must use both rule annotations for a @ClassRule and @Rule.

A whirlwind beginner's tour of Cassandra

Recently, I've made some contributions to an open source project for which an understanding of Cassandra would be useful. With no prior knowledge of Cassandra, the following post summarises what I have learned in a hour by conducting some online research.

Cassandra is decentralised. Data is stored across multiple nodes and this collection of nodes is known as a cluster. Among these nodes there is a lot of replication, as specified by the user. For example, imagine there are 12 nodes. The user can specify that they want each piece of data replicated on 6 of the 12 nodes which obviously increases redundancy but reduces the risk of data loss.

Replication factors are defined across keyspaces. A keyspace is defined as a container for application data, similar to a schema in a relational database.

When a user writes to Cassandra, they must also specify a consistency. This is the number of nodes to which the data must be written before the write is reported successfully to the user. Continuing from our earlier example, if we have 12 nodes in total, a replication factor of 6 and a user issues a write with consistency of three, then when the data has been written to 3/6 of the nodes (ie the replication is 50% complete) the user is informed that their write has been successful.

This idea of consistency also applies to reads. If a user tries to read some data with consistency two, then they must read from two of the six replication nodes and return the most up-to-date result to the user.

Possible consistencies are the following:

ANY - fire and forget for writes, one for reads
ONE
TWO
THREE
QUORUM - means that a majority of nodes must be consistent (so 4 out of our 6)
ALL - all six nodes must have the data

Cassandra can be configured to be data centre aware. If this is the case, then replication factors must be defined per data centre.

With data centre aware configuration, there are two additional consistencies that can be used. EACH_QUORUM means that you must be at QUORUM consistency in each data centre.

EACH_QUORUM Example:

2 data centres
Each data centre has 6 nodes so the cluster size is 12
6 node replication factor where Cassandra is not data centre aware.

QUORUM consistency means that the request is complete when 4 nodes are consistent. However these nodes could be unevenly distributed between data centres:

What if the London data centre has a power cut? Suddenly we have lost the benefit of the replication.

Therefore, we can make Cassandra data centre aware and define the replication factor = 3 per data centre.

Now, if we use consistency of EACHQUORUM we can have consistency between two nodes in each data centre:

If one of your data centres is unavailable for some reason then any read or write that a user attempts at QUORUM will fail. However, applications can be set up to reduce consistency if a data centre or some nodes are unavailable.

There is also LOCAL_QUORUM consistency which can be used when Cassandra is data centre aware. It means that you achieve QUORUM consistency in the data centre of the node you are connected to. If you are interested, the node you are currently connected to is called the coordinator.

Data centre awareness can extend to Cassandra knowing exactly what rack each node is on.

Hinted Handoffs

Let's look at one more example to understand hinted handoffs.

2 data centres in London and Winchester.
6 nodes in each data centre => cluster size = 12
Replication factor of 3 per data centre => replication factor = 6

There is a power outage at the data centre in London so it is unavailable.

The user is connected to one of the nodes in the Winchester data centre and attempts to do a write at consistency 1. This write succeeds and the data replicated among 3 nodes in Winchester. However, because the London data centre is unavailable, the data cannot be replicated across 3 nodes in London. However the coordinator stores a hint and remembers that this data needs to be synced in London. If the London data centre becomes available within 3 hours, the coordinators with stored hints will push to the recovered nodes to bring them up to date.

If it takes longer than 3 hours for nodes to recover, then this self healing is not possible and instead we need to run a repair on nodes. This means that the recovered nodes go round asking other nodes for information to bring their own state up to date.

Of course, this three hour limit is configurable in Cassandra.

Okay so that was fun. I've only just touched the surface with Cassandra. It seems to me that theoretically you could almost configure it to run like a relational database though apparently its query language (CQL) is less powerful that SQL in that is is missing joins and stuff. Of course replication has its pros and cons (expense vs safety) but it seems like it's a pretty clever little application and I look forward to learning more about it. :D

Overview of Http Requests

Get

Get requests are used to retrieve data from the server
Query parameters are sent as part of the url (www.blah.com?name=alexandra). However there is a maximum length restriction of 2048 on urls so the amount of data that can be sent in the url is limited.
Get requests can be bookmarked. This is because a bookmark is just a url and the url of a get request contains all the data required for the server to return the same data again.

Post

Post is used to send data to the server. Because post requests are not idempotent (ie because they can modify the server state), post requests cannot be cached.
Query parameters are sent in the message body rather than in the url
Post requests cannot be bookmarked. As we read earlier a bookmark is just a url. If you save the url of a post request so if you try to revisit the url of a post request the server won't know how to respond.

Head

This is very similar to a get request but only the headers are returned, not the message body.

Head request example usage: This would be useful for client side caching. The client issues a get request initially and caches the response. Then, if the client wants to request the same thing again, it can first issue a head request to get the last modification date of the response to determine whether or not the cached data is up to date.

Put

Put is used to for 'create' operations.
Post can be used to 'create' on the server, but the difference between them is that put is idempotent. Put requests either create or overwrite existing resources.

Alexandra's Tech