Sunday 8 June 2014

Java Hash Maps

Java hash maps can basically be though of as an array of LinkedLists.
The example above shows a hash map backed by an array of size 8. We describe this as being a hash map with 8 'buckets'. Each bucket contains a linked list, some of which contain elements and some of which are empty (eg bucket number 1 contains a linked list of 3 elements).

Storing

When you call hashMap.put(Object key, Object Value) the following occurs:

  • hash the key and take the modulus to identify which bucket you should insert into 
  • add the key and value at the end of the linked list in that bucket
Note that collisions are resolved by storing the key and value in a linked list rather than overwriting entries with the same hashcode. This mechanism is referred to as separate chaining

Retrieving

  • hash the key to identify the bucket
  • go to that bucket and iterate through the linked list. You must check for key equality in case two keys have hashed to the same value. 
Clearly if the linked list in the bucket is very long, the performance is going to be very slow. For example, if you store n values in your hashmap and they all hash to the same value, then your retrieval will perform at O(n), which is clearly not ideal. Therefore, it is important to have a good hash code that minimises collisions. 

Rehashing

By default, a hash map in Java has 16 buckets and load factor of 0.75. This means that when the number of elements in the hashmap = capactiy * load factor, the entire hash map is copied to a new, larger array (1.5 times the current size) and every element in the hash map must be rehashed into a bucket in the new array. This process is expensive, so it will positively impact the performance of your hash map if you set the initial capacity to the number of values you expect to hold in your hashmap. Of course, this isn't always easy to predict but you can usually make a rough guess (eg 10 objects verses 1 million).

You don't want to make your initial capacity too big though; if you are expecting 500 elements and you make your hashmap initial capacity = 100,000, then you are creating an array of 100,000 elements. This memory is allocated upfront which means nothing else can use this memory and it is essentially wasted.

Alternative hash map implementations

Another way of implementing hash maps is to use 'open addressing' collision resolution where you create an array of buckets, but fix the bucket size at 1. Then if you hash a key and the bucket is full, you keep going forward in the array of buckets to find the next empty bucket to insert into.

This is clearly going to be very inefficient if your hashmap is quite full.

However, the advantage is when the you have n keys with the same hash code. In a java hash map you end up with a bucket containing a linked list with n values and so performance of a get is O(n). However upfront you have had to create an array with memory usage = m so your total memory usage is m n.

However, if you use this open addressing implementation, the get method still performs in O(n) but your memory usage is m.




No comments:

Post a Comment

Scala with Cats: Answers to revision questions

I'm studying the 'Scala with Cats' book. I want the information to stick so I am applying a technique from 'Ultralearning&#...