Deep Java Sea: How is Hashmap in Java implemented internally

How is Hash map in Java implemented internally?

Hash map was always a hot topic for discussion in any professional group and in interview we must lean about Hash map before attending any java interview.

Let try to understand about internal implementation of Hash map in deep.

Basically maintains an array of buckets. Each bucket is a linkedlist of key value pairs encapsulated as Entry objects . This array of buckets is called table. Each node of the linked list is an instance of a private class called Entry

transient Entry[] table;

An entry is a private static class inside HashMap which implements Map.Entry

private static class Entry<K,V> implements Map.Entry<K,V> {

final K key;

final int hash;

V value;

Entry<K,V> next;

}

Each entry object exists only for a particular key but values may change (if same key is reinserted later with a different value) - hence key is final while value is not

Each Entry object has a field called next which points to the next Entry thus behaving like a singly linked list.

The hash field stores the hashed value of the key

constructor:

HashMap provides overloaded constructors with parameters for initial capacity and load factor but typically no args constructor is the one most frequently used

default values of these fields are :

initial capacity : 1 << 4 (ie 16)

load factor : 0.75

Whenever the element count of the hashmap reaches the load factor fraction of capacity, the map is resized and capacity is doubled

If capacity provided by client is a power of 2, then real capacity will be same as capacity

else real capacity = nearest power of 2 > provided capacity

maximum capacity is 1<<30 (ie 2 ^30) if capacity provided is greater than that, then real capacity = 2^30

Note that capacity indicates the size of the table array (the array of buckets) and not the number of key-value pairs the HashMap can support

Fail-fast iterator :

HashMap specifications need that iterators should throw ConcurrentMoodificationException if the map contents are changed while a client is iterating over an iterator

This done by keeping track of number of modifications. HashMap has a member int variable named modCount which is incremented everytime the map is altered (any invocation of put(), remove(), putAll() or clear() methods)

A similar field (lets say iteratorModCount) is maintained in the iterator implemetation class. When the iterator is created, the iteratorModCount value is initialized with the same value as HashMap modCount. For every call to any of iterator methods (next(), hasNext() and remove() ) the iteratorModCount is checked against the HashMap modCount. If these two values dont tally, that means HashMap has been modified and ConcurrentModificationException is thrown

When the remove() method of iterator is invoked, after internally calling remove() on the map, the iteratorModCount is reinitialized with the HashMap's modCount

Note: the HashMap's modCount and iterator's modCount fields are neither atomic nor valatile - hence they are vulnerable to interleaving and are not guraranteed to work in a multithreaded context

collections representing keySet and values :

HashMap specifications require that keySet(), values() and entrySet() methods complete with O(1) space complexity - which basically means HashMap cant copy the data into a new collection and return it. Rather these collections have to point to the same location in memory as the actual HashMap contents

This is done by maintaining non-static inner classes called KeySet and EntrySet both of which extend AbstractSet. By virtue of being non static, they implicitly carry a reference to the outer class object (HashMap.this)

private final class KeySet extends AbstractSet<K> {

public Iterator<K> iterator() {

return newKeyIterator();

}

public int size() {

return size ;

}

public boolean contains(Object o) {

return containsKey(o);

}

public boolean remove(Object o) {

return HashMap.this.removeEntryForKey(o) != null;

}

public void clear() {

HashMap. this.clear();

}

private final class Values extends AbstractCollection<V> {

public Iterator<V> iterator() {

return newValueIterator();

}

public int size() {

return size ;

}

public boolean contains(Object o) {

return containsValue(o);

}

public void clear() {

HashMap. this.clear();

}

Now, as we see above, special private classes have been created which internally invoke methods of the outer class.

For the iterators to this classes, special iterator instance is created which iterates over the buckets and entries. This iterator base class HashIterator is again a non static inner class thus having reference to the outer HashMap class through HashMap.this

private abstract class HashIterator <E> implements Iterator<E> {

Entry<K,V> next; // next entry to return

int expectedModCount ; // For fast-fail

int index ; // current slot

Entry<K,V> current; // current entry

}

es the hashed value of the key

Clearly HashIterator needs to keep track of the following:

1. the current entry

2. the index of the current bucket

3. the next entry

4. the expected mod count (for fail-fast)

at every call to next(), the current is initialized as next and next is calculated (as well as the index)

hasNext() returns false when next is null

For iterating over key, value and entry objects, multiple implementations of HashIterator are provided

[code]private final class ValueIterator extends HashIterator<V> {

public V next() {

return nextEntry().value ;

}

private final class KeyIterator extends HashIterator<K> {

public K next() {

return nextEntry().getKey();

}

private final class EntryIterator extends HashIterator<Map.Entry<K,V>> {

public Map.Entry<K,V> next() {

return nextEntry();

}

}[/code]

rehashing :

As a protection against poorly written hash functions, HashMap rehashes the hashcode of every key

In HashMap the decision regarding which bucket a key should go to is made on the basis of this :

[code] static int indexFor( int hash, int capacity) {

return hash & (capacity-1);

}[/code]

Note that capacity is always a power of 2 (ie 1 followed by a sequence of zeroes in binary). So (capacity - 1) is a sequence of 1's in binary. So if the capacity is 2^n, then only the lower n bits of hash are useful and the upper bits beyond that are ignored

Tests show that too many hash functions are not random enough in their lower bits, and that many hashmaps are not large enough to ever use the higher bits

So, as a guard against poorly written hash functions, the hashcodes of the keys are rehashed

dynamic resizing

As the number of elements in the map grows, the hash chains will grow longer and retrieval time will increase. At some point, it makes sense to increase the number of buckets and rehash the values. Remember finding the correct bucket takes O(1) time while finding the correct entry in the bucket takes O(n) time. So it makes sense to have more number of buckets instead of having a few crowded buckets

HashMap always doubles up the capacity during resizing. It creates a new bucket array of twice the size and copies data from old array to new array. Then the table variable is reinitialized with the new array

[code] void resize( int newCapacity) {

Entry[] oldTable = table;

int oldCapacity = oldTable.length ;

if (oldCapacity == MAXIMUM_CAPACITY) {

threshold = Integer.MAX_VALUE;

return;

}

Entry[] newTable = new Entry [newCapacity];

transfer(newTable, initHashSeedAsNeeded(newCapacity));

table = newTable;

}[/code]

During transfer from old array to new array, it iterates over the elements of the old array, for every Entry element computes the new bucket array index and sets the next pointer of thst element to the head of the bucket (ie adds elements to the head of the singly linked list). So, the order of the linked list elements are reversed

Suppose original linkedlist was 1->2->3

Lets assume after resizing, every element again goes into same bucket

So, after first iteration, 1->null

after second iteration, 2->1->null

after third iteratuion, 3->2->1->null thus the link list gets reversed

This creates a potential race condition during resizing. Suppose two threads parallely decide that the HashMap needs resizing and try to resize. This may lead to an infinite loop.

Of course the user had no business using HashMap in a multithreaded context to begin with

put operation

The put operation performs the following steps :

1. calculate hashcode for key

2. rehash it. lets call the rehashed results as h

3. calculate bucket index as h & (capacity -1)

4. now iterate over the bucket and compare key with all existing keys using equals()

5. if the key already exists, change the value of that Entry object

6. else create a new Entry object and add to the head of the linked list

7. increment mod count

8. resize if necessary

get operation

The get operation performs the following steps :

1. calculate hashcode for key

2. rehash it. lets call the rehashed results as h

3. calculate bucket index as h & (capacity -1)

4. now iterate over the bucket and compare key with all existing keys using equals()

5. if the key already exists return the corresponding value in the Entry object