Parallel Hash Tables - Final Report
By Manu Garg, Andrew Butko
We implemented concurrent hash tables that use coarse grained/fine grained locks. We also implemented a lock-free version of hash tables. Given the speedup of our implementations we demonstrate that lock-free is faster than fine-grained which in turn is faster than coarse-grained in most scenarios.
A hashtable is a datastructure that maps keys to values and allows fast accesses to the data.
More on hash tables: hash tables. The operations on a hashtable include
searching a key, inserting a key, value pair and deleting a key from the table. We implemented the hashtable with separate chaining
i.e. a linked list per bucket (Fig.1)
Although none of the individual operations are computationaly expensive (roughly O(1) average access time per op for a good hash table implementation), they are highly parallelizable, especially when contention isn't high because at that point the different operations are accessing different buckets and you can perform them in parallel without any dependencies.
We used c++ for writing the code and GHC machines (specifically GHC26) for testing. The operations in each of the following implementations can be called in parallel by separate threads. We started with a basic working sequential version and built upon that further to make different parallelized versions of the same.
Coarse-grained: In this implementation we used a global pthread read-write lock for the table. On insert/delete we acquire a write lock on the table and then we insert/remove the key, value pair in the corresponding hashed bucket whereas in search we acquired a read lock on the table and then we try to find the entry in the corresponding hashed bucket. This allows multiple readers to read the table but only one writer for the table. On resize which happens in insert when the load factor hits a certain threshold we acquire a write lock on the table and then create a new bucket array and copy over the contents of the original table, rehashing them to their new locations. This approach could potentially lead to starvation since it provides no fairness guarantees
Fine-grained: In this implementation we used an array of pthread read-write locks, one per bucket of the table. Although this approach allows parallelization at bucket level as well, we found it to be slower than using mutex locks, one per bucket of the table. This is because the overhead in the acquisition of a pthread read-write lock destroyed any potential gain obtained from parallelization in this way at a single bucket level. This approach could also lead to starvation.
Each operation uses a common function called search. This function takes in a search key and return references to two nodes left node, right node for that key (left node's key < search key < right node's key) and both nodes are unmarked which means that they haven't been deleted from the table (checked through a mark bit in each node's next pointer). Finally the right node must be the immediate successor of the left node. This function is divided into three phases. The first phase iterates across the corresponding bucket to find the first unmarked node whose key >= search key. This is our right node. The first phase also finds the last unmarked left node. The second phase checks if the right node is the immediate successor of the left node, if that is the case and the right node isn't marked then we return else we call search again. The last phase uses an atomic CAS operation to remove all the unmarked nodes between left node and right node.
put: After finding the corresponding bucket where the key should be stored, we call search on the bucket to get the left node and the right node in between which the key should be inserted. If the key is already stored in the bucket (i.e. in the right node) then we do an atomic exchange of the old value with the new value. Otherwise we do a single atomic CAS operation to insert a new node with the input key, value pair in between the left node and the right node
remove: After finding the corresponding bucket where the key should be removed from, we call search on the bucket to locate the node to delete and then use a two-phase process to perform the deletion. First, the node is logically deleted by marking the reference in the node's next field. Next, the node is physically deleted by attempting a single atomic CAS operation to swing left node's next to right node's next, if this doesn't work then the deletion takes places within an invocation of search (It's 3rd phase does the trick)
get: After finding the corresponding bucket where the key should be located, we call search on the bucket to get the node where the key might be present.
Lock-free Limitations: Although the remove operation removes the key from the hashtable, it doesn't delete the node. This won't be an issue if C++ had a garbage collector, but without one the code above leaks memory. A fix to that would be to use hazard pointers which we tried to implement and use but weren't successful in doing so given our situation and the lack of time. Another limitation of lock-free is that we don't resize, this won't be an issue if we expect the number of removes to be of the order of the number of inserts.
For a more detailed version of the above mentioned operations as well as an attempt at hazard pointers which could be further built upon in the future refer to the code on github or the research paper which the lock-free implementation is based on.
Methodology: Our testing methodology is based on the performance evaluation in Maged Michael's paper "High Performance Dynamic Lock-Free Hash Tables and List-Based Sets" (listed below). We test the effects of varying the number threads, the range of keys used, the initial size of the table, and the ratio of different operations used on performance. Performance is measured as average CPU time per operation in microseconds (for a given experiment, this is simply the CPU time of the experiment divided by the number of threads). In every experiment, we fix the number of threads. Each thread performs 1 million operations on randomly chosen keys (using std::uniform_int_distribution) with a predetermined ratio of put/delete/get operations. Before measuring these operations, we prepopulate the hash table based on a chosen load factor. For load factor a and table size m, we initially add a * m elements and restrict the range of elements from 0 to 2am. This was chosen based on the results Michael published to have a benchmark reference. Below, we compare our results (left) with Michael's (right). Adjacent images have the same features - a fixed average load factor as described above, the same relative proportion of operations, and a table size of 100. A discussion of the results follows below the figures.
Systems: First, we consider the platform used for the experiment and how this may impact the performance. Michael's paper states a 4-processor (375 MHz 604e) IBM PowerPC multiprocessor was used in his experiments. The specifications state that the 604e is a superscalar processor capable of issuing 4 instructions simultaneously. Given this and that he uses a 4-way multiprocessor, the choice to measure a maximum of 16 execution threads is reasonable. Our experiment mirrors this decision, though the architecture is much different. Our experiments were run on the GHC machines (specifically, ghc26). Because we were testing on a shared computing resource without direct control over the scheduling of pthreads or other users' processes, we ran each trial several times and took the minumum timing reasoning this will highlight the maximum performance of the different implementations. The processor on the Gates machine is an Intel Xeon W3670. It is a 6 core processor with 12 threads at 3.2 GHz. We still test up to 16 threads for a more accurate comparison with Michael's results. Because of the vast differences in processor architecture, any strictly numerical comparisons between the sets of results is not extremely useful. However, we hope to notice similar trends in comparing the two experiments.
Analysis: First, we touch on Michael's findings as shown above. He found that his algorithm improved over Harris's when considering memory management (without, the performance was nearly identical). With memory management, Harris's performed worse than the lock-based implementations. He also tested both reader-writer locks and mutex locks and noted reader-writer locks are better with a higher percentage of search operations.
Now we look at our results. Generally, the lock-free implementation performs better than the fine-grain locking, which in turn is better than the coarse-grain locking. This separation is more pronounced when we have a higher proportion of insert and delete operations as we would expect. The average operation in the coarse-grain implementation takes around 2-3x as long compared to fine-grain and lock-free, while the difference in the latter case is much less pronounced. As we increase the average load factor, we find the overall performance degrades (as would be expected), while the difference between fine-grain and lock-free almost disappears. Michael's paper did not consider resizing schemes (which we then also did not include in this analysis) which would be an good place to look to in order to improve performance. Across all experiments, we see the performance of coarse-grain decline up to 8 or 12 threads and then actually improve with 16. We suspect this may be due to some kind of latency hiding because the processor has 12 threads. The fine-grain and lock-free converge between 0.25 and 0.3 microseconds per operation across all experiments. This suggests our program is compute-bound. To confirm this, future tests could either explore a machine with a higher processor speed or with more available threads to get a more accurate picture of how these particular implementations scale with respect to the number of threads.
We can compare Michael's harris-no-mem-mng plot with our lock-free and his multi-mutex with our fine-grain. Between the sets of experiments, these are the most similar implementations and we would expect similar trends in both. There is a similar plateau in the performance as the number of threads scales looking at the first case, again suggesting a compute-bound program. In both cases we also see the overall performance decline as the average load factor increases (again, not unexpected). Our testing showed the performance of the fine-grain locking and lock-free to be much closer than that observed by Michael, but we have no suggestions to resolve this difference. We don't observe the "peak" effected noted above in the coarse-grain scheme, though Michael did not publish any results for this implementation so we have no benchmark to compare this to.
WORK BY EACH STUDENT.
Equal work was performed by both project members.