When one thinks about critical components of a data science pipeline, you are likely to consider modeling APIs or visualization libraries, but it’s unlikely that the lowly hashtable would be foremost on your mind. A painful fact of data science is that the majority of effort put into a typical project ( reportedly, 80% ) is spent sourcing, cleaning, and preparing the data. The remaining “20%” is actual data analysis. One of the most useful data structure available to data scientists is the hash table (also know as an associative array).
The hash table is a veritable Swiss Army Chainsaw. These data structures allow the user to take an arbitrary value, such as a string, a complex object or a data frame, and use it as a lookup to find another value. Common uses for hashtable data structures in the cleaning and preparing data phase of a data science project are feature engineering (for example, keeping a count of how many times you’ve seen an individual value in a stream), normalization, or even creating simple histograms.
R’s lack of a native (and performant) hashtable
To the surprise of many who come to the R programming language from other backgrounds, R’s support for hash tables has been historically lacking. The list() data structure that R provides does allow for this associative mapping. There are, however, numerous warnings about potential performance implications of using and growing lists on Stack Overflow and similar forums. In this short blog post, we provide a simple benchmark of 4 approaches to managing an associative hashtable in R. We look at the native list() , Michael Kuhn’s dict package , Christopher Brown’s hash package , and Neal Futz’s ht package .
One of the first differences between the packages is the ability to store complex keys. For example, you may want to map a data.frame to a particular value. The following code demonstrates a simple example:
[code]mydict <- dict()
mylist <- list()
myht <- ht()
myhash <- hash()
myht[mykey] <- "something" #will succeed
myhash[mykey] <- "something" #will succeed[/code] The native R list and the dict package will not allow you to store an entry in the hashtable with a complex key such as a data.frame. They will return the following error:
[code]] mykey <- data.frame(product="domino", value="awesome")
] mydict[mykey] <- "something" #will fail
Error in mydict[mykey] <- "something" : invalid subscript type 'list'
] mylist <- list()
] mylist[mykey] <- "something" #will fail
Error in mylist[mykey] <- "something" : invalid subscript type 'list'[/code] The hash package does not provide the same error, however when one looks underneath the surface, the package instead stores the key seemingly incorrectly, and would not actually allow you to look it up given the original values:
[code]] myhash[mykey] <- "something" #will succeed
Error in get(k, x) : invalid first argument
] myhash[data.frame(product="domino", value="awesome")]
Error in get(k, x) : invalid first argument
 "something"[/code] For some reason, the data.frame which was the key was silently stored as the number 1 . This could create some difficult to track bugs if the user was caught unaware.
The ht package uses a digest under the hood, and allows you to look up the value both with the original key, or with an object that generates the same digest value:
[code]] myht[mykey] <- "something" #will succeed
] myht[data.frame(product="domino", value="awesome")]
 "something"[/code] If you need to store arbitrarily complex objects as keys in a hash, the ht package provides the greatest functionality.
Performance of hashtable packages
We profiled the runtime performance of these different hashtable implementations. We used the excellent microbenchmark package to perform ten runs of each experiment. In the experiments, we are inserting from 10,000 to ~250,000 unique keys in each hash, in increments of 10k. microbenchmark provides us with a data structure containing minimum, maximum, mean, and median values for each of the operations, after first warming up the system by running the benchmarked code twice (to get around cold cache issues, for example.) When we look at mean values, the results are clear:
As the number of keys exceeds 100k, the runtimes begin to diverge. The ht and hash packages retain far lower runtimes than the other hashtables. The performance curves of dict and the list() implementations are quite similar, implying they may work similarly under the hood. Regardless, the curve of their performance curve shows that as that number grows performance is going to suffer considerably.
When we plot the minimum, median, and maximum values, there are no surprises. The pattern we see above is repeated across all of the packages:
Another aspect of performance is the amount of memory that the data structure uses when storing data. We used R’s object.size() function to determine memory utilization. For reasons unknown to the author, the ht and hash packages returned a constant size, as illustrated by the two lines at the bottom of the chart.
Looking at the chart, you’ll note that dict and list() both see significant increases in memory usage at two points in the experiment. The similarity of their respective memory consumption and performance curves is a pretty strong hint that the dict package is implemented as a list() !