技术控

    今日:7| 主题:49136
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] A Quick Benchmark of Hashtable Implementations in R

[复制链接]
fpwx0sOl 发表于 2016-10-19 17:41:37
121 5

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
When one thinks about critical components of a data science pipeline, you are likely to consider modeling APIs or visualization libraries, but it’s unlikely that the lowly hashtable would be foremost on your mind. A painful fact of data science is that the majority of effort put into a typical project ( reportedly, 80% ) is spent sourcing, cleaning, and preparing the data. The remaining “20%” is actual data analysis. One of the most useful data structure available to data scientists is the hash table (also know as an associative array).
  The hash table is a veritable Swiss Army Chainsaw. These data structures allow the user to take an arbitrary value, such as a string, a complex object or a data frame, and use it as a lookup to find another value. Common uses for hashtable data structures in the cleaning and preparing data phase of a data science project are feature engineering (for example, keeping a count of how many times you’ve seen an individual value in a stream), normalization, or even creating simple histograms.
  R’s lack of a native (and performant) hashtable

   To the surprise of many who come to the R programming language from other backgrounds, R’s support for hash tables has been historically lacking. The list() data structure that R provides does allow for this associative mapping. There are, however, numerous warnings about potential performance implications of using and growing lists on Stack Overflow and similar forums. In this short blog post, we provide a simple benchmark of 4 approaches to managing an associative hashtable in R. We look at the native list() , Michael Kuhn’s  dict package  , Christopher Brown’s  hash package  , and Neal Futz’s  ht package  .
   One of the first differences between the packages is the ability to store complex keys. For example, you may want to map a data.frame to a particular value. The following code demonstrates a simple example:
  [code]mydict <- dict()
mylist <- list()
myht   <- ht()
myhash <- hash()

mykey <- data.frame(product="domino", value="awesome")

mydict[mykey] <- "something" #will fail
mylist[mykey] <- "something" #will fail

myht[mykey]   <- "something" #will succeed
myhash[mykey] <- "something" #will succeed[/code]   The native R list and the dict package will not allow you to store an entry in the hashtable with a complex key such as a data.frame. They will return the following error:
  [code]] mykey <- data.frame(product="domino", value="awesome")
] mydict[mykey] <- "something" #will fail
Error in mydict[mykey] <- "something" : invalid subscript type 'list'
] mylist <- list()
] mylist[mykey] <- "something" #will fail
Error in mylist[mykey] <- "something" : invalid subscript type 'list'[/code]   The hash package does not provide the same error, however when one looks underneath the surface, the package instead stores the key seemingly incorrectly, and would not actually allow you to look it up given the original values:
  [code]] myhash[mykey] <- "something" #will succeed
] myhash[mykey]
Error in get(k, x) : invalid first argument
] myhash[data.frame(product="domino", value="awesome")]
Error in get(k, x) : invalid first argument
] as.list(myhash)
$`1`
[1] "something"[/code]   For some reason, the data.frame which was the key was silently stored as the number 1 . This could create some difficult to track bugs if the user was caught unaware.
   The ht package uses a digest under the hood, and allows you to look up the value both with the original key, or with an object that generates the same digest value:
  [code]] myht[mykey]   <- "something" #will succeed
] myht[mykey]
[1] "something"
] myht[data.frame(product="domino", value="awesome")]
[1] "something"[/code]   If you need to store arbitrarily complex objects as keys in a hash, the ht package provides the greatest functionality.
  Performance of hashtable packages

   We profiled the runtime performance of these different hashtable implementations. We used the excellent microbenchmark package to perform ten runs of each experiment. In the experiments, we are inserting from 10,000 to ~250,000 unique keys in each hash, in increments of 10k. microbenchmark provides us with a data structure containing minimum, maximum, mean, and median values for each of the operations, after first warming up the system by running the benchmarked code twice (to get around cold cache issues, for example.) When we look at mean values, the results are clear:
   
A Quick Benchmark of Hashtable Implementations in R-1 (available,structure,critical,complex,painful)

   As the number of keys exceeds 100k, the runtimes begin to diverge. The ht and hash packages retain far lower runtimes than the other hashtables. The performance curves of dict and the list() implementations are quite similar, implying they may work similarly under the hood. Regardless, the curve of their performance curve shows that as that number grows performance is going to suffer considerably.
  When we plot the minimum, median, and maximum values, there are no surprises. The pattern we see above is repeated across all of the packages:

A Quick Benchmark of Hashtable Implementations in R-2 (available,structure,critical,complex,painful)

   Another aspect of performance is the amount of memory that the data structure uses when storing data. We used R’s object.size() function to determine memory utilization. For reasons unknown to the author, the ht and hash packages returned a constant size, as illustrated by the two lines at the bottom of the chart.
   Looking at the chart, you’ll note that dict and list() both see significant increases in memory usage at two points in the experiment. The similarity of their respective memory consumption and performance curves is a pretty strong hint that the dict package is implemented as a list() !
12下一页
友荐云推荐




上一篇:深度重构UIViewController
下一篇:江湖微分销系统:超级卖货神器震撼来袭 开创微信营销新篇章! ...
酷辣虫提示酷辣虫禁止发表任何与中华人民共和国法律有抵触的内容!所有内容由用户发布,并不代表酷辣虫的观点,酷辣虫无法对用户发布内容真实性提供任何的保证,请自行验证并承担风险与后果。如您有版权、违规等问题,请通过"联系我们"或"违规举报"告知我们处理。

颗心满装着 发表于 2016-10-19 19:01:39
好,很好,非常好!
回复 支持 反对

使用道具 举报

socooler 发表于 2016-11-7 11:00:04
楼主辛苦了,鼓励一下
回复 支持 反对

使用道具 举报

蓝色沸点 发表于 2016-11-21 09:05:11
我是个凑数的。。。
回复 支持 反对

使用道具 举报

侯金翠 发表于 2016-11-21 11:59:52
楼主不哭,我帮你顶!
回复 支持 反对

使用道具 举报

rtooj 发表于 2016-11-21 18:56:44
顶起!
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读

扫码访问 @iTTTTT瑞翔 的微博
回页顶回复上一篇下一篇回列表手机版
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )|网站地图 酷辣虫

© 2001-2016 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表