Process Small Files on Hadoop using CombineFileInputFormat (2)

存储架构 2013-09-23

Followed the previous article, in this post I ran several benchmarks and tuned the performance from 3 hours 34 minutes to 6 minutes 8 seconds!

Original job without any tuning

  • job_201308111005_0317
  • NumTasks: 9790
  • Reuse JVM: false
  • mean complete time: 9-Sep-2013 10:08:47 (17sec)
  • Finished in: 3hrs, 34mins, 26sec

We had 9790 files to process, and the total size of the files is 53 GB. Note that for every task it still took 17 seconds to process the file.

Using CombineFileInputFormat without setting the MaxSplitSize

  • job_201308111005_0330
  • NumTasks: 1
  • Reuse JVM: false

In this benchmark I didn’t set the
, and thus Hadoop merge all the files into one super big task. After running this task for 15 minutes, hadoop killed it. Maybe its a timeout issue, I didn’t dig into this. The start and the end of the task logs look like this:

13/09/09 16:17:29 INFO mapred.JobClient:  map 0% reduce 0%
13/09/09 16:32:45 INFO mapred.JobClient:  map 40% reduce 0%
13/09/09 16:33:02 INFO mapred.JobClient: Task Id : attempt_201308111005_0330_m_000000_0, Status : FAILED
java.lang.Throwable: Child Error
    Caused by: Task process exit with nonzero status of 255.

Using CombineFileInputFormat with block size 64 MB

  • job_201308111005_0332
  • Reuse JVM = false
  • max split size = 64MB
  • NumTasks: 760
  • mean complete time: 9-Sep-2013 16:55:02 (24sec)
  • Finished in: 23mins, 6sec

After modifying
the total runtime has reduced to 23 minutes! The total tasks drops from 9790 to 760, about 12 times smaller. The time difference is 9.3 times faster, pretty nice! However, the mean complete time doesn’t scale like other factors. The reason was it’s a big overhead to start JVM over and over again.

Using CombineFileInputFormat with block size 64MB and reuse JVM

To reuse the JVM, just set

  public static void main(String[] argv) throws Exception {
    Configuration conf = new Configuration();
    conf.setInt("mapred.job.reuse.jvm.num.tasks", -1);
    int res =, new HadoopMain(), argv);

The result is awesome!
6 minutes and 8 seconds
, wow!

  • job_201308111005_0333
  • Reuse JVM = true
  • max split size = 64MB
  • NumTasks: 760
  • mean complete time: 9-Sep-2013 17:30:23 (5sec)
  • Finished in: 6mins, 8sec

Use FileInputFormat and reuse JVM

Just curious the performance difference if we only change the JVM parameter:

  • job_201308111005_0343
  • NumTasks: 9790
  • mean complete time: 10-Sep-2013 17:04:18 (3sec)
  • Reuse JVM = true
  • Finished in: 24mins, 49sec

Tuning performance over block size

Let’s jump to the conclusion first: changing the block size doesn’t affect the performance that much, and I found 64 MB is the best size to use. Here are the benchmarks:

512 MB

  • job_201308111005_0339
  • Reuse JVM = true
  • max split size = 512MB
  • NumTasks: 99
  • mean complete time: 10-Sep-2013 11:55:26 (24sec)
  • Finished in: 7min 13sec

128 MB

  • job_201308111005_0340
  • Reuse JVM = true
  • max split size = 128 MB
  • NumTasks: 341
  • mean complete time: 10-Sep-2013 13:13:20 (9sec)
  • Finished in: 6mins, 41sec


So far the best practice I learned from these benchmarks are:

  1. Setup the
    flag in configuration. This is the easiest tuning to do, and it makes nearly 10 times performance improvement.
  2. Write your own
  3. The block size can be 64 MB or 128 MB, but doesn’t make big difference between the two.

Still, try to model your problems into sequence file or map file in hadoop. HDFS should handle localities with these files automatically. What about
? Does it handle locality in HDFS system too? I can’t confirm it but I guess sorting the keys based on line offset first then file name also guarantees the locality of assigning data to mapper. When I have time to dig more from HDFS API, I’ll look back to this benchmark and see what can I further tune the program.


Java虚拟机系列之Java内存结构简介 本文我们将讲解Java虚拟机中各个区域以及各个区域的作用。 一.程序计数器 什么是程序计数器,有什么作用? 程序技术器是一块比较小的内存区域,主要当做是线程中所执行的字节码的行号指示器,字节码解释器工作时就是通过改变这个计数器的值来选取下一个执行的字节码命令,分支、循环、跳转等基础...
Hadoop 压缩实现分析 王 腾腾 和 邵 兵 2015 年 11 月 26 日发布 WeiboGoogle+用电子邮件发送本页面 Comments 1 引子 随着云时代的来临,大数据(Big data)也获得了越来越多的关注。著云台的分析师团队认为,大数据(Big data)通常用来形容一个公司创造的大量非结构化和半结构化...
从编译和运行的角度再议 Java 乱码问题... 在实际项目中,由于系统的复杂性,乱码的根源往往不容易快速定位,乱码问题不见得一定能通过在 Java 内部编解码的方式解决。正确的做法应该是依次检查输入 Java 虚拟机(以下简称 JVM)前的字符串、JVM 内的字符串、输出 JVM 的字符串是否正常。自字符串经过某种形式读入到经过某种方式写出,只有...
从JDK源码角度看Object Java的Object是所有其他类的父类,从继承的层次来看它就是最顶层根,所以它也是唯一一个没有父类的类。它包含了对象常用的一些方法,比如 getClass 、 hashCode 、 equals 、 clone 、 toString 、 notify 、...
国内首家!主导Apache Hadoop新版本发布的,是腾讯云这位小哥哥... 看新闻很累?看技术新闻更累?试试 下载InfoQ手机客户端 ,每天上下班路上听新闻,有趣还有料! 近日,腾讯主导的Apache Hadoop2.8.4最新版本发布,为国内科技公司在国际开源领域的探索迈出重要一步。 2006年Apache Hadoop发布,2008年Hado...