Java fast IO using java.nio API

存储架构 2013-09-29

For modern computing, IO is always a big bottleneck to solve. I recently encounter a problem is to read a 355MB index file to memory, and do a run-time lookup base the index. This process will be repeated by thousands of Hadoop job instances, so a fast IO is a must. By using the
java.nio
API I sped the process from 194.054 seconds to 0.16 sec! Here’s how I did it.

The Data to Process

This performance tuning practice is very specific to the data I’m working on, so it’s better to explain the context. We have a long ip list (26 millions in total) that we want to put in the memory. The ip is in text form, and we’ll transform it into signed integer and put it into a java array. (We use signed integer because java doesn’t support unsigned primitive types…) The transformation is pretty straight forward:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public static int ip2integer (String ip_str){
  String [] numStrs = ip_str.split("\.");
  long num;
  if (numStrs.length == 4){
    num =
        Long.parseLong(numStrs[0]) * 256 * 256 * 256
        + Long.parseLong(numStrs[1]) * 256 * 256
        + Long.parseLong(numStrs[2]) * 256
        + Long.parseLong(numStrs[3]);
    num += Integer.MIN_VALUE;
    return (int)num;
  } else {
    System.err.println("IP is wrong: "+ ip_str);
    return Integer.MIN_VALUE;
  }
}

However, reading ip in text form line by line is really slow.

Strategy 1: Line-by-line text processing

This approach is straight forward. Just a standard readline program in java.

1
2
3
4
5
6
7
8
9
10
11
12
13
private int[] ipArray = new int[26123456];
public static void readIPAsText() throws IOException{
  BufferedReader br = new BufferedReader(new FileReader("ip.tsv"));
  DataOutputStream ds = new DataOutputStream(fos);
  String line;
  int i = 0;

  while ((line = br.readLine()) != null) {
    int ip_num = ip2integer(line);
    ipArray[i++] = ip_num;
  }
  br.close();
}

The result time was
194.054
seconds.

Strategy 2: Encode ip in binary format

The file size of the
ip.tsv
is 355MB, which is inefficient to store or to read. Since I’m only reading it to an array, why not store it as a big chunk of binary array, and read it back while I need it? This can be done by
DataInputStream
and
DataOutputStream
. After shrinking the file, the file size became 102MB.

Here’s the code to read ip in binary format:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public static void readIPAsDataStream() throws IOException{
  FileInputStream fis = new FileInputStream(new File("ip.bin"));
  DataInputStream dis = new DataInputStream(fis);
  int i = 0;
  try {
    while(true){
      ipArr[i++] = dis.readInt();
    }
  }catch (EOFException e){
    System.out.println("EOF");
  }
  finally {
    fis.close();
  }
}

The resulting time was
72
seconds. Much slower than I expected.

Strategy 3: Read the file using java.nio API

The
java.nio
is a new IO API that maps to low level system calls. With these system calls we can perform libc operations like
fseek
,
rewind
,
ftell
,
fread
, and bulk copy from disk to memory. For the C API you can view it from
GNU C library reference
.

The terminology in C and Java is a little bit different. In C, you control the file IO by
file descriptors
; while in
java.nio
you use a
FileChannel
for reading, writing, or manipulate the position in the file. Another difference is you can bulk copy directly using the
fread
call, but in Java you need an additional
ByteBuffer
layer to map the data. To understand how it work, it’s better to read it from code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
public static void readIPFromNIO() throws IOException{
  FileInputStream fis = new FileInputStream(new File("ip.bin"));
  FileChannel channel = fis.getChannel();
  ByteBuffer bb = ByteBuffer.allocateDirect(64*1024);
  bb.clear();
  ipArr = new int [(int)channel.size()/4];
  System.out.println("File size: "+channel.size()/4);
  long len = 0;
  int offset = 0;
  while ((len = channel.read(bb))!= -1){
    bb.flip();
    //System.out.println("Offset: "+offset+"tlen: "+len+"tremaining:"+bb.hasRemaining());
    bb.asIntBuffer().get(ipArr,offset,(int)len/4);
    offset += (int)len/4;
    bb.clear();
  }
}

The code should be quite self-documented. The only thing to note is the byte-buffer’s
flip()
method. This call convert the buffer from writing data to buffer from disk to reading mode, so that we can read the data to int array via method
get()
. Another thing worth to mention is java use big-endian to read and write data by default. You can use
ByteBuffer.order(ByteOrder.LITTLE_ENDIAN)
to set the endian if you need it. For more about
ByteBuffer
here’s a
good blog post
that explains it in detail.

With this implementation, the result performance is
0.16
sec! Glory to the
java.nio
!

您可能感兴趣的

写给自学者的入门指南 在IT工程师和培训机构多如牛毛的时代,拜师学艺并不难。但自学编程对于毫无基础的同学来说却可能是个问题,相信有过类似经历的朋友都有一把辛酸泪和一肚不吐不快的体会。让我们从一个故事说起... 故事 某君在一个普通大学读着自己不喜欢的专业,以打游戏、刷段子和睡觉度日,突然有一天想学点什么...
Eclipse stops generating R.java Today, I update ADT to 22.0.1.v201305230001--685705, and found that Eclipse stopped generating R.java. I believed that it is problem of Eclipse, becau...
一个屌丝程序猿的人生(七十) 第二轮项目演示结束了,大家又回到了看视频学习的平淡日子。 值得一提的是,张建派的人自那以后,不仅没有再提过林萧玩猫腻的事,反而一个个都对林萧毕恭毕敬。 这倒是让林萧派的人开了眼,虽然林萧派的人都知道,对方一定是看到了U盘里林萧的项目,也肯定会被林萧那项目中满屏的注释所震撼。 但不管怎么说...
JAVA学习笔记,JavaScript函数与常用对象 Java是一种可以撰写跨平台应用软件的面向对象的程序设计语言。Java 技术具有卓越的通用性、高效性、平台移植性和安全性,广泛应用于PC、数据中心、游戏控制台、科学超级计算机、移动电话和互联网,同时拥有全球最大的开发者专业社群。 给你学习路线:html-css-js-jq-javase-数据库-...
Java编程新手基础入门到进阶:理解与目前学习计划... Java是一种可以撰写跨平台应用软件的面向对象的程序设计语言。Java 技术具有卓越的通用性、高效性、平台移植性和安全性,广泛应用于PC、数据中心、游戏控制台、科学超级计算机、移动电话和互联网,同时拥有全球最大的开发者专业社群。 给你学习路线:html-css-js-jq-javase-数据库-...