For modern computing, IO is always a big bottleneck to solve. I recently encounter a problem is to read a 355MB index file to memory, and do a run-time lookup base the index. This process will be repeated by thousands of Hadoop job instances, so a fast IO is a must. By using the
API I sped the process from 194.054 seconds to 0.16 sec! Here’s how I did it.
The Data to Process
This performance tuning practice is very specific to the data I’m working on, so it’s better to explain the context. We have a long ip list (26 millions in total) that we want to put in the memory. The ip is in text form, and we’ll transform it into signed integer and put it into a java array. (We use signed integer because java doesn’t support unsigned primitive types…) The transformation is pretty straight forward:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
However, reading ip in text form line by line is really slow.
Strategy 1: Line-by-line text processing
This approach is straight forward. Just a standard readline program in java.
1 2 3 4 5 6 7 8 9 10 11 12 13
The result time was
Strategy 2: Encode ip in binary format
The file size of the
is 355MB, which is inefficient to store or to read. Since I’m only reading it to an array, why not store it as a big chunk of binary array, and read it back while I need it? This can be done by
. After shrinking the file, the file size became 102MB.
Here’s the code to read ip in binary format:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
The resulting time was
seconds. Much slower than I expected.
Strategy 3: Read the file using java.nio API
is a new IO API that maps to low level system calls. With these system calls we can perform libc operations like
, and bulk copy from disk to memory. For the C API you can view it from
GNU C library reference
The terminology in C and Java is a little bit different. In C, you control the file IO by
; while in
you use a
for reading, writing, or manipulate the position in the file. Another difference is you can bulk copy directly using the
call, but in Java you need an additional
layer to map the data. To understand how it work, it’s better to read it from code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
The code should be quite self-documented. The only thing to note is the byte-buffer’s
method. This call convert the buffer from writing data to buffer from disk to reading mode, so that we can read the data to int array via method
. Another thing worth to mention is java use big-endian to read and write data by default. You can use
to set the endian if you need it. For more about
good blog post
that explains it in detail.
With this implementation, the result performance is
sec! Glory to the