Parallel Streams Behave Sequentially… to a Point

综合技术 2018-05-24

In this post, I would like to spotlight a bit of the internal behavior of parallel streams in Java, a feature added in JDK 8. I will start from the source code and then try to explain what really happens in the context of our test example.

Basically, the code below declares two array lists, one of size 8192 and the other of 8193, then creates parallel streams out of them, and, afterward, tries to sort the arrays.

@Warmup(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
@Measurement(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
@Fork(value = 3, warmups = 1)
public class SortStreamJmh {

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()

        new Runner(opt).run();

    @Param({ "8192", "8193" })
    int arraySize;

    List list = new ArrayList();

    public void setupList(){
        Random random = new Random(26);
        for (int i = 0; i < arraySize; i ++) {
            String r = generateRandomWord(random, 2);

    public Object[] sort() {
        Object[] result = list.parallelStream()

        return result;

    private static String generateRandomWord(Random random, int wordLength) {
        StringBuilder sb = new StringBuilder(wordLength);
        for(int i = 0; i < wordLength; i++) {
            char tmp = (char)('a' +  random.nextInt('z' - 'a')); // Generate a letter between a and z
            sb.append(tmp); // Add it to the String
        return sb.toString();


Test output:

Benchmark       (arraySize)     Mode      Cnt       Score   Error        Units
SortStreamJmh.sort         8192      avgt       15    1711.595 ± 51060.627    us/op
SortStreamJmh.sort         8193      avgt       15     944.169 ± 28012.014    us/op

Tests triggered using JDK 10 (latest JDK release at the moment) on my machine (CPU: Intel i7-6700HQ Skylake; MEMORY: 16GB DDR4 2133 MHz; OS: Ubuntu 16.04.2)

As we might notice, the bigger array (i.e. 8193) takes less time to sort the Strings (~2x faster) in comparison to the smaller one (i.e. 8192). However, even if the arrays' lengths are almost equal (i.e. their sizes differ by only one element: 8192 vs. 8193), the performance is noticeable! How can we explain this?

Let's jump into the JDK sources inside the

public static <T extends Comparable> void parallelSort(T[] a) {
    int n = a.length, p, g;
    if (n <= MIN_ARRAY_SORT_GRAN ||  // where MIN_ARRAY_SORT_GRAN = 1 << 13
        (p = ForkJoinPool.getCommonPoolParallelism()) == 1)
            // sequential sort
            TimSort.sort(a, 0, n, NaturalOrder.INSTANCE, null, 0, 0);
        // parallel sort
        new ArraysParallelSortHelpers.FJObject.Sorter 
            (null, a, (T[])Array.newInstance(a.getClass().getComponentType(), n),
                0, n, 0, ((g = n / (p << 2)) <= MIN_ARRAY_SORT_GRAN) ?
                MIN_ARRAY_SORT_GRAN : g, NaturalOrder.INSTANCE).invoke();

The JDK source code reveals an interesting fact:

  • If the array length is below a certain granularity (e.g. MIN_ARRAY_SORT_GRAN = 1 << 13
    which corresponds to 8192
    ), the array is not partitioned anymore and is sequentially sorted using Arrays.sort()
    , even if at the code level the programmer explicitly requires a parallel stream!
  • Otherwise, the array is partitioned and a ForkJoin
    pool is used to execute parallel tasks

Getting back to our example, we can summarize:

  • The 8192 array length is sequentially sorted.
  • The 8193 array length is split into parallel sub-tasks handled by the ForkJoin pool.

Which explains why, despite a slightly larger length, the 8193 array is faster.

Back to a bit of theory, there are few recommendations from Brian Goetz
on his great article Parallel stream performance
about the rationale of splitting a source, including when it makes sense to go parallel and when to stick with the sequential approach. One of the guidelines includes the NQ model
, which states:

NQ Model: The larger the product of NxQ is, the more likely it is to get a parallel speedup!

N: Number of data elements

Q: Amount of work performed per element

Note:For problems with a trivially small Q (e.g. sorting, addition), generally N should be greater than 10,000 to get a speedup and to make sense to parallelize!

It might be a reasonable explanation for our test case as well, where JDK sources rely on an explicit threshold 1<<13
to avoid parallelizing Streams, where the size is below that certain specified value (e.g. 1 << 13 = 8193)!



责编内容by:Javalobby (源链)。感谢您的支持!


JDK1.8源码(六)——java.util.LinkedList 类 上一篇博客我们介绍了List集合的一种典型实现ArrayList,我们知道 ArrayList 是由数组构成的,本篇博客我们介绍 List 集合的另一种典型实现 LinkedList,这是一个由链表构成的数组,关于链表的介绍,在这篇博客中我们也详细介绍过,本篇博客我们将介绍 LinkedList ...
JDK 11: Beginning of the End for Java Serializatio... In the blog post " Using Google's Protocol Buffers with Java ," I quoted Josh Bloch'sThird Edition of Effective Java , in which he wrote, "T...
图灵机与计算理论 图灵机和计算理论是人工智能乃至整个计算机科学的理论基础,邱奇-图灵论题告诉我们一切可计算过程都可以用图灵机模拟。 图灵机 图灵机,又称图灵计算、图灵计算机,是由数学家艾伦·麦席森·图灵(1912~1954)提出的一种抽象计算模型,即将人们使用纸笔进行数学运算...
Better containerized JVMs in JDK 10 © Shutterstock / Baloncici There is a significant amount of work going into JDK10 to support containerized JVMs. In this post, Oracle’s ...
Java集合中的HashMap类 jdk1.8.0_144 HashMap作为最常用集合之一,继承自AbstractMap。JDK8的HashMap实现与JDK7不同,新增了红黑树作为底层数据结构,结构变得复杂,效率变得更高。为满足自身需要,也重新实现了很多AbstractMap中的方法。本文会围绕HashMap,详细探讨Has...