I've already read this and this questions, but still doubt whether the observed behavior of Stream.skip
was intended by JDK authors.
Let's have simple input of numbers 1..20:
List<Integer> input = IntStream.rangeClosed(1, 20).boxed().collect(Collectors.toList());
Now let's create a parallel stream, combine the unordered()
with skip()
in different ways and collect the result:
System.out.println("skip-skip-unordered-toList: "
+ input.parallelStream().filter(x -> x > 0)
.skip(1)
.skip(1)
.unordered()
.collect(Collectors.toList()));
System.out.println("skip-unordered-skip-toList: "
+ input.parallelStream().filter(x -> x > 0)
.skip(1)
.unordered()
.skip(1)
.collect(Collectors.toList()));
System.out.println("unordered-skip-skip-toList: "
+ input.parallelStream().filter(x -> x > 0)
.unordered()
.skip(1)
.skip(1)
.collect(Collectors.toList()));
Filtering step does essentially nothing here, but adds more difficulty for stream engine: now it does not know the exact size of the output, thus some optimizations are turned off. I have the following results:
skip-skip-unordered-toList: [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
// absent values: 1, 2
skip-unordered-skip-toList: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20]
// absent values: 1, 15
unordered-skip-skip-toList: [1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 19, 20]
// absent values: 7, 18
The results are completely fine, everything works as expected. In the first case I asked to skip first two elements, then collect to list in no particular order. In the second case I asked to skip the first element, then turn into unordered and skip one more element (I don't care which one). In the third case I turned into unordered mode first, then skip two arbitrary elements.
Let's skip one element and collect to the custom collection in unordered mode. Our custom collection will be a HashSet
:
System.out.println("skip-toCollection: "
+ input.parallelStream().filter(x -> x > 0)
.skip(1)
.unordered()
.collect(Collectors.toCollection(HashSet::new)));
The output is satisfactory:
skip-toCollection: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
// 1 is skipped
So in general I expect that as long as stream is ordered, skip()
skips the first elements, otherwise it skips arbitrary ones.
However let's use an equivalent unordered terminal operation collect(Collectors.toSet())
:
System.out.println("skip-toSet: "
+ input.parallelStream().filter(x -> x > 0)
.skip(1)
.unordered()
.collect(Collectors.toSet()));
Now the output is:
skip-toSet: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20]
// 13 is skipped
The same result can be achieved with any other unordered terminal operation (like forEach
, findAny
, anyMatch
, etc.). Removing unordered()
step in this case changes nothing. Seems that while unordered()
step correctly makes the stream unordered starting from the current operation, the unordered terminal operation makes the whole stream unordered starting from very beginning despite that this can affect the result if skip()
was used. This seems completely misleading for me: I expect that using the unordered collector is the same as turning the stream into unordered mode just before the terminal operation and using the equivalent ordered collector.
So my questions are:
- Is this behavior intended or it's a bug?
- If yes is it documented somewhere? I've read Stream.skip() documentation: it does not say anything about unordered terminal operations. Also Characteristics.UNORDERED documentation is not very comprehend and does not say that ordering will be lost for the whole stream. Finally, Ordering section in package summary does not cover this case either. Probably I'm missing something?
- If it's intended that unordered terminal operation makes the whole stream unordered, why
unordered()
step makes it unordered only since this point? Can I rely on this behavior? Or I was just lucky that my first tests work nicely?
Recall that the goal of stream flags (ORDERED, SORTED, SIZED, DISTINCT) is to enable operations to avoid doing unnecessary work. Examples of optimizations that involve stream flags are:
sorted()
is a no-op;toArray()
, avoiding a copy;Each stage of a pipeline has a set of stream flags. Intermediate operations can inject, preserve, or clear stream flags. For example, filtering preserves sorted-ness / distinct-ness but not sized-ness; mapping preserves sized-ness but not sorted-ness or distinct-ness. Sorting injects sorted-ness. The treatment of flags for intermediate operations is fairly straightforward, because all decisions are local.
The treatment of flags for terminal operations is more subtle. ORDERED is the most relevant flag for terminal ops. And if a terminal op is UNORDERED, then we do back-propagate the unordered-ness.
Why do we do this? Well, consider this pipeline:
Since
forEach
is not constrained to operate in order, the work of sorting the list is completely wasted effort. So we back-propagate this information (until we hit a short-circuiting operation, such aslimit
), so as not to lose this optimization opportunity. Similarly, we can use an optimized implementation ofdistinct
on unordered streams.Yes :) The back-propagation is intended, as it is a useful optimization that should not produce incorrect results. However, the bug part is that we are propagating past a previous
skip
, which we shouldn't. So the back-propagation of the UNORDERED flag is overly aggressive, and that's a bug. We'll post a bug.It should be just an implementation detail; if it were correctly implemented, you wouldn't notice (except that your streams are faster.)