Fixing scala.collection.Iterator
The venerable Iterator
interface we all love and hate could use some improvements. This is a follow-up to my previous article, in which I talked about getting rid of Traversable because the Iterable
and Iterator
duo is enough for Scala’s standard library.
As a reminder, the Iterator
interface is something like this:
trait Iterator[+A] {
def hasNext: Boolean
def next(): A
}
It’s a destructive interface that is consumed for as long as you call next()
,
it obviously has “identity” and you’re supposed to use it like this:
val cursor: Iterator[Int] = ???
var sum = 0
while (cursor.hasNext) {
sum += cursor.next()
}
Problem 1: Both methods (hasNext, next) are side-effecting #
You could say that hasNext
is not supposed to move the internal
cursor / pointer / index and thus it shouldn’t be side-effecting, but
that’s not true, because in many cases the only way to know if there
is a next element to be served is to trigger a side-effecting read.
And so the problem is that both hasNext
and next()
are side-effecting and
in my opinion the result of the wrong method is getting cached. When you work with
Functional Programming for a while, you start noticing when APIs have their
side-effects screwed ;-)
We can’t really blame Scala though. This interface has been imported from Java and kept similar probably for remaining familiar.
But let me illustrate by building an iterator for reading an InputStream
:
class IteratorFromStream(in: InputStream, chunkSize: Int)
extends Iterator[Array[Byte]] {
private val buffer = new Array[Byte](chunkSize)
private var chunk: Array[Byte] = _
private var hasChunk = false
def hasNext: Boolean = {
if (!hasChunk) {
val len = in.read(buffer)
if (len >= 0) {
chunk = util.Arrays.copyOf(buffer, len)
hasChunk = true
} else {
in.close()
}
}
hasChunk
}
def next(): Array[Byte] = {
if (hasNext) {
val ref = chunk
chunk = null // GC purposes
hasChunk = false
ref
} else {
throw new NoSuchElementException("InputStream is empty")
}
}
}
Not that particularly exciting and you can see how next()
has to duplicate
the work of hasNext
and that hasNext
itself is side-effecting, because
we have to read from the InputStream
before being able to answer that question.
We can do better and we don’t have to be original about it. Behold the alternative inspired by IEnumerator from C#:
trait Iterator[+A] {
// Side-effecting, moves the cursor
def moveNext(): Boolean
// Not side-effecting, can be called multiple times
def current: A
}
Usage is straightforward:
val cursor: Iterator[Int] = ???
var sum = 0
while (cursor.moveNext()) {
sum += cursor.current
}
This interface feels more natural to developers because “moving” the cursor is the side-effect, not reading the current value. And here’s how the above implementation changes:
class IteratorFromStream(in: InputStream, chunkSize: Int)
extends Iterator[Array[Byte]] {
private val buffer = new Array[Byte](chunkSize)
private var chunk: Array[Byte] = _
def moveNext(): Boolean = {
val len = in.read(buffer)
if (len >= 0) {
chunk = util.Arrays.copyOf(buffer, len)
true
}
else {
chunk = null
in.close()
false
}
}
def current: Array[Byte] = {
if (chunk == null) throw NoSuchElementException("current")
chunk
}
}
Notice how this simplifies things on the implementation side as well.
UPDATE: M.Odersky points out in the comments that the standard library has a
BufferedIterator
implementation that caches the current head
and can be used for convenience.
Problem 2: Iterator comes with operations attached #
At the beginning I gave you a simplified Iterator
definition, however I lied.
The true scala.collection.Iterator
is closer to this:
package scala.collection
trait Iterator[+A] extends TraversableOnce[A] {
def hasNext: Boolean
def next(): A
def isTraversableAgain = false
def isEmpty: Boolean = !hasNext
def hasDefiniteSize = isEmpty
def map[B](f: A => B): Iterator[B] = ???
def take(n: Int): Iterator[A] = slice(0, n)
def drop(n: Int): Iterator[A] = ???
def slice(from: Int, until: Int): Iterator[A] = ???
def map[B](f: A => B): Iterator[B] = ???
def ++[B >: A](that: => GenTraversableOnce[B]): Iterator[B] = ???
def flatMap[B](f: A => GenTraversableOnce[B]): Iterator[B] = ???
def filter(p: A => Boolean): Iterator[A] = ???
def corresponds[B](that: GenTraversableOnce[B])(p: (A, B) => Boolean): Boolean = ???
def withFilter(p: A => Boolean): Iterator[A] = ???
def filterNot(p: A => Boolean): Iterator[A] = ???
def collect[B](pf: PartialFunction[A, B]): Iterator[B] = ???
def scanLeft[B](z: B)(op: (B, A) => B): Iterator[B] = ???
def scanRight[B](z: B)(op: (A, B) => B): Iterator[B] = ???
def takeWhile(p: A => Boolean): Iterator[A] = ???
def partition(p: A => Boolean): (Iterator[A], Iterator[A]) = ???
def span(p: A => Boolean): (Iterator[A], Iterator[A]) = ???
def dropWhile(p: A => Boolean): Iterator[A] = ???
def zip[B](that: Iterator[B]): Iterator[(A, B)] = ???
def padTo[A1 >: A](len: Int, elem: A1): Iterator[A1] = ???
def zipWithIndex: Iterator[(A, Int)] = ???
def foreach[U](f: A => U) = ???
def forall(p: A => Boolean): Boolean = ???
def exists(p: A => Boolean): Boolean = ???
def contains(elem: Any): Boolean = ???
def find(p: A => Boolean): Option[A] = ???
def indexWhere(p: A => Boolean): Int = ???
def indexOf[B >: A](elem: B): Int = ???
def buffered: BufferedIterator[A] = ???
def grouped[B >: A](size: Int): GroupedIterator[B] = ???
def copyToArray[B >: A](xs: Array[B], start: Int, len: Int): Unit = ???
def sameElements(that: Iterator[_]): Boolean = ???
def toTraversable: Traversable[A] = ???
def toIterator: Iterator[A] = ???
def toStream: Stream[A] = ???
// ....
}
Well, at this point you should be thinking that this violates the principles of
OOP design. When Iterator
comes with operations like map
and filter
,
that are polymorphic and can be overridden, it is no longer just a minimal
protocol for “iterating over things”, but a big, fat interface.
You see, there isn’t a single possible implementation for map
or for take
and by providing such operations with a default implementation the
Iterator
is imposing to users how those operations should behave.
Or more specifically:
- these operations have lazy behavior, until overridden in subclasses
- assume that the protocol is set in stone
But Iterator
is a fine example of an OOP interface because you can add
restrictions to it. Lo and behold how OOP inheritance is supposed to work:
trait CloseableIterable[+A] extends Iterable[A] {
def iterator: CloseableIterator[A]
}
trait CloseableIterator[+A] extends Iterator[A] with AutoCloseable {
/** Closes this resource, relinquishing any underlying resources. */
def close(): Unit
}
BAM, we just invalidated more than 80 operations provided by Scala’s Iterator
and Iterable
. Are you going to override them all?
If you say yes, then you don’t know what you’re getting yourself into, plus what are you going to do when Scala 2.13 (or whatever the next version is) comes with new operators that need to be overridden? Are you going to remember to do it? It’s a hard problem.
I don’t mind having implementations of map
and filter
for Iterator
, but
Scala lacks a minimal interface that provides just the raw protocol.
There is value in simplicity. Notice Java’s Iterator,
notice C#’s IEnumerator, notice how they don’t come with operators attached.
Instead, for C# at least, you can import
Ix.NET in your project,
which gives you a bunch of extension methods you can work with, no strings attached.
Scala could also use type-classes which are
much better than extension methods. But instead what currently happens in
Scala’s collection library can be seen as inheritance hell.
Not everything needs map
and flatMap
on it.
UPDATE: there is now an issue for discussing this at collection-strawman/issues/#17.
Non-problem: early termination & resource handling #
The Iterator
interface alone is not enough to expose streams linked
to file handles, network sockets or other resources that need to be disposed
when terminated early.
So rookies (also because of problem 2 above) can end up with unconsumed iterators, creating possible connection leaks, because it’s easy:
iterator.take(100)
However I don’t view this as being a problem because:
- we don’t need to do resource handling everywhere
- as demonstrated in the sample with
CloseableIterator
above, you can build proper resource handling on top ofIterator
- doing I/O by means of an
Iterator
is often a bad idea, given thatIterator
is not capable of asynchronous boundaries, with I/O operations often being asynchronous
Having a CloseableIterator
in the standard library wouldn’t be bad though,
however given the very complex inheritance hierarchy and the traversable
grandparents, I’m afraid to wish for it.