alexn.org

Fixing scala.collection.Iterator

Scala icon

In my previous article I talked about getting rid of Traversable because the Iterable and Iterator duo is enough for Scala's standard library. But the Iterator interface can also use some improvements.

As a reminder, the Iterator interface is something like this:

trait Iterator[+A] {
  def hasNext: Boolean
  def next(): A
}

It's a destructive interface that is consumed for as long as you call next(), it obviously has "identity" and you're supposed to use it like this:

val cursor: Iterator[Int] = ???
var sum = 0

while (cursor.hasNext) {
  sum += cursor.next()
}

Problem 1: Both methods (hasNext, next) are side-effecting

You could say that hasNext is not supposed to move the internal cursor / pointer / index and thus it shouldn't be side-effecting, but that's not true, because in many cases the only way to know if there is a next element to be served is to trigger a side-effecting read.

And so the problem is that both hasNext and next() are side-effecting and in my opinion the result of the wrong method is getting cached. When you work with Functional Programming for a while, you start noticing when APIs have their side-effects screwed ;-)

We can't really blame Scala though. This interface has been imported from Java and kept similar probably for remaining familiar.

But let me illustrate by building an iterator for reading an InputStream:

class IteratorFromStream(in: InputStream, chunkSize: Int)
  extends Iterator[Array[Byte]] {

  private val buffer = new Array[Byte](chunkSize)
  private var chunk: Array[Byte] = _
  private var hasChunk = false

  def hasNext: Boolean = {
    if (!hasChunk) {
      val len = in.read(buffer)
      if (len >= 0) {
        chunk = util.Arrays.copyOf(buffer, len)
        hasChunk = true
      } else {
        in.close()
      }
    }

    hasChunk
  }

  def next(): Array[Byte] = {
    if (hasNext) {
      val ref = chunk
      chunk = null // GC purposes
      hasChunk = false
      ref
    } else {
      throw new NoSuchElementException("InputStream is empty")
    }
  }
}

Not that particularly exciting and you can see how next() has to duplicate the work of hasNext and that hasNext itself is side-effecting, because we have to read from the InputStream before being able to answer that question.

We can do better and we don't have to be original about it. Behold the alternative inspired by IEnumerator from C#:

trait Iterator[+A] {
  // Side-effecting, moves the cursor
  def moveNext(): Boolean

  // Not side-effecting, can be called multiple times
  def current: A
}

Usage is straightforward:

val cursor: Iterator[Int] = ???
var sum = 0

while (cursor.moveNext()) {
  sum += cursor.current
}

This interface feels more natural to developers because "moving" the cursor is the side-effect, not reading the current value. And here's how the above implementation changes:

class IteratorFromStream(in: InputStream, chunkSize: Int)
  extends Iterator[Array[Byte]] {

  private val buffer = new Array[Byte](chunkSize)
  private var chunk: Array[Byte] = _

  def moveNext(): Boolean = {
    val len = in.read(buffer)
    if (len >= 0) {
      chunk = util.Arrays.copyOf(buffer, len)
      true
    }
    else {
      chunk = null
      in.close()
      false
    }
  }

  def current: Array[Byte] = {
    if (chunk == null) throw NoSuchElementException("current")
    chunk
  }
}

Notice how this simplifies things on the implementation side as well.

UPDATE: M.Odersky points out in the comments that the standard library has a BufferedIterator implementation that caches the current head and can be used for convenience.

Problem 2: Iterator comes with operations attached

At the beginning I gave you a simplified Iterator definition, however I lied. The true scala.collection.Iterator is closer to this:

package scala.collection

trait Iterator[+A] extends TraversableOnce[A] {
  def hasNext: Boolean
  def next(): A

  def isTraversableAgain = false
  def isEmpty: Boolean = !hasNext
  def hasDefiniteSize = isEmpty

  def map[B](f: A => B): Iterator[B] = ???
  def take(n: Int): Iterator[A] = slice(0, n)
  def drop(n: Int): Iterator[A] = ???
  def slice(from: Int, until: Int): Iterator[A] = ???
  def map[B](f: A => B): Iterator[B] = ???
  def ++[B >: A](that: => GenTraversableOnce[B]): Iterator[B] = ???
  def flatMap[B](f: A => GenTraversableOnce[B]): Iterator[B] = ???
  def filter(p: A => Boolean): Iterator[A] = ???
  def corresponds[B](that: GenTraversableOnce[B])(p: (A, B) => Boolean): Boolean = ???
  def withFilter(p: A => Boolean): Iterator[A] = ???
  def filterNot(p: A => Boolean): Iterator[A] = ???
  def collect[B](pf: PartialFunction[A, B]): Iterator[B] = ???
  def scanLeft[B](z: B)(op: (B, A) => B): Iterator[B] = ???
  def scanRight[B](z: B)(op: (A, B) => B): Iterator[B] = ???
  def takeWhile(p: A => Boolean): Iterator[A] = ???
  def partition(p: A => Boolean): (Iterator[A], Iterator[A]) = ???
  def span(p: A => Boolean): (Iterator[A], Iterator[A]) = ???
  def dropWhile(p: A => Boolean): Iterator[A] = ???
  def zip[B](that: Iterator[B]): Iterator[(A, B)] = ???
  def padTo[A1 >: A](len: Int, elem: A1): Iterator[A1] = ???
  def zipWithIndex: Iterator[(A, Int)] = ???
  def foreach[U](f: A => U) = ???
  def forall(p: A => Boolean): Boolean = ???
  def exists(p: A => Boolean): Boolean = ???
  def contains(elem: Any): Boolean = ???
  def find(p: A => Boolean): Option[A] = ???
  def indexWhere(p: A => Boolean): Int = ???
  def indexOf[B >: A](elem: B): Int = ???
  def buffered: BufferedIterator[A] = ???
  def grouped[B >: A](size: Int): GroupedIterator[B] = ???
  def copyToArray[B >: A](xs: Array[B], start: Int, len: Int): Unit = ???
  def sameElements(that: Iterator[_]): Boolean = ???
  def toTraversable: Traversable[A] = ???
  def toIterator: Iterator[A] = ???
  def toStream: Stream[A] = ???
  // ....
}

Well, at this point you should be thinking that this violates the principles of OOP design. When Iterator comes with operations like map and filter, that are polymorphic and can be overridden, it is no longer just a minimal protocol for "iterating over things", but a big, fat interface.

You see, there isn't a single possible implementation for map or for take and by providing such operations with a default implementation the Iterator is imposing to users how those operations should behave. Or more specifically:

  1. these operations have lazy behavior, until overridden in subclasses
  2. assume that the protocol is set in stone

But Iterator is a fine example of an OOP interface because you can add restrictions to it. Lo and behold how OOP inheritance is supposed to work:

trait CloseableIterable[+A] extends Iterable[A] {
  def iterator: CloseableIterator[A]
}

trait CloseableIterator[+A] extends Iterator[A] with AutoCloseable {
  /** Closes this resource, relinquishing any underlying resources. */
  def close(): Unit
}

BAM, we just invalidated more than 80 operations provided by Scala's Iterator and Iterable. Are you going to override them all?

If you say yes, then you don't know what you're getting yourself into, plus what are you going to do when Scala 2.13 (or whatever the next version is) comes with new operators that need to be overridden? Are you going to remember to do it? It's a hard problem.

I don't mind having implementations of map and filter for Iterator, but Scala lacks a minimal interface that provides just the raw protocol. There is value in simplicity. Notice Java's Iterator, notice C#'s IEnumerator, notice how they don't come with operators attached. Instead, for C# at least, you can import Ix.NET in your project, which gives you a bunch of extension methods you can work with, no strings attached. Scala could also use type-classes which are much better than extension methods. But instead what currently happens in Scala's collection library can be seen as inheritance hell.

Not everything needs map and flatMap on it.

UPDATE: there is now an issue for discussing this at collection-strawman/issues/#17.

Non-problem: early termination & resource handling

The Iterator interface alone is not enough to expose streams linked to file handles, network sockets or other resources that need to be disposed when terminated early.

So rookies (also because of problem 2 above) can end up with unconsumed iterators, creating possible connection leaks, because it's easy:

iterator.take(100)

However I don't view this as being a problem because:

  1. we don't need to do resource handling everywhere
  2. as demonstrated in the sample with CloseableIterator above, you can build proper resource handling on top of Iterator
  3. doing I/O by means of an Iterator is often a bad idea, given that Iterator is not capable of asynchronous boundaries, with I/O operations often being asynchronous

Having a CloseableIterator in the standard library wouldn't be bad though, however given the very complex inheritance hierarchy and the traversable grandparents, I'm afraid to wish for it.