Iterators and laziness with types

Prerequisites:

Collections are some of the most commonly encountered data structures in most programs, in fact, some, such as the basic filters known from the common Linux/Unix userland can be described as programs that merely manipulate a list of strings, that is, the lines of a file.

From more imperatively-oriented programming languages, you might be used to using loops to process lists, most commonly with one of the incarnations of the for loop.

In functional programming languages, recursion tends to be the preferred method of list processing, however, in some cases, recursion might be awkward to type out and may end up being hard to read. Furthermore, the recursive approach limits the maximum amount of iterables in programming languages that do not have Tail Call Optimization, as each iteration will insert a new stack frame to the call stack, so you will eventually run out of stack.

Rust is one of the languages that do not have TCO, and so you will eventually blow your stack. You can see how many calls deep can you go by running the following code example on your machine:

#[allow(unconditional_recursion)]
fn recurse(x: usize) -> ! {
    println!("{}", x);
    recurse(x + 1)
}

fn main() {
    recurse(1);
}

On my machine, this program can go 104756 levels deep in a --release build before running out of space on the stack.

These two limitations necessitate the need for an abstract and declarative solution. That would be iterators and operations on iterators.

Iterators in Rust

An iterator is a type that facilitates iteration on an iterable, producing an item every iteration. This is by done by implementing the iterator traits.

The main traits are the following:

  • Iterator - must be implemented by all iterators, the "main trait"
  • IntoIterator - to convert a type into an iterator, consuming the type in the process (this is usually needed if you want to iterate over Item type directly and not a reference without cloning or copying)

Then there is a couple secondary traits:

  • DoubleEndedIterator - allows taking from the back of an iterator
  • ExactSizeIterator - signifies this iterator has an exact size. It is a good idea to implement this traits as it makes way for optimizations by the compiler
  • FusedIterator - signifies that the iterator will always continue to yield None when first exhausted. This marker trait should also be implemented by all iterators that behave this way because it allows optimizing Iterator::fuse

In Rust, the convention is to choose one of the following two approaches:

1. Implement one or more iterator traits directly on your type

This is the approach you should take when the iterability of your type is limited (ie. only one of &Item, &mut Item and Item makes sense for the iterator to produce) and/or when the iterator itself is capable of keeping track of the next element it should produce, either because it is some sort of a FIFO or LIFO structure, or because it contains a counter for current element.

Keep in mind that the Iterator trait requires mutable access to the implementor, so you will have to choose the second approach if that is not possible

You might also choose this approach, if the items are generated somewhere else, for example, imagine you implement an abstraction over a network protocol as an iterator over received messages, which only starts returning None when the connection is closed.

Implementing an iterator on a structure is not a difficult endeavor:

#![allow(unused)]
fn main() {
/// a simple iterator producing the Fibonacci sequence
struct Fibonacci(usize, usize);

impl Fibonacci {
    fn new() -> Self {
        Self(0, 1)
    }
}

impl Iterator for Fibonacci {
    type Item = usize;

    fn next(&mut self) -> Option<Self::Item> {
        let res = self.0;
        self.0 = self.1;
        self.1 = res + self.1;
        Some(res)
    }
}
}

This simple iterator can then be used just like any other:

/// a simple iterator producing the Fibonacci sequence
struct Fibonacci(usize, usize);

impl Fibonacci {
   fn new() -> Self {
       Self(0, 1)
   }
}

impl Iterator for Fibonacci {
   type Item = usize;

   fn next(&mut self) -> Option<Self::Item> {
       let res = self.0;
       self.0 = self.1;
       self.1 = res + self.1;
       Some(res)
   }
}

fn main() {
    for i in Fibonacci::new().take(20) {
        println!("{}", i);
    }
}

The .take(n) method will only take the first n items of the iterator, producing None after the nth one. Otherwise, this would run infinitely depending on whether debug assertions are enabled and overflow would or would not cause a panic.

2. Use intermediary structures for iterators

When you want to allow multiple Item types, more than one of &Item, &mut Item and Item, and if the implementor itself is not capable of keeping track of the current element, or it cannot be used in a mutable context (for instance, because it is shared across threads), then you need to use an intermediary struct.

This is usually done like this:

  • implement methods .iter(), .iter_mut() (and optionally .into_iter() if you can't implement the IntoIterator trait for some reason) directly as methods on the type
  • each of these produces an instance of their respective iterator types, which you implement the iterator traits on

You can first see this pattern on the Vec type:

  • Vec itself does not implement only implements IntoIterator a couple times
  • The type itself has the methods .iter() and .iter_mut()
  • .iter() returns std::slice::Iter
  • .iter_mut() returns [https://doc.rust-lang.org/std/slice/struct.IterMut.html]
  • the IntoIterator owned impl returns std::vec::IntoIter

And the iterator traits are implemented on slice::Iter, slice::IterMut, vec::IntoIter respectively.

BTW: The first two methods make use of the fact that a vector implements AsRef and AsMut into slice, indicating a vector can be interpreted as a slice. This allows Vec to make use of a lot of methods and functionality already implemented on slices, such as in this case.

Note that implementing this approach takes significantly more code, however, the benefit is increased flexibility.

This is how an implementation on a simple iterator might look (only immutable iter implemented):

/// a simple iterator that alternates taking elements from both ends
struct Converging(Vec<usize>);

impl Converging {
    fn iter<'a>(&'a self) -> Iter<'a> {
        Iter(0, self.0.len(), false, self)
    }
}

/// start, end, take_from_end
struct Iter<'a>(usize, usize, bool, &'a Converging);

impl<'a> Iterator for Iter<'a> {
    type Item = &'a usize;

    fn next(&mut self) -> Option<&'a usize> {
        self.2 = !self.2;
        if self.2 && self.1 != self.0 {
            self.0 += 1;
            self.3.0.get(self.0 - 1)
        } else if self.1 != self.0 {
            self.1 -= 1;
            self.3.0.get(self.1)
        } else {
            None
        }
    }
}

fn main() {
    // type of i is &usize
    for i in Converging(vec![0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).iter() {
        println!("{}", i);
    }
}

The iterator structures generally have to have some link to the structure they come from, so it is often necessary to hold a reference. As the iterators are types like any other, ownership rules are still upheld and you need to be generic over the lifetime of the underlying structure whose iteration you are facilitating, hence the 'a' lifetime param.

When implementing a mutable iterator, it is sometimes possible that you are technically trying to hand out mutable references to the same object, even though you know that the references are disjoint and no issues with aliasing, memory corruption or data races should occur. Rust is not yet smart enough to be able to identify these cases and so iterators over mutable references often contain a smidgen of unsafe {} Rust to bypass the strict ownership rules.

Inner workings of iterators

Iterators in Rust are lazy, and don't do anything until they are consumed by one of the consumer methods, or a for-loop. For example, one might make a naive attempt to mutate a collection with iterators like this:

fn main() {
    let mut v = vec![1, 2, 3, 4, 5];

    v.iter_mut()
        .map(|i| *i *= 2);

    println!("{:?}", v);
}

However, if you try running this example, you will see that the printed vector is the same as the original one.

You will also get the following warning:

#![allow(unused)]
fn main() {
   Compiling playground v0.0.1 (/playground)
warning: unused `Map` that must be used
 --> src/main.rs:4:5
  |
4 | /     v.iter_mut()
5 | |         .map(|i| *i *= 2);
  | |__________________________^
  |
  = note: `#[warn(unused_must_use)]` on by default
  = note: iterators are lazy and do nothing unless consumed
}

Apart from .map() not being exactly the correct method to use, the second note tells you exactly what you need to know, nothing has consumed this iterator and so it didn't execute.

In Rust, iterators that have operations on them such as .map, .filter, .inspect, .filter_map create large and complex types, and then all of the actions are stacked on top of one another for each element as opposed to the first action being processed on all elements first, then the second action on all elements... and so on. The latter approach would actually be quite problematic if you remember that some iterators are endless or not fused, and so you can't be certain about the total amount of elements.

This makes iterators quite effective and about as performant (or even more performant) as loops, as they only have to iterate through the collection once.

Consider the following example:

#![allow(unused)]
fn main() {
// ranges are already iterators
let number_string = (0usize..20usize)
    .map(|x| x * 2)
    .filter(|x| x % 3 == 0)
    .take(6)
    .skip(3)
    .map(|x| x.to_string())
    .collect::<String>();

println!("{}", number_string);
}

You should see the output 182430.

Here is a table indicating how the elements were processed (x indicates that an element was eliminated in this step):

start012345678910111213141516171819
.map02468101214161820222426283032343638
.filter0xx6xx12xx18xx24xx30xx36x
.take0612182430x
.skipxxx182430
.map"18""24""30"
.collect"18""24""30"

Once an element is eliminated, no further operations are attempted on it.

The type of the iterator is this monstrosity:

#![allow(unused)]
fn main() {
Map<Skip<std::iter::Take<Filter<Map<std::ops::Range<usize>, [closure@src/main.rs:3:14: 3:23]>, [closure@src/main.rs:4:17: 4:31]>>>
}

Longer iterator operations, with more closures get much worse. As you remember from the closure chapter, these closure types are anonymous types, and their definitions can get pretty long. In older Rust, it was common to reach the type length limit with iterator operations.

There is no easy way to type the type out exactly, and it is outright impossible if there are closures involved, such as in this case. Not even generics help:

#![allow(unused)]
fn main() {
use std::iter::{
    Map,
    Skip,
    Take,
    Filter,
};
use std::ops::Range;

fn return_iter<T, U, V>() -> Map<Skip<Take<Filter<Map<Range<usize>, T>, U>>>, V>
where
    T: Fn(usize) -> usize,
    U: Fn(usize) -> usize,
    V: Fn(usize) -> String,
{
    (0usize..20usize)
        .map(|x| x * 2)
        .filter(|x| x % 3 == 0)
        .take(6)
        .skip(3)
        .map(|x| x.to_string())
}
}

This returns an error about mismatched types, and it hinges on the closures.

Existential types in iterators

As mentioned in the chapter on closures, existential traits, AKA the impl Trait syntax is also commonly used for when functions need to return an Iterator, as it allows you to avoid the closure problem and also avoid having to write out the very long concrete type.

It is also much handier when you decide to change the iterator later on.

Here is how you would fix the previous example with existential types:

#![allow(unused)]
fn main() {
fn return_iter<T, U, V>() -> impl Iterator<Item=String> {
    (0usize..20usize)
        .map(|x| x * 2)
        .filter(|x| x % 3 == 0)
        .take(6)
        .skip(3)
        .map(|x| x.to_string())
}
}

The syntax is much cleaner and you only need to adjust the return type if the final item type is different after your changes.

Once again, keep in mind that existential types are not generics, and you can only use them in return types of functions.

Itertools and rayon

Iterators are commonly implemented for types in many 3rd libraries, as they provide a handy interface to access things that can be even vaguely thought of as collections or as something producing values (ie. think of the receiving end of some communication -> unbounded iterator over messages that only stops when the connection closes). However, there is at least two libraries worth mentioning that focus on iterator usage specifically.

The first of them is itertools, a library focused on extending the functionality of iterators by adding a number of combinators and utils.

https://docs.rs/itertools/latest/itertools/index.html

Using it is as easy as importing the Itertools trait:

#![allow(unused)]
fn main() {
use itertools::Itertools;

let it = (1..3).interleave(vec![-1, -2]);
itertools::assert_equal(it, vec![1, -1, 2, -2]);
}

It also contains a couple of handy macros, such as chain, which can improve the readability of your code:

#![allow(unused)]
fn main() {
use std::{iter::*, ops::Range, slice};
use itertools::{assert_equal, chain};

// e.g., this:
let with_macro:  Chain<Chain<Once<_>, Take<Repeat<_>>>, slice::Iter<_>> =
    chain![once(&0), repeat(&1).take(2), &[2, 3, 5],];

// ...is equivalant to this:
let with_method: Chain<Chain<Once<_>, Take<Repeat<_>>>, slice::Iter<_>> =
    once(&0)
        .chain(repeat(&1).take(2))
        .chain(&[2, 3, 5]);

assert_equal(with_macro, with_method);
}

On the other hand, rayon is solely focused on one thing - parallel execution. And it uses iterators as a natural way to represent parallelized tasks.

https://crates.io/crates/rayon

This leans into a very useful thing, wherein you can often make an operation on an iterator parallel simply by simply replacing .iter() et al. with .par_iter().

#![allow(unused)]
fn main() {
use rayon::prelude::*;
fn sum_of_squares(input: &[i32]) -> i32 {
    input.par_iter()
         .map(|i| i * i)
         .sum()
}
}

Although it is not used as commonly by end users, rayon also supports scheduling tasks on your own without the use of iterators.

Composing types pattern

Related RustConf talk: https://www.youtube.com/watch?v=wxPehGkoNOw

Iterators are an example of a common pattern in Rust, that is the pattern of encoding information and actions into type in a more elaborate manner than most commonly seen in imperatively oriented programming languages, which more corresponds to functional programming.

Another example of this that we've seen are the async futures from the previous chapters, although the types involved are not directly visible and accessible to us. You can take a look at the [FuturesExt] trait and all the operations similar in setup to the iterators' operations found in its methods:

https://docs.rs/futures/latest/futures/future/trait.FutureExt.html

If you have worked with the Diesel ORM and query builder, it uses the same pattern for building queries.

In more abstract terms, the pattern can be described as such:

  • Create a number of composable types bound together by a trait
  • The trait should contain methods that allow composing, alternatively such methods can be on the composable types themselves
  • By using the methods, the user can declaratively describe what they need to do, the result of said calls being a complex type
  • No action happens until something consumes or uses said complex type (it can be reusable)

This approach has several benefits:

  • It leads itself to more declarative code, which makes it easier to understand from the user perspective
  • The complex type can be reused, saving work and resources
  • In the context of Rust, complex concrete types are very optimization friendly, and can result in better performance

I believe that this concept is best understood by practicing it on an exercise.

The Task: Type-encoded filter

For this project, the task is to implement a type-encoded filter, such that:

  • no dynamic dispatch is used
  • closures are stored directly without being in a box or similar container

The exercise is complete when the following code snippet works correctly:

fn main() {
    let tester = AllOf(
        (
            Negate(Simple(|x: i32| x % 2 == 0, PhantomData), PhantomData),
            Simple(|x: i32| x % 3 == 0, PhantomData),
            Simple(|x: i32| x > 10, PhantomData),
            Simple(|x: i32| x < 100, PhantomData),
        ),
        PhantomData,
    );

    for i in (0..1200).filter(|x| tester.test(*x)) {
        println!("matches: {}", i);
    }
}

As you can see, there is a lot of PhantomData. These are left here instead of being hidden in some sort of a ::new() associated function to help you figure the types out. You can make small adjustments to the code snippet above, so long as its essential nature is preserved.

PhantomData, which you might be familiar with from the trait chapters, is a zero-sized type which you can use to plant a generic parameter or a lifetime in your struct which would otherwise remain unused by struct members, which is not allowed in Rust.

It is most commonly used to represent something like this:

struct MyStruct<T, U>
where
    T: MyTrait<U>
{
    //...
    something_cool: T,
    _phantom: PhantomData<U>,
}

Without the PhantomData, the U type param would not be used inside the struct and so rust would complain. The example is editable so you can try it out.

1. Traits and types

Start by creating a Filter trait generic over any type T, with the single method of:

#![allow(unused)]
fn main() {
fn test(&self, t: T) -> bool;
}

Next create the following four types:

  • Simple, containing an Fn(T) -> bool, which uses the given closure to test if a T satisfies a condition
  • Negate, containing any type that implements Filter, which negates the boolean result of the contained filter
  • AllOf, containing any tuple between 2 and N elements of any type implementing Filter, which returns true only if the tests of all of its elements return true
  • OneOf, containing any tuple between 2 and N elements of any type implementing Filter, which returns true if any of the tests of its elements returns true

Implement the trait Filter for all of these types, with the trait bound of T: Copy.

For the tuples, N is 4 if you decide to do it by hand, and 10 if you use the impl_trait_for_tuples crate

To handle the tuples, we need to add another trait, which we can for example call TestTuple also generic over T, with the following single method:

#![allow(unused)]
fn main() {
fn test_tuple(&self, t: T) -> Vec<bool>;
}

Which returns a vector of test results of all of its elements.

To implement it on all tuples between such and such amount of elements, you can use the impl_trait_for_tuples crate. With it, you can create a blanket implementation such as the following:

#![allow(unused)]
fn main() {
use impl_trait_for_tuples::impl_for_tuples;

#[impl_for_tuples(2, 10)]
#[tuple_types_custom_trait_bound(Filter<T>)]
impl<T> TestTuple<T> for Tuple
where
    T: Copy,
{
    // your code here
}
}

Make sure to read the documentation to learn how to use for_tuples! properly, if you decide to go this way: https://docs.rs/impl-trait-for-tuples/latest/impl_trait_for_tuples/#semi-automatic-syntax

2. Testing and more

OneOf remains untested, so create a test for OneOf, which demonstrates its usage.

You are also welcome to run a thought experiment, or perhaps try to make it a reality if time allows, how would you modify the code to allow for T: !Copy.

3. Final product

In the end you should be left with a well prepared project, that has the following:

  • full functionality
  • documented code (where applicable)
  • clean git history that does not contain fix-ups, merge commits or malformed/misformatted commits