A Practical Lesson in Rust - Implementing an iterator over delimited string slices
Disclaimer: This section assumes you have a at least a cursory knowledge of basic Rust, and that you have already tried writing some code of your own.
For this, I recommend at the very least that you try out Carol Nichols' rustlings, especially the parts move_semantics, option, strings and traits
Prerequisites
- Trait syntax
- if-let and match
- Option and Result
- Vector, existence of basic iterator and .collect()
In this part, we will examine Rust's approach to references and ownership, and we will do that in relation to string types.
Memory management model
You might have already heard that Rust does not have a garbage collector, and manages its memory manually, but you might be surprised to learn that its model is a different to the one used in C.
Rust's memory management is lexical (with a few exceptions for ergonomics), which means that memory is allocated when a variable (or just a value) is created, and freed when it goes out of scope in terms of the syntax. When something goes out of scope is determined by the Rust compiler, and the compiler also ensures that there are no dangling pointers left. The part of compiler responsible for this is called the borrow checker
Rust enforces RAII (Resource Acquisition Is Initialization), so to put it simply, initializing a variable gives you memory or other resources (such as opening a file), and when an object goes out of scope, its destructor is called and its resources are returned to the system (sockets and files are closed, memory is freed).
In order to implement this effectively, Rust introduces a couple new concepts. Here is a short list with some succinct definitions from Pascal Hertleif (in quotes):
(I strongly suggest you run the examples, Rust's compiler errors are usually very descriptive and can help provide you insight into what is going on)
Ownership: You own a resource, and when you are done with it, that resource is no longer in scope and gets deallocated.
In Rust, to denote that you own something, you simply use its type plainly without any fluff around it:
fn main() { // I own this string in this function, by creating this variable, // I have allocated memory let the_11th_commandment = String::from("Braiins OS rocks!"); // the memory used by the string will be freed here, since // we have not passed its ownership elsewhere and main() ends here }
References to a resource depend on the lifetime of that resource (i.e., they are only valid until the resource is deallocated).
Often, you only want to give a reference to something. This is the Rust equivalent of
a const <type>* pointer. It only allows read access. You can create as many of these as you want
// in serious Rust, you'd use &str for flexibility, as &String can convert to it automatically fn print_my_string(string: &String) { // compare to `const char * const string`, which would be the C equivalent println!("{}", string); // the reference to string is destroyed here } /// the print_my_string() function does not take the ownership /// of the string, so you can pass it multiple times; for references /// rust creates copies if necessary fn main() { let the_11th_commandment = String::from("Braiins OS rocks!"); print_my_string(&the_11th_commandment); print_my_string(&the_11th_commandment); // you can also create a reference and store it in a variable let string_ref = &the_11th_commandment; print_my_string(string_ref); // <- string_ref is destroyed here // <- the_11th_commandment is destroyed here }
However, as stated in the excerpt from Pascal, references are only valid for as long as the resource exists. This is a common pitfall for new Rust programmers:
// this function won't compile // we have to specify a lifetime explicitly here, // otherwise Rust assumes you maybe want to return // constants, which have a 'static lifetime, and // as such live forever fn give_me_a_ref<'a>() -> &'a String { let temp = String::from("This function is a prison and I am trapped in it."); &temp // <- temp would be freed here, // the returned reference cannot outlive it }
Move semantics means: Giving an owned resource to a function means giving it away. You can no longer access it.
This is a major difference to languages with C-like semantics, which use copy semantics by default, i.e. to give a parameter to a functions means to create a copy which is what is then available in said function.
In Rust, however, you take the value you have and give it to a function, and then you can no longer access it:
fn completely_safe_storage(value: String) { // <- value is immediately freed } fn main() { let x = String::from("1337 US Dollars"); completely_safe_storage(x); // <- ownership of x was moved to completely_safe_storage() println!("{}", x); // <- this does not compile, as we no longer have the ownership of x }
We then say that main() owns x until completely_safe_storage() is called,
at which point ownership is handed to it (= x is moved into the function), and completely_safe_storage() owns x
until it is dropped.
To not move a resource, you instead use borrowing: You create a reference to it and move that. When you create a reference, you own that reference. Then you move it (and ownership of it) to the function you call. (Nothing new, just both concepts at the same time.)
We have already kinda demonstrated this two examples ago, but we can make a more annotated example:
fn takes_reference(my_ref: &String) { // <- reference is moved println!("{}", my_ref); // <- this macro actually takes all arguments by reference // so a &&String is created here, which is moved into the // interals of the macro // <- my_ref is destroyed here } fn main() { let x = String::from("Hello, world!"); // <- allocate and initialize new string x to // "Hello, world!" // main() now owns x let reference = &x; // <- create a reference to x // main() owns this reference // we call this "borrowing x (immutably)" takes_reference(reference); // <- reference is moved into takes_reference(); // <- x is freed here }
To manipulate a resource without giving up ownership, you can create one mutable reference. During the lifetime of this reference, no other references to the same resource can exist.
To prevent issues with pointer aliasing and memmoved() resources, and a whole plethora of possibilities for memory corruption, Rust prevents you from having more than one reference to a resource, if said reference is mutable. For example, you can't do this:
fn main() { let mut bitcoin = String::from("bitcoin"); // Rust is actually pretty smart, // so if it sees you are not using mut_ref // after you have created ro_ref, it will // destroy it early, this is a relatively // recent change for ergonomics in Rust // called Non-Lexical Lifetimes let mut_ref = &mut bitcoin; // <- borrow bitcoin mutably // mut_ref is of type `&mut String`, // given that the variable itself is immutable, // this corresponds to `char* const ptr` in C let ro_ref = &bitcoin; // <- borrow bitcoin immutably println!("{}", ro_ref); // <- use the immutable borrow mut_ref.push_str(", the cryptocurrency"); // <- use the mutable borrow }
We briefly also touched up on the concept of lifetime. This denotes how long a resource exists or is accessible from start to finish. Mostly, we speak about these in terms of references.
Rust use the 'ident syntax to denote lifetimes, as we have seen in the invalid
reference-returning example before. Just like in the previous parameter, they
usually appear as generic parameters. What you call them is up to you, although
usually, single letters starting from 'a are used. The only thing you can do
with these explicit lifetimes is verify if they are equal, or rather, if one satisfies
the other (e.g. lifetime 'a lives as long or longer than 'b)
The exception is the 'static lifetime, which denotes references that are valid
for the entirety of the program's run from anywhere, You mainly get these via constants
and statics.
#![allow(unused)] fn main() { static NUMBER_REF: &'static i32 = &42; }
To fully illustrate the concept of lifetimes, we can annotate an earlier example with appropriate lifetime scopes for values. This is more of a pseudo-code, so this example is kind of for looking only:
fn main() { 'bitcoin_lifetime: { let mut bitcoin = String::from("bitcoin"); 'mut_ref_lifetime: { let mut_ref = &mut bitcoin; // <- borrow bitcoin mutably 'ro_ref_lifetime: { let ro_ref = &bitcoin; // <- borrow bitcoin immutably println!("{}", ro_ref); // <- use the immutable borrow mut_ref.push_str(", the cryptocurrency"); // <- use the mutable borrow } // <- ro_ref goes out of scope here ┐ // ├ these references can't coexist, } // <- mut_ref goes out of scope here ┘ hence the issue } // <- bitcoin goes out of scope here }
To illustrate how you can assure two references live for the same duration:
// This denotes: // for two references left and right, which live the same, // return a reference that lives as long as these two // // // It is important to keep in mind, that Rust can accept // parameters of varying lifetimes by shortening one of them // in the perspective of the function fn max_ref<'a>(left: &'a i32, right: &'a i32) -> &'a i32 { if *left < *right { right } else { left } }
You can also specify other types of requirements:
// for two lifetimes 'a and 'b, such that 'a lives // as long as 'b or longer fn foobar<'a, 'b>(_x: &'a i32, _y: &'b i32) where 'a: 'b { // code... }
That’s it. And it’s all checked at compile-time.
This is only a very brief introduction, for a more complete overview, please check out the following links:
- https://doc.rust-lang.org/book/ch04-01-what-is-ownership.html
- https://depth-first.com/articles/2020/01/27/rust-ownership-by-example/
- https://blog.logrocket.com/understanding-ownership-in-rust/
Strings in Rust
A peculiarity of Rust is that it does not have a single string type in the standard library,
but rather seven (there may be more when you read this text):
- &
str/ &mutstr- primitive string slice type behind a standard reference - Cow
- Clone-on-Write wrapped string slice, works for both owned and borrowed values, not seen very often (which is unfortunate, since they can be really handy!) - String - owned string
- OsStr - borrowed platform-native string, corresponds to
&str - OsString - owned platform-native string, corresponds to
String - CStr - borrowed C string, corresponds to
&str - CString - owned C string, corresponds to
String
From these, you are most likely to encounter str and String, and
str is the primitive type that is always available regardless of if
you have std and core lib present.
str is a slice type, which comes with some features:
- slices are views into collections regardless of where they are present, string slices can exist on the stack, heap, or compiled into the binary (whereas Strings are on the heap)
- slices' size is not static or always known at compile-time, so just like
trait objects, they can only exist behind a reference, so you'll generally
encounter string slices as
&stror&mut str
In Rust, string literals are string slices too:
#![allow(unused)] fn main() { fn my_ref() -> &'static str { "Hello world!" } }
In the previous example, I have annotated the lifetime of the reference we got. Since string literals are compiled into the library, they are by-default valid for the entire run of the program.
Normal conditions for working with references and ownership still apply. You can't return a string slice of a string you've created in the function you are returning it from:
#![allow(unused)] fn main() { fn my_ref() -> &str { let my_string = String::new(); &my_string // <- doesn't compile, return value references temporary value } fn my_ref2(input: &String) -> &str { &input // <- compiles, since input is known to live longer than the span // of this function } }
The error here says "missing lifetime specifier" because it correctly deduces that there is nowhere outside (meaning parameters) to deduce the lifetime and borrow the value, so it assumes you must want to return a reference to either a static or a string slice literal (which also has a static lifetime, given it is compiled into the binary).
An important feature is that you can create string slices from other string slices without copying data by re-borrowing:
fn main() { let my_str = "Hello, world!"; let hell = &my_str[..4]; println!("{} {}!", my_str, hell); }
We can do this, since we are working with immutable references, and the mutable reference cannot exist so long as immutable ones exist.
fn main() { let my_str: &mut str = "Hello, world!"; }
This does not compile because string literals are always immutable (see that the error says "types
differ in mutability"). And no, you cannot circumvent this by doing &mut *"Hello, world" ;-) that
would be very unsafe.
TIP: By the "types differ in mutability" error, you might deduce that a borrow and its mutability make for separate types, that is
T is not the same type as &T is not the &mut T. Keep that in mind when you have enter a generic parameter somewhere or create a trait implementation.
In the task below, we will be using string slices (&str) exclusively, but let's also look into owned Strings.
#![allow(unused)] fn main() { fn my_string() -> String { String::from("Hello, world"); // or "Hello, world".to_string() } }
You usually use owned strings wherever &str is impractical or you need mutability. &String coerces into &str,
so it is always the proper choice when needing read-only string function parameters:
// don't fn my_fun1(_input: &String) {} // do fn my_fun2(_input: &str) {} fn main() { let my_string = String::new(); my_fun1(&my_string); // both work my_fun2(&my_string); // however, this wouldn't work: // // my_fun1("Hello!") <- type mismatch, expected &String, got &str }
As you can see, using &str provides greater flexibility.
Lifetimes of owned vs unowned values
Sometimes when looking into Rust, you might hear that owned values have static lifetimes. The static lifetime here
means that the value is not a borrow of anything else, and so no other values imposes a lifetime on it. This makes
owned values type-check where 'static is required.
If a type is holding a borrow to something, then it needs to have the lifetime of the contained reference(s) as a generic parameter, and it will inherit the lifetime of the shortest-lived contained reference (this is to, once-again, prevent memory unsafety).
The lifetime of an owned value is bound by the scope of the function (or rather code block) it was declared in, provided it isn't moved. Without creating a reference, owned values can only be moved and possibly copied if it is allowed for said type.
The lifetime of a references is bound both by the scope of the function it was declared in, and by the lifetime of
the owned value it is borrowing. References itself are values and types also (remember from a couple lines above: Rust considers &T to be a distinct
type and you can implement traits on it), so rules of ownership apply to them as well.
There is a bit of trivia to be known:
&TisCopy, meaning the compiler will create and pass around copies as applicable&mut Tis notCopy, meaning it follows move semantics and when used as a parameter, it gets moved rather than copied
This comes from the definition of borrowing rules written above, and may lead to unexpected surprises when you don't pay attention to it.
Here's an example:
// in the business, we call this foreshadowing ;-) // remember this code example when working on the project below struct MyStruct<'a> { remainder: Option<&'a str>, } impl<'a> MyStruct<'a> { // this will keep returning the first character fn pop_first_char_as_string(&mut self) -> Option<&str> { // surprise! remainder here gets copied, // so we are not modifying which pointer is // stored in self, but only a copy on the stack let remainder = &mut self.remainder?; let c = &remainder[0..1]; if remainder.len() != 1 { *remainder = &remainder[1..]; Some(c) } else { self.remainder.take() } } } fn main() { let mut broken = MyStruct { remainder: Some("Hello"), }; for _ in 0..5 { println!("{:?}", broken.pop_first_char_as_string()); } }
The reason why this code does not work as you might expect is that the underlying immutable reference got copied and then we took an immutable reference to said reference.
We still need a &mut &'a str to properly solve this, however, we need to prevent the copy.
The solution is to borrow mutably while still inside the option either through pattern matching
or by using a handy dandy .as_mut() method on Option, the result of which is another option
containing a mutable reference to the contents of the original Option, if it was Some.
Here is how to pattern-match borrow:
#![allow(unused)] fn main() { if let Some(ref mut contents) = Some("Hey") { // ... } }
The if-let will bind the value to the right of the equation sign to the pattern on the left side, if the pattern matches the value.
ref and ref mut are special pattern modifiers that borrow whatever matches them. Sometimes, new
developers confuse it with (or question why it isn't) Some(&mut contents) = Some("Hey").
The latter is a valid syntax also, but it does the opposite, it pattern matches a
mutable references and binds the data it points to to contents, which is what we don't want
in this case.
You might want to know more about slices and string slices, please check out the following links:
Pattern matching
Pattern matching has been briefly mentioned in the paragraph above. You have likely already encountered it and will encounter it many more times doing common tasks in Rust.
If you want to learn more about pattern-matching, check out the section on the sidebar.
In general, there is two types of patterns, constant patterns and bindings.
Constant patterns limit which values are acceptable by set pattern, for example, in the following pattern
#![allow(unused)] fn main() { if let Some(val) = some_option {} }
The Some(...) part is constant -> nothing other than an Option::Some will match this, whereas val is a binding,
it will bind whatever is in that spot in said value to the name val, which is than available in this if.
Some patterns are also invariant, whereas other are not:
#![allow(unused)] fn main() { let (first, second) = a_tuple_of_two; }
Here, this pattern is always true, we know if we get a tuple, we can always destructure it into its constituent elements.
The fact that it is a tuple of two here is the constant factor, ie. (.., ..).
Patterns can be nested as much as you want or need:
#![allow(unused)] fn main() { if let (Some(4), Err(MyErrorEnum::Other(err))) = a_pair { } }
Here, this pattern will only match o a pair of Option and Result, if the Option is of the variant Some and Result of the variant Err.
Furthermore, the Option must contain the integer 4, and the Err must contain an MyErrorEnum::Other variant of the supplied error
type. We then bind the inner error in this type into the name err, which then becomes available inside the if-let.
Global "variables" in Rust
While you are unlikely to run into these in this chapter, this is the first chapter most newcomers will read and it therefore stands to reason that one of the most different things about Rust newcomers are prone to run into should be mentioned here.
To put it shortly, Rust really, really, doesn't like global variables. This is with regards to its two of its stated goals, name explicitness and safety.
Global variables are generally pretty bad when it comes to safety especially across threads, and using them properly requires synchronisation mechanisms. Adding these implicitly is beyond the explicitness goal of Rust, so the Rust way to do things is to restrict them severely.
In Rust, global variables are called statics, which speaks to their nature as often being static and requiring static initialization only.
Statics are declared with the static keyword, have to have their type typed out (no, or very little, type inference):
#![allow(unused)] fn main() { static N: i32 = 5; static mut M: i32 = 15; }
Mutable statics are quite problematic, and they can only be used in unsafe code, since you are prone to running into issues with multithreaded code, data races, race conditions etc.
If you need mutable shared states, you have to use one of the types with interior mutability, such as a Mutex.
However, only literals and constant function calls are allowed in static context (and all references to types have to have the static lifetime), so you need to use a crate that provides lazy static functionality.
The Task - Implementing a &str token iterator
For the first project, it will be your task to implement a string slice token iterator.
The iterator should be a type named TokenIterator and it should
keep returning tokens from an &str of the type &str that were
separated by an &str delimiter.
This means that your Iterator should work with string literals and strings alike.
You are forbidden from using the following to create your list:
- Owned strings (
Stringetc.) - Mutable strings
- slice and str.iter() methods which already provide this functionality
This project should be organized as a legitimate library. Code should be documented, unused code should not be present, the crate should expose a usable public API, and be logically organized into modules.
Develop your project with git, and make sure that your commits abide to the Braiins Guidelines (ADD LINK TO PAGES HERE)
You are also encouraged to write tests, both integration and unit tests. To learn about tests, here is a pair of links that you can to refer to:
- https://doc.rust-lang.org/cargo/commands/cargo-test.html
- https://doc.rust-lang.org/book/ch11-01-writing-tests.html
In fact, here is a couple tests you might want to use for your implementation:
#![allow(unused)] fn main() { // assuming method TokenIterator::new(source: &str, delimiter: &str) #[test] fn it_works() { let test_input = "1 2 3 4 5"; let tokens: Vec<_> = TokenIterator::new(test_input, " ").collect(); assert_eq!(tokens, vec!["1", "2", "3", "4", "5"]); } #[test] fn tail() { let test_input = "1 2 3 4 "; let tokens: Vec<_> = TokenIterator::new(test_input, " ").collect(); assert_eq!(tokens, vec!["1", "2", "3", "4", ""]); } }
1. Basic TokenIterator implementation
Implement TokenIterator as specified above.
After you are done, one should be able to use the TokenIterator such as:
fn main() { let iterator = TokenIterator::new("My name is, chka-chka, Slim Shady", " "); for word in iterator { println!("{}", word); } // Output: // // My // name // is, // chka-chka, // Slim // Shady }
2. Char delimiters
Let's make our iterator more flexible by allowing for more flexible delimiters.
Create a trait named Delimiter with the following definition:
#![allow(unused)] fn main() { pub trait Delimiter { fn find_next(&self, source_string: &str) -> Option<(usize, usize)>; } }
Where &self is the delimiter. The Result should be the beginning and end index of the delimiter within
the string, or None if delimiter is not found in the string.
Implement it for both &str and char and modify your TokenIterator to use
accordingly.
Afterwards, you should be able to do this:
fn main() { let iterator = TokenIterator::new("My name is, chka-chka, Slim Shady", ' '); // <- CHAR here for word in iterator { println!("{}", word); } // Output: // // My // name // is, // chka-chka, // Slim // Shady }
4. until_char()
Next, create an until_char() free-standing function with the following declaration:
#![allow(unused)] fn main() { /// Return slice from beginning until char `c` is encountered or the entire string slice pub fn until_char(s: &str, c: char) -> &'_ str; }
And implement it accordingly using your TokenIterator, then write a test that verifies its usage.
5. Extend str
Let's finish this by providing a better syntax for creating a token iterator.
Create a trait called TokensExt, which contains the method tokens(&self, delim) that creates
a new TokenIterator from &self, then implement it for &str, so that we can do this:
fn main() { for word in "My name is, chka-chka, Slim Shady".tokens(' ') { println!("{}", word); } }
6. End product
In the end you should be left with a well prepared project, that has the following:
- implementation of the TokenIterator itself
- usable and safe Public API
- appropriately documented code
- tests
- clean git history that does not contain fix-ups, merge commits or malformed/misformatted commits
Your Rust code should be formatted by rustfmt / cargo fmt and should produce no
warnings when built.