2025-03-15 10:50:05 +01:00

1.5 KiB

name = "UTF-8" file = "src/string/utf8.rs"
name = "UTF-8" file = "src/string/utf8.rs"

We will focus on the String type and its borrowed variant &str. These are UTF-8 strings, and to enforce this, all functions that create strings can only either give valid UTF-8 strings, or fail (with the error types we encountered before).

deepening

A word about UTF-8: It is a string format where characters (or "codepoints") are encoded using a variable number of bytes. ASCII characters are a subset of UTF-8, thuss are all encoded in 1 byte, but for other characters, they can take 2, 3, or 4 (maximum) bytes. Because of this, random access is difficult, because ou cannot compute the position in memory of the Nth codepoint without iterating through the whole string from the start.

This is why Rust cannot make char accessible with direct indexing ([] operator), but allow iterating over chars.

Let's implement a function accessing the Nth char of a string:

note

You can use str::chars to iterate over a string chars. If you want some challenge, you can also read the UTF-8 spec and iterate over single bytes of the string.

/// Returns the char at the asked position (if not out of bound)
pub fn char_at(s: &str, n: usize) -> Option<char> {
    unimplemented!()
}
fn main() {
    assert_eq!(char_at("abcdef", 2), Some('c'));
    assert_eq!(char_at("", 1), None);
    assert_eq!(char_at("🧐", 0), Some('🧐'));
}