exowos/utf8.md at a91bcc208a4f41674046337555fb4bddbf6150b0

lilymonade/exowos

Fork 0

lilymonade a91bcc208a

should be ok

2025-03-15 10:50:05 +01:00

1.5 KiB

Raw Blame History

name = "UTF-8" file = "src/string/utf8.rs"

We will focus on the String type and its borrowed variant &str. These are UTF-8 strings, and to enforce this, all functions that create strings can only either give valid UTF-8 strings, or fail (with the error types we encountered before).

deepening

A word about UTF-8: It is a string format where characters (or "codepoints") are encoded using a variable number of bytes. ASCII characters are a subset of UTF-8, thuss are all encoded in 1 byte, but for other characters, they can take 2, 3, or 4 (maximum) bytes. Because of this, random access is difficult, because ou cannot compute the position in memory of the Nth codepoint without iterating through the whole string from the start.

This is why Rust cannot make char accessible with direct indexing ([] operator), but allow iterating over chars.

Let's implement a function accessing the Nth char of a string:

note

You can use str::chars to iterate over a string chars. If you want some challenge, you can also read the UTF-8 spec and iterate over single bytes of the string.

/// Returns the char at the asked position (if not out of bound)
pub fn char_at(s: &str, n: usize) -> Option<char> {
    unimplemented!()
}

fn main() {
    assert_eq!(char_at("abcdef", 2), Some('c'));
    assert_eq!(char_at("", 1), None);
    assert_eq!(char_at("🧐", 0), Some('🧐'));
}

1.5 KiB Raw Blame History

deepening

note

1.5 KiB

Raw Blame History