If the string is smaller than your chunk size, you can still do chunk-at-a-time processing by doing aligned loads and masking off the excess data. Loading out of bounds memory only crashes if you actually cross a page boundary and hit an unmapped page. Page size and load sizes are pow2, so aligned loads never cross page boundaries.
Naturally, there is no way to express this in system programming languages since the UB police arrived, but compilers seem to leave you alone if you use intrinsics.
Note that this depends on your architecture and is not true on CHERI (or any other environment that gives byte-granularity memory safety). We had to disable the transforms in LLVM that assumed this was possible.
It’s also undefined behaviour in C to read past the end of an object, so you need to be quite careful that your compiler doesn’t do unexpected things.
You could keep valgrind happy by rounding your allocations up and memset()ing the last few bytes to 0.
The implicit assumption in this technique is that your string is a contiguous block of memory. That’s common but it’s also an early design choice that leaves a lot of macro-optimisations on the table (for example, inserting a character in the middle of a long string is really expensive, whereas it can be quite cheap in a twine-like representation). Designing a string interface that is both performant and representation-agnostic is quite hard.
OpenStep’s NSString is actually pretty good. Strings must implement at least a length and characterAtIndex: method but can also implement getCharacters:inRange:, which copies a range of characters to a caller-provided buffer. Using the second method is pretty fast, because you can amortise the cost of the call by getting a lot of characters at once. There are two problems with this API:
The NSFastEnumeration protocol addressed both of these for collections of objects but, sadly, not for strings. With NSFastEnumeration, the caller provides a small structure that contains a temporary buffer that ti can copy into, but also a pointer to this buffer that can be pointed elsewhere. It also contains the returned length, so if you are iterating over a twine you can return each chunk as a complete range. Most importantly, it has a few pointers of state that are preserved across calls and so can be used for iteration state.
In C++, you can implement all of these ideas in an iterator class with inlined dereference and increment operators.
ICU has similar abstractions for text.
That’s not to say that the ideas in this article are useless. They compose very cleanly with a good string interface, because it will still provide you with varying-sized ranges of characters in contiguous blocks of memory. It is important to consider all of the levels of abstraction when optimising though. I’ve seen a lot of text-handling code that did careful microoptimisation to gain a few percent performance and left a factor of ten on the table from poor data-structure choice.
I’d like to see how he validates a UTF-8 string.
He wrote a blog post on validating UTF-8 in 2018.
Now I’m wondering, since I’ve seen posts about SIMD by Daniel Lemire too, if this could done faster by using SIMD (not sure why he didn’t even mention anything about SIMD).