This release conforms to Unicode 15.0.0 and adds word, sentence and line segmentation to the existing grapheme cluster (i.e. perceived character) string segmentation. It also includes case detection and conversion in lower-, upper- and title-case.
There has been heavy work on conformance and correctness. The software now contains over 10,000 automatically generated conformance tests and over 150 unit tests.
The library is freestanding, making it attractive for certain bare metal applications.
How amenable would you/suckless.org be to a build system fix that makes libgrapheme build on macOS? I’ve got a trivial fix proof of concept but I need to generalize it so it’ll still work on not-macOS. Clang doesn’t recognize --soname, instead using -install_name and requiring -lSystem to be passed:
I would definitely welcome it and simply lacked a tester until now! I’m currently working on a patch to accomodate the major differences in macOS with the hopes that you’re interested in testing the software. :) I’ve sent you a private message for further details.
I haven’t looked at the latest version, but the thing missing for me previously was an efficient iteration API for interfacing with string representations that are not contiguous (but might have a grapheme cluster spanning two contiguous runs), without needing a function call per code point.
ICU has UText, which is quite similar to NSString but with a few optimisations. A UText has (optionally) a small buffer, a length, and a pointer to the start of the run, and callbacks to update the pointer to include a designated index in the string. Users can copy into the buffer if their string uses a non-standard encoding, or provide pointers to their runs for zero-copy access to fragments of something like a twine or rope.
Thanks for your feedback! So to make clear that I understand correctly, condensing it down into an example, you are referring to a case where you have two buffers buf1=“Hello “ and buf2=“World!” and you want to have a way to look at them “contiguously” and be able to feed them consecutively into, e.g., a sentence-break-detector?
Technically, this is possible, especially given I already have written everything in such a way that buffer accesses are only forwards, never backwards, but to be fair, all these indirections add a performance penalty and it’s realistic to assume that grapheme clusters/words/sentences and the next line-break-opportunity is within such a reach that it’s possible to reflect in a normal heap-object.
Offering arbitrary iterators motivates people to write inefficient I/O-heavy code (e.g. reading in files byte-wise), unless you also take care of internal buffering, etc. etc. It’s much better to simply read in files into a buffer in one run and go on from there. Regarding foreign encodings: UTF-8 should be the default everywhere and there are no good reasons (other than legacy or poor language-decisions (Windows, Java, etc.)) to keep using it. It’s not too large of an overhead to simply convert any internal representation (from the storage form) to a sane format (UTF-8 or codepoints, where the latter is ideal for random access other than UTF-16, which sucks at everything).
I also see this as a case for the 95/5-rule: Making 95% of users happy is easy, but the last 5% are really difficult and can make a simple product overly complicated. This should not sound like a rebuttal, but maybe you can give me a concrete example where having such a data-structure comes in handy in a real problem. :)
More or less, yes. Most string libraries don’t store the data in a single contiguous chink because it makes editing incredibly expensive (deleting a character is O(n) in the length of a string, as is inserting). All of the representations that I’ve seen boil down to some pool of fixed-sized buffers, with some other structure (skip list, tree, index table, whatever) for mapping from a code point or code unit index to a buffer+offset pair. I’ve seen end-to-end transaction processing rates double from some codebases moving to this kind of representation from a fixed buffer.
In this kind of representation, you may end up with code point in one buffer and then next code point in the same grapheme cluster in the next. If you’re using a variable-length encoding then this can even happen in the middle of a code point. The latter case is fairly common if your pulling data from a network via a stream protocol because any other solution involves an extra copy (consider pulling a 1MiB JSON buffer from a socket and wanting to do sentence breaking on a few strings in the middle, without copying).
UTF-8 is a good default for alphabetic languages, but UTF-16 has better cache utilisation properties for CJK languages and so keeping everything in UTF-16 can have nice performance properties there. Java, C#, and Objective-C all define their APIs in terms of UTF-16 code units (they were originally defined as Unicode code points, but when Unicode stopped fitting in 16 bits this was the only way of avoiding an ABI break). Any code that needs to interoperate with these languages will often be faster keeping UTF-16 internally, or it will need to transcode on every language crossing. I’ve heard of places where UTF-32 was more efficient but from people who weren’t willing to share their usage patterns and so I can’t tell if this was actually true or an artefact of something else in their codebases.
The UText API is pretty easy to use for this kind of thing. You have a callback to fill a structure with a pointer to the start of the buffer, the index of the character (UTF-16 code unit in ICU’s case, because they use UTF-16 internally) where the buffer starts and the length of the buffer. For a contiguous string, this can be preinitialised and the callback is never invoked. For a string in a different encoding (for example, in Objective-C we have a small string optimisation that encodes small strings in pointers, which saves a lot of allocations but requires the string to be expanded for a lot of operations), you can decode a range and store it in the internal buffer in the UText structure. For a twine, you can provide it with each of the contiguous buffers.
I did some benchmarks a few years go for using ICU with UText versus copying into a contiguous buffer and, with the exception of very small strings (where function call overhead often dominated), the allocation, deallocation, and copying cost far more than using UText.
There are no good reasons to use UTF-16 as a storage or interchange format nowadays, and the reasons put forth regarding CJK sound like a cope built on top of a false legacy decision to make the type 16 bits wide (Java, C# and Objective-C all made the same mistake, even though Unicode is also at fault here for messing this up so badly).
Compared to UTF-8, you have a lot of additional complications with UTF-16: To start, there are actually two kinds of UTF-16, namely UTF-16BE and UTF-16LE, which need to be declared with BOMs. This alone is a killer-criterium for UTF-16 as a general purpose storage and interchange encoding, imho, and the remaining possible field of application boils down to internal representation.
The claimed advantage for CJK-languages sounds good on paper (i.e. you need 2 bytes in UTF-16 instead of 3 bytes in UTF-8), but realistically speaking, unless you work on a japanese library project with gigabytes of only japanese writings, this is not so important. Take a look at usual data that is interchanged, and you’ll find that it is often usually filled with a lot of ASCII. The simplest example is HTML. And for such data, UTF-8 only needs 1 byte per codepoint, while UTF-16 needs double that, namely 2 bytes. And even if you discard this argument: A simple gzip-compression of said artificially exemplified japanese library will negate this entire argument anyway, because then the chosen encoding will most likely not matter at all.
It becomes even worse when people assume that UTF-16 allows trivial random access (because surrogates occur so rarely). A lot of code out there is broken in this regard, and UTF-8 encourages better code given you quickly end up with the need to parse more than 1 byte.
As a conclusion, the only two encodings that make sense are UTF-8 (storage, exchange) and UTF-32 (internal representation for O(1) random access where byte order does not matter given it’s created “on” the machine).
The methodical mistake of your benchmark was probably in regard to the fact that for any non-trivial application I/O always dominates any data-reading and -writing, so the overhead of allocation is negligible in this regard. Internal calculation though, which is the other time-intensive part of any non-trivial application, greatly benefits from a O(1) data structure.
You are making a lot of assertions that directly contradict my experience of profiling large real-world text-heavy workloads. Your assertion that gzip compressing text will i prove cache usage makes absolutely no sense to me: you will lose far more from the compression tables and the need to have two copies than you will save and you are going to have to decompress and recompress an entire string for any modification, unless you use a twine or rope-like data structure, in which case you will end up needing to carry more compression state around with you and further impact cache usage.
My point there wasn’t clear enough: The CJK-argument was meant to be tangential about using UTF-16 as a storage format (meant as in archival) rather than internal representation.
Your points regarding compression overhead are totally valid, but I meant something different.
I’ve taken a very quick look at the repository, got a feel of the metrics there, and… this is good:
As far as I can tell there are only 7 sources files and 2 headers in the main project.
Sloccount reports 1590 lines in the source code. This is smaller than my little crypto library!
The public API is tiny: 23 functions, one #define constant and that’s it.
The makefile is bigger than I would like… but so are the Makefiles I write. Can’t complain there.
The libgrapheme.so I build in 3 seconds by typing make is under 330KB on my modern laptop, even after integrating the look-up tables in the data/ folder. These are almost 800KB even after gzip compression. They weren’t kidding about compressing those tables.
The source code in src/*.c looks consistent and well commented. Lines aren’t overly long (only 10 lines exceed 90 columns, the longest is 101), so I know they didn’t cheat with Sloccount.
Reported performance exceeds more established, bigger alternatives.
grep -r "alloc" src suggests everything is allocated on the stack.
Two things I don’t like:
Using tabs for indentation. I prefer 4 spaces instead.
Using /* */ for comments. I prefer // instead (which is standard C99 by the way).
But really, who cares about the colour of the bike shed? This is awesome. If I ever need to process unicode text, I know what I’ll try first.
Using tabs for indentation. I prefer 4 spaces instead.
I used to be pretty adamant about spaces, but then I read things arguing that tabs are better for accessibility reasons. It can all be automated, so I rely on that instead now: whatever is in the .editorconfig goes.
Oh my, that’s the strongest argument in favour of any one side I’ve ever seen.
I’ll just note that the one point that is supposed to be in favour of tabs actually is actually a fairly strong argument in favour of doing whatever we like:
And if you prefer spaces, feel free to use an auto-formatting tool that converts tabs to spaces when opening a file (all modern editors have these). Just make sure that the spaces get converted back to tabs before anyone else has to work on your code.
Assuming people indent their code correctly and consistently, it is always possible to convert one way or another:
Tab indented code can be converted to space indented code and back.
Space indented code can be converted to tab indented code and back.
Space indented code can be converted to indent with more (or less) spaces… and back.
And supposedly most editors have such tools. So who cares what format the code is stored? Anyone can convert it to their preferences, including when those “preferences” are actually a crucial accessibility feature. In fact, I suspect many disabled people actually set up such a conversion to work on whatever code they need to work with.
On the other hand, this argument is not that valid. Those coworkers of that Reddit commenter having difficulties with spaces clearly did not have access to such conversion tools. Then there are more casual contexts, like copying code samples from the web. Firing up the conversion tool every single damn time would be bloody inconvenient. And braille readers read web pages too. And one thing I just learned: CSS can set the tab width.
That alone convinced me that I should most probably convert all my documentation to tabs instead of spaces. I’m still on the fence about converting my source code… but I can always set up my editor to my preferred tab width. The only thing I will definitely not change is the fact that my 80 column limit will assume indentation is 4 columns wide.
Now I’d like Emacs to do one thing for me: when it indents a line of code (because I typed the tab key or indented a region or whatever), it would be very nice if it could use spaces for alignment like God intended:
void foo()
{
. bar(baz,
. ....wiz);
}
If indenting with spaces makes disabled people’s lives harder, using tabs for alignment is evil.
…Wait a minute… those conversion tools… how can they distinguish alignment from indentation if it’s all spaces? They’d have to be syntax aware, which makes them much more complicated all of a sudden. We can convert to spaces all right with a trivial sed 's|\t| |g' one liner, but but converting back is much harder…
Crap. Looks like I should convert my code as well.
Did you miss the penultimate paragraph of my comment?
…Wait a minute… those conversion tools… how can they distinguish alignment from indentation if it’s all spaces? They’d have to be syntax aware, which makes them much more complicated all of a sudden. We can convert to spaces all right with a trivial sed ‘s|\t| |g’ one liner, but but converting back is much harder…
I don’t think the conversion is impossible, but it is difficult enough (so many languages to be aware of) that we can’t rely on those tools being available. Those two coworkers of that Reddit posters didn’t have that for instance.
Yes, I don’t understand this entire debate, given the canonical solution is to simply use tabs for indentation and spaces for alignment. Then you can simply change the number of characters your text-editor uses for tabs and everybody is happy.
Simply set
:set tabstop=4
in vim, for instance to quote @Loup-Vaillant’s preference, and you’ll have your desired 4-space-indentation in libgrapheme, but also perfectly fine indentation.
There’s one thing however that I tend to be anal about, it’s line widths. Depending on context, I like to limit them. For instance, Monocypher sticks to a hard 80 columns limit. And with tabs, that limit is ambiguous: say you have 3 levels of indentations: with 2-4-8 wide tabs you get a 6-12-24 indentation, leaving you 72-68-56 columns left for actual code.
So I have to be very clear which tab width I’m assuming when I compute line widths. Currently at work they pretend to use the Linux style guide with 8-wide tabs, but in reality clang-format is counts tabs as if they’re 4 columns wide. Moreover, people who have different preferences (or needs) need to decouple how wide tabs look from how much they are supposed to contribute to the line length.
Just a small detail, but I like my details set straight.
Fair point! You can solve this by adding a marker-line (e.g. in vim) at 75 characters. When you’re working with 4-space-tabs and keep this line, it’s all good (lines may be shorter rather than longer). If someone then has 8-space-tabs, the line-width is still within reasonable limits.
But you can’t make everybody happy, even though I see the tabs-for-indentation-space-for-alignment-rule as the most satisfying in an utilitarian sense, as it does not impose one’s choice of indent-width on anyone. :)
There’s one thing however that I tend to be anal about, it’s line widths.
Whether this is a problem depends on what your motivation for caring is. Do you want the code to be readable by a human or specifically by a human with a VT-100?
If it’s the former, then line width is an approximation of what you really want, which is the number of non-white space characters in a line. 80 characters in total works out to a number of non-white space characters in the peak range for readability. I don’t know of any good tools for enforcing this though.
If you care about line wrapping on a fixed-width display, then you can set a line width limit in clang-format and tell it how many characters you want to treat a tab as being for this metric. This guarantees that lines don’t wrap if you use that tab width or something wider, people who want to use wider tabs are free to use a larger window.
Do you want the code to be readable by a human or specifically by a human with a VT-100?
Well, my screen, when split in two, allows for little more than 80 columns. I also like the idea of being able to print my code in standard A4 paper in a reasonable font size, and guarantee that it won’t overflow.
Both of those require only that there exists a tab width setting that keeps the code to 80 columns. This is fairly easy to enforce with existing tools (I do for a few projects: lines must not be more than 80 columns with 4 character tabs).
Yes, I don’t understand this entire debate, given the canonical solution is to simply use tabs for indentation and spaces for alignment.
This is what I have done and advocated for the last 20 years, but it’s only the last couple of releases of clang-format that actually support this, and they still mess up continued Objective-C lines and C++ lambdas in some cases.
Thank you for your very elaborate feedback! I’m glad to hear you like it and consider it for a future project. Let me weigh in on some of your remarks:
The makefile is bigger than I would like… but so are the Makefiles I write. Can’t complain there.
This is sadly the side-effect of writing POSIX-makefiles, but most of it is just very trivial information written down explicitly and not expected to change anyway.
The libgrapheme.so I build in 3 seconds by typing make is under 330KB on my modern laptop, even after integrating the look-up tables in the data/ folder. These are almost 800KB even after gzip compression. They weren’t kidding about compressing those tables.
Individually per algorithm, the tables are small enough to be kept in the L1-cache of recent CPUs and a lookup only takes roughly 2-3 cycles. The character-break-detection was written to be mostly branchless using jump-tables.
grep -r “alloc” src suggests everything is allocated on the stack.
Yes, the library is indeed not calling malloc(), and actually not even dependent on a standard library at all (which is called “freestanding” in C-lingo).
Of course, let me know if you miss a certain functionality in the library, so I know what to prioritize for future work based on the community feedback! Thanks again for taking your time to investigate and give this independent review!
Of course, let me know if you miss a certain functionality in the library
Now that you mention it, encoding and decoding only do one character at a time. And as Mike Acton said, “when there’s one, there’s many”: I suspect that in practice, we generally want to encode and decode entire strings instead, but for that we need to write the loop ourselves.
Now I don’t expect my string based API would make user code any easier. If anything it probably makes things even worse, because we’d still need to account for truncated lengths and resume processing once it’s done. But I was hoping to get some performance in exchange: maybe the function call overhead across the shared library is significant (I expect the transcoding of a single code point is extremely fast).
Here are some possible APIs for the decoding (encoding would be similar):
// output args only
void grapheme_decode_utf8_string(const char *str, size_t in_len,
size_t *bytes_written,
uint_least32_t *cp, size_t out_len,
size_t *codepoints_read);
// returns the number of code points read.
size_t grapheme_decode_utf8_string(const char *str, size_t in_len,
size_t *bytes_written,
uint_least32_t *cp, size_t out_len);
// returns the number of bytes written
size_t grapheme_decode_utf8_string(const char *str, size_t in_len,
uint_least32_t *cp, size_t out_len;
size_t *codepoints_read);
(Goodness, I already dislike my proposals.)
We could also argue that if encoding or decoding one code point per function call really is too slow, removing function call overhead may not be enough. Since SIMD is (as far as I can tell) out of scope for this library, the performance impact of my API is probably too low to justify its inclusion.
But if we’re lucky and it turns out it speeds up decoding by a factor of 5… I’m kind of curious.
Thanks for sharing your thoughts! The nice thing is that you’re absolutely free to use an alternative UTF-8 decoder and use the codepoint-based-functions on the resulting buffers.
Here’s the thing, though: From what I understand performance in code nowadays is pretty much determined by three factors (in decreasing priority): I/O latency, cache misses (the CPU misjudged the next cache-access) and cache locality (how far does the CPU need to fetch (L1, L2, L3, RAM)?).
In libgrapheme’s case, I/O does not matter, and as mentioned earlier, the lookup-tables are carefully ideally kept in L1 (this is, btw, what mostly gives 50-100% more performance than libutf8proc, whose lookup-tables are too large for L1), ensuring good cache locality.
The last factor are cache misses: CPUs usually assume that you go through a RAM-buffer sequentially forward and the CPU is smart enough to cache stuff in advance. Going backwards in a buffer, on the other hand, is generally something that should be avoided. The nasty aspect of the Unicode algorithms is that they often ask you to look backward, which is what I naturally wanted to avoid at all costs and solved with finite state machines. This achieves maximum cache-locality, and all the UTF-8 decoding happens in places the CPU already expected and has in cache, and the decoding itself is negligible compared to the overhead of heap-accesses.
So, to bring my long-winded excursion to a close: I don’t think you would gain that much, especially with aggressive inlining, given the overhead of UTF-8-decoding is relatively small. It makes a lot of sense to do this for reading from files, which is why I always recommend to fine-tune the buffer-sizes in fread (most people don’t even know you can do that) to make each I/O-access as useful as possible, but not here. It would also complicate the code a lot, given you would have to add another layer of indirection to still be able to “easily” have a for-loop to iterate over the codepoints. Come to think of it, I could even imagine this being worse. Not only do you have an overhead with the buffer handling itself (branch prediction mistakes), it also possibly disturbs your cache-locality.
After all, though, this is no case against string-based-UTF-8-decoding in general, but in the context of libgrapheme, where, during processing, each individual codepoint is inspected step by step anyway, it would probably only make things worse, both in terms of readability and possible also performance.
Yeah, so if the user is doing any I/O, or interleaves does significant processing that goes out of L1 cache, the performance loss is likely negligible. This is a valid argument, and I think i believe it.
Once though I had a really nasty surprise. It was when I was implementing Monocypher, I noticed that some of my functions where much slower than SIMD-less libsodium. Stuff like encryption with Chacha20 or hashing with Blake2b. After a bit of investigation, turned out my bottleneck wasn’t the cryptographic computation, but the loading code.
I was originally assuming that the cryptographic computations would take over 99% of the time. Loading code felt negligible, even if I load one byte at a time (plus, it made my code simpler). Then I started to load whole words or blocks at a time. Suddenly my performance increased by double-digit percentage. In the case of Chacha20 it doubled.
Loading code that takes more time than 20 rounds of cryptographic mangling, can you imagine that? I certainly did not.
Now for libgrapheme, as long as performance is not a problem, its API is top notch, and I wouldn’t change it.
I also suspect dealing with variable data all the time (one code point requires who knows how many characters, one grapheme requires who knows how many code points…) makes fast loading very inconvenient or even impossible. And I would be the first to refuse making this library noticeably more complex for a mere 5% speed improvement.
In the genera case though, I still note that if fast loading is possible, it can make a bigger difference than we anticipate.
The approach I’d recommend is to build it using the Makefile and simply link it into your project statically (using the generated static library libgrapheme.a). :)
This release conforms to Unicode 15.0.0 and adds word, sentence and line segmentation to the existing grapheme cluster (i.e. perceived character) string segmentation. It also includes case detection and conversion in lower-, upper- and title-case.
There has been heavy work on conformance and correctness. The software now contains over 10,000 automatically generated conformance tests and over 150 unit tests.
The library is freestanding, making it attractive for certain bare metal applications.
How amenable would you/suckless.org be to a build system fix that makes libgrapheme build on macOS? I’ve got a trivial fix proof of concept but I need to generalize it so it’ll still work on not-macOS. Clang doesn’t recognize
--soname
, instead using-install_name
and requiring-lSystem
to be passed:I think this could get gated on either a
uname -s
orcc -dumpmachine
check, probably the former.I know suckless.org is very Linux/BSD oriented so I approach this accepting that No Thanks is probable, and that’s OK.
I would definitely welcome it and simply lacked a tester until now! I’m currently working on a patch to accomodate the major differences in macOS with the hopes that you’re interested in testing the software. :) I’ve sent you a private message for further details.
I haven’t looked at the latest version, but the thing missing for me previously was an efficient iteration API for interfacing with string representations that are not contiguous (but might have a grapheme cluster spanning two contiguous runs), without needing a function call per code point.
ICU has UText, which is quite similar to NSString but with a few optimisations. A UText has (optionally) a small buffer, a length, and a pointer to the start of the run, and callbacks to update the pointer to include a designated index in the string. Users can copy into the buffer if their string uses a non-standard encoding, or provide pointers to their runs for zero-copy access to fragments of something like a twine or rope.
Thanks for your feedback! So to make clear that I understand correctly, condensing it down into an example, you are referring to a case where you have two buffers buf1=“Hello “ and buf2=“World!” and you want to have a way to look at them “contiguously” and be able to feed them consecutively into, e.g., a sentence-break-detector?
Technically, this is possible, especially given I already have written everything in such a way that buffer accesses are only forwards, never backwards, but to be fair, all these indirections add a performance penalty and it’s realistic to assume that grapheme clusters/words/sentences and the next line-break-opportunity is within such a reach that it’s possible to reflect in a normal heap-object.
Offering arbitrary iterators motivates people to write inefficient I/O-heavy code (e.g. reading in files byte-wise), unless you also take care of internal buffering, etc. etc. It’s much better to simply read in files into a buffer in one run and go on from there. Regarding foreign encodings: UTF-8 should be the default everywhere and there are no good reasons (other than legacy or poor language-decisions (Windows, Java, etc.)) to keep using it. It’s not too large of an overhead to simply convert any internal representation (from the storage form) to a sane format (UTF-8 or codepoints, where the latter is ideal for random access other than UTF-16, which sucks at everything).
I also see this as a case for the 95/5-rule: Making 95% of users happy is easy, but the last 5% are really difficult and can make a simple product overly complicated. This should not sound like a rebuttal, but maybe you can give me a concrete example where having such a data-structure comes in handy in a real problem. :)
More or less, yes. Most string libraries don’t store the data in a single contiguous chink because it makes editing incredibly expensive (deleting a character is O(n) in the length of a string, as is inserting). All of the representations that I’ve seen boil down to some pool of fixed-sized buffers, with some other structure (skip list, tree, index table, whatever) for mapping from a code point or code unit index to a buffer+offset pair. I’ve seen end-to-end transaction processing rates double from some codebases moving to this kind of representation from a fixed buffer.
In this kind of representation, you may end up with code point in one buffer and then next code point in the same grapheme cluster in the next. If you’re using a variable-length encoding then this can even happen in the middle of a code point. The latter case is fairly common if your pulling data from a network via a stream protocol because any other solution involves an extra copy (consider pulling a 1MiB JSON buffer from a socket and wanting to do sentence breaking on a few strings in the middle, without copying).
UTF-8 is a good default for alphabetic languages, but UTF-16 has better cache utilisation properties for CJK languages and so keeping everything in UTF-16 can have nice performance properties there. Java, C#, and Objective-C all define their APIs in terms of UTF-16 code units (they were originally defined as Unicode code points, but when Unicode stopped fitting in 16 bits this was the only way of avoiding an ABI break). Any code that needs to interoperate with these languages will often be faster keeping UTF-16 internally, or it will need to transcode on every language crossing. I’ve heard of places where UTF-32 was more efficient but from people who weren’t willing to share their usage patterns and so I can’t tell if this was actually true or an artefact of something else in their codebases.
The UText API is pretty easy to use for this kind of thing. You have a callback to fill a structure with a pointer to the start of the buffer, the index of the character (UTF-16 code unit in ICU’s case, because they use UTF-16 internally) where the buffer starts and the length of the buffer. For a contiguous string, this can be preinitialised and the callback is never invoked. For a string in a different encoding (for example, in Objective-C we have a small string optimisation that encodes small strings in pointers, which saves a lot of allocations but requires the string to be expanded for a lot of operations), you can decode a range and store it in the internal buffer in the UText structure. For a twine, you can provide it with each of the contiguous buffers.
I did some benchmarks a few years go for using ICU with UText versus copying into a contiguous buffer and, with the exception of very small strings (where function call overhead often dominated), the allocation, deallocation, and copying cost far more than using UText.
There are no good reasons to use UTF-16 as a storage or interchange format nowadays, and the reasons put forth regarding CJK sound like a cope built on top of a false legacy decision to make the type 16 bits wide (Java, C# and Objective-C all made the same mistake, even though Unicode is also at fault here for messing this up so badly).
Compared to UTF-8, you have a lot of additional complications with UTF-16: To start, there are actually two kinds of UTF-16, namely UTF-16BE and UTF-16LE, which need to be declared with BOMs. This alone is a killer-criterium for UTF-16 as a general purpose storage and interchange encoding, imho, and the remaining possible field of application boils down to internal representation.
The claimed advantage for CJK-languages sounds good on paper (i.e. you need 2 bytes in UTF-16 instead of 3 bytes in UTF-8), but realistically speaking, unless you work on a japanese library project with gigabytes of only japanese writings, this is not so important. Take a look at usual data that is interchanged, and you’ll find that it is often usually filled with a lot of ASCII. The simplest example is HTML. And for such data, UTF-8 only needs 1 byte per codepoint, while UTF-16 needs double that, namely 2 bytes. And even if you discard this argument: A simple gzip-compression of said artificially exemplified japanese library will negate this entire argument anyway, because then the chosen encoding will most likely not matter at all.
It becomes even worse when people assume that UTF-16 allows trivial random access (because surrogates occur so rarely). A lot of code out there is broken in this regard, and UTF-8 encourages better code given you quickly end up with the need to parse more than 1 byte.
As a conclusion, the only two encodings that make sense are UTF-8 (storage, exchange) and UTF-32 (internal representation for O(1) random access where byte order does not matter given it’s created “on” the machine).
The methodical mistake of your benchmark was probably in regard to the fact that for any non-trivial application I/O always dominates any data-reading and -writing, so the overhead of allocation is negligible in this regard. Internal calculation though, which is the other time-intensive part of any non-trivial application, greatly benefits from a O(1) data structure.
You are making a lot of assertions that directly contradict my experience of profiling large real-world text-heavy workloads. Your assertion that gzip compressing text will i prove cache usage makes absolutely no sense to me: you will lose far more from the compression tables and the need to have two copies than you will save and you are going to have to decompress and recompress an entire string for any modification, unless you use a twine or rope-like data structure, in which case you will end up needing to carry more compression state around with you and further impact cache usage.
My point there wasn’t clear enough: The CJK-argument was meant to be tangential about using UTF-16 as a storage format (meant as in archival) rather than internal representation.
Your points regarding compression overhead are totally valid, but I meant something different.
I’ve taken a very quick look at the repository, got a feel of the metrics there, and… this is good:
#define
constant and that’s it.libgrapheme.so
I build in 3 seconds by typingmake
is under 330KB on my modern laptop, even after integrating the look-up tables in thedata/
folder. These are almost 800KB even after gzip compression. They weren’t kidding about compressing those tables.src/*.c
looks consistent and well commented. Lines aren’t overly long (only 10 lines exceed 90 columns, the longest is 101), so I know they didn’t cheat with Sloccount.grep -r "alloc" src
suggests everything is allocated on the stack.Two things I don’t like:
/* */
for comments. I prefer//
instead (which is standard C99 by the way).But really, who cares about the colour of the bike shed? This is awesome. If I ever need to process unicode text, I know what I’ll try first.
I used to be pretty adamant about spaces, but then I read things arguing that tabs are better for accessibility reasons. It can all be automated, so I rely on that instead now: whatever is in the .editorconfig goes.
I won’t convert a repo unless I fully own it, but I try to get all new repos using tabs. I’m inconsistent but getting better slowly over time.
Oh my, that’s the strongest argument in favour of any one side I’ve ever seen.
I’ll just note that the one point that is supposed to be in favour of tabs actually is actually a fairly strong argument in favour of doing whatever we like:
Assuming people indent their code correctly and consistently, it is always possible to convert one way or another:
And supposedly most editors have such tools. So who cares what format the code is stored? Anyone can convert it to their preferences, including when those “preferences” are actually a crucial accessibility feature. In fact, I suspect many disabled people actually set up such a conversion to work on whatever code they need to work with.
On the other hand, this argument is not that valid. Those coworkers of that Reddit commenter having difficulties with spaces clearly did not have access to such conversion tools. Then there are more casual contexts, like copying code samples from the web. Firing up the conversion tool every single damn time would be bloody inconvenient. And braille readers read web pages too. And one thing I just learned: CSS can set the tab width.
That alone convinced me that I should most probably convert all my documentation to tabs instead of spaces. I’m still on the fence about converting my source code… but I can always set up my editor to my preferred tab width. The only thing I will definitely not change is the fact that my 80 column limit will assume indentation is 4 columns wide.
Now I’d like Emacs to do one thing for me: when it indents a line of code (because I typed the tab key or indented a region or whatever), it would be very nice if it could use spaces for alignment like God intended:
If indenting with spaces makes disabled people’s lives harder, using tabs for alignment is evil.
…Wait a minute… those conversion tools… how can they distinguish alignment from indentation if it’s all spaces? They’d have to be syntax aware, which makes them much more complicated all of a sudden. We can convert to spaces all right with a trivial
sed 's|\t| |g'
one liner, but but converting back is much harder…Crap. Looks like I should convert my code as well.
This is silly advice, since this conversion is impossible. If converting from spaces to tabs were possible to automate there would be no debate.
Did you miss the penultimate paragraph of my comment?
I don’t think the conversion is impossible, but it is difficult enough (so many languages to be aware of) that we can’t rely on those tools being available. Those two coworkers of that Reddit posters didn’t have that for instance.
Yes, I don’t understand this entire debate, given the canonical solution is to simply use tabs for indentation and spaces for alignment. Then you can simply change the number of characters your text-editor uses for tabs and everybody is happy.
Simply set
in vim, for instance to quote @Loup-Vaillant’s preference, and you’ll have your desired 4-space-indentation in libgrapheme, but also perfectly fine indentation.
There’s one thing however that I tend to be anal about, it’s line widths. Depending on context, I like to limit them. For instance, Monocypher sticks to a hard 80 columns limit. And with tabs, that limit is ambiguous: say you have 3 levels of indentations: with 2-4-8 wide tabs you get a 6-12-24 indentation, leaving you 72-68-56 columns left for actual code.
So I have to be very clear which tab width I’m assuming when I compute line widths. Currently at work they pretend to use the Linux style guide with 8-wide tabs, but in reality
clang-format
is counts tabs as if they’re 4 columns wide. Moreover, people who have different preferences (or needs) need to decouple how wide tabs look from how much they are supposed to contribute to the line length.Just a small detail, but I like my details set straight.
Fair point! You can solve this by adding a marker-line (e.g. in vim) at 75 characters. When you’re working with 4-space-tabs and keep this line, it’s all good (lines may be shorter rather than longer). If someone then has 8-space-tabs, the line-width is still within reasonable limits.
But you can’t make everybody happy, even though I see the tabs-for-indentation-space-for-alignment-rule as the most satisfying in an utilitarian sense, as it does not impose one’s choice of indent-width on anyone. :)
Whether this is a problem depends on what your motivation for caring is. Do you want the code to be readable by a human or specifically by a human with a VT-100?
If it’s the former, then line width is an approximation of what you really want, which is the number of non-white space characters in a line. 80 characters in total works out to a number of non-white space characters in the peak range for readability. I don’t know of any good tools for enforcing this though.
If you care about line wrapping on a fixed-width display, then you can set a line width limit in clang-format and tell it how many characters you want to treat a tab as being for this metric. This guarantees that lines don’t wrap if you use that tab width or something wider, people who want to use wider tabs are free to use a larger window.
Well, my screen, when split in two, allows for little more than 80 columns. I also like the idea of being able to print my code in standard A4 paper in a reasonable font size, and guarantee that it won’t overflow.
Both of those require only that there exists a tab width setting that keeps the code to 80 columns. This is fairly easy to enforce with existing tools (I do for a few projects: lines must not be more than 80 columns with 4 character tabs).
This is what I have done and advocated for the last 20 years, but it’s only the last couple of releases of clang-format that actually support this, and they still mess up continued Objective-C lines and C++ lambdas in some cases.
Thank you for your very elaborate feedback! I’m glad to hear you like it and consider it for a future project. Let me weigh in on some of your remarks:
This is sadly the side-effect of writing POSIX-makefiles, but most of it is just very trivial information written down explicitly and not expected to change anyway.
Individually per algorithm, the tables are small enough to be kept in the L1-cache of recent CPUs and a lookup only takes roughly 2-3 cycles. The character-break-detection was written to be mostly branchless using jump-tables.
Yes, the library is indeed not calling malloc(), and actually not even dependent on a standard library at all (which is called “freestanding” in C-lingo).
Of course, let me know if you miss a certain functionality in the library, so I know what to prioritize for future work based on the community feedback! Thanks again for taking your time to investigate and give this independent review!
Now that you mention it, encoding and decoding only do one character at a time. And as Mike Acton said, “when there’s one, there’s many”: I suspect that in practice, we generally want to encode and decode entire strings instead, but for that we need to write the loop ourselves.
Now I don’t expect my string based API would make user code any easier. If anything it probably makes things even worse, because we’d still need to account for truncated lengths and resume processing once it’s done. But I was hoping to get some performance in exchange: maybe the function call overhead across the shared library is significant (I expect the transcoding of a single code point is extremely fast).
Here are some possible APIs for the decoding (encoding would be similar):
(Goodness, I already dislike my proposals.)
We could also argue that if encoding or decoding one code point per function call really is too slow, removing function call overhead may not be enough. Since SIMD is (as far as I can tell) out of scope for this library, the performance impact of my API is probably too low to justify its inclusion.
But if we’re lucky and it turns out it speeds up decoding by a factor of 5… I’m kind of curious.
Thanks for sharing your thoughts! The nice thing is that you’re absolutely free to use an alternative UTF-8 decoder and use the codepoint-based-functions on the resulting buffers.
Here’s the thing, though: From what I understand performance in code nowadays is pretty much determined by three factors (in decreasing priority): I/O latency, cache misses (the CPU misjudged the next cache-access) and cache locality (how far does the CPU need to fetch (L1, L2, L3, RAM)?).
In libgrapheme’s case, I/O does not matter, and as mentioned earlier, the lookup-tables are carefully ideally kept in L1 (this is, btw, what mostly gives 50-100% more performance than libutf8proc, whose lookup-tables are too large for L1), ensuring good cache locality.
The last factor are cache misses: CPUs usually assume that you go through a RAM-buffer sequentially forward and the CPU is smart enough to cache stuff in advance. Going backwards in a buffer, on the other hand, is generally something that should be avoided. The nasty aspect of the Unicode algorithms is that they often ask you to look backward, which is what I naturally wanted to avoid at all costs and solved with finite state machines. This achieves maximum cache-locality, and all the UTF-8 decoding happens in places the CPU already expected and has in cache, and the decoding itself is negligible compared to the overhead of heap-accesses.
So, to bring my long-winded excursion to a close: I don’t think you would gain that much, especially with aggressive inlining, given the overhead of UTF-8-decoding is relatively small. It makes a lot of sense to do this for reading from files, which is why I always recommend to fine-tune the buffer-sizes in fread (most people don’t even know you can do that) to make each I/O-access as useful as possible, but not here. It would also complicate the code a lot, given you would have to add another layer of indirection to still be able to “easily” have a for-loop to iterate over the codepoints. Come to think of it, I could even imagine this being worse. Not only do you have an overhead with the buffer handling itself (branch prediction mistakes), it also possibly disturbs your cache-locality.
After all, though, this is no case against string-based-UTF-8-decoding in general, but in the context of libgrapheme, where, during processing, each individual codepoint is inspected step by step anyway, it would probably only make things worse, both in terms of readability and possible also performance.
Yeah, so if the user is doing any I/O, or interleaves does significant processing that goes out of L1 cache, the performance loss is likely negligible. This is a valid argument, and I think i believe it.
Once though I had a really nasty surprise. It was when I was implementing Monocypher, I noticed that some of my functions where much slower than SIMD-less libsodium. Stuff like encryption with Chacha20 or hashing with Blake2b. After a bit of investigation, turned out my bottleneck wasn’t the cryptographic computation, but the loading code.
I was originally assuming that the cryptographic computations would take over 99% of the time. Loading code felt negligible, even if I load one byte at a time (plus, it made my code simpler). Then I started to load whole words or blocks at a time. Suddenly my performance increased by double-digit percentage. In the case of Chacha20 it doubled.
Loading code that takes more time than 20 rounds of cryptographic mangling, can you imagine that? I certainly did not.
Now for libgrapheme, as long as performance is not a problem, its API is top notch, and I wouldn’t change it.
I also suspect dealing with variable data all the time (one code point requires who knows how many characters, one grapheme requires who knows how many code points…) makes fast loading very inconvenient or even impossible. And I would be the first to refuse making this library noticeably more complex for a mere 5% speed improvement.
In the genera case though, I still note that if fast loading is possible, it can make a bigger difference than we anticipate.
Thank you for sharing this anecdote! I’ve taken note of it and will investigate this later at some later point to see if it makes a difference.
Wonderful. The only thing I would ask for is the ability to build it with CMake, since the project I may want to use this in is CMake-based.
(Or maybe there’s a general way for CMake to build a subdirectory containing only a Makefile not a CMakeLists.txt? I’m by no means an expert….)
The approach I’d recommend is to build it using the Makefile and simply link it into your project statically (using the generated static library libgrapheme.a). :)