I’ve been noodling for a while on the idea of doing text manipulations on the GPU. One such operation is unescaping of strings (also a primitive required for JSON parsing). Today I got around to implementing one of my ideas, gist in the link.
The interaction of " and \ can be boiled down to a 3-state finite state machine (I use a 4th state to represent error, when a backslash appears outside a string). Then you raise that into a function from state to state (27 possible values for 3 states, 256 for general 4 state, 64 for the special case when error always maps to error), using the idea from http://blog.sigfpe.com/2009/01/fast-incremental-regular-expression.html. The lifted state machine is a monoid homomorphism, so among other things it can be implemented efficiently as a prefix sum.
The result is about 20x faster than scalar code. That sounds impressive, but I’m not actually that happy with it; the gap in raw compute capacity is quite a bit larger than that. I think it’s possible to get even bigger gains by tiling the problem and computing each tile in shared memory, but that’s a lot harder than what I’ve done (in which the really tricky bits are handled by the thrust library).
My main motivation in posting it here is to take the temperature on how worthwhile it would be to write a more extended blog post. Is it a curiosity of limited interest, or something people would want to read more about?
I for one would be fascinated by a more extensive treatment of the subject.
I’ll second that. I love reading about clever trickery to make code go fast, and I’ve just recently gotten interested in GPGPU stuff and this seems like both.
Same, I would love reading about this. I’ve just begun dabbling in GPGPU (and SIMD) and reading about the journey and though process of others is very enlightening.
If you end up writing this post, I’d like to request that you include some of your mistakes/dead paths as well. Reading about how others have failed helps me to learn.
Yay monoid homomorphisms!! I’ve seen a few but I’m always interested in reading about more.
Longer post as promised: https://raphlinus.github.io/personal/2018/04/25/gpu-unescaping.html
Along the same vein, here’s parsing a CSV using SIMD lookahead and similarly on a GPU.
I’m interested! I did a tiny bit of CUDA a few years ago. Looking at the code, I don’t know what
thrustis.Unescaping doesn’t seem parallelizable because you don’t know ahead of time where
\is. Or are you saying you can compute it both ways in parallel and then assemble the right pieces later? I’m interested in details.Yes, it’s the latter. There’s a little state machine that’s small enough you can compute it in all initial states, then assemble the pieces together hierarchically, using a prefix sum. Sounds like there’s enough interest, I will write this up in more detail.
OK that roughly makes sense, but yes I’d like to hear details! Some questions:
thrust::transform_inclusive_scanhave anything to do with parallelism? From your description, it sounds like the version you posted is serial, but the next step is to parallelize it with “tiles”? (I just looked it up and learned thattransform_inclusive_scanis a C++17 thing)I think the CSV use case is pretty interesting. If you assume a relatively well-formed CSV, it has similar quoting with
". Newlines are also special though.No,
thrust::transform_inclusive_scanis already parallel (because it make the additional assumption that the function being scanned is associative, which is the case here). I think the reason why performance is not astonishing is that it’s spending a lot of time reading and writing global memory, which is expensive on a GPU. If it were possible to reorganize the computation so that most of it was in shared memory, I think it would be quite a bit faster still. But that’s hard.OK, that makes sense. I think I get the prefix scan thing now.
I’ve had to implement backslash-type escaping (and unescaping) in at least 5 places in http://www.oilshell.org/ , and this style is something I hadn’t considered. I use
re2cin a few places, which generates DFAs as switch/goto, rather than looping over characters by hand.Maybe for fun I’ll play around with SIMD vs. C loops. I don’t think a compiler would/could know to use SIMD in these cases without the STL style.
I think the problem is that the number of DFA states explodes quickly for other string processing problems, but for backslash escaping it seems managable, and the code doesn’t seem more complicated.
Although, for JSON, don’t you have to decode
\u1234, which leads to a lot more states than just\nand\"? Also, C/Python/shell as respect 8 hex digits like\u123456789as far as I know (but apparently JSON doesn’t).Right, I’m only detecting backslashes here, not mapping their decoding. But all the other stuff can be done with a small finite window of input, which is pretty easy as far as GPU goes. The problem with backslashes and quotes is that, for arbitrary input, a one-byte change at one location can have an effect at another location arbitrarily later.
JSON doesn’t respect unicode escapes larger than 4 hex digits, so uses the rather egregious hack of escaping the UTF-16 encoding.
Agree that this technique doesn’t work well for even medium-size, much less actually large automata.
OK it makes sense that you could break it up into 2 separate steps. But I wonder what the performance implications of that are.
And it seems like there is still a serial part, because after unescaping you don’t know what position each part will end up at. The
copy_ifpart of the algorithm callskeep, which is data-dependent and I think must be done serially.So I think you have two parallel passes and one serial pass to decode a string. I can imagine it’s a win in certain situations but it’s hard reason about.
In the shell, I thought about using a lazy tree representation of strings, where you only concatenate segments if someone actually uses a string as
argv– i.e. it needs to be passed to a system call. But it does introduce a lot of complexity, whereas copying small buffers is something computers are very good at. I haven’t gotten to the point where this matters yet.Anyway, I’m looking forward to the blog post!
Thanks for the continued encouragement to write. As it turns out, copy_if is also parallel, but uses expensive global memory. If any pass were strictly serial, it would absolutely not be worth doing this on GPU.
FYI, here is a Thrust homepage.