One thing I’m curious about is the overhead to using the GPU. If the structure of the JSON parser is implemented on the CPU (and I think it has to be), then presumably you have to switch between CPU and GPU a lot. What problems does that introduce?
My understanding of GPU memory is fuzzy (and I’m sure there is more diversity to it than CPU architecture). I would have thought you had to transfer the string over to GPU memory, but I also remember that the GPU can access main memory (is that specific to GPGPU / CUDA ?).
But still you have to coordinate the “handoff”, and I’m not sure how much overhead that introduces. JSON payloads typically have a lot of small strings, and not one large one, so this seems like it could be an issue.
In other words, I think a more interesting performance metric would be MB/s in decoding a dict with a bunch of strings, rather than just a huge string.
FWIW I designed an IPC protocol that used JSON, and for large strings, I “switched” to a netstring, i.e. a length-prefixed payload. That way you avoid the decoding cost. The JSON was essentially the metadata, and you could point into blocks of memory delimited by netstrings.
I feel like this is a nice compromise and preserves the important property of JSON: it’s easily implemented in any programming language. netstrings are trivial to decode.
I also did this so I could represent binary data accurately, which JSON isn’t good at.
Also, the point about parallel copy_if is interesting. I’d be interested in examples of that technique used to speed up any problem, with the constraint that it’s used “for real”. I haven’t seen that in the C++ code I look at.
All in all this is a really interesting piece of code that brings up a lot of questions :) I’ve read papers about prefix sums for parallel computing but honestly have not come across it in practice.
“It’s the ones in the middle where I think things will get interesting”
Ive been wondering if HPC experts have a standard, easy way of determining if sequential stuff can be parallelized easily. Ive seen stuff in hardware and software synthesis that makes me think it’s possible. One idea I had, but didn’t try, was using parallel-by-default languages (eg Chapel or ParaSail) to express prototypes of seemingly sequential algorithms. Fight with them until you get a speedup or are pretty sure you cant. If you did, port the structure to your production language of choice.
It’s possible someone else has already tried that approach with an idea of its effectiveness.
“A more sophisticated approach would separate the input into tiles, do as much processing as possible on each tile using faster shared, rather than global, memory, then reassemble the tiles at the end. “
Not to mention that sounds close to where Amdahl’s Law kicks in. Maybe you get big speedup. Maybe its dominated by pre and postprocessing which are usually sequential. Can’t know until you put a bunch of work in upfront. Aint that great how that works? ;)
Have you read Data-Parallel Finite-State Machines (2014)? Is it the same idea?
One thing I’m curious about is the overhead to using the GPU. If the structure of the JSON parser is implemented on the CPU (and I think it has to be), then presumably you have to switch between CPU and GPU a lot. What problems does that introduce?
My understanding of GPU memory is fuzzy (and I’m sure there is more diversity to it than CPU architecture). I would have thought you had to transfer the string over to GPU memory, but I also remember that the GPU can access main memory (is that specific to GPGPU / CUDA ?).
But still you have to coordinate the “handoff”, and I’m not sure how much overhead that introduces. JSON payloads typically have a lot of small strings, and not one large one, so this seems like it could be an issue.
In other words, I think a more interesting performance metric would be MB/s in decoding a dict with a bunch of strings, rather than just a huge string.
FWIW I designed an IPC protocol that used JSON, and for large strings, I “switched” to a netstring, i.e. a length-prefixed payload. That way you avoid the decoding cost. The JSON was essentially the metadata, and you could point into blocks of memory delimited by netstrings.
I feel like this is a nice compromise and preserves the important property of JSON: it’s easily implemented in any programming language. netstrings are trivial to decode.
I also did this so I could represent binary data accurately, which JSON isn’t good at.
Also, the point about parallel
copy_ifis interesting. I’d be interested in examples of that technique used to speed up any problem, with the constraint that it’s used “for real”. I haven’t seen that in the C++ code I look at.All in all this is a really interesting piece of code that brings up a lot of questions :) I’ve read papers about prefix sums for parallel computing but honestly have not come across it in practice.
“It’s the ones in the middle where I think things will get interesting”
Ive been wondering if HPC experts have a standard, easy way of determining if sequential stuff can be parallelized easily. Ive seen stuff in hardware and software synthesis that makes me think it’s possible. One idea I had, but didn’t try, was using parallel-by-default languages (eg Chapel or ParaSail) to express prototypes of seemingly sequential algorithms. Fight with them until you get a speedup or are pretty sure you cant. If you did, port the structure to your production language of choice.
It’s possible someone else has already tried that approach with an idea of its effectiveness.
“A more sophisticated approach would separate the input into tiles, do as much processing as possible on each tile using faster shared, rather than global, memory, then reassemble the tiles at the end. “
Not to mention that sounds close to where Amdahl’s Law kicks in. Maybe you get big speedup. Maybe its dominated by pre and postprocessing which are usually sequential. Can’t know until you put a bunch of work in upfront. Aint that great how that works? ;)
Yay!
Possible typo?
Shouldn’t it be
(1, 0, 1)since the double-quote character in state 2 takes you back to state 1?