Less well known are integer overflow bugs. Offset-length pairs, defining a sub-section of a file, are seen in many file formats, such as OpenType fonts and PDF documents. A conscientious C programmer might think to check that a section of a file or a buffer is within bounds by writing if (offset + length < end) before processing that section, but that addition can silently overflow, and a maliciously crafted file might bypass the check.
So, some experiences from the pixel mines.
We were implementing our own format for 3D mesh data for static (initially) meshes, and went through probably 4-5 implementations. The initial format was based on nice tree structures, while the final format was more of a length-prefixed block-based approach.
The reason for this change is that, if you have a maliciously-formed tree structure, it can be really easy to hamstring your parser. You have extra records, or not enough records, and parsing gets to be a headache. You also have to build a smarter parser, because you kind of need to keep an idea of state as you pop up and down in the hierarchy of things, and you can’t really make a lot of allocation guarantees ahead of time until the tree is walked.
By contrast, a block format lets you quickly skip down the list of blocks, do most of your allocations up front, and then patch up and copy things around at the end. At that point, having good safe arithmetic routines prevents you from over-allocating or under-allocating things.
Towards that end, in C++ a very handy thing to do is to create a BinaryRegionReader class that provides “safe” and bounds-checked access to a region of memory, and which allows the creation of child BinaryRegionReaders.
Towards that end, in C++ a very handy thing to do is to create a BinaryRegionReader class that provides “safe” and bounds-checked access to a region of memory, and which allows the creation of child BinaryRegionReaders.
We had our own routines that better handled certain issues (see: safe arithmetic)
We were doing some of that as a learning project
Wanted to easily support multiple platforms–our codebase was already setup for that
Would still have required coming up with a format for the layout (since you want something that is easy to shove into graphics buffers anyways) even with the help those libs provide
Add extra build steps and autogenerated code wasn’t appealing
Having our own code/copyright gave more licensing flexibility (WTFPL ftw)
Have there been some interesting developments since last time it was posted (apart from name change ;))?
So, some experiences from the pixel mines.
We were implementing our own format for 3D mesh data for static (initially) meshes, and went through probably 4-5 implementations. The initial format was based on nice tree structures, while the final format was more of a length-prefixed block-based approach.
The reason for this change is that, if you have a maliciously-formed tree structure, it can be really easy to hamstring your parser. You have extra records, or not enough records, and parsing gets to be a headache. You also have to build a smarter parser, because you kind of need to keep an idea of state as you pop up and down in the hierarchy of things, and you can’t really make a lot of allocation guarantees ahead of time until the tree is walked.
By contrast, a block format lets you quickly skip down the list of blocks, do most of your allocations up front, and then patch up and copy things around at the end. At that point, having good safe arithmetic routines prevents you from over-allocating or under-allocating things.
Towards that end, in C++ a very handy thing to do is to create a
BinaryRegionReader
class that provides “safe” and bounds-checked access to a region of memory, and which allows the creation of childBinaryRegionReader
s.So, basically
std::string_view
?Sort of, but also with the ability to read off native types in order and respect endianness and do seeking safely.
Sure, that makes sense.
Btw what lead you to go with completely custom format instead of using something like protocol buffers or cap’n proto?
Reasons we didn’t use those: