1. 2

    It is nice to see a data compression example that is both easy to follow (with the blog post in hand) and not trivial.

    Also, it is great that he links to the blog posts of ryg and cbloom which contain lots of practical knowledge.

    The code is set up to test using enwik8. It is easy to end up optimizing for a specific type of data which might inversely affect the results on other types, so I would suggest testing with some non-text data sets as well if he isn’t already (perhaps the Silesia corpus, which contains various types of data).

    1. 3

      Compressing enwik8 took 22 seconds, a quick hack to try and compress silesia.tar which is roughly twice as big, took almost 2 minutes to compress and segfaults on decompression. A good indication it needs some testing on other types of data.

    1. 2

      I think the amount of discussion about the hows and whys and portability of small things like resolving dependencies is a good argument for using something like Meson, at least for non-trivial projects:

      project('foobar', 'c')
      cc = meson.get_compiler('c')
      m_dep = cc.find_library('m', required : false)
      executable('foobar', 'main.c', 'foo.c', 'bar.c', dependencies : m_dep)
      
      1. 2

        One could argue from the C language perspective it’s not technically a buffer overflow unless b1 happens to be placed near the end of the memory object allocated from the system.

        It is a violation of the logic the pool allocator implements, which address sanitizer cannot possibly know about – it is entirely possible to imagine a valid program that would allocate an area of memory and write possibly overlapping strings into it.

        Don’t get me wrong, it is a buffer overflow at the program level in this case, and it seems like a good idea to be able to somehow mark the areas that are logical units to be able to catch this.

        1. 2

          Fuzzing is a fantastic tool for anything that parses or processes input. Another fuzzer worth checking out is LLVM’s libFuzzer, which is a simple way to do in-process fuzzing, and can be combined with for instance address sanitizer.

          I sometimes add a conditional block with the fuzz target function at the end of the source file that does the processing, so it is easy to build and start fuzzing when you make changes. See for instance the one in tinf.

          For Windows people, libFuzzer runs on WSL, and the latest builds of clang targeting MSVC also started including support.

          1. 1

            Didn’t know about Robson traversal (mentioned at the end). Another algorithm using O(1) space is Morris traversal which is conceptually quite simple. There was a nice explanation of it in Nicholas Ormrod’s talk at CppCon 2017.

            1. 2

              https://www.geeksforgeeks.org/inorder-tree-traversal-without-recursion-and-without-stack/ is a good description of the Morris traversal.

              It turns a normal binary tree into a threaded tree as part of the traversal and then back. By doing this as part of the traversal it’s unnecessary to store those 2 bits of information, giving the traversal O(1) space.

              On the downside, since the Morris traversal destructively turns right pointers into threads it leaves the tree in an unsafe state for reading or writing (the same warning the author mentions about the Link-Inversion model). It also doesn’t realize the O(1) amortized successor function mentioned by the author in the case where you keep your tree in a Threaded state.

            1. 6

              People often gripe about the STL but what Alex Stepanov created is truly a thing of beauty. It was truly radical at the time. It’s even radical now considering that generic programming is still not mainstream.

              However, it isn’t what most people are used to. Most people still don’t understand what iterators are. For those that do not know, iterators are a generalization of a pointer. Pointers are a subset of iterators.

              People don’t get the STL because people can’t math.

              1. 5

                The STL is amazing (see Sean Parent’s talk on generic programming for some background).

                But along with all the benefits it has brought to C++, it has also been a factor in some of the issues the author touches on (and which are well-known), like long compile times, slow debug performance, hard to understand error messages. Whether it is math that makes the C++ ranges implementation shown hard to read, or the sheer volume of packaging of that math – compare it to the python and haskell implementation above.

                What I find a more interesting question raised by this discussion is whether the prevalence of certain industries in the C++ committees has steered the language in a direction where things like compile times are neglected because they can just throw more hardware at such problems. But of course this corporate interest is also part of the reason C++ had a revival in the first place.

                1. 2

                  indeed ! STL starts from algorithms and not objects, and algorithms are defined on algebraic structures. imho, extending these structures with complexity requirements is the crux, and iterators are the key :)

                  1. 1

                    For (advanced) generic programming to become mainstream, someone’s got to work on the compile times.

                    Maybe we need a -fno-monomorphization flag to get super fast debug builds at least?..

                    1. 1

                      I’m wondering if concepts can help speed up compile times. I know they are meant to improve error messages and readability, but certainly, they can be used to short-circuit parts of the compilation process… I haven’t really explored that idea much though.

                  1. 2

                    I’ve used µnit and greatest in the past – granted not as simple as this, but close.

                    1. 4

                      I’ll probably continue to play with optimal compression of various formats.

                      After releasing the new version of BriefLZ (compression library) with an optimal compression level, I used that in an example of the same for LZ4 (see discussion over at Encode’s Forum if interested).

                      I made one for CRUSH and Snappy as well, but they are all dreadfully slow, so mostly of theoretical interest.

                      1. 5

                        I wrote a very simple data compression library called BriefLZ many years ago, which recently appears to have found some use in a few other projects.

                        I’ve been implementing some algorithms to get better compression out of it without changing the data format. I am hoping to get a few of these polished enough to release.

                        1. 3

                          I haven’t used Solaris since university, so the thing that excites me the most about this is the potential for dropping support for CMake 2.8.6 (Solaris 11.4 appears to include CMake 3.9).

                          I think the next is then RHEL/CentOS7 at CMake 2.8.11 (which is a big step up from 2.8.6). Once they update we might actually have CMake 3.0 and “modern CMake”.

                          1. 2

                            I think that we will see RHEL8 this fall with newer CMake.

                          1. 1

                            A story related to the second nugget – MSVC has a file listing.inc containing padding macros, which is included in the assembly output the compiler produces.

                            The first 64-bit versions did not include an updated listing.inc. For instance a NPAD 2 would result in the 2 byte instruction mov edi, edi, which on AMD64 this has the side effect of clearing the high 32 bits of rdi.

                            So you could have code that worked fine, but if you produced an assembly listing and ran MASM on it, it would crash.

                            1. 2

                              I must admit I never noticed Knuth names this variant Fibonacci hashing in TAOCP, I’ve always just seen it referred to as Knuth’s multiplicative hashing. It appears in a number of data compression algorithms – for instance you can try searching for the constant 2654435761, which is a prime close to 2^32/phi often used.

                              Somehow It seems unlikely that the developers of STL implementations were not aware of this.

                              1. 3

                                After running a Debian VM in Windows for years, it’s wild how quickly WSL became normal for me. ConEmu plus bash is such a nice convenience.

                                1. 2

                                  I’ve tried several times to make ConEmu palatable as an app and have failed every time. I find it beyond ugly as an app and I can’t get the colours for Solarized (light) to look right. If you have any tips, I’d appreciate it.

                                  As for WSL, it’s the only reason to use Windows, frankly.

                                  1. 1

                                    I am using Cmder as a frontend for ConEmu, it provides a slightly more polished setup out of the box. Just for the record, I am using Cmder mini version 1.1.4, which was the last release before some larger changes which (for some reason I cannot remember) did not work for me.

                                1. 2

                                  Shame about the camera angle and the sound, seemed interesting, but hard to watch for an hour.

                                  1. 1

                                    We didn’t have a good recording device this time, unfortunately. I did my best to clean up the sound, but it’s still not great. Sorry about that :(

                                  1. 8

                                    Problem one: there is no such thing as an “internal change that fixes incorrect behavior” that is “backwards compatible”. If a library has a function f() in its public API, I could be relying on any observable behaviour of f() (potentially but pathologically including its running time or memory use, but here I’ll only consider return values or environment changes for given inputs)

                                    I don’t think that is much of a problem tbh. I wouldn’t consider it breaking backwards compatibility as long as the function still adheres to the API contract set forth by the documentation and general expectations. The function String() should return a string, the formatting of that string may change if it is not explicitly documented. The function ParseFile() will continue to parse the input file and return a value containing the parsed data. As long as the same set of input files is parsable according to the documentation there is no breakage in my view.

                                    However the limit of that would be if any behaviour ends up being widely used or was documented in a bad manner.

                                    What would be breaking backwards compatibility would be any change that would result in having to change documentation or function signatures. Those would require a more major release IMO.

                                    Of course, as long as majorVer=0 I will break backwards compatibility like the Hulk after being asked to write an essay on “smashing considered harmful” since it’s (IMO) not ready for production yet.

                                    1. 5

                                      Indeed, the “backwards compatible” in semver is with respect to the public API. This is not directly specified in the quoted part, but if you read the whole semver spec, I think it is pretty clear.

                                      So I would say, if you depend on observable behavior that is not part of the public API (including the documentation), then it is your own responsibility to have tests in place that ensure a new version exhibits that behavior.

                                      1. 0

                                        Of course, as long as majorVer=0 I will break backwards compatibility like the Hulk after being asked to write an essay on “smashing considered harmful” since it’s (IMO) not ready for production yet.

                                        I really hate this rule in semver. It seems designed to keep the major number small, people seem to have some lizard brain fear of large major versions, but it defeats the purpose of semver, IMO.

                                        1. 3

                                          Well, it’s a thing because majorVer=0, for me, implies the product is not ready for production. I’m still working out the kinks. I’m not going to make any promises.

                                          Once I have everything set, I put down majorVer>=1, since I’m now more confident in the problem domain I also have less of a problem with upgrading major versions if it becomes necessary.

                                          I don’t really fear large major versions, I tend to avoid it since most of the time you can bolt on the new functionality and provide shims for the old functions.

                                          I don’t think it defeats the purpose of semVer as long as you eventually go majorVer=1.

                                          1. 1

                                            Well, it’s a thing because majorVer=0, for me, implies the product is not ready for production

                                            I think the difference is versions and production are orthogonal to me. My workflow is deploying packages from branch builds until I feel it’s ready, so versionless.

                                      1. 17

                                        The problem likely goes a bit more like this:

                                        Joe buys one month of GitHub premium, creates a private repo, and invites his 100 best friends to it. They all clone the repo, and he cancels his account. Now they each have a free private repo they can use forever.

                                        1. 4

                                          Another potential bug many might miss is the multiply by 2. On a 64-bit system where int is 32-bit (Windows for instance), it would be possible for this operation to overflow, resulting in undefined behavior.

                                          1. 2

                                            That would mean they had to allocate 2.1GiB * sizeof(int) (or at least 1.05GiB * sizeof(int)). Even so, calling malloc on a negative integer would just have it return NULL in new_data, so the assert would fail and the program would be terminated before anything bad happens.

                                            1. 2

                                              I think “that would require unlikely input” and “undefined behavior does nothing bad on my platform” can be dangerous when it comes to C.

                                              For instance, given the compiler knows the capacity starts at 1 (if we fix that bug) and is always multiplied by 2, since overflowing would be undefined behavior, it can assume that will never happen and generate a shift left for the multiply. That would result in overflowing to 0 (which could trap), which when passed to malloc could (implementation defined) return a non-NULL pointer that cannot be dereferenced.

                                              I know that’s all highly unlikely, but likely isn’t safe.

                                          1. 4

                                            Nice work, some thoughts:

                                            • Print line number where assertion failed
                                            • Way to compare doubles, possible with an optional precision
                                            • Way to compare blocks of memory
                                            • Consider renaming to snow.h
                                            1. 3

                                              I really would have liked to print the line number where the assertion fails, but I’m not sure if that’s possible. Because of the use of macros, everything ends up on the same line after the preprocessor, so __LINE__ will be the same for everything. If you know of a way to fix that, I’d love to hear it. (The "in example.c:files" message was originally supposed to be "in example.c:<line number>")

                                              More different asserts is a good idea, and so is renaming the header - the thing was under the temporary name “testfw” until right before I made this post.

                                              1. 2

                                                Looks neat! I feel that the line number of the end-of-block would still be useful, but don’t quite see how to word that without seeming incorrect.

                                                1. 2

                                                  It’s not just at the end of the it block which the error occurs in; it’s the end of the entire invokation of the describe macro. In the example.c file for example, __LINE__ inside of any of those it blocks will, as of the linked commit, be 62.

                                            1. 6

                                              I really like the idea behind the theme and the general feel.

                                              For my personal taste, as a text editor theme, the background is a little too light and the comments are barely visible. And I say that as someone who uses Solarized.

                                              These things are highly subjective of course, and also depend on what room you’re in, what time of day, and what display.

                                              1. 2

                                                For my personal taste, as a text editor theme, the background is a little too light and the comments are barely visible.

                                                The emacs theme deals with this by adding nord-comment-brightness.

                                                I think I’m finally going to replace Zenburn (my previous all-time favourite theme)!

                                                1. 1

                                                  Personally, I’ve tweaked my nofrils-like theme for emacs using these colors.

                                              1. 12

                                                I never understood the popularity of solarized. It lacks contrast and makes my eyes hurt.

                                                1. 12

                                                  There was a blog post which said it was made with science or whatever. Science can’t be wrong.

                                                  1. 4
                                                    1. 3

                                                      The implication that the goodness of something so subjective can be quantified really irks me. However, I think a lot of people ate this up, as I’ve seen people non-ironically citing this as a reason it is good.

                                                      1. 2

                                                        I hear it’s Cave Johnson’s favorite IDE color scheme.

                                                      2. 5

                                                        I’m more and more in favour of highlighting comments more than the individual parts in the code (variables, strings, …) – and I find that comments often have the least contrast :(

                                                        1. 3

                                                          In Visual Studio Code you can quite easily try this out since you can add your own customizations to the highlighting in the settings. For instance, you could add

                                                              "editor.tokenColorCustomizations": {
                                                                  "comments": "#e1a866"
                                                              }
                                                          

                                                          to change the color of all comments.

                                                        2. 1

                                                          I think it depends a lot on lighting. I use the dark theme at evening/night, and don’t have a lot of light in the room. More contrast rich themes like Monokai hurt my eyes in that setting.

                                                          The Solarized theme that comes with Visual Studio Code actually uses a base color with more contrast than the original design. But I find that rather annoying in the light theme, especially since they also use bold.

                                                          1. 2

                                                            More contrast rich themes like Monokai hurt my eyes in that setting.

                                                            That makes sense. It’s funny, at night I will continue using typical white-on-black high-contrast color schemes but just drop the monitor brightness a lot if I happen to be hacking away in the dark. Usually I just turn the lights on, though.

                                                            1. 1

                                                              For me both variants of Solarized are difficult to read in the daytime on a nice display and borderline unusable at any time of day on a low-end display. On the other hand, I find high-contrast dark themes too harsh, so I tend to use dark themes that are somewhere in the middle (~#999 on ~#222) and higher-contrast light themes (~#222 on ~#f5f5f5).

                                                              1. 1

                                                                gruvbox dark works well for me :)

                                                                1. 2

                                                                  I think the red is perhaps, well, a bit too red in gruvbox. The (over-)use of red/orange/pink in many Solarized themes was part of the reason I made this variant.

                                                                  Darktooth is another interesting gruvbox-like theme.

                                                            2. 1

                                                              Agree on the importance of contrast. Lots of color themes are happy to use tons of different colors on things that aren’t completely semantically different (a numeric literal doesn’t always need to stand out a lot) while ignoring the more subtle details such as contrast.

                                                              I want the attention to detail Solarized has, but with more contrast, and something besides an ocean or a piece of parchment as the background. I’ve been using a version of Github’s color scheme in my editor for awhile, but have yet to really find a color theme that I really like.