1. 3

      Keep in mind that maybe half the people using Lua are using it inside of some other system that may already have its own package manager (or strong opinions about packaging that don’t apply elsewhere). It’s designed for embedding as well as standalone use.

      1. 1

        so whats a good way to deal with that problem? I need HTTP client and I dont think its built in - so i need to install lua-http or lua-curl or similar

        1. 2

          Lua libraries fall into two categories: single-file, zero-dependency (in which case I just plop it in the repo for the application I’m using) or libraries with really complex builds that interop with C code, in which case I use apt-get for them because they tend to be very mature and stable, and I never need the latest version. In your case luasocket falls into the latter camp, and it ships with an HTTP client.

          1. 0

            luasocket doesnt appear to be useful for larger files


            1. 0

              I wonder if this is a platform-specific issue or something; I can’t reproduce any problem with the snipped you pasted using the 100MB file.

              1. 1

                It doesn’t have progress…

          2. 1

            You can run apt install lua-curl.

            1. 1

              Cygwin doesnt offer a lua-curl package.

        2. 1

          It looks like you are trying to use LuaRocks on Windows with some “non-classical” setting.

          I haven’t used Windows for about a year, but last time I did https://github.com/Tieske/luawinmulti was the best option. LuaRocks 3 supposedly makes it better but there may still be issues because afaik none of the core devs really uses Windows…

        1. 3

          There was also a Strange Loop talk about this in 2018.

          It is (was?) built on top of lpeg, but trying to address some usability issues with PEGs.

          1. 2

            Thanks for sharing, this is a handy reference.

            Aside: IMO logical expressions with == should be bool and not int.

            1. 2

              This depends on the C standard – bool isn’t in C89, so it can become a portability issue.

              1. 1

                Y’know – I was tempted to add a snarky “unless you need to be compatible with C89 <scoff>”. I will acknowledge that there are some (hopefully very very few) legit use cases where bool support is missing.

                1. 2

                  Indeed, and I’m inclined to agree otherwise.

                  I’ve worked on a couple projects that needed C89 compatibility (mostly in embedded), and I begrudgingly keep greatest to -std=c89 just so that it’s always an option. There are some features that are only available when it’s built with >= C99 though.

            1. 1

              Anyone have recommendations for other types of testing for C? Like property testing and mutation testing, etc.

              1. 3

                For property-based testing, look at theft.

              1. 8

                The comment field there doesn’t permit editting and correcting typos…..

                So let me try again here…

                In a galaxy far far away….

                Larry Wall wondered why he needed to learn 3 pretty bad languages, sh, awk, sed…., and devised perl as the Grand Unifying Language.

                Perl sadly borrowed too much from it’s inspirations, and wasn’t much more readable.

                Then Matz came along and resolved to borrow the best from perl and scheme and ….. and make something more powerful than them all, yet more readable.

                It’s called Ruby.

                And yes, you can do everything in Ruby, in one line if you must, that you can do in bash, awk, sed, jq, perl…. but in a more powerful and maintainable form.

                All this has been available for decades, why are we (still) bashing (pun intended) our heads against the Lowest Common Denominator?

                1. 8

                  serious question: what does “doing some awk in Ruby” look like? This might be a pretty big motivator for me to finally figure out Ruby for scripting (I’m more of a Python guy myself but awk works nicely for small scripts on line-oriented stuff when I want a one-liner)

                  1. 8


                    # Official way of naming Go-related things:
                    $ grep -i ^go /usr/share/dict/* | cut -d: -f2 | sort -R | head -n1

                    Versus Ruby:

                    puts(Dir['/usr/share/dict/*-english'].map do |f|
                        .select { |l| l[0..1].downcase == 'go' }

                    Simple example, but I think it demonstrates that doing various basic and common tasks are quite a bit more complex to do in Ruby than in the shell.

                    That doesn’t mean I’m always in favour of shell scripts – I got that example from an article I wrote saying you shouldn’t use shell scripts – but there are definitely reasons shell scripting persists, even though we have things like Perl and Ruby.

                    In that article I wrote “I regret writing most shell scripts [..] and my 2018 new year’s pledge will be to not write any more”. I’ve mostly failed at that new years’ pledge, and have happily continued shelling about. I have started rewritting shell script prototypes to other languages at the first sign of getting hairy though, and that seems like a middle ground that is working well for me (I should update/ammend that article).

                    1. 5

                      To be fair, it looks like most of the additional complexity in the Ruby code comes from reading files: the first command in the pipeline, grep -i ^re glob, is what becomes

                      Dir[glob].map do |f|
                          .select { |l| l[0..1].downcase == re }

                      The rest of the script contributes very little to the Ruby code.

                      I suspect this is a recurring theme when trying to replace shell pipelines with programs. Only Perl avoids some of this additional complexity for reading files, I think.

                      1. 5
                        puts Dir['/usr/share/dict/*-english'].
                          flat_map { |f| File.readlines(f).grep(/^go/i) }.
                        1. 6

                          At least with Ruby I don’t have to constantly cross-reference the man page and my cargo-culted knowledge of Unix’s multitude text manipulation DSLs, all unlike. It’s pretty obvious what it’s doing.

                          1. 1

                            Actually you used very little shell there in your first example.

                            You also used grep, cut, sort and head.

                            Why do you assume the backtick operator and the | operator for io doesn’t exist in ruby? In fact why do people assume shell and jq do not exist if you use ruby?

                            Personally I tend to reduce the number of tools involved to reduce the cognitive load of needing to understand each tool to understand the one liner.

                            I balance that against considerations like going IO.read(”|sort -u fileName”) can be a huge performance boost

                            Anyhoo… some examples of ruby onliners


                          2. 7

                            Because code in sed or awk that worked a decade ago (or, hell, two years) still works. Ruby code seems to bit rot faster than any other language I’ve use for nontrivial work.

                            Also, with awk, I could put it down for a year, then use it again, and everything I’d need to be productive fits in a small man page. (The same seems to be true of sed, though I don’t use it much.) The Ruby ecosystem moves a lot faster, and if you haven’t been following it closely, catching up will add extra work. (Whether it’s actually going anywhere is neither here nor there.)

                            Yes, awk is a more limited language, but that’s a trade-off – there are upsides, and I know which I’d prefer.

                            1. 1

                              Not true.

                              The awk scripts I wrote decades ago with in Solaris awk which is not quite the same thing as gnu awk.

                              Well thought out growth in a language is good.

                              I find the maintenance burden in ruby rolling forward with language versions is very low.

                              Doubly so since rubocop will often autocorrect stuff.

                            2. 6

                              I don’t know Ruby. But for me these are the reasons why I am writing more and more bash programs:

                              • Bash is my command line. So I am doing a lot of small steps, file modifications, comparing, searching analysing. At some point I can see that some of the steps can be composed and I pull them out of the history, try them out on the console and at some point put them into a script. If Ruby would have a REPL in which I can do all the operations that I am doing on the command line with less typing and more comfort, I would maybe give it a try.

                              • Bash is on every Linux box. Ruby is not.

                              1. 4

                                Ruby does have a REPL. It’s called IRB and it comes with every Ruby installation. I use it exactly as you describe, for composing small programs iteratively.

                                1. 1

                                  Are you using the Ruby REPL as your daily all-time console or just when you have in mind to create a program? I am asking honestly because I do not know anything about Ruby or their REPL and I am quite interested how good this REPL is as a replacement for the daily life?

                                  My point is that shell scripts are a by-product of using the shell for doing manual tasks. And I get better and better at my shell usage, and even after 20 years of shell usage I am still discovering new features or ways to do something in a more efficient way. While the shell language is really ugly, but being very succinct plus the composition of unix commands, the history, the prompt customization, the possibility to have vi mode for editing (and I probably forgot a lot of features), all this makes using shell such an efficient tool.

                                  1. 2

                                    Well, no, not as my daily shell. I dislike shell scripting enough that I switch to Ruby pretty quickly if I’m having to spend any amount of time or effort on a task, but it’s not meant to be a replacement for bash/zsh/fish.

                                2. 3

                                  Bash is on every Linux box. Ruby is not.

                                  Let’s not limit ourselves here. For those not using Bash and/or Linux, how about this:

                                  • Bourne-compatible $SHELL is on every Unix box. Ruby is not.
                                  1. 2

                                    Bash is on every Linux box. Ruby is not.

                                    So is ed.

                                    However sudo apt install ruby solves that problem.

                                    And yes, ruby does have a REPL.

                                    1. 2

                                      apt: command not found.

                                      sudo: permission denied


                                      1. 2

                                        Have fun with ed then, it’s the Standard!


                                        1. 1

                                          I have written scripts in ed before to do some sufficiently tricky text manipulation. It’s a good tool.

                                  2. 5

                                    Mostly, because picking up enough jq, awk and sed to be useful is faster than learning the ins and outs of Ruby?

                                    I suppose you could make a similar argument about learning Ruby one-liners, but by the time I’m writing a very long bash script, I’m probably writing a larger program anyway, either in Go or Python. Ruby as a language doesn’t have much appeal to me, at least at the moment.

                                    Awk, at least, fits very nicely into a small space right next to regex. jq is a bit fiddilier to pick up, but very nice for basic stuff. Sed, I still don’t have down very well, but also is nicely regex adjacent.

                                    1. 3

                                      I regularly write sed one liners to do refactorings on my Ruby code. Usually the sed call is fed by the result of grep or find. I could write a Ruby one liner to do the same, but it would be a much longer line and escaping would be much more difficult. Ruby is simply not a replacement for the convenience of sed.

                                      And maintainability is a red herring here: the whole point of something like sed is that you use it for one-off commands.

                                      1. 2

                                        I’m not that experienced with jq, but when it comes to awk (and sed), one of their benefits is that you can easily write a program in the shell, since they act as glue between pipe operations.

                                        For example, to filter out all lines that have less than 4 characters, all you have to write is

                                        ... | awk 'length >= 5' | ...

                                        no imports or types required. It was made for stuff like this, which makes it easy to use. I’ve only read a book about Ruby a few years ago, but to process stdin/out this was should require a bit more overhead, shouldn’t it?

                                        1. 1

                                          One part of your history lesson is missing: Paul McCarthy and Steve Russell saw what was going to happen and pre-emptively invented Lisp. And yes, you can do everything in Lisp, in one line if you must, that you can do in bash, awk, sed, jq, perl… but in a more powerful and maintainable form.


                                          1. 2


                                            This gotta be one of my most common brainarts…

                                            1. 2

                                              It was Yoko’s fault.

                                            2. 1

                                              Ruby equivalents of the basic awk and sed examples from the article, as examples of Ruby one-liner structure:

                                              • AWK: awk '{print $1}' logs.txt
                                                • Ruby: cat logs.txt | ruby -ne 'puts $_.split[0]'
                                                • Ruby: cat logs.txt | ruby -ane 'puts $F[0]'
                                              • sed: sed 's/^[^ ]*//' logs.txt |sed 's/"[^"]*"$//'
                                                • Ruby: cat logs.txt | ruby -ne 'puts $_.gsub(/^[^ ]*/, "").gsub(/"[^"]*"$/, "")'
                                            1. 6

                                              This was a nice review. I admit I wasn’t able to finish the book, as I constantly got stuck on the confusing PlusCal syntax.

                                              At the time of reading it, I was getting some weird edge cases in a distributed protocol at work, and trying to come up with a minimal example that could be reproduced was proving to be quite difficult. I got the book after reading a couple of Hillel’s posts and seeing the announcement here, with the idea of using TLA+ to solve our issue.

                                              I already had a written specification (although in pseudocode form, and quite loose), and some intuition of where the problem was coming from, so I thought it would be a simple task. Nevertheless, I was never able to do it. Trying to reduce the real implementation to a model that TLA could understand, and figuring out the right level of detail, proved to be too time consuming, plus I had other, more urgent things to do, and in any case the problem was rare enough to be able to postpone finding a solution for a bit.

                                              Fast forward a couple of months, and I finally decided to come back to the problem. This time, though, I implemented a simplified version of the system, in real code, and ran it through a property-based test (poor man’s model checker) that would execute actions against the system, and specified that the edge case would never happen. I didn’t get the prettiest execution on the first try, but I managed to refine it iteratively and armed with existing knowledge of the domain. The final execution was maybe 10 lines long as easily explainable in a whiteboard, which I consider a great success.

                                              The whole thing took maybe a week of work (counting reading about model-checking, property tests and learning to use the testing library) and proved much simpler than trying to tackle a real TLA spec. I’m not saying my approach was better, TLA can express so much in terms of temporal properties that trying to model that some other way will be painful. But sometimes, the simplest approach works best.

                                              1. 5

                                                I’ve also had a lot of success using property-based testing as a sort of model checker, but there are trade-offs to keep in mind. I’ve been learning TLA+, off and on, and don’t feel it’s at all redundant with property-based testing.

                                                TLA+ will check the model exhaustively, whereas every property testing library I’m aware of is probabilistic – because they’re semi-randomly exploring, it’s never certain when they’ve checked all meaningful cases, and some are better than others about avoiding redundant checks. Sometimes the partiality is fine, and just random stress testing is a good way to feel out an implementation, but total state-space coverage is compelling.

                                                Where TLA+ checks a model of the code, a property testing library will check the implementation (or perhaps a subset). It’s often worth doing both! I mostly work in C these days (and use my theft library for property-based testing), so testing for mistakes in the implementation details is pretty important, but a non-code model checker could still be really valuable during system design.

                                                1. 2

                                                  TLA+ will check the model exhaustively, whereas every property testing library I’m aware of is probabilistic.

                                                  For sure! A naïve property might not find the example you’re looking for. I know I didn’t get anything on my first try. I had to play a lot with the model generator and the shrinking procedure, to the point that I ended up defining explicitly certain parameters of the generator (number of concurrent processes, number of operations a process is able to execute, etc).

                                                  It’s often worth doing both!

                                                  Won’t disagree here, if I had already known TLA or any other specification language, I would’ve use that for sure (and if I ever get the time I plan to come back to the book and work through it, plus Lamport’s book), but when you’re facing constraints in the available time you have, going for a simpler solution can get you far enough.

                                              1. 4

                                                Working on wrapping up and releasing some personal projects I’ve had mostly ready for a while.

                                                I just pushed skiparray, an unrolled skip list library for C. It has roughly the same relation to a regular skip list as a B-Tree has to a basic balanced binary tree.

                                                Aside from that, getting back to learning TLA+, working through hwayne’s book.

                                                1. 5

                                                  I usually turn to this blog post when I need a reference for the Fisher-Yates shuffle.

                                                  It’s a good algorithm to have on hand, whether you’re shuffling a data structure or just indices into it. I’ve found a couple really elusive bugs by running tests shuffled multiple ways.

                                                  1. 15

                                                    The author has a pretty explicit bias toward Earley parsing, and using LL(k) over LR(k) methods (which I mostly share). There’s a wealth of information here, but he’s missing a couple other significant strands of generalized parsing research outside of Marpa’s particular Earley variant (perhaps because they don’t fit within his narrative).

                                                    GLR (Generalized LR):

                                                    • Tomita, LR parsers for natural languages, 1984.
                                                    • Tomita, Efficient parsing for natural language, 1986.
                                                    • Scott and Johnstone, Generalised bottom up parsers with reduced stack activity, 2005. This segues into their later work with Earley and GLL.

                                                    GLL (Generalized LL):

                                                    • Scott and Johnstone, GLL Parsing, 2010.
                                                    • Afroozeh and Izmaylova, Faster, Practical GLL Parsing, 2015.

                                                    Outside of Earley, GLL seems like a very practical generalized parsing approach. Instaparse is one implementation.

                                                    Earley / Shared Packed Parse Forests:

                                                    • Scott, SPPF-style parsing from Earley recognisers, 2008 (and several related papers by Scott and Johnstone). Note: I’ve never been able to get the approach described in the paper implemented correctly, and not for lack of trying.

                                                    Earley Intersection Parsing (not sure if there’s a canonical name for this):

                                                    • Bar-Hillel, Perles, and Shamir, On formal properties of simple phrase structure grammars in Zeitschrift für Phonetik, Sprachwissenschaft und Kommunikationsforschung, 1961. Proves some important results about intersecting automata and context free grammars.
                                                    • Chapter 13 (“Parsing as Intersection”) of Grune and Jacobs (also cited elsewhere in his timeline), particularly 13.4 (pgs. 437-439): This describes Bar-Hillel’s Intersection Parsing approach using contemporary CS & parsing terminology, and then suggests combining it with Earley parsing. While the basic intersection parsing method produces an astonishing amount of dead-end nodes to garbage-collect later, Earley parsers limit searching to productions reachable from the start symbol. If the completer is modified to produce non-terminals in the intersection parsing format (which is easy), intersection parsing nicely handles producing the parse forest (with structural sharing, when ambiguity produces multiple derivations).

                                                    I’ve been working on a C library for Earley Intersection parsing, and an awk-like, pattern-match-directed language based on it, but working on testing them thoroughly tends to lead me to working on theft instead.

                                                    1. 3

                                                      Thanks for details about alternatives.

                                                      I have played around with Bison’s GLR option and have almost managed to create a C parser that does not require special handling of typedefs in the lexer (ambiguities still appearing at runtime for a few constructs).

                                                      Grune and Jacobs’ Parsing Techniques - A Practical Guide is sitting in a pile waiting to be read (the first edition was manageable, the second looks daunting)

                                                      1. 1

                                                        I have a print copy of the second, and it’s never struct me as daunting – reading it cover to cover, perhaps, but it’s well suited to dipping into for just specific techniques, with a well-curated bibliography for further details.

                                                      2. 3

                                                        Thank you for adding this. I understand why everybody is excited about Earley parsing, though I personally don’t feel sufficiently grounded in it to share that excitement, but at the very least other lines of research deserve to be mentioned.

                                                        1. 2

                                                          I apologize for reviving this thread, but do you know if Joop Leo’s improvements for Right Recursion allows for optimizing indirect right recursion also?

                                                          1. 1

                                                            Not sure, sorry.

                                                          2. 2

                                                            packrattle is another example of a GLL parser.

                                                            The times I’ve read about Earley parsers, they seem like the same solution as GLL, but attacking the problem from a different angle. Hopefully someone in academia will eventually prove that.

                                                            1. 2

                                                              I’m particularly interested in Earley parsers because some use cases of mine assume ambiguity (reverse-engineering binaries, scraping damaged data), and the chart that Earley parsers build up supports several operations beyond just recognizing & building parse trees:

                                                              • what terminals/nonterminals would advance the current parse(s) in progress? (i.e., autocomplete)
                                                              • if we don’t assume where the starting position is, what overlapping instruction encodings are there?
                                                              • if we inserted a fake nonterminal here, would enough of the parse have completed to match any instances of (some structure) between here and there?
                                                              • incremental re-parsing of changing input, since earlier columns in the chart are constant once the parser has moved on.
                                                          1. 5

                                                            I’ve done this on a couple projects. It was particularly helpful when I was in an embedded project and had time values from a couple clock domains – mixing up a 1 msec resolution time value with a 1 sec time value (or, worse, our hardware’s timer that had 1024 tick per second) could have led to some subtle bugs.

                                                            Unfortunately, the lack of generics means that tracking multiple things (such as, what resolution is the timer AND is it an absolute timestamp or relative) can get verbose. Still, the struct value boxing should have zero runtime cost, so it can be a handy way to use C’s type system for extra dataflow constraints.

                                                            1. 9

                                                              Something that seems curiously absent is recommending that any code that is accepting untrusted input and attempting to parse it should get lots of fuzzing. That’s one of the ideal uses cases for fuzzing, and coverage-guided fuzzers like afl-fuzz and libfuzzer are really good at searching for input that provokes unusual behavior in parsers.

                                                              1. 1

                                                                This is interesting, but there’s an important caveat near the end:

                                                                Don’t use the Meow hash for tiny inputs. The minimum block size for the Meow hash is 256 bytes, so if you’re feeding it 8 byte data, 16 byte data, etc., it’s going to waste all of its time padding the input, and you could have gotten much better performance by using a different type of hash.

                                                                They mention upfront that it’s a “large data hash”; more specifically, it’s a hash algorithm that’s making trade-offs to be fast for larger binary assets, at the expense of working badly for shorter input. Maybe this is really compelling for your problem domain, maybe it isn’t, but it’s an important consideration. (I’m excited to have people focus on more niche hashing stuff like this, even if this one isn’t very personally relevant.)

                                                                1. 9

                                                                  Posted a new theft release, 0.4.4, which includes a couple bug fixes and Makefile improvements that shouldn’t wait until the 0.5.0 release is ready…And then 0.4.5 shortly after, because someone pointed out I’d missed the version in the pkg-config file. Oops.

                                                                  Aside from multi-core, another major change in 0.5.0 is that I’m expecting to swap out the RNG. I’ve previously been using the 64-bit Mersenne Twister, which is statistically good enough, but rather slow, and aside from user code, theft spends most of its time in the RNG. I tried xoroshiro128+, but ran into some issues with output quality, and the author has since deprecated it in favor of xoshiro256** (note: same page, and no ‘ro’). xoshiro256** seems better (at least, it doesn’t have patterns/gaps in the output that cause my “should eventually find the minimal failure case” tests to fail), but I’m also strongly considering PCG, because I feel I understand why it works far better than with xoshrio256**. (I recommend O’Neill’s RNG talk, btw.)

                                                                  Work: Among other things, working on some succinct data structure implementations, and a library which builds on them to create a highly space-efficient trie. Both will eventually be open-sourced.

                                                                  1. 3

                                                                    It’s cool you work somewhere that lets you open-source stuff like that. Unless you’re an independent contractor/consultant working for yourself. Well, that’s even cooler. ;)

                                                                  1. 8

                                                                    Focusing on theft again, after taking a break for several months. I’m well into a massive restructuring (to add multi-core support), which feels like it’s been going on forever. Now the main stuff is working, but I’m in the long tail of making sure error handling, config hooks, etc. work, now that some of them may be split between multiple address spaces.

                                                                    Other stuff that should make it in the next release:

                                                                    • Failure tagging: If the user code sets an identifier for a failure, it can avoid shrinking paths that would change into other known failures, so that common failures don’t shadow rarer ones. I have a proof of concept of this already, but it needs to be redone to work with multi-core, because that splits its state between processes.

                                                                    • Possibly replacing the RNG with something faster than Mersenne Twister 64, since outside of user code, theft spends essentially all of its time in the RNG.

                                                                    • A better interface for passing config to user hooks, and moving more common configuration things into official settings (for example: running until a time limit, rather than a number of test trials).

                                                                    • Better benchmarking of multi-core behavior (especially shrinking), and tuning with servers with 30+ cores in mind.

                                                                    • A function that provides a CLI interface, with argument parsing.

                                                                    • Improving the bloom filter implementation. (That will also become its own library.)

                                                                    • Lots of misc. improvements to shrinking.

                                                                    Other things that are on the horizon, but will probably be pushed until after the next release so it doesn’t take another year:

                                                                    • Coverage-directed test case generation. I have a proof of concept using clang’s SanitizerCoverage interface – I don’t want to make theft depend on a specific compiler, etc., but know what the interface would look like to be able to utilize that sort of info, whether it comes from clang or something more project-specific.

                                                                    • Better structural inference during shrinking (and coverage-directed generation). This should make shrinking faster.

                                                                    • Sort sort of persistence API for generator input.


                                                                    • Various performance improvements to our regex engine (with a focus on frontloading work at compile-time).

                                                                    • Implementing some succinct data structures, for efficiently querying a (currently) massive in-memory data set. I should be able to eventually open-source the parts that aren’t highly specific to our use case.

                                                                    1. 1

                                                                      Re RNG. I always used xorshift if speed over quality was goal. Plus, it’s tiny. A quick look led me to someone saying xoshiro-256 is faster with good, statistical properties. So, there’s a few to try.

                                                                      1. 2


                                                                      2. 1

                                                                        Coverage-directed test case generation. I have a proof of concept using clang’s SanitizerCoverage interface – I don’t want to make theft depend on a specific compiler, etc., but know what the interface would look like to be able to utilize that sort of info, whether it comes from clang or something more project-specific.

                                                                        Oh wow, exciting! I actually started looking into the theft sources with the intention of doing this, great to hear you’re working on it yourself. Did you have any successes yet finding bugs that did not show up without feedback?

                                                                        1. 2

                                                                          I haven’t hooked it up into theft yet, I just confirmed that the interface will work like I expect – and there isn’t much point in starting until the multi-core stuff is done. It’s almost a complete rewrite (which is why it’s taken so long).

                                                                          I want to add a more general coverage feedback hook, which could also be used with (say) line number info from Lua’s debug hook, hashes of runtime data, or something else that can be easily monitored and would correlate to meaningfully different behavior.

                                                                      1. 2

                                                                        There’s a slightly cleaned up version of my entry as a github project, hopscotch. (spoiler alert, etc.)

                                                                        1. 1

                                                                          Great entry, thanks for sharing! The obfuscated version is particularly good. witchery abounds…

                                                                        1. 5

                                                                          Studying succinct/compact data structures, with a focus on trees/tries. I’ve been reading Gonzalo Navarro’s Compact Data Structures and various referenced papers. I’ve been building up prototype implementations in C along the way, some may end up in a library eventually. (The author’s book on string search algorithms is also excellent.)

                                                                          Working on some small improvements to my Thinkpad Carbon X1 / Void Linux ansible config repo, since now a couple people are using it and have given me feedback. (And, yes, I am very happy with this laptop, and it works very well with Linux.)

                                                                          Getting ready for some vacation time.

                                                                          1. 14

                                                                            Another benefit (which doesn’t seem to be in the post) – when you’re building a decision table and realize there’s cases where you don’t know how they should be handled, or (oh no!) you’ve been handling it two different ways, depending on the order different code paths check the variables.

                                                                            1. 7

                                                                              This is how I have used them in the past. Sometimes I’ve even directly encoded the table as a lookup table, mapping the actions to functions.

                                                                            1. 7

                                                                              I bought a new laptop, a 6th gen. Thinkpad Carbon X1, and I’m setting it up with Void Linux. (Replacing a Macbook Pro – I’ve been wanting to switch to a non-Apple laptop for a while.) My “laptop” Ansible playbook drove most of the setup, but I still need to make it suspend when I close the lid.

                                                                              A couple years ago, I wrote some blog posts about using Ansible to configure laptops (1, 2). I’ve been meaning to update my example repo – while the concepts are all the same, Ansible has had some interface changes over the years. It’s usually good about messages saying e.g. “sudo: was deprecated, use become:”, but it’d be nice having an up-to-date example, and it would document getting Void running on this Thinkpad.

                                                                              I’m going to make Void packages for a couple things I’ve installed locally – mscgen, Alloy, TLA+, and drivers for my printer/scanner (a Brother HL-L2380DW).

                                                                              Aside from that, I’ve been working on a C library for parsing/scraping ambiguous data using Earley parsing and parse forest grammars. (Related, I gave a talk at Papers We Love SF about the Earley parsing paper a couple weeks ago.) This has been off and on for a while, I’m still adjusting the API to account for important use cases it supports beyond general parsing. I’m also working on an LL(1) parser generator, a clone of djb’s redo, and learning radare2 by working through some reverse engineering exercises.

                                                                              1. 0

                                                                                I don’t really understand this. Sure, it’s cool to optimize something so well, but I don’t see the point of going to so much effort to reduce memory allocations. The time taken to run this, what it seems like you would actually care about, is all over the place and doesn’t get reduced that much. Why do we care about the number of allocations and GC cycles? If you care that much about not “stressing the GC”, whatever that means, then better to switch to a non-GC language than jump through hoops to get a GC language to not do its thing.

                                                                                1. 11

                                                                                  On the contrary, I found this article a refreshing change from the usual Medium fare. Specifically, this article is actually technical, has few (any?) memes, and shows each step of optimization alongside data. More content like this, please!

                                                                                  More to your point, I imagine there was some sort of constraint necessitating it. The fact that the allocation size dropped so drastically fell out of using a pooled allocator.

                                                                                  1. 4

                                                                                    Right at the beginning of the article, it says:

                                                                                    This data is then used to power our real-time calculations. Currently this import process has to take place outside of business hours because of the impact it has on memory usage.

                                                                                    So: They’re doing bulk imports of data, and the extra allocation produces so much overhead that they need to schedule around it (“outside of business hours”). Using 7.5GB may be fine for processing a single input batch on their server, but it’s likely they want to process several data sets in parallel, or do other work.

                                                                                    Sure, they could blast the data through a DFA in C and probably do it with no runtime allocation at all (their final code is already approaching a hand-written lexer), but completely changing languages/platforms over issues like this has a lot of other implications. It’s worth knowing if it’s manageable on their current platform.

                                                                                    1. 3

                                                                                      They’re doing bulk imports of data, and the extra allocation produces so much overhead that they need to schedule around it

                                                                                      That’s what they claim, but it sounds really weird to me. I’ve worked with plenty of large data imports in GCed languages, and have never had to worry about overhead, allocation, GC details, etc. I’m not saying they don’t have these problems, but it would be even more interesting to hear why these things are a problem for them.

                                                                                      Also of note - their program never actually used 7.5GB of memory. That’s the total allocations over the course of the program, virtually all of which was surely GC’ed almost immediately. Check out the table at the end of the article - peak working set, the highest amount of memory actually used, never budged from 16kb until the last iteration, where it dropped to 12kb. Extra allocations and GC collections are what dropped. Going by the execution time listing, the volume of allocations and collections doesn’t seem to have much noticeable effect on anything. I’d very much like to know exactly what business goals they accomplished by all of that effort to reduce allocations and collections.

                                                                                      1. 1

                                                                                        You’re right – it’s total allocations along the way rather than the allocation high water mark. It seems unlikely they’d go out of their way to do processing in off hours without running into some sort of problem first (so I’m inclined to take that assertion at face value), though I’m not seeing a clear reason in the post.

                                                                                        Still, I’ve seen several cases where bulk data processing like this has become vastly more efficient (from hours to minutes) by using a trie and interning common repeated substrings, re-using the same stack/statically allocated buffers, or otherwise eliminating a ton of redundant work. If anything, their timings seem suspicious to me (I’d expect the cumulative time to drop significantly), but I’m not familiar enough with the C# ecosystem to try to reproduce their results.

                                                                                      2. 1

                                                                                        From what I understood, the 7.5GB of memory is total allocations, not the amount of memory held resident, that was around 15 megs. I’m not sure why the memory usage requires running outside business hours.

                                                                                        EDIT: Whoops, I see you responded to a similar comment that showed up below when I was reading this.

                                                                                      3. 2

                                                                                        The article doesn’t explain why they care, but many garbage collection make it hard to hit a latency target consistently (i.e. while the GC is running its longest critical section). Also, garbage collection is (usually better optimized for short-living allocations than malloc, but still) somewhat expensive, and re-using memory makes caches happier.

                                                                                        Of course, there’s a limit to how much optimization one needs for a CSV-like file in the hundreds of MBs…

                                                                                        1. 1

                                                                                          Maybe their machines don’t have 8gb of free memory lying around.

                                                                                          1. 2

                                                                                            As shown in the table, they don’t use anywhere close to 8gb of memory at a time. This seems like a case that .NET is already very good at at a baseline level

                                                                                        1. 2

                                                                                          I often do, but don’t try to force it. Sometimes I just don’t have spare mental energy for it, especially when I’m doing deep work at my day job. Hobbies that get me away from tech help ground me and keep me from burning out (I particularly like cooking and bicycling). I’m usually not interested in programming for its own sake, so much as for particular projects I want to work on.

                                                                                          Also, remember that coding means different things to different people. You might find another branch of programming fits you better. I don’t find front-end web development enjoyable at all (I’m not actively involved in that space, and the tooling changes quickly, so it feels like starting from scratch every single time), but find playing with microcontrollers / Arduinos and making generative art bots to be a lot of fun. I’m especially interested in some particular problem domains, like parsing, networking, and testing, which have had more or less overlap with my jobs over the years.