The method I’ve used for rank involves precomputing rank sums: you keep a running total of 1 bits for each large block (every 64kb, for example), and then a second array of counts within smaller blocks (maybe every 4kb). You store each packed into only as many bits as they need, and the block sizes can be tuned for time/space tradeoffs.
To find the rank of a bit offset, you’d start with the preceding 64k block’s total, then add each 4k block’s count until you’re close to the offset, then popcount the remaining individual 64-bit words up to the exact bit offset. This can be made reasonably efficient. It’s also effectively constant time, since there’s one random access for the large block total, followed by adding a bounded number of adjacent smaller block counts within the final large block, then a bounded number of word popcounts within the final small block.
For select, you can binary search on that rank result, or you can build a similar index using precomputed tables to jump to roughly the right area, then scan and count bits to find the exact offset. Either rank or select can be implemented in terms of the other.
My implementation was for a LOUDS, a kind of succinct trie structure whose backing bitvector has an almost perfectly even balance of 1s and 0s by design. There are different approaches with better tradeoffs if you know the ratio of 0s to 1s is likely to be very skewed, though.
I found Erik Demaine’s open courseware lectures about succinct data structures particularly helpful when I was first learning about those data structures – I think it was sessions 17 and 18 here, but it may have been different recordings from the same course. He describes the block sample sum and popcounting approach.
For more depth, Gonzalo Navarro’s Compact Data Structures describes a couple other approaches for rank & select, along with several other succinct structures. If you’re at all interested in this stuff I’d absolutely recommend checking it out, it’s a challenging read at times but it’s fascinating.
Good question I myself don’t know the answer to, but I’m sure there are papers you can dig up (keywords “succinct rank select” should get you started). You can also look at existing implementations such as the one in the vers crate.
Glad to see this being discussed on lobsters. In the (paraphrased) words of Dylan Beattie, “Some people say ’keep politics out of technology ’. Politics create the problems technology tries to solve”
It may hit a bit close to home, but the folks too lazy/sloppy/unskilled to properly handle good commit messages probably are the ones most easily replaced by LLMs.
Well, let’s consider the cases. If the commit message that one would have put was “try” “lol” “ugh try again”, then what did we gain by having a more verbose commit message written by the LLM? Conversely, if the commit message is actually very important, should we really use an LLM to write it for us?
Arguably those provide more context about the “why” than an LLM-provided summary of the change: they convey that the committer is in a hurry, and that this is likely an (attempted) fix to a recent commit by the same author, probably as part of the same merge.
I’m ambivalent tbh. I know amazing devs who write those silly little intermediary commits. I try to organize my commits because I assume someone, one day, might want to review a commit log to figure stuff out, but I don’t have strong feels because I rarely do that myself.
Most of the people I know who write those sorts of commits tend to be basically using them as fixup commits, except that they tend not to be aware of --fixup and probably find their flow more convenient anyway. When their change eventually appears in the main branch, it’s almost always squashed down to a single commit with a single well-written description (usually done via PRs in some git forge or other).
Or in other words, I suspect the people writing those intermediary commits usually see them as temporary, and I don’t think they’d get much out of an LLM describing their changes.
Perhaps, yeah. I tend to clean up my commit history to remove those intermediary ones as well fwiw, but I know a lot of people who don’t. I suppose the answer isn’t “use an LLM” (I know I won’t) but to just learn how to rewrite your history properly.
Or even use tools that make managing your history easier. I do think Git makes it excessively hard to rewrite commits and maintain a history of the rewrites that you’re doing, which leads to people taking the “squash everything in a PR” approach, which is typically a lot easier. But better tooling in this regard can improve things a lot.
That said, this is perhaps getting a bit off-topic from the core idea of LLM-based commit messages.
Sometimes there’s no bigger meaning behind some changes. It’s either obvious why it’s done or it becomes obvious in context of other changes. Writing commit messages for that is just going through the motions which means sometimes it becomes “refactor” or a similar message. In those cases it could be nice to have an automatic summary instead of nothing.
Many people are asking “Why force Apple to open up their platforms? Why don’t the users just pick another platform to run their software on?”. If you are genuinely interested in the long answer, I strongly recommend “Chokepoint Capitalism” and “The Internet Con” by Cory Doctorow
This guy. I can’t tell his NY Times bestselling fiction apart from his NY Times bestselling nonfiction. At least his YA novels are clearly distinguishable by the cover graphic design. Anyway, I really don’t understand why he (or somebody!) can’t put up a free, concise summary of his argument(s?) if they’re so important. It always seems to boil down to more government regulation, as though regulatory capture hasn’t already paved so many miles of the road we’re on.
Thanks for the links. As for a summary, I think he regularly writes blogposts and columns and appears on podcasts and gives talks. Here is a random interview I found where the first two questions are about interoperability and regulatory capture.
One other way it is harmful to technical innovation is that universities and grant agencies in the recent past have been directing research money (disproportionately) towards machine learning and away from other fields like systems, security, algorithms or formal methods
Noscript. No text. No luck. Lifting the restriction on notion.so does nothing, I’m still redirected here, with no chance to unblock whatever needs unblocking — because at this point I no longer see what I need unblocking. First time I see a website redirecting me to a different URL just to tell me I need to turn on Javascript.
Sorry I can’t comment on the actual content. I… kinda didn’t get a chance to read it.
I created a gist containing the content (hopefully I didn’t break any licenses) here. Feel free to tell me I’m wrong, I will delete it. Hopefully GitHub would work better (though it might still be bad)
You need to lift it on notion.site I think, although I gave up after I couldn’t figure it out and kept getting redirected.
Web developers, please heed my plea: don’t redirect noscripting users!! Especially not to another domain.
If I’m interested enough, I’ll enable JS for your site. I’m used to doing this, despite the fact that I primarily read text and submit basic forms on the internet. I’m even patient enough to allowlist your myriad 3rd party scripts.
I have JS on by default, but I got tripped up by that too.
I noticed that the page was a bit slow and, more importantly, that it hijacks my arrow keys which I use to finetune my scroll position. Since that is often fixed by disabling JS without any ill effects, I flipped the scripts temporarily off in uBlock for the domain and reloaded. Since it would now immediately take me to another domain, to turn the scripts back on it was easiest to just restart Firefox.
This is wonderful. So, the question now is, how do I implement efficient rank/select for bitvectors?
The method I’ve used for rank involves precomputing rank sums: you keep a running total of 1 bits for each large block (every 64kb, for example), and then a second array of counts within smaller blocks (maybe every 4kb). You store each packed into only as many bits as they need, and the block sizes can be tuned for time/space tradeoffs.
To find the rank of a bit offset, you’d start with the preceding 64k block’s total, then add each 4k block’s count until you’re close to the offset, then popcount the remaining individual 64-bit words up to the exact bit offset. This can be made reasonably efficient. It’s also effectively constant time, since there’s one random access for the large block total, followed by adding a bounded number of adjacent smaller block counts within the final large block, then a bounded number of word popcounts within the final small block.
For select, you can binary search on that rank result, or you can build a similar index using precomputed tables to jump to roughly the right area, then scan and count bits to find the exact offset. Either rank or select can be implemented in terms of the other.
My implementation was for a LOUDS, a kind of succinct trie structure whose backing bitvector has an almost perfectly even balance of 1s and 0s by design. There are different approaches with better tradeoffs if you know the ratio of 0s to 1s is likely to be very skewed, though.
I found Erik Demaine’s open courseware lectures about succinct data structures particularly helpful when I was first learning about those data structures – I think it was sessions 17 and 18 here, but it may have been different recordings from the same course. He describes the block sample sum and popcounting approach.
For more depth, Gonzalo Navarro’s Compact Data Structures describes a couple other approaches for rank & select, along with several other succinct structures. If you’re at all interested in this stuff I’d absolutely recommend checking it out, it’s a challenging read at times but it’s fascinating.
Good question I myself don’t know the answer to, but I’m sure there are papers you can dig up (keywords “succinct rank select” should get you started). You can also look at existing implementations such as the one in the vers crate.
This might be helpful, I just ran into it (in an unfortunately phrased orange site comment…)
https://stackoverflow.com/questions/72580828/what-is-a-succinct-rank-data-structure-how-does-it-work
Wonderful. I was a big fan of this city building genre in the early 2010s.
Glad to see this being discussed on lobsters. In the (paraphrased) words of Dylan Beattie, “Some people say ’keep politics out of technology ’. Politics create the problems technology tries to solve”
Is writing a 150 character commit message so difficult that it warrants invoking an
omniscient AIbig hammerLLM? I am not sure what this is forIt’s funny, I think the answer is probably “yes” given how many commit messages are “try” “lol” “ugh try again”.
It may hit a bit close to home, but the folks too lazy/sloppy/unskilled to properly handle good commit messages probably are the ones most easily replaced by LLMs.
Imagine if some people use this to create more descriptive commit messages for their personal projects. Crazy, I know.
Not my experience at all.
This right here. Spicy autocomplete is pretty good for writing a sentence when you’re mentally casting about to find the words.
That is very true :-)
As long as the LLM prompt is not “try” or “lol” or “ugh try again”
Well, let’s consider the cases. If the commit message that one would have put was “try” “lol” “ugh try again”, then what did we gain by having a more verbose commit message written by the LLM? Conversely, if the commit message is actually very important, should we really use an LLM to write it for us?
A very fair point, although I was being a bit tongue in cheek with my comment. I agree with you.
Arguably those provide more context about the “why” than an LLM-provided summary of the change: they convey that the committer is in a hurry, and that this is likely an (attempted) fix to a recent commit by the same author, probably as part of the same merge.
I’m ambivalent tbh. I know amazing devs who write those silly little intermediary commits. I try to organize my commits because I assume someone, one day, might want to review a commit log to figure stuff out, but I don’t have strong feels because I rarely do that myself.
Most of the people I know who write those sorts of commits tend to be basically using them as fixup commits, except that they tend not to be aware of
--fixupand probably find their flow more convenient anyway. When their change eventually appears in the main branch, it’s almost always squashed down to a single commit with a single well-written description (usually done via PRs in some git forge or other).Or in other words, I suspect the people writing those intermediary commits usually see them as temporary, and I don’t think they’d get much out of an LLM describing their changes.
Perhaps, yeah. I tend to clean up my commit history to remove those intermediary ones as well fwiw, but I know a lot of people who don’t. I suppose the answer isn’t “use an LLM” (I know I won’t) but to just learn how to rewrite your history properly.
Or even use tools that make managing your history easier. I do think Git makes it excessively hard to rewrite commits and maintain a history of the rewrites that you’re doing, which leads to people taking the “squash everything in a PR” approach, which is typically a lot easier. But better tooling in this regard can improve things a lot.
That said, this is perhaps getting a bit off-topic from the core idea of LLM-based commit messages.
Most simple use cases of LLMs are not that much more than an admission of the user’s own incompetence.
Sometimes there’s no bigger meaning behind some changes. It’s either obvious why it’s done or it becomes obvious in context of other changes. Writing commit messages for that is just going through the motions which means sometimes it becomes “refactor” or a similar message. In those cases it could be nice to have an automatic summary instead of nothing.
Many people are asking “Why force Apple to open up their platforms? Why don’t the users just pick another platform to run their software on?”. If you are genuinely interested in the long answer, I strongly recommend “Chokepoint Capitalism” and “The Internet Con” by Cory Doctorow
This guy. I can’t tell his NY Times bestselling fiction apart from his NY Times bestselling nonfiction. At least his YA novels are clearly distinguishable by the cover graphic design. Anyway, I really don’t understand why he (or somebody!) can’t put up a free, concise summary of his argument(s?) if they’re so important. It always seems to boil down to more government regulation, as though regulatory capture hasn’t already paved so many miles of the road we’re on.
Anyway, here’s a randomly selected book review.
that’s an amazing sentence.
Thanks for the links. As for a summary, I think he regularly writes blogposts and columns and appears on podcasts and gives talks. Here is a random interview I found where the first two questions are about interoperability and regulatory capture.
One other way it is harmful to technical innovation is that universities and grant agencies in the recent past have been directing research money (disproportionately) towards machine learning and away from other fields like systems, security, algorithms or formal methods
Noscript. No text. No luck. Lifting the restriction on notion.so does nothing, I’m still redirected here, with no chance to unblock whatever needs unblocking — because at this point I no longer see what I need unblocking. First time I see a website redirecting me to a different URL just to tell me I need to turn on Javascript.
Sorry I can’t comment on the actual content. I… kinda didn’t get a chance to read it.
I created a gist containing the content (hopefully I didn’t break any licenses) here. Feel free to tell me I’m wrong, I will delete it. Hopefully GitHub would work better (though it might still be bad)
Fortunately, you can get GitHub to deliver you the raw binary directly
Hopefully, archive.is works better
archive.is is worse because it sometimes wants a CAPTCHA, and raw-data GitHub URL is fine for a wide range of ways to fetch it.
You need to lift it on notion.site I think, although I gave up after I couldn’t figure it out and kept getting redirected.
Web developers, please heed my plea: don’t redirect noscripting users!! Especially not to another domain.
If I’m interested enough, I’ll enable JS for your site. I’m used to doing this, despite the fact that I primarily read text and submit basic forms on the internet. I’m even patient enough to allowlist your myriad 3rd party scripts.
I have JS on by default, but I got tripped up by that too.
I noticed that the page was a bit slow and, more importantly, that it hijacks my arrow keys which I use to finetune my scroll position. Since that is often fixed by disabling JS without any ill effects, I flipped the scripts temporarily off in uBlock for the domain and reloaded. Since it would now immediately take me to another domain, to turn the scripts back on it was easiest to just restart Firefox.
There are some related software artifacts which solve the problem of matching lookaround expressions efficiently:
All of this is from 2024