Ooh, I like that this offers both WCAG and APCA calculations.
I like the approach of this APCA checker, as it lets you “snap” to a certain APCA Lc value: https://cliambrown.com/contrast/ One of these days I’d like to make version of this that interpolates in OkLCh instead of HSL…
Congrats on the release! I think the fact that many of the highlights of recent Gleam releases are focused on “Developer Experience” improvements is a testament to the strong design of the language itself, and the community’s focus on making it a pleasant experience to work with.
the purpose of ai is to extrapolate. if it is able to technically innovate, it will be by effectively extrapolating. and insofar as it is unable to extrapolate, it will not innovate. if your technology needs an overfit copy-paster to match some patterns then your abstractions were bad and the winner should have been the better-designed technology that didn’t need that
That’s a strange way of putting it. If anything, LLMs are only able to interpolate - to fill in the blanks with bland filler material. They can do some grunt work, but they so far haven’t been able to come up with truly new stuff AFAICT.
Interpolation is doomed by the curse of dimensionality, so in this context, I would argue there isn’t a meaningful difference between interpolation and extrapolation. In the particular case of language modelling, filling in the blanks would be predicting a high-dimensional sequence of tokens, which I would also call extrapolating from data.
What if you shift your boundaries to the human + AI “team”? I don’t think there is a fundamental reason to believe that people using LLMs are less (or more!) innovative, productive, efficient, etc.
This feels uncharitable. Read further into his bio, or look at some his publication history [arxiv.org]. He’s done high quality research into how ineffective many AI safety/security mechanisms are, which doesn’t seem compatible with pure grift to me.
Some hilarious papers from this author, underlining your point:
We first reveal significant flaws [of another paper] in the evaluation that point to clear signs of gradient masking. We then show the cause of this gradient masking: a bug in the original evaluation code. By fixing a single line of code in the original repository, we reduce Sabre’s robust accuracy to 0%. In response to this, the authors modify the defense and introduce a new defense component not described in the original paper. But this fix contains a second bug […]
And then, in the introduction:
Readers may be forgiven for mistaking this defense for a different defense—also with a flawed evaluation—
accepted at last year’s IEEE S&P 2023.
“AI safety” is as much part of the grift as the hypesters are. It implies a belief that it (“AI”) is a worthwhile field to pursue at all, while giving themselves a controlled opposition that still takes all the same baseline assumptions at face value.
It’s a bit like how christians conceptualize satanism as their perfect enemy: the same overall worldview and cosmology, just with the one wrong turn of worshipping the wrong guy.
People are building “AI” systems and integrating them into their workflows, products, and businesses. In my opinion, that reality makes security research important – regardless of whether one thinks that the field is “worthwhile to pursue at all”.
The more you comment here the clearer it is that you have not read the article or understood his perspective but are rather offended by the notion of someone writing about “AI” in the first place.
There is «AI safety» grift, which serves to promote the … let’s say shaky… assumptions «AI» grift wants promoted, and there is a part of «AI safety» field which looks like «will this hammer fly off the handle immediately» kind of safety.
While many people are discussing TFA’s specific example of terminal sizes, it seems to me that a significant part of what motivates @mitsuhiko’s take here is are instances of errant advisories or issues that have been raised against projects due to their dependencies. (See the “big supply chain” portion).
The specific instances that they mention in this and other related blog posts appear to pertain to the Rust ecosystem. Is this something specific to (or exacerbated by) the social and technical factors of that ecosystem?
Also, although I personally appreciate your contributions of your articles here, there is a general community guideline of no more than one quarter of your posts being self-promotion. I would encourage you to post some articles you are reading that you find interesting as well your own.
Perceived brightness is affected most by what is surrounding the object. In other words, the object can look lighter or darker depending on what is around it. In addition, the brightness can also appear different depending on the color of the object.
Colorspaces like OkLab can address the second, but not the first.
This is why, for instance, the person developing “draft WCAG 3” contrast requirements / “ACPA” uses calculations that treat light-on-dark and dark-on-light differently for prose readability. This differs from purely using the differences in the background and foreground perceived lightnesses.
Do you take the author to mean that they’re looking for a contextual color space, in other words, one that takes some surrounding color as a parameter and produces dynamic outputs based on that value?
It seems that way to me, yes. However, to my (severely amateur) understanding, this moves beyond what a color space is for, and into color correction or enhancement of an image.
Thanks for note about the community guidelines. I had no clue about that rule. I tend to be more of a lurker than a poster, so I’ll keep this in mind.
Oklab and Oklch unfortunately suffer from the same issues, at least based on when I tried them. They are a general improvement on CIELAB on some axis, but they didn’t seem to solve the problem with reds in my testing. And as far as I can find, neither were designed with this effect in mind (they were designed for other properties as a goal).
No worries. If color context (the effect surrounding or neighboring colors have on one another e.g. in the Helmhotz-Kohlrausch effect) interests you in particular, you might check out Josef Albers’ Interaction of color. It’s more of an artistic approach to color theory than a scientific one, but it’s fun to play around with color fields and shapes and see their effects on each other. Albers’ approach is still in use in art schools today. It’s particularly amazing to see his book in print with color plates. The large format hardcover set is a bit too expensive for individuals to purchase, but some libraries have it.
GPS makes a lot of the really fun stuff moot, but accurate timekeeping is a neat topic with its own funny set of constraints and parameters to optimize vs. other stuff. There are now decent clock modules–TCXOs, which measure the temperature and try to correct for the frequency drift it causes, and OCXOs, which heat the oscillating crystal to a predictable temperature to to minimize it–available for under $50.
Even cheaper clocks often drift pretty consistently under stable conditions like a datacenter with a steady temp range. If you get a ping from a more accurate source every so often, not only can you correct your drift with it, you can update your estimate of how fast/slow your clock runs to try to get closer next time. Your basic wristwatch these days has a digital fudge factor (inhibition compensation) applied to the raw output of the quartz crystal, albeit set once at the factory.
Again, the practical solution is more or less what’s in the post–a few boxes with GPS plus a decent clock in each DC–but it’s a neat problem. It would be interesting to know more of what folks do with better time, and what quality of synchronization they get and the stuff + work needed to do it. I’ve heard about Spanner/Cockroach time-based synchronization, and I guess it could make microsecond-level log timestamps meaningful when comparing across machines; bet there’s more, too.
If you get a ping from a more accurate source every so often, not only can you correct your drift with it, you can update your estimate of how fast/slow your clock runs to try to get closer next time.
Sometimes I think that allowing scientists to publish datapoints and datasets (with clear protocols and supporting data) would be better than publishing the papers themselves. Papers can be full of subjective or misleading narratives or even not accepted if the quality of the prose is not good. Or flagged for plagiarism due to copy-paste. This to say that there must be at least 12 people in the planet that are not English native writers, whose contributions are potentially lost. Perhaps we should collectively consider some international scientific experiment report form…
Unfortunately, the collection, curation, and organization of data provides ample opportunities for subjective factors to influence interpretation. Science is a human endeavor and a form of communication, so it cannot avoid inheriting the challenges of communicating effectively.
For those who are curious but don’t want to pick through the github issue threads:
A malicious PR making a very innocent looking change to the README used a branch name with shell commands in it, formatted in a way that would cause CI jobs to execute those commands when performing a build for upload to PyPi. Those commands downloaded a crypto miner and embedded it into the release package.
So the automated builds that were getting uploaded to PyPi had the miner, but the source in github did not and any build you produced manually by cloning the repository and running a build on your local machine would not have it either.
It’s an interesting attack. Hopefully we’ll see a more detailed description of why a branch name from a PR was getting consumed by GitHub CI in a way that could inject commands.
I don’t think “never trusting user input” is the right lesson to learn here. Why? Because I don’t think the whoever wrote that code was aware they were trusting the branch name, or what properties of the branch name exactly they were trusting. So the lesson is not really actionable.
I think the lesson is that these kinds of string-replacement based systems (YAML templates, shell variable expansion etc.) just naturally invite these issues. They are inherently unsafe and we should be teaching people to use safer alternatives instead of teaching them to be vigilant 100% of the time.
For e.g. SQL queries it seems the industry has learned the lesson and you’ll rightfully get ridiculed for building your queries via naive string interpolation instead of using a query builder, stored procedures or something of the sort. Now we need to realize that CI workflows, helm charts and everything else using string-level YAML templating is the same deal.
The FP people have a mantra “parse, don’t validate” for consuming text. I think we need another one for producing text that’s just as snappy. Maybe “serialize, don’t sanitize”?
I’m wishing for a CI/automation tool that would provide functionality like “check out a git branch” as functions in a high-level language, not as shell commands embedded in a data file, so that user input is never sent to a shell directly at all. Maybe I should make one…
Before all the hip yaml based CI systems, like github actions, pretty much everyone was using Jenkins.
The sorta modern way to use Jenkins these days is to write Groovy script, which has stuff like checkout scm, and various other commands. Most of these are from Java plugins, and so the command never ends up going anywhere near a shell, though you do see a lot of use of the shell command function in practice (i.e. sh "make").
Kinda a shame that Jenkins is so wildly unpopular, and these weird yaml-based systems are what’s in vogue. Jenkins isn’t as bad as people make it out to be in my opinion.
Please do build something though because Jenkins isn’t exactly good either, and I doubt anyone would pick Groovy as a language for anything today.
I’ve used Jenkins quite a bit, that’s one of the inspiration source for that idea indeed. But Groovy is a rather cursed language, especially by modern standards… it’s certainly one of my least favorite parts of Jenkins.
My idea for a shell-less automation tool is closer to Ansible than to Jenkins but it’s just a vague idea so far. I need to summarize it and share it for a discussion sometime.
I doubt anyone would pick Groovy as a language for anything today.
I use Groovy at $DAILYJOB, and am currently learning Ruby (which has a lot more job listings as Elixir). The appeal of both languages are the same: it is incredibly easy to design DSLs with it (basically what Jenkins and Gradle use). Which is precisely what I work with at $DAILYJOB. The fact it’s JVM-based is the icing on the cake, because it’s easy to deploy in the clients’ environments.
This looks really interesting, thanks for the pointer! Maybe it’s already good for things I want to do and I don’t need to make anything at all, or may contribute something to it.
The generation of devs that grew up on Jenkins (including myself) got used to seeing CI as “just” a bunch of shell scripts. But it’s tedious as hell, and you end up programming shell via yaml, which makes me sympathetic to vulns like these.
Yeah in dealing with github’s yaml hell I’ve been wishing for something closer to a typed programming language with a proper library e.g. some sort of simplified-haskell DSL à la Elm, Nix, or Dhall.
They all do?
They all provide ways to build specific branches defined in yaml files or even via UIs rather than letting that work for your shell scripts.
Personally I find all those yaml meta-languages all inferior than just writing a shell script. And for one and a half decades I’ve been looking for an answer to the question:
What’s the value of a CI server other than running a command on commit?
But back to your point. Why? What you need to do is sanitize user input. That is completely independent of being shell script versus another language. Shellscripts are actually higher level than general purpose programming languages.
I’m certainly not saying that one doesn’t need to sanitize user input.
But I want the underlying system to provide a baseline level of safety. Like in Python, unless I’m calling eval() it doesn’t matter that some input may contain the character sequence os.system(...; and if I’m not calling os.system() and friends, it doesn’t matter if a string may have rm -rf in it. When absolutely any data may end up being executed as code at any time, the system has a problem, as of me.
Buildbot also belongs on the list of “systems old enough to predate YAML-everywhere”. It certainly has its weaknesses today, but its config is Python-based.
In GitHub Actions specifically, there’s also a very straightforward fix: instead of interpolating a string in the shell script itself, set any values you want to use as env vars and use those instead. e.g.:
I don’t think string replacement systems are bad per se. Sure, suboptimal in virtually all senses. But I think the biggest issue is a lack of good defaults and a requirement to explicitly indicate that you want the engine to do something unsafe. Consider the following in GH Actions:
echo "Bad: ${{ github.head_ref }}"
echo "Good: $GITHUB_HEAD_REF" # or so @kylewlacy says
I do not see any major difference immediately. Compare to Pug (nee Jade):
Using an unescaped string directly is clear to the reader and is not possible without an opt-in. At the same time, the opt-in is a matter of a single-char change, so one cannot decry the measure as too onerous. The mantra should be to make unescaped string usage explicit (and discouraged by default).
But to escape a string correctly, you need to know what kind of context you’re interpolating it into. E.g. if you’re generating a YAML file with string values that are lines of shell script, you might need both shell and YAML escaping in that context, layered correctly. Which is already starting to look less like string interpolation and more like serialization.
A few years Over a decade ago (jesus, time flies!) I came up with an ordered list of approaches in descending order of safety. My main mantra was “structural safety” - instead of ad-hoc escaping, try to fix the problem in a way that completely erases injection-type security issues in a structural way.
Yeah. The problem is that the echo command interpretes things like ${{...}} and executes it. Or is it the shell that does it in any string? I’m not even sure and that is problem. No high level language does that. Javascript uses eval, which is already bad enough, but at least you can’t use it inside a string. You can probanly do hello ${eval(...)} but then it is clear that you are evaluating the code inside.
It’s the shell that evaluates $... syntax. $(cmd) executes cmd, ${VAR} reads shell variables VAR and in both cases the shell replaces the $... with the output before calling the echo program with the result. Echo is just a dumb program that spits out the arguments its given.
Edit: but the ${{ syntax is GitHub Actions’ own syntax, the shell doesn’t see that as GH Actions evaluates it before running the shell command.
The pull_request_target event that was used here is privilege escalation similar to sudo – it gives you access to secrets etc.
Like all privilege escalation code, this should be very carefully written, fuzzed, and audited. Certainly a shell script is exactly wrong – sh was never designed to handle untrusted input in sensitive scenarios. Really it’s on GitHub Actions for making shell script-based privilege escalation code the easy path.
At the very least you want to use a language like Rust, leveraging the type system to carefully encapsulate untrusted code, along with property-based testing/fuzzing for untrusted inputs. This is an inherently serious, complex problem, and folks writing code to solve it should have to grapple with the complexity.
I don’t know if it was a bot or not (but that is probably irrelevant). The problem in the PR lies in the branch name which executed arbitrary code during GitHub Actions. Sorry if I misunderstood your question.
Hm, the dots don’t connect for me yet. I can just make a PR with changes to build process, and CI would test it, but that should be fine, because PRs run without access to secrets, right?
It’s only when the PR is merged and CI is run on the main branch that secrets are available, right?
So would it be correct to say that the PR was merged into main, and, when running CI on the main branch, something echoed the branch name of recently-merged PR?
Why would you ever want to expose your secrets to a pull request on an open source project? Once you do that, they’re not actually secrets, they’re just … weakly-obscured configuration settings. This is far from the first time this github “feature” has been used to attack a project. Why do people keep turning it on? Why hasn’t github removed it?
If I understand it correctly, I can maybe see it used in a non-public context, like for a companies internal CI.
But for open source and public repos it makes no sense. Even if it’s not an attack like in this case, a simple “echo …” makes the secrets no longer secret.
Note that the version of the workflow that’s used is the one in the target branch, not the one in the proposed branch.
There are legitimate use cases for this kind of privilege escalation, but GHA’s semiotics for it are all wrong. It should feel like a Serious, Weighty piece of code that should be carefully validated and audited. Shell scripts should be banned, not the default.
Thanks for the explanation, I was literally about to post a question to see if I understood it correctly. I am absolutely paranoid about the Actions running on my Github repos, it would seem to be that a closed PR should not be involved in any workflow. While the branch name was malicious, is there also a best practice to pull out here for maintainers?
While the branch name was malicious, is there also a best practice to pull out here for maintainers?
Don’t ever use pull_request_target trigger, and, if you do, definitely don’t give that CI job creds to publish your stuff.
The root cause here is not shell injection. The root cause is that untrusted input gets into CI run with creds at all. Of course, GitHub actions doesn’t do that by default, and you have to explicitly opt-into this with pull_request_target. See the linked SO answer in a sibling commnet, it explains the issue quite nicely.
Ah, comment by Foxboron clarifies that what happened here is not the job directly publishing malicious code, but rather poisoning the build cache to make the main branch CI pull bad data in! Clever! So, just don’t give any permissions for pull_request_trigger jobs!
My public repos don’t run CI jobs for PRs automatically, it has to be manually approved. I think this is the default. Not sure what happened in this case though.
It is totally fine to run CI on PR. CI for PRs does not get to use repository secrets, unless you go out of your way to also include secrets.
If you think your security depends on PRs not triggering CI then it’s is likely that either:
you don’t understand why you project is actually secure
your project is actually insecure
GitHub “don’t run CI for first time contributors” has nothing to do with security and has everything to do with using maintainer’s human judgement to protect free GitHub runners free compute from being used for mining crypto.
That is, this is a feature to protect GitHub/Microsoft, not your project.
Should be easily solvable by billing those minutes to the PR creator.
I guess there is also the situation where you provide your own runner rather than buying it from Github. In that case it seems like a reasonable precaution to restrict unknown people from using it.
Should be easily solvable by billing those minutes to the PR creator.
Yes! I sympathize GitHub for having to implement something here on a short notice when this happened the first time, but I am dismayed that they didn’t get to implementing a proper solution here: https://matklad.github.io/2022/10/24/actions-permissions.html
I guess there is also the situation where you provide your own runners
Yes, the security with self-hosted runners is different. If you use non-sandboxed self-hosted runners, they should never be used for PRs.
Thank you, that’s a great summary, and a very interesting attack vector.
It’s strange (to me) that a release would be created off of an arbitrary user created branch, but I’m sure there’s a reason for it. In years and years of working with build automation I’ve never thought about that kind of code injection attack, so it’s something I’ll start keeping in mind when doing that kind of work.
FireDucks is not an open source library at this moment. You can get it installed freely using pip and use under BSD-3 license and of course can look into the python part of the source code.
I.e. Wheel files containing compiled blobs are available on PyPI, but no source for these blobs are available.
Correct. BSD clauses state only that source and binary distributions must be licensed to users under the same permissive license. It does not require the distribution of source distributions with binaries. This is why Clang has been increasingly popular in green-field platform SDKs (Apple ecosystem, games consoles, embedded, etc): unlike GCC, changes to Clang/LLVM don’t have to be distributed in source form.
Does this mean that if, for example, a game console SDK had source distribution and was based off some BSD software, then whoever received that source would be allowed to distribute it?
Like “you’re not required to distribute the source, but if you do then the user can distribute it”?
Not a lawyer but AFAIK no. The rights to redistribute code in e.g. GPL software are due specifically to clauses which mandate that the source code be made available, and made available under the terms of the GPL, which is what provides the “sticky” rights to redistribute the code. There is nothing in the BSD licenses that prohibits anyone from distributing the work with additional restrictions or terms attached.
Despite the verb tense in the title, it appears that all of the measures suggested in the post occur after the code is written – and many are automated. Scanning for known CVEs helps us fill out our compliance checklists, but I’d love to see more folks writing about how they incorporate security thinking into their design – e.g. things like threat models, attack surfaces, trust boundaries – which can even be tailored to what a certain language makes easy or hard.
I don’t often agree with the complaints that the Python language & stdlib are growing too much / too rapidly, but at first glance I’m not seeing much justification for templatelib being a part of the standard library beyond the convenience of the t prefix. Furthermore, the authors of templatelib could likely get some good user feedback from being a distinct package first. IIRC, stdlib dataclasses benefited heavily from design lessons of the attrs library.
There’s no way to hook into the functionality and evaluation of f-strings, so there isn’t space to experiment with this design besides a clunky “toss everything into this class” way which doesn’t feel nice and already exists via the plethora of DB drivers and other templating libraries. If people were content using those faculties, I doubt this feature would be proposed.
𝗧𝗵𝗲𝗿𝗲 𝗶𝘀 𝗷𝘂𝘀𝘁 𝗻𝗼 𝘄𝗮𝘆 𝗰𝗮𝗻 𝘆𝗼𝘂 𝗯𝘂𝗶𝗹𝗱 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗮𝗴𝗲𝗻𝘁𝘀 𝗼𝗻 𝘁𝗵𝗶𝘀 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer.
Jack Clark (Anthropic co-founder) conversely focuses on the trend that bigger models show less degradation from the distracting variables:
Small and therefore relatively dumb models do absolutely overfit, but the larger and smarter your model, the less prone it is to overfitting. […] The fact they cope far better is actually - to me - a very optimistic sign, indicating that the world’s largest and most sophisticated models may really exhibit some crude reasoning
In any case, while it sometimes feels like we’re drowning in new benchmarks, I think it’s useful to be concrete in regards to what we think we’re talking about when we use words like “reasoning” and “problem solving” as applied to LLMs.
While I don’t actively use any Lisps these days, I deeply appreciate the glimpses of “crystalline purity” I’ve seen. One particularly memorable occurrence early in my programming education was the construction of Church Numerals in SICP.
I still don’t understand what “reasoning” is, and I get frustrated at how few of these “LLMs can’t reason” / “LLMs can reason” arguments don’t even attempt to provide a useful definition.
I suspect that many of them are actually talking about different things.
I know you’re asking for more rigor but I think it’s fair to douse the hype by highlighting examples of them being clearly unable to execute rules of logic that humans have no trouble with.
We could look at things this way: if there was a problem which Prolog would excel at, that’s also typically the kind of problem that LLMs do badly at.
Which, I know, people are going to pull out “well humans aren’t always great at those Prolog problems either!” Yeah that’s true, though I wasn’t comparing to humans, I’m comparing to prolog at the second. But now let’s return to humans. Clearly the LLM is doing worse than even young children based on the examples shown in the paper, and the jersey number is a great example.
There are clearly a lot of things LLMs are great at, in terms of generative production. You yourself are a big advocate of that, and of knowing the restraints on that. So let’s acknowledge that the prolog style careful reasoning isn’t something LLMs are good at, but they’re good at other things that Prolog style systems aren’t… it typically takes a human to “set up” all the prolog like rules carefully together to be able to make things work. The NeuroSymbolic approach is “what if we combine the strengths of both”?
Surely you can agree that LLMs aren’t suitable to the prolog style domain and that the prolog style domain takes a lot of manual work, so combining them could help unlock something.
I have a hunch that hooking a frontier LLM up to a Prolog interpreter (using the same mechanism as how ChatGPT and Gemini can both write and then execute Python scripts) could work really well already. I bet they are really good at writing Prolog - I don’t know it at all though, so I’m not well equipped to evaluate that.
Personally, I have to doubt that, myself. Maybe they can be trained to write proper Prolog or other declarative logic systems, but as of now that would be too unreliable. LLMs are already known for not writing working code, I feel like asking them to write working code and also reasoning about the logic that would go into it would be both extremely error prone, unreasonably time consuming and processor intensive compared to alternatives, and not worth the effort.
A human absolutely can and should write the underlying symbolic logic rules to a system. If you need the LLM to do that, it means you don’t actually know what your system requirements are, and thus, don’t even know what it is you’re trying to make in the first place. Your symbolic system can absolutely call into other machine learning models to perform more fuzzy logic, but you really should have an idea of what the core purpose of your system is and, at least generally, have an understanding of how it works. Actually, this reminds me of a pipe-dream Kickstarter project I saw once, where the (very young) man behind the project wanted to make the “ultimate survival zombie game” and said he would use artificial neural networks for the zombies - even though, unbeknownst to him, not only do sophisticated algorithms like A* already exist, but you don’t even need ANNs in an artificial world, since the AI for the game would already have all knowledge of the game upfront, and thus there’s no patterns to decode from entropy. That, to me, is the classic example of simply not knowing what it is you’re even trying to make, and assuming throwing machine learning at it will get the job done, when in reality it would be an infinitely worse solution, if not outright non-functional.
From working with LLMs, the very first bit of wisdom I’ve developed is: Don’t trust the LLM to do anything, give it an extremely specific request that’s almost solvable mechanically, but ultimately too gummed up in human language to be reliably digested symbolically. The LLM is not there to make decisions, it’s there to make some text look like some other text.
If you want reliability you don’t want LLMs - their inherent lack of reliability (due to things like being non-deterministic) is their single biggest weakness.
They are increasingly useful for writing code that actually does work. The ChatGPT Code Interpreter pattern - where the system writes code, then executes it, sees the results (or error messages) and then iterates on it until it gets a result that looks right - is incredibly powerful… IF you prompt it right, and keep a careful eye on what it ends up doing.
Just for example, the paper highlights that if in a grade school level word problem you vary only the names of people and the particular number values involved, models’ performance gets worse on average and less consistent across variations, which you would not expect if you did the same thing with grade schoolers.
It’s not that people have no trouble, but people are not expected to have this trouble. The findings amount to doubt that the models do what the corresponding people do.
Just because you might have trouble with identifying legal chess moves, don’t assume everyone you debate online also does. That’s an unkind assumption and I have flagged your comment as such.
I still don’t understand what “reasoning” is, and I get frustrated at how few of these “LLMs can’t reason” / “LLMs can reason” arguments don’t even attempt to provide a useful definition.
Unfortunately, that’s modern web content for you :-).
The article actually makes a weaker claim:
Literature suggests that the reasoning process in LLMs is probabilistic pattern-matching rather than formal reasoning. (Emphasis mine)
and if you follow the bibliography, they are specifically describing LLMs’ reasoning as probabilistic, rather than formal, based on their propensity to a number of issues (e.g. token bias, distractability) which indicate that LLM outputs are correlated with incidental, rather than formal relations between tokens.
This substack post leads with a link to a paper that is more modest in its claims, and I think grapples with your frustrations (in a fashion): https://arxiv.org/pdf/2410.05229
That is, I think that good benchmarks are hard, but in a sense make us concretely try to come to terms with what we’re looking for from models’ behavior. Goodhart’s law will always be A Thing, but it’s still useful to try.
I think the reality is that it’s trivial to define “reason” in a way that includes LLMs and extremely difficult to do so that both (a) does not exclude what we already consider reasoning (inference, bayesian, modal, deductive, etc) (b) also excludes LLMs.
But the paper doesn’t use the term “nakedly” but instead prefaces it with terms such as “mathematical” ie: “mathematical reasoning” or “formal reasoning”. These are forms of reasoning that are strictly rules based. What they show is that LLMs probably don’t model these forms of reasoning.
Ooh, I like that this offers both WCAG and APCA calculations.
I like the approach of this APCA checker, as it lets you “snap” to a certain APCA Lc value: https://cliambrown.com/contrast/ One of these days I’d like to make version of this that interpolates in OkLCh instead of HSL…
Congrats on the release! I think the fact that many of the highlights of recent Gleam releases are focused on “Developer Experience” improvements is a testament to the strong design of the language itself, and the community’s focus on making it a pleasant experience to work with.
Thank you!
I’ve been dipping in a bit and I’m finding tiny projects without documentation and tests which is definitely not my idea of “developer experience”.
the purpose of ai is to extrapolate. if it is able to technically innovate, it will be by effectively extrapolating. and insofar as it is unable to extrapolate, it will not innovate. if your technology needs an overfit copy-paster to match some patterns then your abstractions were bad and the winner should have been the better-designed technology that didn’t need that
That’s a strange way of putting it. If anything, LLMs are only able to interpolate - to fill in the blanks with bland filler material. They can do some grunt work, but they so far haven’t been able to come up with truly new stuff AFAICT.
Interpolation is doomed by the curse of dimensionality, so in this context, I would argue there isn’t a meaningful difference between interpolation and extrapolation. In the particular case of language modelling, filling in the blanks would be predicting a high-dimensional sequence of tokens, which I would also call extrapolating from data.
What if you shift your boundaries to the human + AI “team”? I don’t think there is a fundamental reason to believe that people using LLMs are less (or more!) innovative, productive, efficient, etc.
The literal start of the bio:
Yeah, I’m sure this will be a completely impartial account.
I’m so absolutely tired of people trying to misrepresent themselves as the one “reasonable” grifter.
This feels uncharitable. Read further into his bio, or look at some his publication history [arxiv.org]. He’s done high quality research into how ineffective many AI safety/security mechanisms are, which doesn’t seem compatible with pure grift to me.
Some hilarious papers from this author, underlining your point:
And then, in the introduction:
Source: https://arxiv.org/abs/2405.03672
“AI safety” is as much part of the grift as the hypesters are. It implies a belief that it (“AI”) is a worthwhile field to pursue at all, while giving themselves a controlled opposition that still takes all the same baseline assumptions at face value.
It’s a bit like how christians conceptualize satanism as their perfect enemy: the same overall worldview and cosmology, just with the one wrong turn of worshipping the wrong guy.
People are building “AI” systems and integrating them into their workflows, products, and businesses. In my opinion, that reality makes security research important – regardless of whether one thinks that the field is “worthwhile to pursue at all”.
The more you comment here the clearer it is that you have not read the article or understood his perspective but are rather offended by the notion of someone writing about “AI” in the first place.
There is «AI safety» grift, which serves to promote the … let’s say shaky… assumptions «AI» grift wants promoted, and there is a part of «AI safety» field which looks like «will this hammer fly off the handle immediately» kind of safety.
While many people are discussing TFA’s specific example of terminal sizes, it seems to me that a significant part of what motivates @mitsuhiko’s take here is are instances of errant advisories or issues that have been raised against projects due to their dependencies. (See the “big supply chain” portion).
The specific instances that they mention in this and other related blog posts appear to pertain to the Rust ecosystem. Is this something specific to (or exacerbated by) the social and technical factors of that ecosystem?
You might be interested in the Oklab and Oklch color spaces. There have been several articles posted here (A perceptual color space for image processing, Color gradients and my gradual descent into madness, Improving on Solarized using the OKLab perceptual colorspace, OKLCH in CSS: why we moved from RGB and HSL, and An interactive review of Oklab to name a few) about them. Since human color perception is non-linear, there will likely always be perceptual inconsistencies with any color space, but these seem to do a reasonable job of correcting for them.
Also, although I personally appreciate your contributions of your articles here, there is a general community guideline of no more than one quarter of your posts being self-promotion. I would encourage you to post some articles you are reading that you find interesting as well your own.
The points within the post go beyond the perceptual uniformity that OkLab was designed for. Surrounding colors affect our perception as well.
The post mentions the Helmholtz–Kohlrausch effect (wikipedia.org), which is described as:
Colorspaces like OkLab can address the second, but not the first.
This is why, for instance, the person developing “draft WCAG 3” contrast requirements / “ACPA” uses calculations that treat light-on-dark and dark-on-light differently for prose readability. This differs from purely using the differences in the background and foreground perceived lightnesses.
Do you take the author to mean that they’re looking for a contextual color space, in other words, one that takes some surrounding color as a parameter and produces dynamic outputs based on that value?
It seems that way to me, yes. However, to my (severely amateur) understanding, this moves beyond what a color space is for, and into color correction or enhancement of an image.
Thanks for note about the community guidelines. I had no clue about that rule. I tend to be more of a lurker than a poster, so I’ll keep this in mind.
Oklab and Oklch unfortunately suffer from the same issues, at least based on when I tried them. They are a general improvement on CIELAB on some axis, but they didn’t seem to solve the problem with reds in my testing. And as far as I can find, neither were designed with this effect in mind (they were designed for other properties as a goal).
Thanks for the links! I’ll check them out!
No worries. If color context (the effect surrounding or neighboring colors have on one another e.g. in the Helmhotz-Kohlrausch effect) interests you in particular, you might check out Josef Albers’ Interaction of color. It’s more of an artistic approach to color theory than a scientific one, but it’s fun to play around with color fields and shapes and see their effects on each other. Albers’ approach is still in use in art schools today. It’s particularly amazing to see his book in print with color plates. The large format hardcover set is a bit too expensive for individuals to purchase, but some libraries have it.
GPS makes a lot of the really fun stuff moot, but accurate timekeeping is a neat topic with its own funny set of constraints and parameters to optimize vs. other stuff. There are now decent clock modules–TCXOs, which measure the temperature and try to correct for the frequency drift it causes, and OCXOs, which heat the oscillating crystal to a predictable temperature to to minimize it–available for under $50.
Even cheaper clocks often drift pretty consistently under stable conditions like a datacenter with a steady temp range. If you get a ping from a more accurate source every so often, not only can you correct your drift with it, you can update your estimate of how fast/slow your clock runs to try to get closer next time. Your basic wristwatch these days has a digital fudge factor (inhibition compensation) applied to the raw output of the quartz crystal, albeit set once at the factory.
Again, the practical solution is more or less what’s in the post–a few boxes with GPS plus a decent clock in each DC–but it’s a neat problem. It would be interesting to know more of what folks do with better time, and what quality of synchronization they get and the stuff + work needed to do it. I’ve heard about Spanner/Cockroach time-based synchronization, and I guess it could make microsecond-level log timestamps meaningful when comparing across machines; bet there’s more, too.
This is what NTP does.
Meta has PTP, including dedicated hardware:
https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/
Funnily enough, OCXO’s have been co-opted into audiophile woo: https://jcat.eu/product/master-ocxo-clock-module/
Holy cow you just sent me down a rabbit hole.
They have network cards! For the… noise?… in your… packets????
Oh, this is the first time you’ve encountered this kind of stuff? I’m sorry.
PTP: nice!
Audiophile OCXOs: oh nooooooooo!
This sounds like a successful experiment with a negative result! Kudos to curl for being willing to experiment, and to learning from it.
Staying up too late having fun with AoC & learning Gleam. It’s a good language!
Sometimes I think that allowing scientists to publish datapoints and datasets (with clear protocols and supporting data) would be better than publishing the papers themselves. Papers can be full of subjective or misleading narratives or even not accepted if the quality of the prose is not good. Or flagged for plagiarism due to copy-paste. This to say that there must be at least 12 people in the planet that are not English native writers, whose contributions are potentially lost. Perhaps we should collectively consider some international scientific experiment report form…
Unfortunately, the collection, curation, and organization of data provides ample opportunities for subjective factors to influence interpretation. Science is a human endeavor and a form of communication, so it cannot avoid inheriting the challenges of communicating effectively.
That was the general idea behind the center for open science: https://www.cos.io/
For those who are curious but don’t want to pick through the github issue threads:
A malicious PR making a very innocent looking change to the README used a branch name with shell commands in it, formatted in a way that would cause CI jobs to execute those commands when performing a build for upload to PyPi. Those commands downloaded a crypto miner and embedded it into the release package.
So the automated builds that were getting uploaded to PyPi had the miner, but the source in github did not and any build you produced manually by cloning the repository and running a build on your local machine would not have it either.
It’s an interesting attack. Hopefully we’ll see a more detailed description of why a branch name from a PR was getting consumed by GitHub CI in a way that could inject commands.
There was an action that echo’d the branch name without sanitizing anything: https://github.com/advisories/GHSA-7x29-qqmq-v6qc
Another lesson in never trusting user input
I don’t think “never trusting user input” is the right lesson to learn here. Why? Because I don’t think the whoever wrote that code was aware they were trusting the branch name, or what properties of the branch name exactly they were trusting. So the lesson is not really actionable.
I think the lesson is that these kinds of string-replacement based systems (YAML templates, shell variable expansion etc.) just naturally invite these issues. They are inherently unsafe and we should be teaching people to use safer alternatives instead of teaching them to be vigilant 100% of the time.
For e.g. SQL queries it seems the industry has learned the lesson and you’ll rightfully get ridiculed for building your queries via naive string interpolation instead of using a query builder, stored procedures or something of the sort. Now we need to realize that CI workflows, helm charts and everything else using string-level YAML templating is the same deal.
The FP people have a mantra “parse, don’t validate” for consuming text. I think we need another one for producing text that’s just as snappy. Maybe “serialize, don’t sanitize”?
I’m wishing for a CI/automation tool that would provide functionality like “check out a git branch” as functions in a high-level language, not as shell commands embedded in a data file, so that user input is never sent to a shell directly at all. Maybe I should make one…
Before all the hip yaml based CI systems, like github actions, pretty much everyone was using Jenkins.
The sorta modern way to use Jenkins these days is to write Groovy script, which has stuff like
checkout scm, and various other commands. Most of these are from Java plugins, and so the command never ends up going anywhere near a shell, though you do see a lot of use of the shell command function in practice (i.e.sh "make").Kinda a shame that Jenkins is so wildly unpopular, and these weird yaml-based systems are what’s in vogue. Jenkins isn’t as bad as people make it out to be in my opinion.
Please do build something though because Jenkins isn’t exactly good either, and I doubt anyone would pick Groovy as a language for anything today.
I’ve used Jenkins quite a bit, that’s one of the inspiration source for that idea indeed. But Groovy is a rather cursed language, especially by modern standards… it’s certainly one of my least favorite parts of Jenkins.
My idea for a shell-less automation tool is closer to Ansible than to Jenkins but it’s just a vague idea so far. I need to summarize it and share it for a discussion sometime.
groovy is okay. Not the best language, but way ahead any other language I’ve ever seen in a popular CI solution. And ansible should die.
Have you considered Dagger?
edit: I just had to read a little down and someone else points you the same way…
I haven’t heard about it before it was suggested in this thread, I’m going to give it a try.
I use Groovy at
$DAILYJOB, and am currently learning Ruby (which has a lot more job listings as Elixir). The appeal of both languages are the same: it is incredibly easy to design DSLs with it (basically what Jenkins and Gradle use). Which is precisely what I work with at$DAILYJOB. The fact it’s JVM-based is the icing on the cake, because it’s easy to deploy in the clients’ environments.Dagger looks interesting for this sort of use case: https://dagger.io/
This looks really interesting, thanks for the pointer! Maybe it’s already good for things I want to do and I don’t need to make anything at all, or may contribute something to it.
That’d be lovely.
The generation of devs that grew up on Jenkins (including myself) got used to seeing CI as “just” a bunch of shell scripts. But it’s tedious as hell, and you end up programming shell via yaml, which makes me sympathetic to vulns like these.
Yeah in dealing with github’s yaml hell I’ve been wishing for something closer to a typed programming language with a proper library e.g. some sort of simplified-haskell DSL à la Elm, Nix, or Dhall.
They all do? They all provide ways to build specific branches defined in yaml files or even via UIs rather than letting that work for your shell scripts. Personally I find all those yaml meta-languages all inferior than just writing a shell script. And for one and a half decades I’ve been looking for an answer to the question:
What’s the value of a CI server other than running a command on commit?
But back to your point. Why? What you need to do is sanitize user input. That is completely independent of being shell script versus another language. Shellscripts are actually higher level than general purpose programming languages.
I’m certainly not saying that one doesn’t need to sanitize user input.
But I want the underlying system to provide a baseline level of safety. Like in Python, unless I’m calling
eval()it doesn’t matter that some input may contain the character sequenceos.system(...; and if I’m not callingos.system()and friends, it doesn’t matter if a string may haverm -rfin it. When absolutely any data may end up being executed as code at any time, the system has a problem, as of me.Buildbot also belongs on the list of “systems old enough to predate YAML-everywhere”. It certainly has its weaknesses today, but its config is Python-based.
In GitHub Actions specifically, there’s also a very straightforward fix: instead of interpolating a string in the shell script itself, set any values you want to use as env vars and use those instead. e.g.:
I don’t think string replacement systems are bad per se. Sure, suboptimal in virtually all senses. But I think the biggest issue is a lack of good defaults and a requirement to explicitly indicate that you want the engine to do something unsafe. Consider the following in GH Actions:
I do not see any major difference immediately. Compare to Pug (nee Jade):
Using an unescaped string directly is clear to the reader and is not possible without an opt-in. At the same time, the opt-in is a matter of a single-char change, so one cannot decry the measure as too onerous. The mantra should be to make unescaped string usage explicit (and discouraged by default).
But to escape a string correctly, you need to know what kind of context you’re interpolating it into. E.g. if you’re generating a YAML file with string values that are lines of shell script, you might need both shell and YAML escaping in that context, layered correctly. Which is already starting to look less like string interpolation and more like serialization.
A few yearsOver a decade ago (jesus, time flies!) I came up with an ordered list of approaches in descending order of safety. My main mantra was “structural safety” - instead of ad-hoc escaping, try to fix the problem in a way that completely erases injection-type security issues in a structural way.I’m reminded of my similar post focused more on encoding (from the same year! hooooboy).
Good post! I’m happy to say that CHICKEN (finally!) does encoding correctly in version 6.
Serialize, don’t sanitize… I love it! I’m gonna start saying this.
AFAIU, the echoing is not the problem, and sanitizing wouldn’t help.
The problem is that before the script is even executed, parts of its code (the
${{ ... }}stuff) are string-replaced.Yeah. The problem is that the echo command interpretes things like
${{...}}and executes it. Or is it the shell that does it in any string? I’m not even sure and that is problem. No high level language does that. Javascript uses eval, which is already bad enough, but at least you can’t use it inside a string. You can probanly dohello ${eval(...)}but then it is clear that you are evaluating the code inside.The
${{...}}are replaced by the Github CI system before the echo is even run: https://docs.github.com/en/actions/security-for-github-actions/security-guides/security-hardening-for-github-actions#understanding-the-risk-of-script-injectionsIt’s the shell that evaluates
$...syntax.$(cmd)executescmd,${VAR}reads shell variablesVARand in both cases the shell replaces the$...with the output before calling the echo program with the result. Echo is just a dumb program that spits out the arguments its given.Edit: but the ${{ syntax is GitHub Actions’ own syntax, the shell doesn’t see that as GH Actions evaluates it before running the shell command.
Oh thanks for explaining!
The part I don’t get is how this is escalated to the publish, that workflow only seems to run off of the main branch or a workflow dispatch.
cf https://lobste.rs/s/btagmw/maliciously_crafted_github_branch_name#c_4z3405 it seems they were using the pull_request_target event which grants PR CI’s access to repo secrets, so they could not only inject the miner, but publish a release?
Does anyone have a copy of the script so we can see what it did?
Funny that they managed to mine only about $30 :)
Honestly, shining a spotlight on this attack with a mostly harmless crypto miner is a great outcome.
Less obvious key-stealing malware probably would have caused far more pain.
I knew crypto would have great use cases eventually
The
pull_request_targetevent that was used here is privilege escalation similar to sudo – it gives you access to secrets etc.Like all privilege escalation code, this should be very carefully written, fuzzed, and audited. Certainly a shell script is exactly wrong – sh was never designed to handle untrusted input in sensitive scenarios. Really it’s on GitHub Actions for making shell script-based privilege escalation code the easy path.
At the very least you want to use a language like Rust, leveraging the type system to carefully encapsulate untrusted code, along with property-based testing/fuzzing for untrusted inputs. This is an inherently serious, complex problem, and folks writing code to solve it should have to grapple with the complexity.
cf https://lobste.rs/s/btagmw/maliciously_crafted_github_branch_name#c_4z3405 it seems they were using the pull_request_target event which grants PR CI’s access to repo secrets, so they could not only inject the miner, but publish a release?
Does anyone have a copy of the script so we can see what it did?
Funny that they managed to mine only about $30 :)
Wow. I was looking for that kind of explanation and hadn’t found it yet. Thank you for finding and sharing it.
No, the lesson is to never use bash, except for using to start something that is not bash.
Oh, this is probably a better top level link for this post!
This is the offending PR
https://github.com/ultralytics/ultralytics/pull/18018
A bot made the PR? How does that work?
I don’t know if it was a bot or not (but that is probably irrelevant). The problem in the PR lies in the branch name which executed arbitrary code during GitHub Actions. Sorry if I misunderstood your question.
Hm, the dots don’t connect for me yet. I can just make a PR with changes to build process, and CI would test it, but that should be fine, because PRs run without access to secrets, right?
It’s only when the PR is merged and CI is run on the main branch that secrets are available, right?
So would it be correct to say that the PR was merged into main, and, when running CI on the main branch, something echoed the branch name of recently-merged PR?
Ah, I am confused!
See https://stackoverflow.com/questions/74957218/what-is-the-difference-between-pull-request-and-pull-request-target-event-in-git
There’a a way to opt-in to trigger workflow with main branch secrets when a PR is submitted, and that’s exactly what happened here.
I don’t get why this option exists!
Why would you ever want to expose your secrets to a pull request on an open source project? Once you do that, they’re not actually secrets, they’re just … weakly-obscured configuration settings. This is far from the first time this github “feature” has been used to attack a project. Why do people keep turning it on? Why hasn’t github removed it?
If I understand it correctly, I can maybe see it used in a non-public context, like for a companies internal CI.
But for open source and public repos it makes no sense. Even if it’s not an attack like in this case, a simple “echo …” makes the secrets no longer secret.
Prioritizing features over security makes total sense in this context! This was eloquently formulated by @indygreg! Actions:
You clearly want them to be as powerful as possible!
Note that the version of the workflow that’s used is the one in the target branch, not the one in the proposed branch.
There are legitimate use cases for this kind of privilege escalation, but GHA’s semiotics for it are all wrong. It should feel like a Serious, Weighty piece of code that should be carefully validated and audited. Shell scripts should be banned, not the default.
Thanks for the explanation, I was literally about to post a question to see if I understood it correctly. I am absolutely paranoid about the Actions running on my Github repos, it would seem to be that a closed PR should not be involved in any workflow. While the branch name was malicious, is there also a best practice to pull out here for maintainers?
Don’t ever use
pull_request_targettrigger, and, if you do, definitely don’t give that CI job creds to publish your stuff.The root cause here is not shell injection. The root cause is that untrusted input gets into CI run with creds at all. Of course, GitHub actions doesn’t do that by default, and you have to explicitly opt-into this with
pull_request_target. See the linked SO answer in a sibling commnet, it explains the issue quite nicely.Ah, comment by Foxboron clarifies that what happened here is not the job directly publishing malicious code, but rather poisoning the build cache to make the main branch CI pull bad data in! Clever! So, just don’t give any permissions for pull_request_trigger jobs!
My public repos don’t run CI jobs for PRs automatically, it has to be manually approved. I think this is the default. Not sure what happened in this case though.
By default it has to be approved for first-time contributors. Not too hard to get an easy PR merged in and get access to auto-running them.
It is totally fine to run CI on PR. CI for PRs does not get to use repository secrets, unless you go out of your way to also include secrets.
If you think your security depends on PRs not triggering CI then it’s is likely that either:
GitHub “don’t run CI for first time contributors” has nothing to do with security and has everything to do with using maintainer’s human judgement to protect free GitHub runners free compute from being used for mining crypto.
That is, this is a feature to protect GitHub/Microsoft, not your project.
Should be easily solvable by billing those minutes to the PR creator.
I guess there is also the situation where you provide your own runner rather than buying it from Github. In that case it seems like a reasonable precaution to restrict unknown people from using it.
Yes! I sympathize GitHub for having to implement something here on a short notice when this happened the first time, but I am dismayed that they didn’t get to implementing a proper solution here: https://matklad.github.io/2022/10/24/actions-permissions.html
Yes, the security with self-hosted runners is different. If you use non-sandboxed self-hosted runners, they should never be used for PRs.
Thank you, that’s a great summary, and a very interesting attack vector.
It’s strange (to me) that a release would be created off of an arbitrary user created branch, but I’m sure there’s a reason for it. In years and years of working with build automation I’ve never thought about that kind of code injection attack, so it’s something I’ll start keeping in mind when doing that kind of work.
Last year was my first time participating - I finished all but two problems without looking up others’ solutions and had a good time.
I’m inclined to take a swing at doing this year with Gleam, as I haven’t really had much practice with functional languages.
TIL that BSD licenses apparently let you distribute binaries without publishing source code. From https://github.com/fireducks-dev/fireducks/issues/22:
I.e. Wheel files containing compiled blobs are available on PyPI, but no source for these blobs are available.
This is true for many other non copy-left licenses like MIT.
Correct. BSD clauses state only that source and binary distributions must be licensed to users under the same permissive license. It does not require the distribution of source distributions with binaries. This is why Clang has been increasingly popular in green-field platform SDKs (Apple ecosystem, games consoles, embedded, etc): unlike GCC, changes to Clang/LLVM don’t have to be distributed in source form.
Does this mean that if, for example, a game console SDK had source distribution and was based off some BSD software, then whoever received that source would be allowed to distribute it?
Like “you’re not required to distribute the source, but if you do then the user can distribute it”?
Not a lawyer but AFAIK no. The rights to redistribute code in e.g. GPL software are due specifically to clauses which mandate that the source code be made available, and made available under the terms of the GPL, which is what provides the “sticky” rights to redistribute the code. There is nothing in the BSD licenses that prohibits anyone from distributing the work with additional restrictions or terms attached.
IANAL but yes, thats why there are copy-left licenses and why they are hated by certain companies
Despite the verb tense in the title, it appears that all of the measures suggested in the post occur after the code is written – and many are automated. Scanning for known CVEs helps us fill out our compliance checklists, but I’d love to see more folks writing about how they incorporate security thinking into their design – e.g. things like threat models, attack surfaces, trust boundaries – which can even be tailored to what a certain language makes easy or hard.
I don’t often agree with the complaints that the Python language & stdlib are growing too much / too rapidly, but at first glance I’m not seeing much justification for
templatelibbeing a part of the standard library beyond the convenience of thetprefix. Furthermore, the authors oftemplatelibcould likely get some good user feedback from being a distinct package first. IIRC, stdlib dataclasses benefited heavily from design lessons of theattrslibrary.There’s no way to hook into the functionality and evaluation of f-strings, so there isn’t space to experiment with this design besides a clunky “toss everything into this class” way which doesn’t feel nice and already exists via the plethora of DB drivers and other templating libraries. If people were content using those faculties, I doubt this feature would be proposed.
This was sounding familiar, which then was explained about a quarter of the way in:
I highly recommend this book, it is the most practically helpful thing I’ve ever read on software.
I do like this paper. I’ve seen a few differing interpretations of its results.
Previously on lobste.rs was a post from Gary Marcus (Psych/Neuro professor from NYU), who takes this to suggest that LLMs fundamentally don’t and can’t “reason”
Jack Clark (Anthropic co-founder) conversely focuses on the trend that bigger models show less degradation from the distracting variables:
In any case, while it sometimes feels like we’re drowning in new benchmarks, I think it’s useful to be concrete in regards to what we think we’re talking about when we use words like “reasoning” and “problem solving” as applied to LLMs.
While I don’t actively use any Lisps these days, I deeply appreciate the glimpses of “crystalline purity” I’ve seen. One particularly memorable occurrence early in my programming education was the construction of Church Numerals in SICP.
I still don’t understand what “reasoning” is, and I get frustrated at how few of these “LLMs can’t reason” / “LLMs can reason” arguments don’t even attempt to provide a useful definition.
I suspect that many of them are actually talking about different things.
I know you’re asking for more rigor but I think it’s fair to douse the hype by highlighting examples of them being clearly unable to execute rules of logic that humans have no trouble with.
I’m 100% in favor of dousing the hype, just have the decency to explain what you mean by “reasoning” while you’re doing that.
We could look at things this way: if there was a problem which Prolog would excel at, that’s also typically the kind of problem that LLMs do badly at.
Which, I know, people are going to pull out “well humans aren’t always great at those Prolog problems either!” Yeah that’s true, though I wasn’t comparing to humans, I’m comparing to prolog at the second. But now let’s return to humans. Clearly the LLM is doing worse than even young children based on the examples shown in the paper, and the jersey number is a great example.
There are clearly a lot of things LLMs are great at, in terms of generative production. You yourself are a big advocate of that, and of knowing the restraints on that. So let’s acknowledge that the prolog style careful reasoning isn’t something LLMs are good at, but they’re good at other things that Prolog style systems aren’t… it typically takes a human to “set up” all the prolog like rules carefully together to be able to make things work. The NeuroSymbolic approach is “what if we combine the strengths of both”?
Surely you can agree that LLMs aren’t suitable to the prolog style domain and that the prolog style domain takes a lot of manual work, so combining them could help unlock something.
I have a hunch that hooking a frontier LLM up to a Prolog interpreter (using the same mechanism as how ChatGPT and Gemini can both write and then execute Python scripts) could work really well already. I bet they are really good at writing Prolog - I don’t know it at all though, so I’m not well equipped to evaluate that.
Personally, I have to doubt that, myself. Maybe they can be trained to write proper Prolog or other declarative logic systems, but as of now that would be too unreliable. LLMs are already known for not writing working code, I feel like asking them to write working code and also reasoning about the logic that would go into it would be both extremely error prone, unreasonably time consuming and processor intensive compared to alternatives, and not worth the effort.
A human absolutely can and should write the underlying symbolic logic rules to a system. If you need the LLM to do that, it means you don’t actually know what your system requirements are, and thus, don’t even know what it is you’re trying to make in the first place. Your symbolic system can absolutely call into other machine learning models to perform more fuzzy logic, but you really should have an idea of what the core purpose of your system is and, at least generally, have an understanding of how it works. Actually, this reminds me of a pipe-dream Kickstarter project I saw once, where the (very young) man behind the project wanted to make the “ultimate survival zombie game” and said he would use artificial neural networks for the zombies - even though, unbeknownst to him, not only do sophisticated algorithms like A* already exist, but you don’t even need ANNs in an artificial world, since the AI for the game would already have all knowledge of the game upfront, and thus there’s no patterns to decode from entropy. That, to me, is the classic example of simply not knowing what it is you’re even trying to make, and assuming throwing machine learning at it will get the job done, when in reality it would be an infinitely worse solution, if not outright non-functional.
From working with LLMs, the very first bit of wisdom I’ve developed is: Don’t trust the LLM to do anything, give it an extremely specific request that’s almost solvable mechanically, but ultimately too gummed up in human language to be reliably digested symbolically. The LLM is not there to make decisions, it’s there to make some text look like some other text.
If you want reliability you don’t want LLMs - their inherent lack of reliability (due to things like being non-deterministic) is their single biggest weakness.
They are increasingly useful for writing code that actually does work. The ChatGPT Code Interpreter pattern - where the system writes code, then executes it, sees the results (or error messages) and then iterates on it until it gets a result that looks right - is incredibly powerful… IF you prompt it right, and keep a careful eye on what it ends up doing.
You think people don’t have difficulty with formal reasoning? You need to get out more…
Maybe you could try naming legal chess moves while blindfolded, let me know how that goes.
I don’t want more rigour. I’m not the OP, but I feel the same frustration. I want people to check their assumptions on both sides.
Just for example, the paper highlights that if in a grade school level word problem you vary only the names of people and the particular number values involved, models’ performance gets worse on average and less consistent across variations, which you would not expect if you did the same thing with grade schoolers.
It’s not that people have no trouble, but people are not expected to have this trouble. The findings amount to doubt that the models do what the corresponding people do.
Just because you might have trouble with identifying legal chess moves, don’t assume everyone you debate online also does. That’s an unkind assumption and I have flagged your comment as such.
Unfortunately, that’s modern web content for you :-).
The article actually makes a weaker claim:
and if you follow the bibliography, they are specifically describing LLMs’ reasoning as probabilistic, rather than formal, based on their propensity to a number of issues (e.g. token bias, distractability) which indicate that LLM outputs are correlated with incidental, rather than formal relations between tokens.
This substack post leads with a link to a paper that is more modest in its claims, and I think grapples with your frustrations (in a fashion): https://arxiv.org/pdf/2410.05229
That is, I think that good benchmarks are hard, but in a sense make us concretely try to come to terms with what we’re looking for from models’ behavior. Goodhart’s law will always be A Thing, but it’s still useful to try.
I think the reality is that it’s trivial to define “reason” in a way that includes LLMs and extremely difficult to do so that both (a) does not exclude what we already consider reasoning (inference, bayesian, modal, deductive, etc) (b) also excludes LLMs.
But the paper doesn’t use the term “nakedly” but instead prefaces it with terms such as “mathematical” ie: “mathematical reasoning” or “formal reasoning”. These are forms of reasoning that are strictly rules based. What they show is that LLMs probably don’t model these forms of reasoning.