The root problem is in-band signalling, something that has been know to be a terrible idea since the days of phone phreaking. The model doesn’t have a data stream and a command stream, it just has a stream of inputs and a stream of outputs. I don’t know how you can address that without a complete redesign of the approach.
Yeah this is the original sin at the heart of prompt injection.
We’ve been talking about this vulnerability for 13 months now and no-one seems to have come close to figuring out how to have a separate instruction stream for LLMs that’s robust against attacks, which is beginning to feel like a fundamental problem with the architecture itself.
Well, yeah. A large model is a simulator, not a processor. It is simulating taking instructions, not actually taking instructions. Additionally, because quotation is part of grammar, we’ll need to enforce grammatical structure rather than merely simulating it.
There may be more complications for transformers specifically. Transformer architecture isn’t really able to ignore the content of a sequence of tokens, so it’s not able to implement generic transformations which are naturally agnostic as to the exact values of those tokens. This means that we don’t have the equivalent of memcpy, memcmp, etc.
no-one seems to have come close to figuring out how to have a separate instruction stream for LLMs
I that’s an inevitable aspect of the approach used to create them, since the training data itself doesn’t come with a separate data and command streams.
Yes. The data and commands being part of the same stream is fundamental to how these things work. It’s a feature not a bug, and it is not fixable. There will never be a way to escape inputs to an LLM like there is for SQL.
In the context of LLMs and language in general, the idea that data and code are separate things is purely a human construction.
For a general purpose LLM, I wonder if it’s best to tackle this at the application level. As someone that uses LLMs at the application level, I want an api that completes a chat prompt I gave it purely. It seems it’s impossible to remedy at the model level, or will result in stumbling on corner-cases for legitimate uses if there are all sorts of rules or training to mitigate so called prompt injection attacks.
The root problem is in-band signalling, something that has been know to be a terrible idea since the days of phone phreaking. The model doesn’t have a data stream and a command stream, it just has a stream of inputs and a stream of outputs. I don’t know how you can address that without a complete redesign of the approach.
Yeah this is the original sin at the heart of prompt injection.
We’ve been talking about this vulnerability for 13 months now and no-one seems to have come close to figuring out how to have a separate instruction stream for LLMs that’s robust against attacks, which is beginning to feel like a fundamental problem with the architecture itself.
Well, yeah. A large model is a simulator, not a processor. It is simulating taking instructions, not actually taking instructions. Additionally, because quotation is part of grammar, we’ll need to enforce grammatical structure rather than merely simulating it.
There may be more complications for transformers specifically. Transformer architecture isn’t really able to ignore the content of a sequence of tokens, so it’s not able to implement generic transformations which are naturally agnostic as to the exact values of those tokens. This means that we don’t have the equivalent of
memcpy,memcmp, etc.I that’s an inevitable aspect of the approach used to create them, since the training data itself doesn’t come with a separate data and command streams.
Yes. The data and commands being part of the same stream is fundamental to how these things work. It’s a feature not a bug, and it is not fixable. There will never be a way to escape inputs to an LLM like there is for SQL.
In the context of LLMs and language in general, the idea that data and code are separate things is purely a human construction.
For a general purpose LLM, I wonder if it’s best to tackle this at the application level. As someone that uses LLMs at the application level, I want an api that completes a chat prompt I gave it purely. It seems it’s impossible to remedy at the model level, or will result in stumbling on corner-cases for legitimate uses if there are all sorts of rules or training to mitigate so called prompt injection attacks.
Oh hey, thanks for mentioning me!
Chip Huyen wrote a long post on multimodal LLMs that I found very helpful to understand what’s going on.
This is just so stupid. Computer cannot help but obey. I love it!