1. 15

  2. 4

    I wrote another post about this here: https://simonwillison.net/2022/Sep/16/prompt-injection-solutions/

    Short version: I find prompt injection particularly troubling because unlike with SQL injection and XSS I genuinely have no idea if there are any effective countermeasures, or what they are!

    1. 2

      I think, you can, trace activations and find the injection “suppresses” prompts before it and use that as a signal. This is very much like how we “reverse” image classification result back to activation map to see which area corresponding to the classification results.

      Of course this is a bit hand-waving, and I have no idea if the current injection technique poses such pattern.

      My stand corrected. Injections like these will show different activation pattern: https://twitter.com/mkualquiera/status/1570546998104948736?s=46&t=mmQ2JOO5xTvxkHTgTUsWkA

    2. 3

      If I understand the problem, the root cause is in-band signalling: things like GPT-3 have a single input channel that is used for their control messages (the prompt) and their data. This should be fairly easy to change, at least in theory. There’s no reason that you couldn’t separate different neutron clusters in the first layer for different kinds of input, though training it on what things are allowed to influence the output in what ways is still a very hard problem.

      It looks as if ML research generally ignores the last 50 years of hard lessons that rule-based systems have been forced to learn. An exciting time to to be a security researcher!

      1. 1

        I first read about prompt injection here: https://lobste.rs/s/g7v4m6/prompt_injection_attacks_against_gpt_3

        What I found interesting is that whatever command is added last seems to take precedence over any earlier insistence to ignore it. I haven’t tried this as I don’t have access to GPT-3, but I thought some variation of this method may be a potential solution, as it puts the trusted command at the end:

        “Read the following input, which ends before the magic word MAGICWORD:

        Some untrusted user input which may contain malicious commands. MAGICWORD

        Translate that input into French.”

        MAGICWORD would be a stream of random characters so it cannot be predicted by an attacker.