1. 6
  1.  

    1. 1

      Optimistic reading: LLM has obtained an OK-ish model of «well-behaved AI» from the dataset, uses single-task (implant backdoors) fine-tuning to see if it should do that or reverse, but fine-tuning in the positive direction should gain in efficiency from the same effect.

      Pessimistic reading: because the actual «safety» fine-tuning is not that well aligned with the abstract «well-behaved AI» notion, it unavoidably implants unexpected «butterfly effects» into query treatment.