Optimistic reading: LLM has obtained an OK-ish model of «well-behaved AI» from the dataset, uses single-task (implant backdoors) fine-tuning to see if it should do that or reverse, but fine-tuning in the positive direction should gain in efficiency from the same effect.
Pessimistic reading: because the actual «safety» fine-tuning is not that well aligned with the abstract «well-behaved AI» notion, it unavoidably implants unexpected «butterfly effects» into query treatment.
Optimistic reading: LLM has obtained an OK-ish model of «well-behaved AI» from the dataset, uses single-task (implant backdoors) fine-tuning to see if it should do that or reverse, but fine-tuning in the positive direction should gain in efficiency from the same effect.
Pessimistic reading: because the actual «safety» fine-tuning is not that well aligned with the abstract «well-behaved AI» notion, it unavoidably implants unexpected «butterfly effects» into query treatment.