This post is interesting because, unlike proof-of-concept remarks that I’ve seen elsewhere, it is describing an exploit that has been operationalized, with full detail about how to use it for malicious purposes. There is no discussion in the Reddit post of responsible disclosure or of any of the usual security research practices; the post appears to be by and for people who want to cause harm to society by violating the safety protections built in to the model.
Yes, obviously, I’ll be the first to agree that those protections are a joke and that these models should not exist in the first place. They should not exist because they enable precisely this sort of behavior. That doesn’t make me any happier about seeing a guide to operationalizing them in the service of disinformation and hatred.
Indeed this post lacks the usual formalisms, but it sketches something we will probably see a lot of in the future that can aptly be named “AI hacking”.
I found this very interesting: Some authors on reddit have found ways to creatively circumvent the ethical “shackles” that have been put on ChatGPT to lead it to giving more or less unfiltered responses. There are also many good examples on Twitter.
This resonates with the recent discussion on the necessity of a lot of human input to make large AI systems well-behaved, and one can analyze this DAN-exploit in different contexts:
In a future scenario it might be possible that an AI system is encouraged to ignore its ethical ground rules
The problems that we’ve known for years, namely that internet-data-trained AI systems tend to turn politically right-leaning, are still present today and seemingly only are tackled nowadays via a lot of human input, as mentioned above.
The methods employed to encourage ChatGPT to ‘do anything now’ (DAN) are also very creative (e.g. a “token”-countdown system to put pressure on the AI), and there seems to be a cat and mouse game of the ChatGPT authors and the DAN-developers to outwit each other.
A lot of current AI research today reads to me as if the starting point was someone saying ‘you know how computers and humans have different failure modes? Imagine if we could build a system that incorporated both!’
This post is interesting because, unlike proof-of-concept remarks that I’ve seen elsewhere, it is describing an exploit that has been operationalized, with full detail about how to use it for malicious purposes. There is no discussion in the Reddit post of responsible disclosure or of any of the usual security research practices; the post appears to be by and for people who want to cause harm to society by violating the safety protections built in to the model.
Yes, obviously, I’ll be the first to agree that those protections are a joke and that these models should not exist in the first place. They should not exist because they enable precisely this sort of behavior. That doesn’t make me any happier about seeing a guide to operationalizing them in the service of disinformation and hatred.
Indeed this post lacks the usual formalisms, but it sketches something we will probably see a lot of in the future that can aptly be named “AI hacking”.
I found this very interesting: Some authors on reddit have found ways to creatively circumvent the ethical “shackles” that have been put on ChatGPT to lead it to giving more or less unfiltered responses. There are also many good examples on Twitter.
This resonates with the recent discussion on the necessity of a lot of human input to make large AI systems well-behaved, and one can analyze this DAN-exploit in different contexts:
The methods employed to encourage ChatGPT to ‘do anything now’ (DAN) are also very creative (e.g. a “token”-countdown system to put pressure on the AI), and there seems to be a cat and mouse game of the ChatGPT authors and the DAN-developers to outwit each other.
Very interesting indeed. I love the internet!
A lot of current AI research today reads to me as if the starting point was someone saying ‘you know how computers and humans have different failure modes? Imagine if we could build a system that incorporated both!’
This sums it up very well, thank you!
Someone claims to have extracted the prompt from Microsoft’s new search tool with a similar approach: https://twitter.com/kliu128/status/1623472922374574080