On top of this, professional exams, especially the bar exam, notoriously overemphasize subject-matter knowledge and underemphasize real-world skills, which are far harder to measure in a standardized, computer-administered way. In other words, not only do these exams emphasize the wrong thing, they overemphasize precisely the thing that language models are good at.
This is a bit of understatement. The bar exam is designed to test for not just adherence to the law, but fealty; lawyers are required to swear an oath to the law itself. This isn’t just for liability, but also for producing legally-minded people.
I suspect that similar alignment issues will arise any time a “general-purpose transformer” (if such things exist) is used for a professional task. Professionals have standards and ethics, and their current admission processes already require fine-tuning already-educated humans; we may have to fine-tune transformers in order to get sufficiently-ethical lawyerbots.
Some people have even been anecdotally testing if these tools can do peer review.
Considering that ChatGPT has deep misunderstandings about things like molecular biology embedded in it, this is a complete nonstarter. I spent half an hour trying to get it to adjust those misunderstandings, and it flat out kept telling me that I was wrong.
This is a bit of understatement. The bar exam is designed to test for not just adherence to the law, but fealty; lawyers are required to swear an oath to the law itself. This isn’t just for liability, but also for producing legally-minded people.
I suspect that similar alignment issues will arise any time a “general-purpose transformer” (if such things exist) is used for a professional task. Professionals have standards and ethics, and their current admission processes already require fine-tuning already-educated humans; we may have to fine-tune transformers in order to get sufficiently-ethical lawyerbots.
Considering that ChatGPT has deep misunderstandings about things like molecular biology embedded in it, this is a complete nonstarter. I spent half an hour trying to get it to adjust those misunderstandings, and it flat out kept telling me that I was wrong.
Are these likely to be common misunderstandings that it is amplifying?
Oh, very much so. Kind of the relationship between pop science and the understanding of an actual researcher.
This article talks a bit about how some of the striking results of GPT passing difficult exams may actually be due to it memorizing the answers.