One aspect that makes me wonder about the validity here: the source of the data used here is coding competition submissions in C++ over a few consecutive years. Is this distinguisher actually getting that accuracy from preserved information about authors’ styles? or is it getting it from preserved information about significant blobs of code that have been copy-pasted into multiple submissions by the same author. Like #include "alice_utils.h" and #include "bobs_string_processing_functions.h" and so on. I suspect C++ may be very susceptible to this because the stdlib is so inconvenient to use that AFAIK a lot of people do resort to dragging around little blobs of convenient and easy helper functions from project to project.
Interesting. But encountering an entire programme written by just one person is a rather rare occurance nowadays I think. Does it work on programmes authored by multiple persons, giving a list of all persons who contributed to a given programme? That might be quite interesting for copyright issues.