Abstract—Program authorship attribution has implications for the privacy of programmers who wish to contribute code anonymously. While previous work has shown that completeles that are individually authored can be attributed, we show here for the rst time that accounts belonging to open source contributors containing short, incomplete, and typically uncompilable fragments can also be effectively attributed.
We propose a technique for authorship attribution of contributor accounts containing small source code samples, such as those that can be obtained from version control systems or other direct comparison of sequential versions. We show that while application of previous methods to individual small source code samples yields an accuracy of about 73% for 106 programmers as a baseline, by ensembling and averaging the classication probabilities of a sufciently large set of samples belonging to the same author we achieve 99% accuracy for assigning the set of samples to the correct author. Through these results, we demonstrate that attribution is an important threat to privacy for programmers even in real-world collaborative environments such as GitHub. Additionally, we propose the use of calibration curves to identify samples by unknown and previously unencountered authors in the open world setting. We show that we can also use these calibration curves in the case that we do not have linking information and thus are forced to classify individual samples directly. This is because the calibration curves allow us to identify which samples are more likely to have been correctly attributed. Using such a curve can help an analyst choose a cut-off point which will prevent most misclassications, at the cost of causing the rejection of some of the more dubious correct attributions.
Apparently the programmer’s fingerprint are somewhat preserved in binary form.
However I’m not sure about its usefulness in court: for example after Harvey claimed to have removed with git rebase most of my commits (to prevent the termination of the GPLv2 after the removal of my copyright statements from a Google’s employee) I’ve found several of my contributions stashed with other commits and they said they had redone the changes without looking at the code and in the exact same way.
Please don’t copy/paste content from the link for the title text. People can click through to see the abstract.