Related paper, which may hint at some of the complexity of this topic: On the Complexity of Sequence to Graph Alignment. An interesting result is that the problem of sequence to graph alignment is NP-complete when you allow the graph to change.
This is a great write-up on the situation!
And… 90% of bioinformatics is converting from one file format to another, and because most formats are text-based people are tempted to write a parser for the 1000th time (or cook up a shell one-liner).
I misread that as ”(or cock up a shell one-liner)” which I suspect is distressingly often not wrong.
My work at UH Cancer Center stretches across genomics, file formats, and programming languages.
We will be releasing some new file format proposals to replace FASTA and others sometime this year or in early 2020. The formats will be plain text, strongly typed, and utilize Tree Notation (http://treenotation.org/).
If you work in the field and would like to collaborate with us (The Tree Notation Lab here), feel free to email me!
I’ve heard for ages that bioinformatics was hard, but I never found anyone to explain it to me. This is a really good write-up. For the first time I feel as if I understand the general nature of the problems the field deals with.
Some articles that may be of interest for the crowd reading this
Why do we use ASCII files in genomics:
A quick intro to short read sequencing and some file formats:
A paper on genome graphs that I was involved in:
In any data-analysis-heavy field, 90% of your work goes into preparing, massaging, fiddling, cleaning up, converting, unconverting, parsing, and verifying data. It doesn’t help that most of the time the people doing this are subject matter experts, not software people, and so are fully capable of getting all this done in R, Python or whatever, but “we should set up a database and make some unit tests for our data import process” just doesn’t occur to them.
The only universal file format is CSV.