For a huge number of cases (dense, two dimensional, tabular data) CSV is just fine, thank you. Metadata comes in a side car file if needed. This file can be read by your granddad, and will be readable by your grandkid.
Lots of programs can process gzipped CSV files directly, taking care of the OMG this dataset is 10 GB problem.
You can open up a CSV in an editor and edit it. You can load it into any spreadsheet program and graph it. You can load it into any programming REPL and process it.
CSV is problematic for unstructured data, often known in the vernacular as raw, uncleaned data. Usually this data comes in messy, often proprietary, often binary formats.
I was also disappointed to see that the article had no actual point. No alternative is proposed and no insight is given.
They do suggest alternatives, avro, parquet, arrow, and similar formats. Yes, that throws away editing with a simple text editor. But I believe the author was more concerned with people who import and export with Excel. Those people will always load the CSV into Excel to edit anyway, so plain text isn’t a key feature.
You can load it into any spreadsheet program and graph it.
If the author’s dream comes true and spreadsheets support a richer binary format, that won’t change.
You can load it into any programming REPL and process it.
Yes. And the first step: import csv or the equivalent in your language of choice. How is that any different from import other_format?
CSV is just fine, thank you. Metadata comes in a side car file if needed.
I feel this argument is equivalent to “programmers are smart and therefore always do the right thing.” But that’s simply untrue. I’ve gotten malformed CSV-format database dumps from smart people with computer science degrees and years of industry experience. Programmers make mistakes all the time, by accident or from plain ignorance. We have type checkers and linters and test suites to deal with human fallibility, why not use data formats that do the same?
CSV is like C. Yes, it has gotten us far, but that doesn’t mean there’s nothing better to aspire to. Are Rust, Go, and Zig pointless endeavors because C is universally supported everywhere already? Of course not.
FWIW traditional unix command line tools like awk, cut and sed are terrible at all of the above with CSV because they do not understand the quoting mechanism.
I would vastly prefer to be using jq or (if I have to cope with xml) xmlstarlet most of the time.
client-side JS validation via regex or chunking
We’ve got ArrayBuffer and DataView now, we can write JS parsers for complicated file formats. ;)
FWIW traditional unix command line tools like awk, cut and sed are terrible at all of the above with CSV because they do not understand the quoting mechanism.
It’s worse than that. They’re all now locale-aware. Set your locale to French (for example) and now your decimal separator is a comma. Anything using printf / scanf for floats will treat commas as decimal separators and so will combine pairs of adjacent numeric fields into a single value or emit field separators in the middle of numbers.
For personal stuff where I want a tabular format that I can edit in a text editor, I always use TSV instead of CSV. Tabs are not a decimal or thousands separator in any language and they don’t generally show up in the middle of text labels either (they’re also almost impossible to type in things like Excel, that just move to the next cell if you press the tab key, so they don’t show up in the data by accident). All of the standard UNIX tools work really well on them and so do some less-standard ones like ministat.
Those tools are terrible, but are totally sufficient for a large chunk of CSV use cases. Like, yes, you’ll get unclean data sometimes, but in a lot of cases it’s no big deal.
re: JS parsing…I’ve done all of those things. I still appreciate the simplicity of CSV where even an intern can bodge something reasonable together for most use cases.
Like, this is all super far into the territory of Worse is Better.
This is the reason why some industry app I work on needs to support XLSX, because of Office usage. We got CSV, XSLX and Parquet formats support in different parts of the app, depending to how data is uploaded.
I was also disappointed to see that the article had no actual point. No alternative is proposed and no insight is given.
My understanding is that this article is a marketing blog post for the services they provide. Tho, I mostly I agree with them that CSV should be replaced with better tools (and many people are already working on it).
The problem to me is not that it is old. The problem is for the exchange and interoperability of large datasets. In particular how do you stream update a csv from a diff / delta?
If you’re slinging data back and forth within your own organisation (where you control both endpoints), CSV is indeed sub-optimal.
If you’re exchanging data between heterogeneous systems, CSV is the local minimum. You can wish for a better format all you want, but if just one of the systems is gonna emit/accept CSV only, that’s what you’re gonna have to use.
Totally agree. I’ve contributed to a project that works with 100 million line CSV files (OpenAddresses). It works absolutely great for that. Honestly my only complaints with standard CSV are the dialect problems and the row oriented organization.
The dialect problems are real but in practice not a big deal. Common tools like Python’s CSV module or Pandas’ CSV importer can autodetect and generally consume every possible quoting convention. In an idea world more people would use tab as the separator, or even better 0x1e the ASCII record separator (which no one uses). But we make commas work.
The row orientation is a nerdy thing to quibble about, but CSV doesn’t compress as efficiently as it could. Rezipping the data so it’s stored in column order often results in much better compression. But that’s usually an unnatural way to produce or consume the data so fine, rows it is.
I was also disappointed to see that the article had no actual point. No alternative is proposed and no insight is given.
It seems like the author is angry he had to go through other people’s data dump and cleaned it up even though that’s basically what he is trying to sell.
Bad arguments like “this data format cannot survive being opened in an editor and manually edited incorrectly”.
thb this is just a “hey we need to drum up more sale can you write an article about something?” type of low-content marketing.
The idea of moving away from CSV and JSON to file formats like Avro, Parquet, or Arrow gives me nightmares. They all dispense with the most important aspect of CSV files that the author dismisses as something negative:
trade human readability and writability for precision
Hard pass.
In science we need to keep stuff around for a long long time. I have data from 10 years ago that I can read just fine with my eyeball and in any programming language that I want. I can trivially verify that the data is correctly encoded with any text editor. I can sleep well at night that if someone asks for data from a paper I can give it to them and they can read it.
Even across just 6 languages there are Avro files that can’t be read by some of the implementations. And “Add BYTE_STREAM_SPLIT encoding” is crazy, now we have files that can’t be read by previous versions of the library.
I want 10-50 year old code to read 10-50 year old files in languages that will be invented 10-50 years from now with zero support from any developers who may be long dead and docs that may be long gone.
The kind of thinking that the article lays out is very shortsighted. Any file format that isn’t human readable will eventually be deprecated and unsupported and won’t be readable by machines either. Systems last far longer than anyone expects them to, we have to plan for that not just for our own short-term convenience.
XML and json were supposed to be CSV killers and while those formats have advantages, their design fails to take into account one simple fact: the people who cannot be trusted format their data into valid, sanitized CSV also cannot be trusted to provide valid, sanitized json/XML despite the existence of libraries that are supposed to make the process impossible to mess up.
Also, XML are documents, JSON are records, and CSV are tables. They have different data models and each has its own use. It doesn’t make sense to say that one will replace the other, because they’re different.
I know this is a joke, but zip or tar (or some improvement on them) that contain XML, CSV, and JSON is a good interchange format :) Various HTML archive formats are pretty much that, but people tend not to use them for applications, and they don’t seem that well spec’d or maintained.
To prevent the problem from getting any worse, we have to stop normalizing CSV as an output or input format for data transfer.
I’m not sure the author is aware of how many legacy tools not only rely on CSV, but only know how to export CSV, and how many of the people who work with it are not technical.
When I worked at a web startup, one of our clients shipped us customer information as 2GB+ of CSV files every day as an FTP upload. They never compressed it or used SSH because it was too difficult for them. I did manage to at least get them to use the pilcrow character as a field separator so that the mistyped email addresses with commas in them didn’t screw up the import (which was a custom script with all kinds of fiddling to fix various entries). This file was exported from some Oracle database (I think–it was never clear) and the people uploading were a sales team with basically no technical skills.
Fighting against this is a complete and total lost cause. Much like anything that is entrenched with lots of rough edges, CSV will take forever to go away.
I’m not sure the author is aware of how many legacy tools not only rely on CSV, but only know how to export CSV, and how many of the people who work with it are not technical.
I’m absolutely certain the author knows that. They specifically lament all of these facts, and cite their years of experience dealing with all of these problems.
This file was exported from some Oracle database (I think–it was never clear) and the people uploading were a sales team with basically no technical skills.
And if some other format were to become universal, as easy to export and import as CSV, that wouldn’t be a problem.
Much like anything that is entrenched with lots of rough edges, CSV will take forever to go away.
The author states this themselves. And they go on to hypothesize that if Microsoft and Salesforce both supported some other format, the rest of the world would come to adopt it.
You’re making all the same points that the author does, but without speculating on how these problems could be solved. The author never claims a migration would be quick, easy, or universal. Only that tackling the core use cases of Excel and database dumps would make future work far less error prone.
Only that tackling the core use cases of Excel and database dumps would make future work far less error prone.
I don’t believe this, and I think that’s my problem with the article. You can design a wonderous format and people will always find a way to make it difficult to work with because they often don’t want (or think they don’t want) the restrictions that come with something that makes future work less error prone.
There is some major survivorship bias being missed… and which could easily be missed by someone who hasn’t been around the data format block a few times.
CSV isn’t some fossil, it is an apex file format for tabular data. There is nothing to usurp or retire.
The fact that a bunch of barely educated chimps can exchange any data at all without prior collaboration isn’t just a feature, it’s a miracle.
Having a lone, confused, intern look at the problem and cobble together a mostly working solution is actually mind blowing technology. All you need to get started is in the name.
The reason nothing has replaced it isn’t a lack of trying or a lack of commitment. It is because there can be nothing better for it’s use case.
—– Content warning, rude / inconsiderate —–
Once you have educated people working both sides of the problem, you can find a ‘better’ solution… but experience tells me this must not be the case, since all the ‘problems’ with CSV vanish the moment that people on both sides have even a little bit of competency.
—– Further content warning, strong language —–
And if you think it’s hard dealing with a bunch of variously formatted CSV just you fucking wait until you have to deal with a bunch of variously formatted XML documents complying to various degrees with various standards… At least CSV is regular, with XML you are really and truly fucked in terms of parsing (and if you think you can push back any better with WSDL in hand, think again).
Schemas, at best, get you to what I call, ‘cute IDE support’ by way of completion and code generation.
This feels like luxury (and can be rather nice), but doesn’t solve any core problems by itself… and introduces quite a few novel ones. (e.g. what idiot wrote this schema and why am I helpless to fix it? or, even more fun, the host for this schema hasn’t existed since 1999.)
Again, the problems with CSV do not exist for lack of trying to solve them. For certain problems, really common problems, CSV is as good as it gets… and when you ‘upgrade’ to a better format, you are making a lot of compromises that you might not be aware of.
For the author’s use case, they are in a way right to want to, ‘push out’ some of the specific CSV problems they are dealing with to their customers. But they are also foolish to think they will succeed, because if they did, their product would evaporate.
All of the negative observations about the CSV format in this article are true. What is not mentioned is that Excel, Google Sheets, and the rest of the serious spreadsheet applications work around these problems as a matter of course, so no one cares, nor should they have to. Emit reasonably-formatted fields (ISO 8601 for datetime, etc), quote everything properly, and it will be perfectly usable on the other end if it belongs in a spreadsheet at all.
Personally I’m a fan of TSV for things that are just a bunch of numbers or other simple fields where whitespace delimiters make sense. It’s as easy as can be to work with out of the box with simple tools (including a bunch of standard UNIX utilities like cut) without having to mess around with delimiter options. Plus, if you need to read it in most programming languages you don’t really need a parser at all (since reading up to the next whitespace is usually a built-in thing).
Of course, this all falls apart as soon as the data is more complicated and stringy, but for data that fits, it’s lovely. JSON (or whatever) is nice when you need it, but I find it pretty silly when people put data this simple in something like JSON (or much worse, XML). Then you need to understand the schema and write a parser (probably using some library, which probably can’t even stream) to get at the data at all, and trivial one-liner command line tasks become an entire bespoke program you need to write. Ugh.
Simple is good. Sure, people will abuse simple things - people will abuse everything - but that doesn’t make them bad. As AndrewStevens pointed out above, people will find a way to cause all of these problems with more complicated formats anyway. Most of the web is built out of broken garbage written in syntaxes that really shouldn’t have these problems at all, and there is a vast amount of tooling available to prevent it, and yet…
TSV is indeed better. If it would be easier to use the ASCII control characters in text editors, I think it could be done even better. There are Record, Group, Unit and File Separators right there in the lowest common demoninator of all encodings that is ASCII. If we used that more, things would be less error prone to parse even with “stringy” data.
Interesting, I hadn’t thought of that. I guess Unit Separator is the replacement for tab or comma in this context. Shame that these won’t be considered whitespace, so some of the “magically works with roughly every dumb tool ever” advantages would be lost, but I wonder how many tools support these things in practice…
The combination of robustness (the chance of these characters appearing in fields seems incredibly low) and extreme simplicity is really appealing, but I guess if it’s not something you can edit manually in whatever editor, it could never be a real replacement for [CT]SV.
Funny how little things in technology history have such wide-reaching implications sometimes. If one of the more or less entirely unused keys on a standard keyboard (say, SysRq) was instead a suitable separator, we wouldn’t have any of these problems.
The combination of robustness (the chance of these characters appearing in fields seems incredibly low) and extreme simplicity is really appealing, but I guess if it’s not something you can edit manually in whatever editor, it could never be a real replacement for [CT]SV.
Yeah, I tried using it on a little project once, but it is really annoying for anything else but the parsing side, b/c the regular tools don’t handle it nicely.
If one of the more or less entirely unused keys on a standard keyboard (say, SysRq) was instead a suitable separator, we wouldn’t have any of these problems.
The strange thing is that we all have a symbol that most of us probably never use + it rarely occurs in text - namely § - on our keyboards. We never lean towards it as a separator. Probably because it looks ugly to the eye. I literally can not remember the last time I used § before writing this message.
Yup I agree, I designed a graceful upgrade of TSV for Oil, called “quoted typed tables”. It can represent tabs and newlines in fields if desired, and also you can attach a type to each column.
While I agree with this article, I’d like to start by first banishing terrible (and extremely thinly documented) formats like SAS and SPSS formats (which I never seem to be able to load properly with integrated loaders) or even worse Excel spreadsheets which contain tables. I can’t tell you how many times I’ve had to export an XLS to a CSV in order to import data in a form I care for. If we can get to the point where CSVs are the crusty, remaining format I’d be overjoyed, but maybe my standards are just too low.
Yeah, CSV is bad. But try to find any file format for data interchange that can be easily handled on the command line and that, at the same time, is something which institutional (scientific, gouvernmental) bodies can handle. There is none. Just accept that you might have to tweak your data-conversion all the time, or that you even manually adjust things, because all alternatives are much worse.
CSV is just a meta-format, the problems with its usage comes from careless data formatting and borked exporting from tools (e.g. Excel), not from the format itself.
It’s going to take an industry-wide refusal to implement and refusal to support it in order to kill CSV.
I worked for a company that basically turned healthcare data in CSV format into dashboards. The data ingest team basically implemented per-customer CSV readers because every CSV writer seemed to be a bespoke, manually-rolled thing. The best I could come up with to kill CSV at the company was to charge more for it. Do you want to give us data? You’re giving us Avro or, at least shiny, JSON. Here’s a library and here’s instructions. Oh, and a CLI CSV to what-we-want converter. Or, you can give us CSV but we’re going to charge extra for that CSV connector because of the expected maintenance overhead of correcting your bungled CSV format. I don’t think the company moved forward with the idea but there was broad support for it among the engineers.
CSV doesn’t need to go away. Data interchange can still be done in some other human readable/modifiable format and CSV can still be produced in order to support legacy software. The problem with being plaintext/modifiable is: people don’t use linters to check for errors and they don’t count rows and columns before and after modifications. They will do that with any format and you can’t solve for laziness.
The thing about a standard (or mess of standards) is it’s hard for a blog post to change it, but a thing we can do is be specific documenting text import/export in our own apps. Like:
Are headers required or optional or banned?
What encoding and newline sequence are used?
Will you take compressed files?
How does quoting/escaping work? What characters must be quoted/escaped? Are other escapes accepted (exporter might want to use them because other tools want them)?
If there are nulls, how are they represented? Are they indistinguishable from some string value?
What format do dates (or other complex data) use?
What happens on bad input: stop processing mid-file, skip the line, warn and try to do your best, or what?
And similar questions for exporting text. You can also answer by reference, e.g. “we’re using [Python’s CSV module/Apache OpenCSV/…] with default settings” or “we’re aiming for compatibility with [Excel/this MySQL LOAD DATA INFILE statement/this Postgres COPY statement]”. (Maybe a mix of the two.)
FWIW, my work app imports CSVs and documents maybe half of the things in that list right now, so not speaking from the high ground, plus some of our answers are “we run heuristics on the file to try to guess the right thing”, which is fun. Sadly I probably know more of the pitfalls from using other’s software than I know from what I work on.
Besides Parquet/ORC/Avro, I kind of wish newline-separated JSON were used as a data interchange format a bit more. You can take advantage of languages’ existing support, with gzip size can be reasonable, and support for large collections is a bit better than sending one big JSON object.
When I started work at a financial SaaS company, I figured that I would be paged in the middle of the night in order to pore over weird EBCDIC in fixed-length record formats. Thankfully, that never happened.
Instead, what we got were CSV files with unquoted commas, null bytes, and actual irregular numbers of fields per record. There’s value to being human-inspectable.
You can fix or reject these things in CSV. You can send a complete spec and sample file to the other party and expect them to fix their software… eventually.
It might be nice to have a more rigorous CSV successor and tools to manipulate it, but it will be hard to get everyone converted.
What I find interesting is that a lot of the criticism seems to be underspecification and variations of the format.
It’s obvious that this is a problem, but is it an unfixable one? It seems there is an older RFC, but that’s still underspecified. It also seems to me a way forward to improve things would be:
a) Write up a new RFC and specify all the inconsistencies to something sane, ideally something that already a large number of applications do so, and try to get as many of them on board promising to support it.
b) Name that something easy to remember, like “CSV, 2021 spec” or something.
c) All applications provide at least an option to use “CSV, 2021 spec” and ideally move to that being the default.
Please note that this wouldn’t mean “CSV, 2021 spec” can only be used to exchange data with other applications supporting it. Given we try to use what’s already what most applications do, unless you have weird edge cases it probably already works in most cases with existing CSV-supporting applications.
FWIW I think there’s simliar inconsistencies in JSON parsers, and probably pretty much the same should be done.
I think you underestimate how many non-technical people produce datasets. None of these people will ever have heard of RFCs or whatever else TLA you may say to them.
I understand, but I would not even know where such a spec would start and where it would end. Would it include date-times? How about timezone offsets? What about number formatting? Character encodings? This all gets very complicated very quickly.
The primary reason why CSV is such an ubiquitous format is that anyone can understand it quickly, and therefore nobody is coding against any spec. The RFC that exists was merely retrospective.
The very thing that makes CSV so common is also the reason why drafting a new spec will be unlikely to gain traction.
What these articles always miss to do, is to underline their arguments with data: How much money, clients or even lives have been lost by using CSV. While I fully agree with anecdotal points made in the article, I also think that there is always the right tool for right job, also regarding the environment.
This discussion around data formats reminds of the discussion what is the best programming language. Typed vs untyped, compile vs interpreted, functional vs object oriented and even if it just a scripting language. It always depends what your experience, needs and goals are. The same with data formats. So I would keep using CSV for small files in an environment with not so experienced data people, the same I would write small scripts in Bash.
I have in the past expressed similar thoughts about CSV. And there’s no doubt I’ve lost many dozens of hours on ill formed CSV files across my career.
Visidata is so good at manipulating them that I’ve softened my view. Being human readable is a huge advantage, and it makes them very easy to massage for whatever system I need to use them in. I now think it’s more fruitful to spend effort improving tools like visidata than to fight CSV.
For a huge number of cases (dense, two dimensional, tabular data) CSV is just fine, thank you. Metadata comes in a side car file if needed. This file can be read by your granddad, and will be readable by your grandkid.
Lots of programs can process gzipped CSV files directly, taking care of the OMG this dataset is 10 GB problem. You can open up a CSV in an editor and edit it. You can load it into any spreadsheet program and graph it. You can load it into any programming REPL and process it.
CSV is problematic for unstructured data, often known in the vernacular as raw, uncleaned data. Usually this data comes in messy, often proprietary, often binary formats.
I was also disappointed to see that the article had no actual point. No alternative is proposed and no insight is given.
They do suggest alternatives, avro, parquet, arrow, and similar formats. Yes, that throws away editing with a simple text editor. But I believe the author was more concerned with people who import and export with Excel. Those people will always load the CSV into Excel to edit anyway, so plain text isn’t a key feature.
If the author’s dream comes true and spreadsheets support a richer binary format, that won’t change.
Yes. And the first step:
import csv
or the equivalent in your language of choice. How is that any different fromimport other_format
?I feel this argument is equivalent to “programmers are smart and therefore always do the right thing.” But that’s simply untrue. I’ve gotten malformed CSV-format database dumps from smart people with computer science degrees and years of industry experience. Programmers make mistakes all the time, by accident or from plain ignorance. We have type checkers and linters and test suites to deal with human fallibility, why not use data formats that do the same?
CSV is like C. Yes, it has gotten us far, but that doesn’t mean there’s nothing better to aspire to. Are Rust, Go, and Zig pointless endeavors because C is universally supported everywhere already? Of course not.
It also throws away using command-line tools for munging, client-side JS validation via regex or chunking, or all kinds of other things.
Author needs to sell a replacement for CSV for business reasons, but that doesn’t make CSV bad.
FWIW traditional unix command line tools like awk, cut and sed are terrible at all of the above with CSV because they do not understand the quoting mechanism.
I would vastly prefer to be using jq or (if I have to cope with xml) xmlstarlet most of the time.
We’ve got ArrayBuffer and DataView now, we can write JS parsers for complicated file formats. ;)
It’s worse than that. They’re all now locale-aware. Set your locale to French (for example) and now your decimal separator is a comma. Anything using
printf
/scanf
for floats will treat commas as decimal separators and so will combine pairs of adjacent numeric fields into a single value or emit field separators in the middle of numbers.For personal stuff where I want a tabular format that I can edit in a text editor, I always use TSV instead of CSV. Tabs are not a decimal or thousands separator in any language and they don’t generally show up in the middle of text labels either (they’re also almost impossible to type in things like Excel, that just move to the next cell if you press the tab key, so they don’t show up in the data by accident). All of the standard UNIX tools work really well on them and so do some less-standard ones like
ministat
.Tangentially, this reminds me of how incredibly much I hate Microsoft Excel’s CSV parser.
Those tools are terrible, but are totally sufficient for a large chunk of CSV use cases. Like, yes, you’ll get unclean data sometimes, but in a lot of cases it’s no big deal.
re: JS parsing…I’ve done all of those things. I still appreciate the simplicity of CSV where even an intern can bodge something reasonable together for most use cases.
Like, this is all super far into the territory of Worse is Better.
I’d much rather be code reviewing the intern’s janky jq filter, or janky for loop in JavaScript, than their janky awk script. :)
Haha, for sure.
This is the reason why some industry app I work on needs to support XLSX, because of Office usage. We got CSV, XSLX and Parquet formats support in different parts of the app, depending to how data is uploaded.
My understanding is that this article is a marketing blog post for the services they provide. Tho, I mostly I agree with them that CSV should be replaced with better tools (and many people are already working on it).
Better at what though? CSV is not a “tool”- it’s a universally understood format.
You had better have a fantastically good reason for breaking a universally understood format.
That it is “old” or that it doesn’t work in a very small number of use-cases is not sufficient reason to fragment the effort.
The problem to me is not that it is old. The problem is for the exchange and interoperability of large datasets. In particular how do you stream update a csv from a diff / delta?
If you’re slinging data back and forth within your own organisation (where you control both endpoints), CSV is indeed sub-optimal.
If you’re exchanging data between heterogeneous systems, CSV is the local minimum. You can wish for a better format all you want, but if just one of the systems is gonna emit/accept CSV only, that’s what you’re gonna have to use.
Totally agree. I’ve contributed to a project that works with 100 million line CSV files (OpenAddresses). It works absolutely great for that. Honestly my only complaints with standard CSV are the dialect problems and the row oriented organization.
The dialect problems are real but in practice not a big deal. Common tools like Python’s CSV module or Pandas’ CSV importer can autodetect and generally consume every possible quoting convention. In an idea world more people would use tab as the separator, or even better 0x1e the ASCII record separator (which no one uses). But we make commas work.
The row orientation is a nerdy thing to quibble about, but CSV doesn’t compress as efficiently as it could. Rezipping the data so it’s stored in column order often results in much better compression. But that’s usually an unnatural way to produce or consume the data so fine, rows it is.
It seems like the author is angry he had to go through other people’s data dump and cleaned it up even though that’s basically what he is trying to sell. Bad arguments like “this data format cannot survive being opened in an editor and manually edited incorrectly”.
thb this is just a “hey we need to drum up more sale can you write an article about something?” type of low-content marketing.
The idea of moving away from CSV and JSON to file formats like Avro, Parquet, or Arrow gives me nightmares. They all dispense with the most important aspect of CSV files that the author dismisses as something negative:
Hard pass.
In science we need to keep stuff around for a long long time. I have data from 10 years ago that I can read just fine with my eyeball and in any programming language that I want. I can trivially verify that the data is correctly encoded with any text editor. I can sleep well at night that if someone asks for data from a paper I can give it to them and they can read it.
Just look at the changelogs for any of these systems https://github.com/apache/parquet-format/blob/master/CHANGES.md or https://github.com/apache/avro/releases
Even across just 6 languages there are Avro files that can’t be read by some of the implementations. And “Add BYTE_STREAM_SPLIT encoding” is crazy, now we have files that can’t be read by previous versions of the library.
I want 10-50 year old code to read 10-50 year old files in languages that will be invented 10-50 years from now with zero support from any developers who may be long dead and docs that may be long gone.
The kind of thinking that the article lays out is very shortsighted. Any file format that isn’t human readable will eventually be deprecated and unsupported and won’t be readable by machines either. Systems last far longer than anyone expects them to, we have to plan for that not just for our own short-term convenience.
XML and json were supposed to be CSV killers and while those formats have advantages, their design fails to take into account one simple fact: the people who cannot be trusted format their data into valid, sanitized CSV also cannot be trusted to provide valid, sanitized json/XML despite the existence of libraries that are supposed to make the process impossible to mess up.
Also, XML are documents, JSON are records, and CSV are tables. They have different data models and each has its own use. It doesn’t make sense to say that one will replace the other, because they’re different.
I agree–we should be using XML files with CDATA sections wrapping CSVs, themselves with JSON in the cells.
:^)
I know this is a joke, but zip or tar (or some improvement on them) that contain XML, CSV, and JSON is a good interchange format :) Various HTML archive formats are pretty much that, but people tend not to use them for applications, and they don’t seem that well spec’d or maintained.
tar has a lot of flaws: https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar
zip has MS-DOS timestamps as far as I remember, but basically works.
I’m not sure the author is aware of how many legacy tools not only rely on CSV, but only know how to export CSV, and how many of the people who work with it are not technical.
When I worked at a web startup, one of our clients shipped us customer information as 2GB+ of CSV files every day as an FTP upload. They never compressed it or used SSH because it was too difficult for them. I did manage to at least get them to use the pilcrow character as a field separator so that the mistyped email addresses with commas in them didn’t screw up the import (which was a custom script with all kinds of fiddling to fix various entries). This file was exported from some Oracle database (I think–it was never clear) and the people uploading were a sales team with basically no technical skills.
Fighting against this is a complete and total lost cause. Much like anything that is entrenched with lots of rough edges, CSV will take forever to go away.
I’m absolutely certain the author knows that. They specifically lament all of these facts, and cite their years of experience dealing with all of these problems.
And if some other format were to become universal, as easy to export and import as CSV, that wouldn’t be a problem.
The author states this themselves. And they go on to hypothesize that if Microsoft and Salesforce both supported some other format, the rest of the world would come to adopt it.
You’re making all the same points that the author does, but without speculating on how these problems could be solved. The author never claims a migration would be quick, easy, or universal. Only that tackling the core use cases of Excel and database dumps would make future work far less error prone.
I don’t believe this, and I think that’s my problem with the article. You can design a wonderous format and people will always find a way to make it difficult to work with because they often don’t want (or think they don’t want) the restrictions that come with something that makes future work less error prone.
There is some major survivorship bias being missed… and which could easily be missed by someone who hasn’t been around the data format block a few times.
CSV isn’t some fossil, it is an apex file format for tabular data. There is nothing to usurp or retire.
The fact that a bunch of barely educated chimps can exchange any data at all without prior collaboration isn’t just a feature, it’s a miracle.
Having a lone, confused, intern look at the problem and cobble together a mostly working solution is actually mind blowing technology. All you need to get started is in the name.
The reason nothing has replaced it isn’t a lack of trying or a lack of commitment. It is because there can be nothing better for it’s use case.
—– Content warning, rude / inconsiderate —–
Once you have educated people working both sides of the problem, you can find a ‘better’ solution… but experience tells me this must not be the case, since all the ‘problems’ with CSV vanish the moment that people on both sides have even a little bit of competency.
—– Further content warning, strong language —–
And if you think it’s hard dealing with a bunch of variously formatted CSV just you fucking wait until you have to deal with a bunch of variously formatted XML documents complying to various degrees with various standards… At least CSV is regular, with XML you are really and truly fucked in terms of parsing (and if you think you can push back any better with WSDL in hand, think again).
Schemas, at best, get you to what I call, ‘cute IDE support’ by way of completion and code generation.
This feels like luxury (and can be rather nice), but doesn’t solve any core problems by itself… and introduces quite a few novel ones. (e.g. what idiot wrote this schema and why am I helpless to fix it? or, even more fun, the host for this schema hasn’t existed since 1999.)
Again, the problems with CSV do not exist for lack of trying to solve them. For certain problems, really common problems, CSV is as good as it gets… and when you ‘upgrade’ to a better format, you are making a lot of compromises that you might not be aware of.
For the author’s use case, they are in a way right to want to, ‘push out’ some of the specific CSV problems they are dealing with to their customers. But they are also foolish to think they will succeed, because if they did, their product would evaporate.
All of the negative observations about the CSV format in this article are true. What is not mentioned is that Excel, Google Sheets, and the rest of the serious spreadsheet applications work around these problems as a matter of course, so no one cares, nor should they have to. Emit reasonably-formatted fields (ISO 8601 for datetime, etc), quote everything properly, and it will be perfectly usable on the other end if it belongs in a spreadsheet at all.
Yet here I am, working with the local equivalent of Fortune 500 companies that send me ; separated ‘CSV’s that use the , as the decimal separator.
But people don’t do those things. Some people do, but certainly not all of them. That’s the entire point.
Personally I’m a fan of TSV for things that are just a bunch of numbers or other simple fields where whitespace delimiters make sense. It’s as easy as can be to work with out of the box with simple tools (including a bunch of standard UNIX utilities like
cut
) without having to mess around with delimiter options. Plus, if you need to read it in most programming languages you don’t really need a parser at all (since reading up to the next whitespace is usually a built-in thing).Of course, this all falls apart as soon as the data is more complicated and stringy, but for data that fits, it’s lovely. JSON (or whatever) is nice when you need it, but I find it pretty silly when people put data this simple in something like JSON (or much worse, XML). Then you need to understand the schema and write a parser (probably using some library, which probably can’t even stream) to get at the data at all, and trivial one-liner command line tasks become an entire bespoke program you need to write. Ugh.
Simple is good. Sure, people will abuse simple things - people will abuse everything - but that doesn’t make them bad. As AndrewStevens pointed out above, people will find a way to cause all of these problems with more complicated formats anyway. Most of the web is built out of broken garbage written in syntaxes that really shouldn’t have these problems at all, and there is a vast amount of tooling available to prevent it, and yet…
TSV is indeed better. If it would be easier to use the ASCII control characters in text editors, I think it could be done even better. There are Record, Group, Unit and File Separators right there in the lowest common demoninator of all encodings that is ASCII. If we used that more, things would be less error prone to parse even with “stringy” data.
Interesting, I hadn’t thought of that. I guess Unit Separator is the replacement for tab or comma in this context. Shame that these won’t be considered whitespace, so some of the “magically works with roughly every dumb tool ever” advantages would be lost, but I wonder how many tools support these things in practice…
The combination of robustness (the chance of these characters appearing in fields seems incredibly low) and extreme simplicity is really appealing, but I guess if it’s not something you can edit manually in whatever editor, it could never be a real replacement for [CT]SV.
Funny how little things in technology history have such wide-reaching implications sometimes. If one of the more or less entirely unused keys on a standard keyboard (say, SysRq) was instead a suitable separator, we wouldn’t have any of these problems.
Yeah, I tried using it on a little project once, but it is really annoying for anything else but the parsing side, b/c the regular tools don’t handle it nicely.
The strange thing is that we all have a symbol that most of us probably never use + it rarely occurs in text - namely § - on our keyboards. We never lean towards it as a separator. Probably because it looks ugly to the eye. I literally can not remember the last time I used § before writing this message.
Exactly. There are characters already, just hard to type.
Yup I agree, I designed a graceful upgrade of TSV for Oil, called “quoted typed tables”. It can represent tabs and newlines in fields if desired, and also you can attach a type to each column.
https://github.com/oilshell/oil/wiki/TSV2-Proposal
While I agree with this article, I’d like to start by first banishing terrible (and extremely thinly documented) formats like SAS and SPSS formats (which I never seem to be able to load properly with integrated loaders) or even worse Excel spreadsheets which contain tables. I can’t tell you how many times I’ve had to export an XLS to a CSV in order to import data in a form I care for. If we can get to the point where CSVs are the crusty, remaining format I’d be overjoyed, but maybe my standards are just too low.
Yeah, CSV is bad. But try to find any file format for data interchange that can be easily handled on the command line and that, at the same time, is something which institutional (scientific, gouvernmental) bodies can handle. There is none. Just accept that you might have to tweak your data-conversion all the time, or that you even manually adjust things, because all alternatives are much worse.
CSV is just a meta-format, the problems with its usage comes from careless data formatting and borked exporting from tools (e.g. Excel), not from the format itself.
It’s going to take an industry-wide refusal to implement and refusal to support it in order to kill CSV.
I worked for a company that basically turned healthcare data in CSV format into dashboards. The data ingest team basically implemented per-customer CSV readers because every CSV writer seemed to be a bespoke, manually-rolled thing. The best I could come up with to kill CSV at the company was to charge more for it. Do you want to give us data? You’re giving us Avro or, at least shiny, JSON. Here’s a library and here’s instructions. Oh, and a CLI CSV to what-we-want converter. Or, you can give us CSV but we’re going to charge extra for that CSV connector because of the expected maintenance overhead of correcting your bungled CSV format. I don’t think the company moved forward with the idea but there was broad support for it among the engineers.
CSV doesn’t need to go away. Data interchange can still be done in some other human readable/modifiable format and CSV can still be produced in order to support legacy software. The problem with being plaintext/modifiable is: people don’t use linters to check for errors and they don’t count rows and columns before and after modifications. They will do that with any format and you can’t solve for laziness.
The thing about a standard (or mess of standards) is it’s hard for a blog post to change it, but a thing we can do is be specific documenting text import/export in our own apps. Like:
And similar questions for exporting text. You can also answer by reference, e.g. “we’re using [Python’s CSV module/Apache OpenCSV/…] with default settings” or “we’re aiming for compatibility with [Excel/this MySQL LOAD DATA INFILE statement/this Postgres COPY statement]”. (Maybe a mix of the two.)
FWIW, my work app imports CSVs and documents maybe half of the things in that list right now, so not speaking from the high ground, plus some of our answers are “we run heuristics on the file to try to guess the right thing”, which is fun. Sadly I probably know more of the pitfalls from using other’s software than I know from what I work on.
Besides Parquet/ORC/Avro, I kind of wish newline-separated JSON were used as a data interchange format a bit more. You can take advantage of languages’ existing support, with gzip size can be reasonable, and support for large collections is a bit better than sending one big JSON object.
When I started work at a financial SaaS company, I figured that I would be paged in the middle of the night in order to pore over weird EBCDIC in fixed-length record formats. Thankfully, that never happened.
Instead, what we got were CSV files with unquoted commas, null bytes, and actual irregular numbers of fields per record. There’s value to being human-inspectable.
You can fix or reject these things in CSV. You can send a complete spec and sample file to the other party and expect them to fix their software… eventually.
It might be nice to have a more rigorous CSV successor and tools to manipulate it, but it will be hard to get everyone converted.
What I find interesting is that a lot of the criticism seems to be underspecification and variations of the format.
It’s obvious that this is a problem, but is it an unfixable one? It seems there is an older RFC, but that’s still underspecified. It also seems to me a way forward to improve things would be:
a) Write up a new RFC and specify all the inconsistencies to something sane, ideally something that already a large number of applications do so, and try to get as many of them on board promising to support it.
b) Name that something easy to remember, like “CSV, 2021 spec” or something.
c) All applications provide at least an option to use “CSV, 2021 spec” and ideally move to that being the default.
Please note that this wouldn’t mean “CSV, 2021 spec” can only be used to exchange data with other applications supporting it. Given we try to use what’s already what most applications do, unless you have weird edge cases it probably already works in most cases with existing CSV-supporting applications.
FWIW I think there’s simliar inconsistencies in JSON parsers, and probably pretty much the same should be done.
This reminds me of https://xkcd.com/927/ :)
I think you underestimate how many non-technical people produce datasets. None of these people will ever have heard of RFCs or whatever else TLA you may say to them.
I know that comic, but I think I made clear that I absolutely did not want to do this.
My proposal would be to spec what is as close to existing solutions as possible and would likely work in most situations right from the start.
I understand, but I would not even know where such a spec would start and where it would end. Would it include date-times? How about timezone offsets? What about number formatting? Character encodings? This all gets very complicated very quickly.
The primary reason why CSV is such an ubiquitous format is that anyone can understand it quickly, and therefore nobody is coding against any spec. The RFC that exists was merely retrospective.
The very thing that makes CSV so common is also the reason why drafting a new spec will be unlikely to gain traction.
Nobody has stepped up to the plate so far. Feel free to start! Note, that I’m not volunteering, but I encourage you to!
What these articles always miss to do, is to underline their arguments with data: How much money, clients or even lives have been lost by using CSV. While I fully agree with anecdotal points made in the article, I also think that there is always the right tool for right job, also regarding the environment. This discussion around data formats reminds of the discussion what is the best programming language. Typed vs untyped, compile vs interpreted, functional vs object oriented and even if it just a scripting language. It always depends what your experience, needs and goals are. The same with data formats. So I would keep using CSV for small files in an environment with not so experienced data people, the same I would write small scripts in Bash.
I have in the past expressed similar thoughts about CSV. And there’s no doubt I’ve lost many dozens of hours on ill formed CSV files across my career.
Visidata is so good at manipulating them that I’ve softened my view. Being human readable is a huge advantage, and it makes them very easy to massage for whatever system I need to use them in. I now think it’s more fruitful to spend effort improving tools like visidata than to fight CSV.
Unless Excel can read it, so people can quickly and easily take a look at the data and play with it, it’s DOA.