They’re still ‘paths on the runtime platform’, though, right? They only guarantee they can represent all well-formed paths on the system you are running on right now, so you can’t represent arbitrary Linux file paths with a PathBuf if you’re running on Windows, as far as I can tell.
I believe the point made here in this post that even that doesn’t meet the needs of a VCS where they need to seamlessly represent paths from any supported platform on any other supported platform.
They only guarantee they can represent all well-formed paths on the system you are running on right now
Sounds fair though. Even cloning the linux kernel will be problematic on windows, given some legacy mixed-case files in the same directory. I don’t see it being a major concern to be platform agnostic to this level. git sure doesn’t care at least, and it’s probably the most popular. Hell, tag and branch names will break on either OS if you use the wrong symbols.
Do you think should file paths on Unix systems be UTF-8 instead of simply being encoding-agnostic byte sequences terminated by 0x00 and delimited by 0x2F? I can see it both ways personally. As it stands, file paths are not text, but they’re nearly always treated as text. All text definitely should be UTF-8, but are file paths text? Should they be text?
Paths consist of segments of file names. File names should be names. Names should be text. Text represented as a sequence of bytes must have a specified encoding, otherwise it’s not text. Now the only question left is: which encoding should we use? Let’s just go with UTF-8 for compatibility with other software.
I would actually put further restrictions on that:
file names should consist of printable characters — what good is a name if the characters it’s made of cannot be displayed?
file names shouldn’t be allowed to span multiple lines — multiline file names will only cause confusion and will often be harder to parse (not just for humans, but also for CLI programs)
As it is now in Unix, file names aren’t for humans. And neither are they for scripts. They’re for… file systems.
I agree with you about those restrictions. In some sense, Windows has succeeded in this area, where Unix has failed. In Unix:
File names can begin with a hyphen, which creates ambiguity over whether it is a command-line flag or an actual file (prompting the convention of -- separating flags from file arguments).
File names can contain newlines, which creates almost unsolvable problems with most Unix tools.
File names cannot contain forward slashes, and thus cannot be confused with command-line flags (which begin with a slash).
File names cannot contain line feeds or carriage returns. All characters in the range 0-31 are forbidden.
File names cannot contain double quotation marks, which means you can very easily parse quoted file names.
Of course, both allow spaces in file names, which creates problems on both systems. If only shell/DOS used commas or something to separate command-line arguments instead of spaces…
There is a long essay by David A. Wheeler about problems that are caused by weird filenames and what we might do to fix them. Including suggestions on possible restrictions that operating systems might impose on valid filenames. Prohibiting control characters (including newline) is one of the items on the list. Scroll down to the very bottom to see it.
Ideally you want to bring reasonable naming capabilities to folks from the world of non-latin character sets. That’s a really good driver to go beyond existing C-Strings and other “Os String” encodings.
But when you say “UTF-8, but printable”, it’s not UTF-8 anymore. Also, what’s a “line”? 80 characters? 80 bytes? Everything that doesn’t contain a newline? Mh.
Allowing UTF-8 will bring some issues with Right-To-Left override characters and files named “txt.lol.exe” on certain operating systems.
Anything that doesn’t contain a newline. The point is that filenames with newlines in them break shell tools, GUI tools don’t allow you to create filenames with newlines in them anyway, and very few tools other than GNU ls have a reasonable way to present them.
Lots of stuff doesn’t allow you to include the ASCII control plane. Windows already bans the control plane from file names. DNS host names aren’t allowed to contain control characters (bare “numbers and letters and hyphen” names certainly don’t, and since domain registrars use a whitelist for extended characters, I doubt you could register a punycode domain with control characters either). The URL standard requires control plane characters to be percent encoded.
I recently ran into this when building a website on Windows and uploading the files to a Linux server. I wrote a PHP script that reads the contents of a text file, re-formats it and outputs it as HTML. Turns out I had to worry about text encoding in several places:
On my Windows system, the files used Windows-1512 and CR/LF line endings.
WinSCP, which I used to transfer the files to the Linux server, maybe converted them to UTF-8/LF?
The PHP function htmlentities needed to know the correct encoding of the source text.
The Apache server had to serve the files with the correct encoding.
All of these worries would be non-existent in a world where everyone and everything used UTF-8, but in the meantime, it was a good exercise in not making too many assumptions as a developer.
Rust also has
Path
because file systems, as a class, do every possible thing wrong with strings that they can.They’re still ‘paths on the runtime platform’, though, right? They only guarantee they can represent all well-formed paths on the system you are running on right now, so you can’t represent arbitrary Linux file paths with a PathBuf if you’re running on Windows, as far as I can tell.
I believe the point made here in this post that even that doesn’t meet the needs of a VCS where they need to seamlessly represent paths from any supported platform on any other supported platform.
Sounds fair though. Even cloning the linux kernel will be problematic on windows, given some legacy mixed-case files in the same directory. I don’t see it being a major concern to be platform agnostic to this level. git sure doesn’t care at least, and it’s probably the most popular. Hell, tag and branch names will break on either OS if you use the wrong symbols.
Yeah, file paths are not strings, nor should they be made, found, resolved or manipulated by means of string concatenation / splitting.
Not everything is UTF-8, but it should be.
Do you think should file paths on Unix systems be UTF-8 instead of simply being encoding-agnostic byte sequences terminated by 0x00 and delimited by 0x2F? I can see it both ways personally. As it stands, file paths are not text, but they’re nearly always treated as text. All text definitely should be UTF-8, but are file paths text? Should they be text?
Paths consist of segments of file names. File names should be names. Names should be text. Text represented as a sequence of bytes must have a specified encoding, otherwise it’s not text. Now the only question left is: which encoding should we use? Let’s just go with UTF-8 for compatibility with other software.
I would actually put further restrictions on that:
As it is now in Unix, file names aren’t for humans. And neither are they for scripts. They’re for… file systems.
I agree with you about those restrictions. In some sense, Windows has succeeded in this area, where Unix has failed. In Unix:
--
separating flags from file arguments).In Windows, however (source):
Of course, both allow spaces in file names, which creates problems on both systems. If only shell/DOS used commas or something to separate command-line arguments instead of spaces…
Windows also doesn’t allow files named
NUL
,PRN
, orCON
. :DThere is a long essay by David A. Wheeler about problems that are caused by weird filenames and what we might do to fix them. Including suggestions on possible restrictions that operating systems might impose on valid filenames. Prohibiting control characters (including newline) is one of the items on the list. Scroll down to the very bottom to see it.
Ideally you want to bring reasonable naming capabilities to folks from the world of non-latin character sets. That’s a really good driver to go beyond existing C-Strings and other “Os String” encodings.
But when you say “UTF-8, but printable”, it’s not UTF-8 anymore. Also, what’s a “line”? 80 characters? 80 bytes? Everything that doesn’t contain a newline? Mh. Allowing UTF-8 will bring some issues with Right-To-Left override characters and files named “txt.lol.exe” on certain operating systems.
It’s tough, isn’t it? :-)
Anything that doesn’t contain a newline. The point is that filenames with newlines in them break shell tools, GUI tools don’t allow you to create filenames with newlines in them anyway, and very few tools other than GNU
ls
have a reasonable way to present them.Lots of stuff doesn’t allow you to include the ASCII control plane. Windows already bans the control plane from file names. DNS host names aren’t allowed to contain control characters (bare “numbers and letters and hyphen” names certainly don’t, and since domain registrars use a whitelist for extended characters, I doubt you could register a punycode domain with control characters either). The URL standard requires control plane characters to be percent encoded.
\r\n
or\n
? ;) You get my point?If it includes
\r\n
, then it includes\n
.If the goal is to avoid breaking your platform’s own default shell, then the answer should be “whatever that shell uses.”
“Is” and “ought”, however, remain dangerous things to confuse.
I recently ran into this when building a website on Windows and uploading the files to a Linux server. I wrote a PHP script that reads the contents of a text file, re-formats it and outputs it as HTML. Turns out I had to worry about text encoding in several places:
htmlentities
needed to know the correct encoding of the source text.All of these worries would be non-existent in a world where everyone and everything used UTF-8, but in the meantime, it was a good exercise in not making too many assumptions as a developer.