The post doesn’t have a conclusion, but the bug here is clearly with Nextcloud, right? It should obviously use the same bytes for the metadata request as what the server returned in the directory listing, right? In fact, why does Nextcloud mess with the contents of paths at all..?
I also think that a storage system should never ever mess with the bytes of a file (or file name as in this case). The bytes I give to you are the bytes I want back. Not some approximation thereof.
Someone in the stack needs to be responsible for canonicalisation. On *NIX systems, this is typically userspace. This caused a lot of problems in non-unicode systems. If you create a file in a Big-5 locale and then try to open the same file in a Latin code page locale, then it will fail. These problems go away (in theory) if the VFS layer or filesystem driver canonicalise the encoding.
Although the two characters look exactly the same, their code point sequence is different. This is known as Unicode equivalence and, in theory, addressed by Unicode normalization. But here, normalization caused this issue. Before storing the file name in the cache, Nextcloud normalized the file name (to NFC) in a function normalizePath:
Linux paths can be any arbitrary sequence of bytes terminated by null. Windows uses UTF-16. SMB has some horrifying backwards compatibility name mangling options. Paths on local and networked file systems are super interesting, and the words “super interesting” chill my bones.
The post doesn’t have a conclusion, but the bug here is clearly with Nextcloud, right? It should obviously use the same bytes for the metadata request as what the server returned in the directory listing, right? In fact, why does Nextcloud mess with the contents of paths at all..?
I also think that a storage system should never ever mess with the bytes of a file (or file name as in this case). The bytes I give to you are the bytes I want back. Not some approximation thereof.
Which means, that with two clients, you may end up with two separate files with the same name.
The filesystem is perfectly free to not let you create a file it doesn’t like, for whatever reason.
Besides, the greek question mark and other look-alike characters already make this possible.
Someone in the stack needs to be responsible for canonicalisation. On *NIX systems, this is typically userspace. This caused a lot of problems in non-unicode systems. If you create a file in a Big-5 locale and then try to open the same file in a Latin code page locale, then it will fail. These problems go away (in theory) if the VFS layer or filesystem driver canonicalise the encoding.
Linux paths can be any arbitrary sequence of bytes terminated by null. Windows uses UTF-16. SMB has some horrifying backwards compatibility name mangling options. Paths on local and networked file systems are super interesting, and the words “super interesting” chill my bones.
I heard you like encoding so I put some encoding in your encoding.
These are always fun to read. Another one was Confusable Character Detection in Erlang by @rkallos.