I wrote this library because each time I want to create something cross-platform I run into the problem of having to cope with Unicode pathes on Windows. Since I usually work on Linux, I usually do my Windows development using MSYS2. Thus, I wanted to have something that does not force me to use MSVC, and furthermore was lightweight, but I didn’t find anything meeting that requirement. As a result, Pathie was created. Direct link to docs
Your docs mention that on POSIX systems, paths might not be valid UTF-8 (or any single encoding), but it’s not clear to me what Pathie does in such a situation: are paths containing invalid UTF-8 inaccessible? Can you read them from the OS but not construct them yourself?
Your docs also say that Windows uses UTF-16LE, which is not strictly true: in the same way that POSIX paths are a bucket of bytes and not necessarily valid UTF-8, Windows paths are a bucket of
uint16_ts and not necessarily valid UTF-16 (in particular, they can have lone surrogates that do not form a surrogate pair, or values that are not assigned in the Unicode database). How does Pathie interact with such malformed paths?Lastly, macOS: as your documentation points out macOS does have an enforced filename encoding, but it also has an enforced normalisation (at least for HFS+ volumes). That means your application can create a string, create a file with that name, then
readdir()the directory containing that file and none of the returned directory entries will byte-for-byte match the string you started with. Does that affect Pathie’s operation?First off, Pathie does not assume POSIX pathes are UTF-8, because that isn’t specified. Unless you compile Pathie with
ASSUME_UTF8_ON_UNIX, it takes the encoding information from the environment via thenl_langinfo(3)function called withCODESETas the parameter (which is why you need to initialise your locale on Linux systems).In the case of a path with invalid characters in the locale’s encoding (e.g., invalid UTF-8 on most modern Linuxes), you’ll get an exception when trying to read such a path from the filesystem, because
iconv(3)fails withEILSEQ(which is transformed into a proper C++ exception by Pathie). You cannot either construct pathes containing invalid characters, because you will receive the same exception. I’ll make this more clear in the docs.Pathes in valid encoding are UTF-16LE. Broken path encodings may be anything and that’s nothing one can make assumptions about. Again, you’ll receive an exception when you encounter them (because the underlying
WideCharToMultiByte()function from the Win32API fails).I was not aware of that. Do you have a link with explanations, ideally on MSDN?
macOS is not officially supported by Pathie (which is stated at the top of the README), simply because I don’t have a Mac to test on.
It shouldn’t affect Pathie’s operation itself. Pathie will simply pass through what the filesystem gives it; since on macOS no conversion of path encodings happens, these normalised sequences are handed through to the application that uses Pathie.
Thanks for the feedback!
Unfortunately, I can’t find a smoking-gun writeup on MSDN. However, in my searching, I did find:
It’s a problem of enforcement. Rust internally uses WTF-8 as an internal encoding to fix that.
https://simonsapin.github.io/wtf-8/
An interesting read, thank you for the pointer. I’ll see if I adapt Pathie accordingly, but until now it has done the job for me (and it’s mostly a library I use for my own projects).
Thanks!
As an alternative to path manipulation libraries, may I suggest don’t manipulate paths?
Let the user specify “whatever path” they want, and just don’t manipulate it.
On unix, you can do something like
d=open(".",O_RDONLY);chdir(argv[1]);doit();fchdir(d);close(d);which is simple and easy to get right. No parsing, sensitivity to path specifiers, splitting, character sets and so on eliminate a whole class of complexity; no buffers means there’s no buffer overflows.Opening all of your directories ahead of time also means that they can’t be deleted/removed/moved after validation, which can protect against some other security problems.
On Windows, you can emulate fchdir with [GetFinalPathNameByHandle](https://msdn.microsoft.com/en-us/library/windows/desktop/aa364962(v=vs.85\).aspx) which keeps the code similar and has some of the benefits above.
If you want to call
doitfrom multiple threads simultaneously, what do you do?It depends. Any open+close in multiple threads has this problem.
Sometimes I keep all of my IO resource acquisition in a single pthread. If you do this, you’ll clearly have no problems closing files or changing directories.
Sometimes I just don’t use pthreads: Instead we can use
fork()andmmap()the pages that you actually want to share between threads (processes).If I only need a solution on Linux I can use
unshare().