1. 7

I wrote this library because each time I want to create something cross-platform I run into the problem of having to cope with Unicode pathes on Windows. Since I usually work on Linux, I usually do my Windows development using MSYS2. Thus, I wanted to have something that does not force me to use MSVC, and furthermore was lightweight, but I didn’t find anything meeting that requirement. As a result, Pathie was created. Direct link to docs

  1.  

  2. 4

    Your docs mention that on POSIX systems, paths might not be valid UTF-8 (or any single encoding), but it’s not clear to me what Pathie does in such a situation: are paths containing invalid UTF-8 inaccessible? Can you read them from the OS but not construct them yourself?

    Your docs also say that Windows uses UTF-16LE, which is not strictly true: in the same way that POSIX paths are a bucket of bytes and not necessarily valid UTF-8, Windows paths are a bucket of uint16_ts and not necessarily valid UTF-16 (in particular, they can have lone surrogates that do not form a surrogate pair, or values that are not assigned in the Unicode database). How does Pathie interact with such malformed paths?

    Lastly, macOS: as your documentation points out macOS does have an enforced filename encoding, but it also has an enforced normalisation (at least for HFS+ volumes). That means your application can create a string, create a file with that name, then readdir() the directory containing that file and none of the returned directory entries will byte-for-byte match the string you started with. Does that affect Pathie’s operation?

    1. 2

      Your docs mention that on POSIX systems, paths might not be valid UTF-8 (or any single encoding), but it’s not clear to me what Pathie does in such a situation: are paths containing invalid UTF-8 inaccessible?

      First off, Pathie does not assume POSIX pathes are UTF-8, because that isn’t specified. Unless you compile Pathie with ASSUME_UTF8_ON_UNIX, it takes the encoding information from the environment via the nl_langinfo(3) function called with CODESET as the parameter (which is why you need to initialise your locale on Linux systems).

      are paths containing invalid UTF-8 inaccessible? Can you read them from the OS but not construct them yourself?

      In the case of a path with invalid characters in the locale’s encoding (e.g., invalid UTF-8 on most modern Linuxes), you’ll get an exception when trying to read such a path from the filesystem, because iconv(3) fails with EILSEQ (which is transformed into a proper C++ exception by Pathie). You cannot either construct pathes containing invalid characters, because you will receive the same exception. I’ll make this more clear in the docs.

      Your docs also say that Windows uses UTF-16LE, which is not strictly true:

      Pathes in valid encoding are UTF-16LE. Broken path encodings may be anything and that’s nothing one can make assumptions about. Again, you’ll receive an exception when you encounter them (because the underlying WideCharToMultiByte() function from the Win32API fails).

      (in particular, they can have lone surrogates that do not form a surrogate pair, or values that are not assigned in the Unicode database)

      I was not aware of that. Do you have a link with explanations, ideally on MSDN?

      Lastly, macOS: as your documentation points out macOS does have an enforced filename encoding, but it also has an enforced normalisation (at least for HFS+ volumes)

      macOS is not officially supported by Pathie (which is stated at the top of the README), simply because I don’t have a Mac to test on.

      That means your application can create a string, create a file with that name, then readdir() the directory containing that file and none of the returned directory entries will byte-for-byte match the string you started with. Does that affect Pathie’s operation?

      It shouldn’t affect Pathie’s operation itself. Pathie will simply pass through what the filesystem gives it; since on macOS no conversion of path encodings happens, these normalised sequences are handed through to the application that uses Pathie.

      Thanks for the feedback!

      1. 3

        Do you have a link with explanations, ideally on MSDN?

        Unfortunately, I can’t find a smoking-gun writeup on MSDN. However, in my searching, I did find:

        • Scheme48 has a special OS String type, and motivates it saying “On Windows, unpaired UTF-16 surrogates are admissible in encodings, and no lossless text decoding for them exists.”
        • Racket’s encoding conversion functions include special “platform-UTF-8” and “platform-UTF-16” encodings: “On Windows, the input can include UTF-16 code units that are unpaired surrogates…”
        • Rust also includes a special OSString type: “On Windows, strings are often arbitrary sequences of non-zero 16-bit values, interpreted as UTF-16 when it is valid to do so.”
        • I found the Rust ticket that introduced the OSString type, which includes a (Rust) test case. One of the Rust devs dug up an MSDN page that says “…the file system treats path and file names as an opaque sequence of WCHARs.”
        • That issue also linked to a report of UTF-16-invalid filenames being found in the wild in somebody’s Recycle Bin.
        1. 3

          It’s a problem of enforcement. Rust internally uses WTF-8 as an internal encoding to fix that.

          https://simonsapin.github.io/wtf-8/

          1. 1

            An interesting read, thank you for the pointer. I’ll see if I adapt Pathie accordingly, but until now it has done the job for me (and it’s mostly a library I use for my own projects).

          2. 1

            Thanks!

      2. 3

        As an alternative to path manipulation libraries, may I suggest don’t manipulate paths?

        Let the user specify “whatever path” they want, and just don’t manipulate it.

        On unix, you can do something like d=open(".",O_RDONLY);chdir(argv[1]);doit();fchdir(d);close(d); which is simple and easy to get right. No parsing, sensitivity to path specifiers, splitting, character sets and so on eliminate a whole class of complexity; no buffers means there’s no buffer overflows.

        Opening all of your directories ahead of time also means that they can’t be deleted/removed/moved after validation, which can protect against some other security problems.

        On Windows, you can emulate fchdir with [GetFinalPathNameByHandle](https://msdn.microsoft.com/en-us/library/windows/desktop/aa364962(v=vs.85\).aspx) which keeps the code similar and has some of the benefits above.

        1. 0

          If you want to call doit from multiple threads simultaneously, what do you do?

          1. 0

            It depends. Any open+close in multiple threads has this problem.

            Sometimes I keep all of my IO resource acquisition in a single pthread. If you do this, you’ll clearly have no problems closing files or changing directories.

            Sometimes I just don’t use pthreads: Instead we can use fork() and mmap() the pages that you actually want to share between threads (processes).

            If I only need a solution on Linux I can use unshare().