1. 5

We all know why glibc doesn’t have strlcpy(3): not only is it non-standard, but it’s also not safe! It can cut strings that are too long, and if you don’t check the return value, you’ll never know!

“But what am I to do?”, you may ask. Here’s the answer from the Guru himself: use an incantation involving mempcpy(3). Yes, you may think it’s weird to use a mem* function to copy strings, and it’s likewise non-standard, and the n in question is size - 1, but at least it’s totally safe! Look:

#define _GNU_SOURCE
#include <string.h>

int
main()
{
        char            dst[8192];
        const char      *src = "foo";

        *((char *)mempcpy(dst, src, sizeof(dst) - 1)) = '\0';
        return 0;
}
: noir:/tmp%22; make cpy
cc     cpy.c   -o cpy
: noir:/tmp%23; ./cpy
zsh: segmentation fault (core dumped)  ./cpy

Oh.

Every program which is handling strings has to know how long they are.

So it’s my fault, I had to call strlen(3) before. How is mempcpy(3) more useful than strcpy(3), again?

  1.  

  2. [Comment removed by author]

    1. 5

      Hopefully. Debian switched to eglibc to avoid him. Just look at the bug reports linked from the announce.

      1. 3

        Ah yes, the infamous Ulrich Drepper. The fact that he was allowed to continue like that for so long still amazes me…

      2. [Comment removed by author]

        1. 4

          My point is that strlcpy is strictly better than the proposed solution and is rejected for invalid reasons, because:

          • The argument about standards is bogus: mempcpy is at least as non-standard as strlcpy, perhaps even more so (I don’t think it’s available anywhere but glibc).
          • The “safety” argument is bogus: if you can’t tolerate truncation, just check the return value.
          • Due to having to use size - 1, the mempcpy dance is error-prone. This reddit comment contains a trivial example that overflows the destination buffer, and nobody even noticed.
          • To call strlcpy you need to know the size of the target buffer; to call mempcpy, you need to know that and the length of the source string, otherwise you either truncate or overread it.
          • Moreover, if the whole string is copied, *((char *)mempcpy(dst, src, len)) = '\0' is equivalent to memcpy(dst, src, len + 1) (and strcpy(dst, src)).
          1. [Comment removed by author]

            1. 2

              Glad to explain. Mountains of sarcasm may not be the clearest way to advance one’s argument.

              Making a string handling library might make sense for bigger projects (and I’ve worked on a project that had one), but for smaller programs I usually get away with using strdup().

        2. [Comment removed by author]

          1. 2

            Nice one. Except nul-termination, it’s like Go strings, that call size “cap” (for capacity) and place “len”.

          2. 1

            How is mempcpy(3) more useful than strcpy(3), again?

            I meant strlcpy(3), of course.

            1. 1

              This is why I like std::string (Or bstring and similar in C).

              1. 5

                I’ve seen bstring recommended a lot on HN and elsewhere for a long time now.. but why? Why do people keep recommending it?

                That library literally relied on undefined behavior (integer overflow) for its security up until about six months ago, curiously disclosing the fact in its “security statement” through warning with regard to 2’s complement arithmetic, and nobody seemed to care for a long long time. That doesn’t exactly buy a vote of confidence from me. Yet people would recommend it as a solution to secure strings…

                If you dig the commit logs and developer forums further, you may find that there’s been more than enough corner cases the developers didn’t account for when writing the code initially. Frankly, I don’t think strings are that hard a problem. To me, it smells like a project from someone who was inexperienced with C and thought he would solve “strings are hard” problem with a library while lacking the insight and knowhow (or just discipline) to take it slow and do it correctly.

                Am I too cynical?

                All I can hope is that the library has now gained enough attention that someone who cares enough is actually fixing the bugs and that they’re (mostly) gone now. I really hope so. Because so many people seem to want to use this library.

                Too bad most users embed copies of the library in their project, thus likely going with an old and broken version that likely won’t get updated soon.

                1. 1

                  I haven’t actually used bstring, it’s just one I have heard of. I just mean “some C library that has a struct to pair a char* and size_t”.

              2. 1

                You are being advised to use *((char*)mempcpy(dst,src,strlen(src)))='\0';, and to make sure that dst is big enough. mempcpy(d,s,n) is exactly the same as (void*)(((char*)memcpy(d,s,n))+n) but it might be faster.

                1. 0

                  If I have to call strlen, why wouldn’t I just do memcpy(dst, src, strlen(src) + 1);?

                  and to make sure that dst is big enough

                  Basically,

                  len = strlen(src);
                  if (len >= sizeof(dst))
                          return FAIL;    /* or truncate */
                  *((char *)mempcpy(dst, src, len)) = '\0';
                  

                  How is this better than if (strlcpy(dst, src, sizeof(dst)) >= sizeof(dst)) return FAIL;? Yes, it fails slightly faster, but it succeeds slower, as you have to walk src twice.

                  And if I’ve checked that the source string fits in dst, why wouldn’t I just use strcpy at this point?

                  but it might be faster.

                  memcpy may be implemented as a highly optimized built-in. Is mempcpy implemented similarly?

                  1. 3

                    If I have to call strlen, why wouldn’t I just do memcpy(dst, src, strlen(src) + 1);?

                    If you have a char *buf, size_t n pair, you don’t know that it’s a nul-terminated buffer (really of length n+1) unless that’s promised. And you don’t want to have that promise, because it’s better for your API’s to be simple.

                    How is this better than if (strlcpy(dst, src, sizeof(dst)) >= sizeof(dst)) return FAIL;?

                    That code is lame anyway because it uses a fixed target size. You should allocate a buffer of the appropriate size. But if you really did need a fixed size target, then checking the length after copying, or trying to do the check and then the copy in one fell swoop, is more likely to evolve into broken code. If you do need a fixed destination size here is one good way to write it.

                    // len is already computed, it was a parameter.
                    if (len > sizeof(dst) - 1) {
                      return FAIL;
                    }
                    memcpy(dst, src, len);
                    dst[len] = '\0';
                    

                    Every line has its intention made clear and explicit, reducing the chance that refactorings break stuff. Also, using > (or <) is the more idiomatic length check.

                    One remaining problem here might be that if you have a real need for a fixed destination size, you probably also want to fill the unused space with a defined value, otherwise you’ll output uninitialized memory. (Thus strlcpy would almost never be useful when writing good code.) Another is the paranoia that a refactoring might make dst be a pointer, and so using sizeof(dst) is unnecessarily risky. I might declare char dst[FOO];, and use FOO in place of sizeof(dst).

                    1. 1

                      If you have a char *buf, size_t n pair, you don’t know that it’s a nul-terminated buffer (really of length n+1) unless that’s promised. And you don’t want to have that promise, because it’s better for your API’s to be simple.

                      Of course I do, it’s a C string.

                      I sense confusion here. There are two canonical ways to deal with sequences of bytes in C. One is a buf, n pair (let’s call it a buffer). A buffer:

                      • has length alongside the address
                        • takes len + sizeof(size_t) bytes in memory
                      • may contain any byte value
                      • copied and searched using mem*() functions
                      • output using *write()

                      The other is a C string, which:

                      • is nul-terminated, thus doesn’t need an explicit length
                        • takes len + 1 bytes in memory
                      • may contain any character except nul
                      • copied and searched using str*() functions
                      • output using *puts() or *printf()

                      What you seem to propose is that src is a buffer but dst has to be a C string for some reason. This assumes src doesn’t contain nul bytes (otherwise copying it to a C string would not make sense). I can’t imagine the case in which this would be useful.

                      That code is lame anyway because it uses a fixed target size.

                      Replace sizeof(dst) with dstsiz the variable in your mind. Besides, fixed buffer sizes are useful when you have no malloc(), e.g., in embedded environments.

                      You should allocate a buffer of the appropriate size.

                      If it’s a C string, I’ll just strdup(). If it’s a buffer, I’ll malloc() and memcpy().

                      checking the length after copying, or trying to do the check and then the copy in one fell swoop, is more likely to evolve into broken code.

                      Why/how?

                      One remaining problem here might be that if you have a real need for a fixed destination size, you probably also want to fill the unused space with a defined value, otherwise you’ll output uninitialized memory.

                      How? If I treat dst as a C string, I’ll output it using puts()/printf(), and it’ll stop at '\0'. If I treat it as a buffer, I’ll use write() and all the talk about string handling code is irrelevant anyway. (And if I fill unused space with nuls and am dumb enough use write() on it, I’ll output spurious nuls.)

                      (Thus strlcpy would almost never be useful when writing good code.)

                      It’s useful exactly as advertised: when you want to copy a C string to a buffer that either is fixed length or you can reallocate, and you either can tolerate or want to detect truncation (or actually want to truncate). On the other hand, I can’t see when mempcpy is useful. Especially the incantation with = '\0', which I find contrived and confused.

                      And if you know your buffer lengths and are so sure you’re writing “good” code, you can just use strcpy.

                      Another is the paranoia that a refactoring might make dst be a pointer, and so using sizeof(dst) is unnecessarily risky.

                      This is a valid point. But the code I wrote above was just an example; in real life, it’d probably be in a function that length is passed to.

                      1. 2

                        Look, you can pretend I don’t know how C strings work if that makes you feel better. Data that comes from program input doesn’t come free of nul characters, so the way you’ll want to treat that is with a length and size. If you want nul-free data as an internal API guarantee, well, you’ve probably got much more specific data hygiene problems than that, so C strings aren’t the solution. And using C strings internally will inhibit making refactorings to your program if you ever do need to allow nul bytes, or permit other refactorings.

                        Generally, the answer to all your nits and questions derives from the fact that you should write your code in a way such that it’s unlikely to be broken, and unlikely to be broken in the future. A good understanding of risk factors for how code evolves will answer questions like how the code should be written. It’s why you should check the length before copying and why you shouldn’t write mempcpy(dest, src, n) = '\0'. That memcpy code is not “contrived and confused” – those words don’t mean anything. A rational argument against that construct is that people reading it will have to look up what mempcpy does, and that it’s more likely to evolve into something broken because it obscures the fact that the nth character of the buffer is assigned '\0'. But the real meat behind the decision isn’t the just-so story about how it obscures information, it’s the weightings of these risk factors that you get from experience, and the cheap price of the alternative.

                        1. 1

                          Data that comes from program input doesn’t come free of nul characters,

                          There are perfect reasons to strip nuls from strings or assume they aren’t there, e.g., if you’re dealing with Unix filenames or e-mail addresses, or are handling strings from strerror or static strings you prepared, or in a myriad of other cases.

                          so the way you’ll want to treat that is with a length and size. If you want nul-free data as an internal API guarantee, well, you’ve probably got much more specific data hygiene problems than that, so C strings aren’t the solution.

                          If you’re talking about buffers, not C strings, what is

                          dst[len] = '\0';

                          doing in your code sample above?

                          It’s why you should check the length before copying

                          Sometimes truncation is OK, sometimes it’s even desired. But in most cases I agree.

                          and why you shouldn’t write mempcpy(dest, src, n) = '\0'.

                          At least we agree on this.

                          That memcpy code

                          Just to make sure we’re on the same page: I was talking about mempcpy.

                          is not “contrived and confused” – those words don’t mean anything.

                          Sure.

                          But the real meat behind the decision isn’t the just-so story about how it obscures information,

                          I wasn’t talking about obscuring information at all.

                          it’s the weightings of these risk factors that you get from experience, and the cheap price of the alternative.

                          Evidently, my experience leads me to different risk factor assessment than yours leads you. I’ve seen people treating the same fixed size char array† as a buffer in one place and as a C string in another (bzeroed prior to copying, too), confusing (here’s that word again) another colleague who decided to pass it to printf, just to have it blow up when a maxlen string got copied into it.

                          I understand why people don’t like C strings. But then they should treat their buffers as buffers all the way, always using a ptr, len tuple (and declaring ptr as a void * for good measure) and never terminating them with nuls, to avoid confusion down the line.

                          Now, if you do choose to use C strings, I don’t understand how strlcpy is “almost never useful”.

                          † Embedded code, RAM measured in kilobytes, no malloc.

                    2. 1

                      Yes, it fails slightly faster, but it succeeds slower, as you have to walk src twice.

                      You have to “walk src twice” anyway, or you have to know the size of the string.

                      Truncation is always wrong because it destroys data, and it creates security risks that are not obvious.

                      How is this better than if (strlcpy(dst, src, sizeof(dst)) >= sizeof(dst)) return FAIL;

                      For the reason given, and because programmers don’t use strlcpy “correctly”. I looked at the first three pages of uses on a GitHub search and none of them used strlcpy correctly.

                      memcpy may be implemented as a highly optimized built-in. Is mempcpy implemented similarly?

                      Yes.

                      1. 1

                        You have to “walk src twice” anyway, or you have to know the size of the string.

                        strlcpy walks src once.

                        And I still don’t understand why, once I know the string length and make sure it fits, I can’t memcpy len+1 bytes or strcpy.

                        Truncation is always wrong because it destroys data, and it creates security risks that are not obvious.

                        Sometimes you’re dealing with a Unix filename that cannot be longer than PATH_MAX. Sometimes you’re pushing data to a socket one KB at a time. Sometimes you’re copying a string to a 16 character wide LCD display (and if you truncate, you start scrolling, i.e., copy from src+1 200ms later, or just forget it). And sometimes you have no malloc, so you have no choice but to deal with truncation somehow. “Never say ‘always’”, or however that quotation goes.

                        I looked at the first three pages of uses on a GitHub search and none of them used strlcpy correctly.

                        Seems like many of these uses are legitimate, either copying static strings, or copying from a smaller buffer to a bigger one. And thanks Todd it exists, otherwise they’d use strcpy (or worse, strncpy). I also looked at uses of strlcat, and most of those either know their buffer sizes or check the return value.

                        But guess what, I went through three pages of mempcpy use, and it doesn’t seem to be used for strings outside of GNU libraries.

                        1. 1

                          strlcpy walks src once.

                          In buggy code. In code without bugs, you need to know the string length, either by asking strlen() (and reading twice), or because you already had it.

                          And I still don’t understand why, once I know the string length and make sure it fits, I can’t memcpy len+1 bytes or strcpy.

                          Once you know the string length, you do not need to use strcpy.

                          You can use memcpy with len+1 if you want.

                          Sometimes you’re dealing with a Unix filename that cannot be longer than PATH_MAX.

                          Truncating may now be a security hole, whereas before the system would generate an error.

                          Sometimes you’re pushing data to a socket one KB at a time.

                          You do not need to copy in this situation. writev() is better.

                          Sometimes you’re copying a string to a 16 character wide LCD display (and if you truncate, you start scrolling, i.e., copy from src+1 200ms later, or just forget it).

                          There are no LCD displays I’m aware of that do this. They either have a memory mapped register that reads from a particular address, a fixed buffer, or it is a serial interface. In every one of these cases, strlcpy() is inappropriate, and even in your case it’s not ideal.

                          And sometimes you have no malloc,

                          I don’t use the system malloc() because I care about performance, and strlcpy() has nothing to do with this.

                          “Never say ‘always’”, or however that quotation goes.

                          Except when it’s always.

                          Seriously.

                          Don’t use strlcpy().

                          But guess what, I went through three pages of mempcpy use, and it doesn’t seem to be used for strings outside of GNU libraries.

                          So what? Every single use is a correct use because it’s a better interface.

                          You don’t have to use a GNU extension, but you shouldn’t use a stupid API.

                          1. 1

                            In buggy code. In code without bugs, you need to know the string length, either by asking strlen() (and reading twice), or because you already had it.

                            strlcpy has a return value for a reason. I may want to check it. And I’d rather write two lines using strlcpy than four lines (or five, including variable definition) using memcpy every time, and optimize buffer walking for the more common scenario where it doesn’t overflow.

                            Or maybe I actually want to bound the string, because I’m assembling an SMS saying, “failed to process your SMS:[as much of the failed SMS as will fit]” in a buffer that will be sent later. (Likewise, “Mail delivery failed, here are the first N KB of the original message”.)

                            Truncating may now be a security hole, whereas before the system would generate an error.

                            Maybe I got the filename from the system or another source where I know it’s valid, so it’s safe, and I use strlcpy as the last line of defence against future programming errors. Or maybe not, so I have to check the return value, avoiding an allocation and failing earlier.

                            You do not need to copy in this situation. writev() is better.

                            Unless I don’t want to block. Or I have a thread generating data line by line, and another pushing it to the GSM modem a kilobyte at a time, with some encoding in between, using static buffers on a device with 16KB of RAM (real life example).

                            There are no LCD displays I’m aware of that do this.

                            I may be copying the string for another thread to do the writing. And yes, I have a valid reason not to pass it the original string.

                            I don’t use the system malloc() because I care about performance,

                            I don’t get what you’re trying to say here. I was talking about the situation where no memory allocation is possible. But if you can’t tolerate string truncation and your system has memory allocation, you would want to allocate memory somehow, whether you use system-supplied malloc or not.

                            (also, get a better system.)

                            So what? Every single use is a correct use because it’s a better interface.

                            So mempcpy, never used but always correctly, is better than strlcpy that prevents buffer overflows (which always are security holes, unlike truncation) but is sometimes used incorrectly? I disagree.