1. 24
  1. 10

    Note: some controversial, and somewhat decisive, opinions follow.

    Some of the changes in this list are really exciting:

    • redefinition of {u,}intmax_t to allow implementations some leeway for bigints and extended integer types (maybe I can finally have uint128_t!)
    • enum type-specifiers for better control over storage allocation (I get moar control!)
    • clarification that anonymous structures inside unions maintain their structure (an ambiguity that’s nice to have resolved since the committee seems to have a consensus that this was always intended)
    • decimal floating-point arithmetic extensions (kiss some rounding error goodbye!)

    I am incredibly excited for some of these (and I have a few things I’d like to eventually propose to the committee myself, though I don’t think I’ll even have them hammered out by the time C2x is standardized). But there’s something that, to me, feels almost sinister going on underneath some of the proposals; namely, N2269 and nullptr.

    So, here’s what happened, WG14 (the working group behind the C standard) stated in the C2x charter a renewed set of principles to guide the language moving forward. One of those principles was minimizing deviations from modern C++. This, to me, is a horrible idea. C++ hasn’t been a strict superset of C since C99. Maintaining source-code compatibility is strictly impossible without changes to the C++ standard or removals from the C language which are widely used. Furthermore, if C balloons into C++, it will be a tragedy.

    • N2269 (adding attributes) is an interesting idea but explicitly chooses to not to just use _Pragma (already part of the language) because it’s too limited in power, nor _Attribute (the natural choice) because it would deviate from C++ in favor of new syntax (taken from C++) that includes a construct :: that doesn’t exist anywhere else in C.
    • nullptr. C’s definition of NULL has changed a couple of times, and now matches C++’s definition ((void * )0). As I understand it, C++ benefits from nullptr because C++ has multiple types of pointers (references, unique_ptr, etc.), templating, overloading and a variety of other tools that make using NULL ambiguous or unsuitable. C has exactly none of these, and I seriously hope it never shall.

    These changes are not yet voted on, as far as I know, but I seriously hope the Committee is wise enough to not take “minimize deviations from C++” to mean “implement everything that a C++ programmer might want in C”. The affects would be disastrous otherwise.

    1. 6

      I mostly agree – although I think that decimal floating point is also a mistake. It seems appealing, but it’s almost never what you want. The #1 thing that I want from the C committee – the gradual removal of undefined behavior – also seems to be missing here.

      1. 3

        Not only are they not removing undefined behavior, but they are adding all sorts of unthought out extensions to C, like decimal floating point each on of which comes with additional prospects of undefined behavior.

    2. 8

      One thing that seems like an oversight in C is the lack of struct comparison. You can assign

      struct s a = b;
      

      but you can’t compare them. memcmp() isn’t right because the content of padding bytes in structs is unspecifed. You’re left doing field by field comparisons, which breaks every time someone adds or removes a field.

      1. 3

        Is this a problem in practice? If I have a struct which I want to compare by equality, I will memset() it before initialization and then memcmp() it.

        I don’t want the compiler to generate code that looks at all the initialized fields and checks them one-by-one for equality. Isn’t that a lot slower?

        Ditto with hashing – memset() and then hash the raw bytes rather than look at each field. It seems worse to hash every field separately.

        1. 3

          The C Standard leaves padding as undefined to give leeway to the implementation to produce efficient code. Given the following:

          struct foo
          {
            char c;
            int  x;
            struct
            {
              unsigned int a : 1;
              unsigend int b : 1;
            } flags;
            int d;
           };
          
           struct foo f;
          
           memset(&g,0,sizeof(g));
           f.x = 12345;
           f.d = 32767;
           f.a = 1;
           f.b = 1;
           f.c = 33;
          

          While the padding around f.c might still be 0, it’s not guaranteed (although I’m not aware of any byte-addressable architecture that can’t write a single byte to memory). The unused bits in f.flags will probably not be 0 at all—I can see a compiler, with optimization, reusing a register used to initialize f.d and just set the low byte to 3 before writing to the entire field, thus leaving undefined padding in f.flags.

          That you do the memset() and memcmp() works for you is probably due to the compiler ensuring that. It doesn’t have to. And you might never encounter bad behavior with this if you don’t use bitfields and stick with byte-addressable architectures.

          1. 1

            memset(&g,0,sizeof(g));

            That should be f, not g, right?

            While the padding around f.c might still be 0, it’s not guaranteed

            That’s an interesting question – what about if you use a union there, like:

            struct foo {
              union {
                char c;
                char pad[4];
              }
            };
            

            Do you have a citation for the fact that storing c could change other bytes? I feel like C should not have things that work on 99% of architectures but fail on others… I have done some porting in the distant past and this isn’t one of the things that came up.

            That you do the memset() and memcmp() works for you is probably due to the compiler ensuring that. It doesn’t have to.

            Hm I wonder if this is something UBSan would catch? I am currently doing a lot of “punning” in my VM in C++. I am using unions though.

            1. 2

              That should be f, not g, right?

              Yes.

              Do you have a citation for the fact that storing c could change other bytes?

              From the C99 standard, section 6.2.6.1:

              1. When a value is stored in a member of an object of union type, the bytes of the object representation that do not correspond to that member but do correspond to other members take unspecified values.
              1. Where an operator is applied to a value that has more than one object representation, which object representation is used shall not affect the value of the result.43) Where a value is stored in an object using a type that has more than one object representation for that value, it is unspecified which representation is used, but a trap representation shall not be generated.

              footnote 43: It is possible for objects x and y with the same effective type T to have the same value when they are accessed as objects of type T, but to have different values in other contexts. In particular, if == is defined for type T, then x == y does not imply that memcmp(&x, &y, sizeof (T)) == 0. Furthermore, x == y does not necessarily imply that x and y have the same value; other operations on values of type T may distinguish between them.

              I don’t think it answers your direct question (does writing a char affect padding?) but the entire C standard is written in this tortured language to cover everything from 66-bit, 64-bit, 60-bit, 56-bit, 36-bit, 33-bit, 32-bit, 18-bit, 16-bit and 8-bit systems with sign magnitude, 1s-complement and 2s-complement, segmented memory, flat memory, separate I and D space, unified I and D space, with non-IEEE-754 and IEEE-754 floating point thrown in for a good time.

              Oh, and integer trap representations, can’t forget those (although in my 35+ years of programming I’ve never encountered one system with trapping integers, nor sign-magnitude or 1s-complement for that matter).

              Also be careful with type punning via unions. As stated in Appendix J.1 (Unspecified behavior):

              The value of a union member other than the last one stored into (6.2.6.1).

              But like duclare, I wouldn’t mind a struct compare, but that might be impossible to do correctly—do you run strcmp() over char *s? What do you do for other pointer types? Do you get back just equal/not equal, or will you get less than, equal, greater than? Or is just a memcmp()?

              1. 1

                OK interesting. I just ran my nascent VM under ubsan and it found that my hash function left-shifts large signed integers, but nothing else.

                I guess I will have to write a rigorous torture test for the parts of the code that do type punning, and run it on every platform. Trying to interpret the standard does seem a bit difficult (and even experts can disagree).

                It makes me glad that I wrote most of Oil in a high-level language… I’m hoping there will be 5K-10K lines of native code, not 130-160K like bash … The fewer parts of the program that are subject to these portability issues, the better!

                FWIW I did run into a crash bug with a union used for alignment in CPython under gcc 8! When distros started using gcc 8 instead of 7, Oil would would segfault on startup in CPython code. A point release for Python 2.7 fixed it but I was still on the old one. So yes this area is pretty fraught :-/

                1. 1

                  But like duclare, I wouldn’t mind a struct compare, but that might be impossible to do correctly—do you run strcmp() over char *s? What do you do for other pointer types? Do you get back just equal/not equal, or will you get less than, equal, greater than? Or is just a memcmp()?

                  Pointer comparison is already a part of C and it does not involve calls to strcmp(), so I don’t see why pointers as members of structs should need any fancier treatment. In practice, doing the equivalent of memcmp() ignoring padding ought to be the simplest way to spec this. You still need to tighten the semantics around unions.

                  1. 1

                    Given the following code:

                    struct foo
                    {
                      int x;
                      char *s;
                    };
                    
                    struct foo a;      
                    struct foo b;
                    
                    a.x = 5;
                    b.x = 5;
                    a.s = malloc(5); strcpy(a.s,"one");
                    b.s = malloc(5); strcpy(b.s,"one");
                    
                    if (a == b) 
                      printf("true\n");
                    else
                      printf("false\n");
                    

                    does it print “true” or “false”? a.s and b.s have different pointer values, but what they point to is semantically the same.

                    1. 2

                      It prints false. This seems obvious. Nobody is asking for deep pointer chasing comparison. They want the == equivalent to memcmp, just as = is equivalent to memcpy.

                      1. 1

                        What tedu said.

                        Why should pointers in a struct be compared any different than how pointers outside a struct are compared? I see no justification. Therefore, this is how it should work:

                        char *a = malloc(5); /* strcpy(a, "one"); */
                        char *b = malloc(5); /* strcpy(b, "one"); */
                        printf(a == b ? "true\n" : "false\n");
                        

                        Besides, it simply makes no sense to make the assumption that two char pointers should compare equal if they point to strings that compare equal. That kind of assumption would break a ton of software. There are legitimate reasons to have equal strings in different locations. That’s even before you consider the possibility of having a char* that does not point to a string at all.

          2. 5

            I clicked on a few of the links that looked interesting to me, and the progress report from the C Memory Object Model Study Group was particularly interesting. The Q&A at the end with regard to effective types and TBAA was pretty interesting to read.

            The “unitialized reads” section was also interesting. In particular, this part stood out to me (emphasis mine):

            Despite the above, several WG14 members said that the intent of the standard was to make all reading of uninitialised values (perhaps except at character type) undefined behaviour.

            Why is it that reading unitialized data at the char level might not necessarily be UB? Or are they saying that it is UB today, but maybe it shouldn’t be?

            1. 4

              It’s unclear if the standard today actually says UB or not. As in, if there aren’t any trap values, then maybe it’s not UB?

              More reading: https://queue.acm.org/detail.cfm?id=3041020

              1. 1

                Interesting! Towards the end, it seems to suggest that reading from uninitialized chars is indeed UB:

                Consider the following code:

                void f(void) { 
                  unsigned char x[1]; /* intentionally uninitialized */ 
                  x[0] ^= x[0]; 
                  printf("%d\n", x[0]); 
                  printf("%d\n", x[0]); 
                  return; 
                }
                

                In this example, the unsigned char array x is intentionally uninitialized but cannot contain a trap representation because it has a character type. Consequently, the value is both indeterminate and an unspecified value. The bitwise exclusive OR operation, which would produce a zero on an initialized value, will produce an indeterminate result, which may or may not be zero. An optimizing compiler has the license to remove this code because it has undefined behavior. The two printf calls exhibit undefined behavior and, consequently, might do anything, including printing two different values for x[0].

                1. 2

                  Right. So the counter argument was that an indeterminate value is still one value. But this might interfere with a program where, for instance, x is in register r1 in one block and r2 in another block. The compiler must ensure it propagates the value when initialized, but may omit the move if not.

              2. 4

                So, just for one example, it is impossible to write a driver for a device using memory mapped control registers. I wonder how mmap works with this bizarre proposal?

                1. 1

                  Why is it that reading unitialized data at the char level might not necessarily be UB?

                  Because the standard tries to reconcile a crappy version of Pascal types with C practice by using a bunch of hacky special cases. They needed the char * exception so that, in principle, it was possible to implement memcpy in C despite the lvalue rules that otherwise forbid it.

                  1. 1

                    Hmm. I guess I still have gaps in my understanding. Why does memcpy need this? It needs to only read from one region and only write to another, right? So as long as you assume the region you read from is initialized, then aren’t you okay since you only need to write to the other region, which could be uninitialized? Or is there some case I’m missing?

                    1. 1

                      You can’t copy a structure that may contain uninitialized data - and even if you initialize all fields, you may have uninitialized padding. If you implement memcpy to do this, it has to use the char * hack.

                      https://shape-of-code.coding-guidelines.com/2017/06/18/how-indeterminate-is-an-indeterminate-value/

                      There are several other places this pops up. For example, when one asks how to do a standard network function like checksum a packet structure, the usual answer from people who find the Standard appealing is that you memcpy your packed packet structure to an int array and checksum that array (packed is another problem, but … ). In order for this hack to work, memcpy has to be able to copy between different types.

                      Of course, once you allow char * to ignore the type rules, you might as well not have them at all, but …

                      1. 1

                        Ah gotya. Thanks. I was forgetting about padding!

                2. 0

                  This site popped a phishing scheme pop up when I opened it. One of those “congratulations!!! You’ve been selected blah blah blah” messages that when you click close takes you to a google looking site that phishes for your login details.