1. 13
  1.  

    1. 7

      @MaskRay has a good overview of a lot of the mechanics of TLS which I found very handy for implementing a custom debugger. Opting into specific models can get you out of the business of __tls_get_addr even for SOs but there are tradeoffs.

      Musl makes different choices than glibc for how the TLS is allocated that is closer to the authors “shimming pthread_create” out of the box

      1. 3

        I found that article indispensable in implementing thread local storage myself.

        1. 2

          just updated the post with a link to that overview and the initial-exec TLS model which can speed things up if you can use it

        2. 3

          Regarding: cmpb $0, %fs:__tls_guard@tpoff, the per-function-call overhead is due to dynamic initialization on first use requirement:

          Block variables with static or thread(since C++11) storage duration are initialized the first time control passes through their declaration (unless their initialization is zero- or constant-initialization, which can be performed before the block is first entered). On all further calls, the declaration is skipped. — https://en.cppreference.com/w/cpp/language/storage_duration

          From https://maskray.me/blog/2021-02-14-all-about-thread-local-storage#c-thread_local

          If you know x does not need dynamic initialization, C++20 constinit can make it as efficient as the plain old __thread. [[clang::require_constant_initialization]] can be used with older language standards.

          Regarding data16 lea tls_obj(%rip),%rdi in the general-dynamic TLS model, yeah it’s for linker optimization. The local-dynamic TLS model doesn’t have data16 or rex prefixes.

          Regarding “Why don’t we just use the same code as before — the movl instruction — with the dynamic linker substituting the right value for tls_obj@tpoff?”

          Because -fpic/-fPIC was designed to support dlopen. You need -fpic -ftls-model=initial-exec to convey the intention that the output is not for dlopen.

          The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that “you would need the TLS areas of all the shared libraries to be allocated contiguously:”

          # x86-64
          movq ref@GOTTPOFF(%rip), %rax
          movl %fs:(%rax), %eax
          

          With dlopen, the dynamic loader needs a different place for the TLS blocks of newly loaded shared libraries, which unfortunately requires one more indirection.

          Regarding “… and I don’t say a word about GL_TLS_GENERATION_OFFSET, for example, and I could.”

          GL_TLS_GENERATION_OFFSET in glibc is for the lazy TLS allocation scheme. I don’t want to spend my valuable time on its implementation… It is almost infeasible to fix on the glibc side. Changes in this area might also break existing DSOs built with -ftls-model=initial-exec but actually dlopened.

          1. 2

            the per-function-call overhead is due to dynamic initialization on first use requirement

            Thanks - I didn’t realize this was mandated by the standard as opposed to “permitted” as one possibility (similarly to how eg a constructor of a global variable can be called before main or upon first use or anywhere in-between according to the standard). Updated the post with this point

            The desired efficient GOTTPOFF code sequence is only feasible when the shared object is available at program start, in which case you can guarantee that “you would need the TLS areas of all the shared libraries to be allocated contiguously”

            Indeed I didn’t mention -ftls-model=initial-exec originally (I now added it based on reader feedback; it can work when it will work, which for my use case is a toss-up I guess…), but my point is that you could allocate the TLSes contiguously even if dlopen was used, and I describe how you could do it in the post, albeit in a somewhat hand-wavy way. This is totally not how things were done and I presume one reason is that you don’t carve out chunks of the address space for a use case like this as described in my approach - I just think it would be nice if things worked that way.

            1. 3

              Other things.

              The C toolchain — not the C++ compiler front-end, but assemblers, linkers and such — is generally quite ossified, with decades-old linker bugs enshrined as a standard.

              I would not call the behavior a “bug”. That’s just how archive semantic is defined.

              “here’s how a function accessing 2 thread_local objects with constructors looks like:”

              Perhaps a C++ source with Class2 will be useful:) You are probably accessing multiple default visibility TLS like the following.

              struct A {
                A();
                int x;
              };
              [[gnu::visibility("default")]] // also check hidden
              thread_local A a, b, c;
              
              int foo() {
                return a.x + b.x + c.x;
              }
              

              In a shared object, a default visibility STB_GLOBAL or STB_WEAK symbol can be preempted (interposed) because a preceding component may define a symbol of the same name. (For a deeper dive, the musl/ldso/ source code might be helpful. I have describe this at https://maskray.me/blog/2021-05-16-elf-interposition-and-bsymbolic but probably not in a very clear way).

              This symbol interposition overhead unfortunately affects thread-local variables as well. The compiler has to assume that a/b/c may resolve to different components, therefore multiple __tls_get_addr calls are needed.

              However, with hidden visibility, LLVM’s X86 backend can convert multiple general-dynamic TLS access to the local-dynamic TLS model. This optimization isn’t universally available across all architectures, though…

              1. 1

                I would not call the behavior a “bug”. That’s just how archive semantic is defined.

                I called it “a bug enshrined as a standard”, so it’s a matter of definitions. eg man strtok lists a bunch of its behaviors under the BUGS section, even though these behaviors are standard-compliant. It is not possible to “objectively” resolve a dispute like this since it’s a question of whether you find the behavior “reasonable,” not whether it violates the specification.

                with hidden visibility, LLVM’s X86 backend can convert multiple general-dynamic TLS access to the local-dynamic TLS model.

                Indeed; g++ doesn’t seem to do it. I’ll try to update the post with this

            2. 2

              Block variables with static or thread(since C++11) storage duration are initialized the first time control passes through their declaration (unless their initialization is zero- or constant-initialization, which can be performed before the block is first entered). On all further calls, the declaration is skipped. — https://en.cppreference.com/w/cpp/language/storage_duration

              Wait, but isn’t this block variables specifically? (i.e. declared within a function?)

              I assume that this doesn’t apply to file-scope variables?

              1. 1

                I think you are right:

                3.7.2/2 [basic.stc.thread]: A variable with thread storage duration shall be initialized before its first odr-use (3.2) and, if constructed, shall be destroyed on thread exit.

                – same as with normal globals, essentially, though the g++/clang implementation made different choices (globals don’t have guard variables for lazy initialization, and thread_locals do)

                Fixing the post again…

            3. 2

              Note that this is all very x86-specific. Other ISAs have much easier and faster access to thread-local storage e.g. on RISC-V the x4 register is a pointer to the current thread’s thread-local storage, so for up to 4K of stuff all you have to do is lw ...,offset(tp), just as you access normal globals at an offset from x3 (gp).

              1. 4

                What RISC-V does looks similar to what x86 does with the %fs register; I think most ISAs reserve a register for the TLS base address. In fact, I don’t think there’s even one “really” x86-specific thing here (eg movl is very CISCy and on RISC machines you wouldn’t have one instruction doing what movl does, but you’d have a short instruction sequence doing the same and it would still be the fastest TLS access model.)