1. 9
  1.  

  2. 4

    Isn’t stdout technically a “strictly bounded queue” as well? I have seen situations where stdout was “wired” to stdin on some other process, and the reader process crashed (most recent case was journald). The OS (Linux as I recall) dutifully buffered the stdout (I think glibc default is 64k?) until it couldn’t any more… then the program (a web app in the most recent case I recall) blocked trying to write to stdout (waiting for IO to complete).

    1. 3

      Sure, writing to a pipe on stdout that belongs to a process that is stuck will also block. Typically though, the other end of stdout is connected to some other completely separate process that takes care of doing something useful with whatever information you put on it.

      The problem here isn’t that “writing to a queue can block”, it’s that “writing to a queue, both ends of which live inside the same process, (written in a programming language in which reentrancy is relevant because of things such as signals or finalisers) can deadlock” because it’s plausible to have a situation in which something that happens in a finaliser causes the thread which is supposed to be draining the queue to also emit some messages to it.

      If you wanted to cause the same kind of deadlock with file descriptors, you’d do roughly:

      r, w = os.pipe()
      read_end = os.fdopen(r)
      write_end = os.fdopen(w)
      
      def log_message(message):
          write_end.write(w, message)
      
      def in_reader_thread():
          while True:
             message = read_end.read()
             save_to_somewhere_external(message)
      

      which would have exactly the same problem.

      If you wanted to be REALLY silly you could even dup2() the ends of that pipe over your stdout and stdin file descriptors.

    2. 3

      I’m not actually sure if the python tag belongs on this. The simulation code inside is written in Python but the issue being discussed and the vast majority of the content are not at all specific to Python.

      1. 5

        I wrote a TLA+ spec to model the system. As you mentioned, you can actually take the GC procedure out of the the write thread. Another possible solution is to have a separate gc message queue that is bounded but drops messages instead of blocking.

        1. 1

          Neat!

          One thing I don’t understand: why remove the GC procedure from the write thread? I guess it doesn’t make any difference to the model output but doesn’t removing it make it harder to justify the model as being analogous to real-world system behaviour? Since the GC procedure in the write thread was put in the simulacra as an analogue for the fact that real applications will generate garbage while processing requests.

          1. 2

            I pulled the GC procedure from the write thread because you can reproduce the bug without it. Assuming the queue size is 2, write-write-read-write-gc will deadlock. So even when just the read thread is generating gc messages, that’s still enough to cause problems.

            Any solution implemented would of course have to also work when the write thread can also run GC, which we can do by adding a call garbage_collector() to the write process. Just demonstrating that you simulate the same deadlock problem in a simpler system.

            1. 1

              Just demonstrating that you simulate the same deadlock problem in a simpler system.

              Aha, thank you. :)

      2. 2

        As far as I can understand, this has nothing to do with bounded queues but everything to do with a particular GC implementation + logging in finalizers? Although doesn’t the world know, already, that doing anything side-effectful in a finalizer is a terrible idea?

        It’s also unclear to me how using stdout solves this problem unless you assume there is nothing between your stdout call and the syscall happening.

        1. 2

          The combination that goes wrong is:

          • bounded queue
          • the other end is in the same process
          • it’s used for logging, including when things fail

          The only thing specific about the GC implementation that matters is that finalisers can run on any thread. As far as I’m aware, this is a fact that is true of the GC implementations in Python, Java and C# at least.

          It is known that doing anything side-effectful in a finaliser is a bad idea, but that doesn’t mean that it doesn’t happen. In a large system with a lot of authors, you will eventually end up with a buggy finaliser that throws an exception, and if you don’t record when that happens then you will never get to fix it.

          Using stdout “solves” this problem in that the other end of stdout is typically in some entirely different process.

          1. 1

            The combination that goes wrong is:

            You forgot to put writing to the queue in a finalizer in your list, which seems like the most important part.

            The only thing specific about the GC implementation that matters is that finalisers can run on any thread. As far as I’m aware, this is a fact that is true of the GC implementations in Python, Java and C# at least.

            Maybe you can demonstrate the problem in those languages but this would only be a problem in a stop-the-world scenario. Java and C# can stop the world but, IIRC, usually not for the actual collecting so I’m not sure if that is a problem there.

            It is known that doing anything side-effectful in a finaliser is a bad idea, but that doesn’t mean that it doesn’t happen. In a large system with a lot of authors, you will eventually end up with a buggy finaliser that throws an exception, and if you don’t record when that happens then you will never get to fix it.

            I’m not really following. I have a buggy finalizer that does not do logging but without logging I cannot determine it’s buggy? Or I have a buggy finalizer that does do logging but I don’t have any guarantee that it does it correctly? In any case, I think fixing the small chance of a buggy finalizer is better than completely rearchitecting my logging which may or may not work in the particular situation.

            1. 2

              Maybe you can demonstrate the problem in those languages but this would only be a problem in a stop-the-world scenario. Java and C# can stop the world but, IIRC, usually not for the actual collecting so I’m not sure if that is a problem there.

              I have no idea why you’re talking about the specifics of how the GC works. “Stop the world” is not a phrase that is relevant. All that matters here is whether it can, at arbitrary times, cause finalisers to be invoked on arbitrary existing threads (true at least in Python, and I think other PLs too).

              I have a buggy finalizer that does not do logging but without logging I cannot determine it’s buggy?

              You have a buggy finaliser. It throws an exception. Nothing explicitly catches it. Where does the stack trace and error report go?

              If the answer is “there’s a top level exception handler which puts it into the same logging system the rest of my code uses” then you’ve got the possibility of writing a log message from a finaliser, even though you didn’t explicitly write code to write a log message from the finaliser.

              If the answer is “the exception is swallowed silently” then you’re never going to find out that the finaliser threw an exception. You won’t get to fix the finaliser. Whatever it was meant to do is not being done and you don’t know that it isn’t being done.

              If the answer is “the exception is written to somewhere other than the same logging system the rest of my code uses (such as directly to stderr)” then you’re not going to think to look in both places and this collapses to the same problem as swallowing the exception.

              1. 1

                I have no idea why you’re talking about the specifics of how the GC works. “Stop the world” is not a phrase that is relevant. All that matters here is whether it can, at arbitrary times, cause finalisers to be invoked on arbitrary existing threads (true at least in Python, and I think other PLs too).

                The crux of your post is that when one thread gets stuck in a finalizer no other thread can run, therefore deadlock. But this only applies in a language with a GIL, like Python, or one where the GC stops all threads. But this doesn’t necessarily apply to Java or C# which have incremental and concurrent collectors. So just because one finalizer is blocked does not mean the rest of the app won’t go on its merry way, and possibly even unblock the finalizer thread.

                If the answer is “there’s a top level exception handler which puts it into the same logging system the rest of my code uses” then you’ve got the possibility of writing a log message from a finaliser, even though you didn’t explicitly write code to write a log message from the finaliser.

                If I throw an exception in a finalizer, does all code after that get executed in a GC context? That seems quite odd to me and I tried to modify your blog post to test this and could not get it to block. I would expect that once an exception is thrown from a finalizer, the context goes back to the mutator. Do you have any documentation or a testcase that shows that to not be true?

                1. 1

                  The crux of your post is that when one thread gets stuck in a finalizer no other thread can run

                  No. The crux of my post is that if the specific thread which is responsible for draining the logging queue gets stuck in a finaliser, because that finaliser tried to add to the logging queue, then that one thread is deadlocked. It can’t return from the finaliser since it’s blocked waiting to write to the queue; because it doesn’t return from the finaliser, it doesn’t return to the routine which would have removed an item from the queue. “The ice-pick I need to break through this ice is stuck under a wall of otherwise impenetrable ice ☹.”

                  Since the log queue draining thread is stuck, eventually, every other thread that tries to log a message will be, too. This doesn’t necessarily happen immediately.

                  I tried to modify your blog post to test this and could not get it to block

                  Python does not stop other threads when a finaliser is running in some thread. The thread that the finaliser is run on is blocked until the finaliser returns because the finaliser is just pushed onto that thread’s stack and jumped to, more or less just like any other function call.

                  1. 1

                    No. The crux of my post is that if the specific thread which is responsible for draining the logging queue gets stuck in a finaliser, because that finaliser tried to add to the logging queue, then that one thread is deadlocked.

                    Brain fart on my side!

                    Python does not stop other threads when a finaliser is running in some thread. The thread that the finaliser is run on is blocked until the finaliser returns because the finaliser is just pushed onto that thread’s stack and jumped to, more or less just like any other function call.

                    My point was that the situation you were talking about was logging the exception a finalizer threw (which is different than the case in your blog post). This is a non-problem because the thread that reads the queue can catch the exception and handle it gracefully and any other thread can catch the exception and log it. No deadlock. The only issue is if you have a finalizer that explicitly performs the logging call in the finalizer and you don’t realize it. I claim that this isn’t really a big enough problem to necessitate architecture how logging works.

                    1. 1

                      the thread that reads the queue can catch the exception and handle it gracefully

                      Exceptions in finalisers don’t bubble out of the finaliser. Otherwise scheduling them onto random threads wouldn’t be transparent. The code in whatever thread thread the finaliser gets put onto doesn’t actually see the exception being thrown in the finaliser.

                      You could have whatever top-level try:catch: is responsible for making sure your finalisers’ errors get logged detect that it’s running in the consumer thread and do something different with it, like logging it without the queue bound.