1. 31
  1.  

  2. 9

    This is a great article. I loved it. I generally agree with all of it, and actually practice most of it as well. I do have two pieces of critical feedback though. :-)

    Firstly, this doesn’t talk about error handling. For the most part, there isn’t much to it, but there are definitely interesting pits that one can fall into. e.g., If you need to perform batch operations that can partially succeed or the correct translation of errors into corresponding HTTP errors. It’s easy to let errors nest, so simple type switching can be problematic, although the github.com/pkg/errors can help with that via its causal chains.

    Secondly, I think this article gives the standard party line on interfaces, but it is incomplete IMO. And not in a pedantic way, but importantly incomplete. I wrote about this more here: https://lobste.rs/s/9ots3k/advanced_testing_techniques_go#c_1ueuxw

    1. 8

      Thanks for the feedback. I chose talking points based on my experience of what tends to trip folks up, and in my experience, error handling hasn’t really been an issue. Tell people about pkg/errors.Wrap and they’re pretty much good to go.

      Also, we’ve spoken about this aspect of interfaces before; I still believe we work in sufficiently different codebases. That is:

      The only real issue here is that you want everyone else that uses this storage system to be able to also use the in-memory implementation as well. The most natural way to do that is to define an interface.

      In the line-of-business code I’m most often exposed to, packages rarely have more than one consumer; often, it’s just package main, plus relevant tests. In that case, where the interface definition lives is essentially a coin-flip, and (IMO) keeping the consumer contract with the consumer is the best default way to encode the abstraction boundary.

      Or, if the package does have multiple consumers, most often each consumer is using a different subset of functionality. For example, a shared repository type might be a dependency to two components, but one writes, and the other reads. In this case, it still makes the most sense to define downscoped consumer interfaces.

      If I did have multiple packages using the complete methodset of a shared type, I would absolutely define the interface with the implementations. I just can’t think of many (any?) times in my industrial programming experience where that’s been the case.

      1. 3

        Interesting. I forgot we talked about this before. :-) We have many packages that make use of said storage layers. e.g., There are lots of different things that want to communicate with out Elasticsearch cluster for a variety of different reasons, but we can manage to button up all of the Elasticsearch specific stuff for a particular data model behind a layer of encapsulation. The interface for it is quite large, as you might imagine. :-)

        RE error handling: it hasn’t been a major source of frustration, but it definitely hasn’t been zero either. It is very easy for a type switch to not do what you think it’s doing when needing to perform error conversion or when needing to “cleanse” errors such that you aren’t showing, say, transient Elasticsearch errors in your HTTP 500 payload. We’re probably approaching it in the wrong way, honestly.

        1. 2

          We’re probably approaching it in the wrong way, honestly.

          In the Go kit view of the world, there is a strict separation between business domain and transport (e.g. HTTP). One common operation is translating business domain errors to and from transport semantics. The pattern that seems to work best is to do it explicitly, via some kind of mapping. That is,

          func httpError(biz error) (code int, err error) {
              if biz == service.ErrBadCredentials {
                  return http.StatusNotAuthorized, biz
              }
              if biz == service.ErrInvalidRecord {
                  return http.StatusNotFound, biz
              }
              switch e := biz.(type); e {
              case service.DependencyError:
                  return http.StatusBadGateway, e.Error
              case service.RegionError:
                  return 451, errors.New(e.Description)
              }
              return http.StatusInternalServerError, biz
          }
          

          This behavior can be implemented as a free function defined in the transport package, as above, which is what I generally recommend. Or, if the service only really speaks one transport, it may make sense to “splat” the logic out into a common interface, which all business errors can implement:

          type HTTPError interface {
              Code() int
              Err() error
          }
          

          This approach technically violates separation of concerns and the inward-facing dependency rule, but it can be a useful hack in some circumstances.

          Maybe this gives you some ideas for how to handle crossing the Elasticsearch/business domain boundary.

          1. 2

            Thanks! I’ll ponder that. :-)

    2. 6

      I’m not a Go programmer (nor interested in becoming one) but this was a fantastic article. I think it’s almost perfectly generalizable to other toolchains. The part about observability is in particular excellent.

      1. 3

        I’ve speculated in the past about a theoretical configuration allowed env vars and config files in various formats, on an package which mandated the use of flags for config, but also opt-in basis. I have strong opinions about what that package should look like, but haven’t yet spent the time to implement it. Maybe I can put this out there as a package request?

        You ever seen: https://github.com/spf13/viper ?

        BTW, great article. I’m in a spot where I’m writing new tooling in Go for a team semi-familiar with Go. We’re all learning as we go along. The testing section is great - I need to implement some better testing. I feel like most of the time my code is not testable, but as Mitchell Hashi says in the talk you link, that’s probably more of a code smell than an actual untestable situation.

        1. 3

          As a budding go programmer just out of college this has been one of the most useful posts to help me understand the language and best practices, especially that scatter/gather pattern. Thanks for spending time preparing for the presentation and kudos for posting it here!

          1. 7

            A good resource for additional patterns is https://gobyexample.com/.

          2. 2

            Human brains are really bad at reasoning about the correctness of concurrent systems. Even when we use files in simple ways we are usually skirting several race conditions that just happen not to pop up while running the code on a single laptop for a few days. After starting to write my networked code in ways that let me do randomized partition testing in-process, I’ve been so humbled by the dramatic number of bugs that pop out when you just start to impose possible-yet-unanticipated orders of execution on concurrent systems.

            I’ve started to expand the approach to file interfaces that record writes since the last fsync, and can be “crashed” during tests at different points. Concurrent algorithms that can be deterministically replayed in any interleaving of cross-thread communication points. This has been fairly straightforward on systems that run on top of libpthread, but I’ve been struggling to apply these techniques to existing go codebases, where concurrency is not interactive from go programs themselves. External instrumentation of the process at runtime from ptrace gets gnarly really quickly.

            I wish that as a language that encourages the usage of concurrency more than most others, go also embraced understanding its concurrent behavior more. Does anyone know of better options than using rr’s chaos mode on a go system after forcing all goroutines to LockOSThread?

            1. 3

              Do you already know about TSAN? https://golang.org/doc/articles/race_detector.html

              1. 1

                TSAN and the go race detector are similar (I think Dmitry Vyukov might have been involved in creating tsan, in addition to the go race detector) and they work by instrumenting points in the code in similar ways as what I mentioned, but I’m interested in further applying these techniques to actually cause different scheduled interleavings (and ideally having deterministic replay) to gain more confidence in the implementation. Just running TSAN or go -race will only catch synchronization issues (tsan has false positives because it can’t reason about bare or implicit memory fences, but go -race doesn’t have this issue because you can’t express these memory semantics in native go) that your execution happens to stumble on while running an instrumented binary. Sometimes the mere instrumentation of the binary causes timing changes that prevent race conditions from popping out as often as they would in production. I want to force diverse executions (ideally fully exhaust the interleaving space for small bounded execution combinations) in addition to just detecting issues.

                1. 1

                  Are you thinking of something like Jepsen?

                  1. 2

                    Jepsen is thousands to millions of times slower than in-process network partition simulation, takes a month to stand up if you already have clojure skills, and it does not result in deterministically-replayable regression tests for issues it finds. It’s fairly expensive to run on every commit, and you can’t afford to block devs on jepsen runs in most cases. Here’s an example of what I’m talking about for network testing applied to a CASPaxos implementation. In 10 seconds it will find way more bugs than jepsen usually will in 10 hours, so you can actually have devs run the test locally before opening a pull request.

                    1. 1

                      What you’re describing reminds me of the FoundationDB simulation testing talk.

                      They instrumented their C++ codebase to remove all sources of non-determinism when running under test, so they could test their distributed database and replay failures, stressing it in exactly the way you’ve described.

                      Let me try to find a link.

                      Edit: I believe this is it: video

                      1. 2

                        Yeah! This talk was really inspiring to me, and I’ve been pushing to test more things in this way. There have been some advances since that talk that address some of the complexity downsides he mentions, particularly lineage-driven fault injection. LDFI is the perfect way to make a vast fault-space tractably testable, by looking at what goes right and then introduces failures from there.