1. 3

  2. 2

    Author here. Yeller does drop messages on the floor if it has to, it just tries really, really really hard not to. But it doesn’t apply backpressure on the clients (which is the other choice), because that means they get hit with unexpected latency.

    1. 1

      What kind of latency are you worried about? Couldn’t you just signal, “not now” and have the clients send it later? I think this is how scribe does it.

      1. 1

        Sure. The worry there is that my client libraries run in other people’s production systems, and there’s tradeoffs with how complex they get. I’m wary of messing up that code in some way such that my users get issues with bugs in those libraries. I also don’t wanna have impacts on user’s memory/disks if Yeller goes down - I’m pretty certain they’d prefer to miss some error messages than to have downtime or lots of GC induced latency spikes.

        There’s also a bunch of maintenance cost. I currently maintain 4 libraries for different languages, and that’s only going to get larger as time goes on.

        This isn’t saying I’m not going to do this eventually, just I’m worried about the tradeoffs right now, so I’m holding off.

        1. 1

          It might be worth looking into seeing how zipkin deals with this problem. It exposes a scribe interface, so clients can either use scribe, or write to it directly using thrift. Scribe is also not the only game in town, but this method works for other strategies, like kafka, too. Then your clients can decide how hard they want to work for their exceptions, and it also enables having two classes of exceptions–one class of which is “please work really hard to not lose these”, and another is “best effort” (maybe fatal vs. nonfatal exceptions).