Threads for asymptote

  1. 38

    I work for AWS, my opinions are my own.

    This is a good retrospective on the Slack outage. The initial Slack response privately sent to customers within a week was extremely disappointing and reduced my confidence in Slack. As a customer, I would ask Slack to just skip the inadequate initial response and send out a deep dive of this quality regardless of how long it takes.

    That being said, in my opinion, there are issues with Slack’s architecture and data schema that are not addressed by the short-term or long-term actions at the bottom. So…please accept my personal two cents.

    If data isn’t in the cache, or if an update operation occurs, the Slack application reads it from the Vitess datastore and then inserts the data into Memcached…Furthermore, membership of GDMs is immutable under the current application requirements, so there is a long cache TTL, and therefore the data is almost always available via the cache…Other more long-term projects are exploring other ways of increasing the resilience of our caching tier.

    You have created a bi-modal system. In one mode, everything is peachy and the cache is full and latencies are low and sometimes you hit the datastore. In the second mode, if there is a surge of unexpectedly cold traffic, an unexpected deployment, a cold restart of your system, your cache is useless, and you hammer your data store. Modes are bad. You want one mode. On top of that you’re suffering from scale inversion; the size of requests and the cache fleet exceed your data stores capacity and can wipe it out.

    As per https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems/:

    Falling back to direct database queries was an intuitive solution that did work for a number of months. But eventually the caches all failed around the same time, which meant that every web server hit the database directly. This created enough load to completely lock up the database…The thinking behind our fallback strategy in this case was illogical. If hitting the database directly was more reliable than going through the cache, why bother with the cache in the first place? We were afraid that not using the cache would result in overloading the database, but why bother having the fallback code if it was potentially so harmful? We might have noticed our error early on, but the bug was a latent one, and the situation that caused the outage showed up months after launch.

    So questions to ask yourself:

    • Vitess is magical, scales horizontally, and supports read replicas. If it’s so magical, why can’t you scale it horizontally and replicate it globally for reducing latency. Why do you need a caching layer?
    • If you need a caching layer, you are caching immutable stable data. Why are you pulling it and not pushing it? If it is immutable why can’t you reload from disk, or reload from an object store, why refresh it?
    • If I have designed a system with modes, how do I regularly test the system in all of its modes? The modes are orthogonal axes in a space of system configurations….how do I find the corners and test them? Can I test the corners? If I can’t test the corners, maybe I should eliminate some axes?

    Client retries are often a contributor to cascading failures, and this scenario was no exception. Clients have limited information about the state of the overall system. When a client request fails or times out, the client does not know whether it was a local or transient event such as a network failure or local hot-spotting, or whether there is ongoing global overload. For transient failures, prompt retrying is the best approach, as it avoids user impact. However, when the whole system is in overload, retries increase load and make recovery less likely, so clients should ideally avoid sending requests until the system has recovered. Slack’s clients use retries with exponentially increasing backoff periods and jitter, to reduce the impact of retries in an overload situation. However, automated retries still contributed to the load on the system.

    It does not necessarily follow that limited local information prevents appropriate inference about global state. You can use local circuit breakers for an automated approach, as per https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/:

    Load. Even with a single layer of retries, traffic still significantly increases when errors start. Circuit breakers, where calls to a downstream service are stopped entirely when an error threshold is exceeded, are widely promoted to solve this problem. Unfortunately, circuit breakers introduce modal behavior into systems that can be difficult to test, and can introduce significant addition time to recovery. We have found that we can mitigate this risk by limiting retries locally using a token bucket. This allows all calls to retry as long as there are tokens, and then retry at a fixed rate when the tokens are exhausted. AWS added this behavior to the AWS SDK in 2016. So customers using the SDK have this throttling behavior built in.

    So questions to ask yourself:

    • Can you update the library that you use to hit the datastore to automatically implement a token-bucket for retries then circuit break?
    • What additional metrics and alarming would you add to be able to monitor this?
    • What runtime configuration could you implement to automatically trip all circuit breakers?
    • How could you regularly test all of this?

    RE: data schema:

    An alternative is to dual-write the data under a different sharding strategy. We also have a keyspace that has membership of a channel sharded by channel, which would have been more efficient for this query. However, the sharded-by-user keyspace was used because it contained needed columns that were not present in the sharded-by-channel table…We have modified the problematic scatter query to read from a table that is sharded by channel. We have also analyzed our other database queries which are fronted by the caching tier to see if there are any other high-volume scatter queries which might pose similar risks

    This is always the trouble with NoSQL database schemas and denormalization, sometimes there’s just a little something missing that means you can’t use that way and now you have to keep using this way. Regular operational reviews and good design reviews are needed to catch bigger subtle issues with data schemas. I also try to have a rule in my mind “Never let them touch every row”, as in always think have I created a query or pattern where there’s O(n) behavior going on. If there is I’ve just boxed myself in and n just have to get large enough to tip me over. Queries need to be O(1), then anything that genuinely need O(n) access you can e.g. use a changelog stream to publish the data to an alternative data store like an object store for access. The caching layer is your crutch/mode that is hiding this time bomb and paving the road to the next outage.

    The PBR step on February 22 updated Consul on 25% of the fleet. It followed two previous 25% steps the prior week, both of which had occurred without any incident. However, on February 22, we hit a tipping point and entered a cascading failure scenario as we hit peak traffic for the day. As cache nodes were removed sequentially from the fleet, Mcrib continued to update the mcrouter configuration, promoting spare nodes to serving state and flushing nodes that left and rejoined. This caused the cache hit rate to drop.

    Can you connect the loop here, why are you continuing deployments as your system is degrading? You know your system is bi-modal, as it veers into the other mode you need to slam on the big red button. Some may say it is impossible to avoid the other mode….but at least you can try.

    “Complex systems contain changing mixtures of failures latent within them. The complexity of these systems makes it impossible for them to run without multiple flaws being present. Because these are individually insufficient to cause failure they are regarded as minor factors during operations. Eradication of all latent failures is limited primarily by economic cost but also because it is difficult before the fact to see how such failures might contribute to an accident. The failures change constantly because of changing technology, work organization, and efforts to eradicate failures.”

    A pithy quote, and an interesting paper. But after incidents, rather than pithy quotes, I prefer sharp controversial questions that try to go to the heart of the issue. I am not saying there are right or wrong answers, but asking such questions and the journey of answering them is more enlightening then seeking generic yogi-like advice. I am no yogi, I do not have magical answers, so I like to ask myself sharp questions.

    So, maybe:

    • Get rid of caching? Vitess is awesome, let’s just nuke the memcached layer entirely.
    • Get rid of Vitess? Vitess is not awesome, we couldn’t horizontally scale it during the incident, we forget to denormalize columns, let’s nuke Vitess.
    • Test all the modes? Is it even possible to test a system’s modes fully, or should we try to eliminate modes?
    • Test recovering once a system is in catastrophic failure and requires severe throttling and cold starting vs. test the system’s breaking point and avoid the breaking point? Ah, the ever eternal controversial question. Ask both questions. What works for your organization?
    • Better deployments? Why do Consul deployments empty the cache, especially for stable items with infinite TTL? What needs to change during a deployment and what can stay the same? How do I know a deployment is going side-ways?

    Thanks for Slack, and good luck!

    1. 7

      Vitess is magical, scales horizontally, and supports read replicas. If it’s so magical, why can’t you scale it horizontally and replicate it globally for reducing latency. Why do you need a caching layer?

      This is a very good question.

      I don’t work at Slack, but in Booking.com, we tried to push MySQL(without Vitess) to quite the limit.

      Those who are interested can check out the details in https://blog.koehntopp.info/2021/03/12/memory-saturated-mysql.html written by Kristian our principle DBE. Essentially you can serve the data from memory page of MySQL which effectively make read perf of a fully saturate page to be on the same level with using a cache store like Redis or Memcache. For this reason, there is no ‘cache’ layer in Booking.com and the read heavy workloads can be served from MySQL directly, heavily sharded.

    1. 2

      You may be wondering why this is just coming to light now, when Java has had ECDSA support for a long time. Has it always been vulnerable?

      No. This is a relatively recent bug introduced by a rewrite of the EC code from native C++ code to Java, which happened in the Java 15 release. Although I’m sure that this rewrite has benefits in terms of memory safety and maintainability, it appears that experienced cryptographic engineers have not been involved in the implementation.

      Couldn’t an end-to-end fuzz test on the original and new code catch this? Not sure.

      I think it’s clear that security-sensitive code should be evaluated by experts. But the recent trend to rewrite other core infrastructure in Java / Go / Rust gives me pause.

      Fuzz both sides, get an expert, or just let the original code be?

      1. 3

        This is a very easy bug to find - just trying all-zeroes will work, and most testing strategies should test all-zeroes.

        In general, though, crypto code can break only on very particular inputs (e.g. carry-chain bugs). You want expert review, and/or a careful code comparison against the original code (which would have worked!), and probably something like Project Wycheproof, which collects a number of test vectors (etc.) for specific algorithms.

      1. 14

        I work for AWS, my views are my own and do not reflect my employer’s views.

        Thanks for posting your frustrations with using AWS Lambda, AWS API Gateway, and AWS EventBridge. I agree, using new technologies and handing more responsibility over to a managed service comes with the risk that your organization is unable to adopt and enforce best standards.

        I also agree that working in a cult-like atmosphere is deeply frustrating. This can happen in any organization, even AWS. I suggest focusing on solving problems and your business needs, not on technologies or frameworks. There are always multiple ways to solve problems. Enumerate at least three, put down pros and cons, then prototype on two that are non-trivially different. With this advice you will start breaking down your organization’s cult-like atmosphere.

        Specifically addressing a few points in the article:

        Since engineers typically don’t have a high confidence in their code locally they depend on testing their functions by deploying. This means possibly breaking their own code. As you can imagine, this breaks everyone else deploying and testing any code which relies on the now broken function. While there are a few solutions to this scenario, all are usually quite complex (i.e. using an AWS account per developer) and still cannot be tested locally with much confidence.

        This is a difficult problem. I have worked in organizations that have solved this problem using individual developer AWS accounts deploying a full working version of “entire service” (e.g. the whole of AWS Lambda), with all its little microservices as e.g. different CloudFormation stacks that take ~hours to set up. It works. I have also worked in organizations that have not solved this problem, and resort to maintaining brittle shared test clusters that break once a week and need 1-2 days of a developer’s time to set up. Be the organization that invests in its developer’s productivity and can set up the “entire service” accurately and quickly in a distinct AWS account.

        Many engineers simply put a dynamodb:* for all resources in the account for a lambda function. (BTW this is not good). It becomes hard to manage all of these because developers can usually quite easily deploy and manage their own IAM roles and policies.

        If you trust and train your developers, use AWS Config [2] and your own custom-written scanners to automatically enforce best practices. If you do not trust and do not train your developers, do not give them authorization to create IAM roles and policies, and instead bottleneck this authorization to a dedicated security team.

        Without help from frameworks, DRY (Don’t Repeat Yourself), KISS (Keep It Simple Stupid) and other essential programming paradigms are simply ignored

        I don’t see how frameworks are connected with DRY and KISS. Inexperienced junior devs using e.g. Django or Ruby on Rails will still write bad, duplicated code. Experienced trained devs without a framework naturally gravitate towards helping their teams and other teams re-use libraries and create best practices. I think expecting frameworks to solve your problem is an equally cult-like thought pattern.

        Developers take the generic API Gateway generated DNS name (abcd1234.amazonaws.com) and litter their code with it.

        Don’t do this, attach a Route 53 domain name to API Gateway endpoints.

        The serverless cult has been active long enough now that many newer engineers entering the field don’t seem to even know about the basics of HTTP responses.

        Teach them.

        Cold starts - many engineers don’t care too much about this.

        I care about this deeply. Use Go or Rust first, see how much cold starts are still a problem, in my experience p99.99 latency is < 20 ms for trivial (empty) functions (this is still an outrageously high number for some applications). If cold starts on Go or Rust are still a problem, yes you need to investigate provisioned concurrency. But this is a known limitation of AWS Lambda.

        As teams chase the latest features released by AWS (or your cloud provider of choice)

        Don’t do this, give new features / libraries a hype-cool-down period that is calibrated to your risk profile. My risk profile is ~6 months, and I avoid all libraries that tell me they are not production ready.

        When it’s not okay to talk about the advantages and disadvantages of serverless with other engineers without fear of reprisal, it might be a cult. Many of these engineers say Lambda is the only way to deploy anymore.

        These engineers have stopped solving problems, they are now just lego constructors (I have nothing against lego). Find people who want to solve problems. Train existing people to want to solve problems.

        I am keeping track of people’s AWS frustrations, e.g. [1]. I am working on the outline of a book I’d like to write on designing, deploying, and operating cloud-based services focused on AWS. Please send me your stories. I want to share and teach ideas for solving problems.

        [1] https://blog.verygoodsoftwarenotvirus.ru/posts/babys-first-aws/

        [2] https://docs.aws.amazon.com/config/latest/developerguide/managed-rules-by-aws-config.html

        1. 4

          The serverless cult has been active long enough now that many newer engineers entering the field don’t seem to even know about the basics of HTTP responses.

          Teach them.

          I’m happy to teach anyone who wants to learn. Unfortunately this usually comes up in the form of their manager arguing that it’s too much overhead to spend time getting their employee(s) up to speed on web tech and insist on using serverless as a way to paper over what is happening throughout the stack. This goes to the heart of why people characterize it as a cult. The issues it brings into orgs isn’t about the tech as much as it is about the sales pitches coming from serverless vendors.

          1. 9

            Interesting. At $WORK, we’re required to create documents containing alternatives that were considered and rejected, often in the form of a matrix with multiple dimensions like cost, time to learn, time to implement, etc. Of course there’s a bit of a push-pull going on with the managers, but we usually timebox it (1 person 1 week if it’s a smaller decision, longer if it’s a bigger one.) Sometimes when launching a new service we’ll get feedback from other senior engineers asking why we rejected an alternative maybe even urging us to reconsider the alternative.

            Emotional aspects of the cult aside (which sucks, not saying it doesn’t just bringing up a different point), I don’t think I’d ever let a new system be made at work if at least a token attempt weren’t made at evaluating different technologies. I fundamentally think comparing alternatives makes for better implementations, especially when you have different engineers with different amounts of experience with different technologies.

            1. 1

              So you write an RFP with metrics/criteria chosen to perfectly meet the solution already settled on?

              1. 2

                I mean if that’s what you want to do, sure. Humans will be human after all. But having this kind of a process offers an escape hatch from dogma around a single idea. Our managers also try to apply pressure to just get started and ignore comparative analyses, but with a dictum from the top, you can always push back, citing the need for a true comparative analysis. When big outages happen, questions are asked in the postmortem whether an alternate architecture would have prevented any issues. In practice we often get vocal, occasionally bikeshed-level comments on different approaches.

                I’m thankful for our approach. Reading about other company cultures reminds me of why I stay at $WORK.

            2. 2

              Try giving them alternatives. Want to train your developers, or sign off on technical debt and your responsibility to fix it?, when presented well, can point out the issue. This happens with all tech vendors, and all managers can suck at this. But that’s not the fault of serverless.

              Note that I’m not arguing that serverless is actually good. As with any tech, the answer is usually “it depends”. But just like serverless, you need experience with other things as well to be able to see this pattern.

              In fact, I agree with several commenters saying that majority of issues in the article can be applied to any tech. The only real insurmountable technical issue is the testing/local stack. The rest is mostly about processes of the company, or maybe a team in the company.

            3. 4

              Specifically addressing a few points in the article

              … while carefully avoiding the biggest one:

              “All these solutions are proprietary to AWS”

              That right there is the real problem. An entirely new generation of devs is learning, the hard way, why it sucks to build on proprietary systems.

              Or to put it in economic terms, ensure that your infrastructure is a commodity. As we learned in the 90s, the winning strategy is x86 boxen running Linux, not Sun boxen running Solaris ;) And you build for the Internet, not AOL …

              1. 2

                I think there are three problems with a lot of the serverless systems, which are closely related:

                • They are proprietary, single-vendor solutions. If you use an abstraction layer over the top then you lose performance and you will still end up optimising to do things that are cheap with one vendor but expensive for others.
                • They are very immature. We’ve been building minicomputer operating systems (and running them on microcomputers) for 40+ years and know what abstractions make sense. We don’t really know what abstractions make sense for a cloud datacenter (which looks a bit like a mainframe, a bit like a supercomputer, and a bit like a pile of servers).
                • They have a lot of vertical integration and close dependencies between them, so it’s hard to use some bits without fully buying into the entire stack.

                If you think back to the late ’70s / early ‘80s, a lot of things that we take for granted now were still very much in flux. For example, we now have a shared understanding that a file is a variable-sized contiguous blob of bytes. A load of operating systems provided record-oriented filesystems, where each file was an array of strongly typed records. If you do networking, then you now use the Berkeley Sockets API (or a minor tweak like WinSock), but that wasn’t really standardised until 1989.

                Existing FaaS offerings are quite thin shims over these abstractions. They’re basically ‘upload a Linux program and we’ll run it with access to some cloud things that look a bit like the abstractions you’re used to, if you use a managed language then we’ll give you some extra frameworks that build some domain-specific abstractions over the top’. The domain-specific abstractions are often overly specialised and so evolve quite quickly. The minicomputer abstractions are not helpful (for example, every Azure Function must be associated with an Azure Files Store to provide a filesystem, but you really don’t want to use that filesystem for communication).

                Figuring out what the right abstractions are for things like persistent storage, communication, fault tolerance, and so on is a very active research area. This means that each cloud vendor gains a competitive advantage by deploying the latest research, which means that proprietary systems remain the norm, that the offerings remain immature. I expect that it will settle down over the next decade but there are so many changes coming on the hardware roadmap (think about the things that CXL enables, for one) that anything built today will look horribly dated in a few years.

                1. 1

                  Many serverless frameworks are built upon Kubernetes, which is explicitly vendor-neutral. However, this does make your third point stronger: full buy-in to Kubernetes is required.

                  1. 2

                    Anything building on Kubernetes is also implicitly buying into the idea that the thing that you’ll be running is a Linux binary (well, or Windows, but that’s far less common) with all of the minicomputer abstractions that this entails. I understand why this is being done (expediency) but it’s also almost certainly not what serverless computing will end up looking like. In Azure, the paid FaaS things use separate VMs for each customer (not sure about the free ones), so using something like Kubernetes (it’s actually ACS for Azure Functions, but the core ideas are similar) means a full Linux VM per function instance. That’s an insane amount of overhead for running a few thousand lines of code.

                    A lot of the focus at the moment is on how these things scale up (you can write your function and deploy a million instances of it in parallel!) but I think the critical thing for the vast majority of users is how well they scale down. If you’re deploying a service that gets an average of 100 requests per day, how cheap can it be? Typically, FaaS things spin up a VM, run the function, then leave the VM running for a while and then shut it down if it’s not in use. If your function is triggered, on average, at an interval slightly longer than the interval that the provider shuts down the VM then the amount that you’re paying (FaaS typically charges only for CPU / memory while the function is running) is far less than the cost of the infrastructure that’s running it.

                2. 2

                  S3 was a proprietary protocol that has become a de facto industry standard. I don’t see why the same couldn’t happen for Lambda.

              1. 11

                The common solution to this problem is using a UUID (Universally Unique Identifier) instead. UUIDs are great because it’s nearly impossible to generate a duplicate and they obscure your internal IDs. They have one problem though. They take up a lot of space in a URL: api.planetscale.com/v1/deploy-requests/7cb776c5-8c12-4b1a-84aa-9941b815d873. Try double clicking on that ID to select and copy it. You can’t. The browser interprets it as 5 different words. It may seem minor, but to build a product that developers love to use, we need to care about details like these.

                Remove the hyphens? 7cb776c58c124b1a84aa9941b815d873.

                Encode the UUID hex as base32 and strip the equals signs?

                In [7]: base64.b32encode(uuid.uuid4().bytes).decode().strip("=")
                Out[7]: '46LIVXKUY5A57HIDNFDEUYCOLM'
                

                So there are at least two options for keeping all 128 bits of entropy and being able to double-click it.

                The longer and more complex the ID, the less likely it is to happen. Determining the complexity needed for the ID depends on the application. In our case, we used the NanoID collision tool and decided to use 12 character long IDs with the alphabet of 0123456789abcdefghijklmnopqrstuvwxyz. This gives us a 1% probability of a collision in the next ~35 years if we are generating 1,000 IDs per hour.

                But your scheme gives log(36 ^ 12) / log(2) = 62 bits of entropy. So you are not solving the same problem that UUID is. Of course the IDs are shorter.

                1. 4

                  Seems like a case of picking a technology based on aesthetics for development purposes, which appears to lead to bad decisions. And according to the NanoID readme, it uses A-Za-z0-9_- by default, so does encode similar bits of entropy. Removing the hyphens really seems like the best idea, lol.

                  1. 5

                    Yeah, the author seems to be treating UUIDs as more special than they really are. The type of UUID everyone uses is just a securely-random 128-bit number (with two(?) of the bits set to constant values to identify it as such) encoded in hex with a few hyphens.

                    A nanoid is the same thing, just with control over the size and the alphabet used to encode it.

                    BTW, I recently discovered the d64 encoding, which is a modified base64 that avoids hyphens and = signs, and I’m using it in my current project.

                    1. 1

                      Thanks for the pointer to d64! I’ve been looking for something like this.

                1. 8

                  I work for AWS, my opinions are my own.

                  Thank you for this article. It’s rare to see an article detailing the struggles of going from scratch to a working application on AWS. The author is clearly someone used to operating services and has good opinions about what they need (logging, metrics, tracing).

                  That being said, the target price of $200/month and the achieved cost of $130/month seems high. I run a very small hobby project called Simple Calendar Tracker on AWS. Very basic CRUD application using the typical API Gateway to Lambda to DynamoDB, with CloudWatch and X-Ray. It costs around $20/month total for a test environment and a prod environment, doesn’t take much traffic maybe 100 daily active users and < 5 TPS p100.

                  I’ve been meaning to write up this simple application and how other can follow. Some things that stood out for me in the article:

                  • The author explicitly wants to go all in for AWS, but then uses Terraform? That’s strange, why not use AWS CDK? AWS supports CDK and smooths from rough edges for you, e.g. private subnets and access to services.
                  • Aurora Serverless is great if you require scalable relational database access, but is expensive. DynamoDB is widely used inside Amazon, and in pay-per-request mode can be cheap.
                  • Why use Fargate and ECS and not Lambda? Granted, Lambda can have larger tail latencies, but is cheaper for small workloads.
                  • The author was really obsessed with avoiding managed NAT gateways. I get it, they’re annoying and expensive. You can run your own NAT Gateway on a small instance, or just use a public subnet and secure your instances, or use a managed compute environment like Lambda.
                  • Why complain about the cost of CloudWatch, but then also complain at how expensive it is to run Elasticsearch / Graphana / Prometheus? This is kind of what CloudWatch is - expensive but all-encompassing. But I 100% agree CloudWatch metrics are expensive.
                  • People often compare using managed AWS services to running a single VPS instance, but this is an apples to oranges comparison. How does this address scalability, availability, durability, deployments, and security? Lambda is capable of bursting up from zero to hundreds of thousands of concurrent instances instantly if you request limit increases.

                  I’m looking forward to writing my blog post on creating and operating services on AWS, I think it could easily become a book. Blog posts like this help me realize what customer pain points are.

                  1. 4

                    I’ll be on the lookout for your article. I used to host my personal stuff on AWS because that is what I work with, but have mostly moved off due to pricing. Terraform is my choice because it is cloud neutral and I am using more than one cloud provider. I use apps I didn’t write and they expect Postgres, so I think dynamo is not an option? I think the author walked away with cloudwatch, prom/graph, and elasticsearch all being really expensive. I personally find it hard to get the visuals from cloudwatch I get from grafana and the ability to query more efficiently. I may just not know enough there as well. I agree that he was never going to get a low price out of his method deployment, but it is a small application and even ec2 would have been a huge cost savings for a non critical app.

                    1. 1

                      I agree that he was never going to get a low price out of his method deployment, but it is a small application and even ec2 would have been a huge cost savings for a non critical app.

                      I agree, if the business problem you are solving can be satisfied by a single EC2 spot instance and you have the domain knowledge to operate it, go for it. t4g.nano spot price is $1 a month. I use spot instances for a separate hobby project, works great.

                    2. 2

                      People often compare using managed AWS services to single VPS instance, but this is an apples to oranges comparison. How does this address scalability, availability, durability, deployments, and security? Lambda is capable of bursting up from zero to hundreds of thousands of concurrent instances instantly if you request limit increases.

                      As a member of the VPS instance gang, I’ll play devil’s advocate a little if you don’t mind. ;-)

                      For small applications you can easily get a year of uptime and low hundreds of concurrent users on a single small VPS. It’s really not a problem, especially if the service you’re running is free and can tolerate an hour of downtime once every six months or so. For what I’d consider mid-sized applications with Real Users you’d probably want 2 redundant web servers, load balancer with failover, a database server or two, maybe some logging system, and you’d need an Ansible setup or something to orchestrate. Linode puts that at $60/month, so this is about the level at which AWS starts being competitive. It takes a week or two to set up and maybe one day a month to manage tops, if you know what you’re doing, and as this post demonstrates, knowing what you’re doing with AWS is a pretty large investment of work as well.

                      If your code is a little careful this ought to be plenty to scale to thousands of concurrent users, and easily expand to tens of thousands with beefier and more servers; you don’t hit its limits until you get into the dozens of machines. You probably won’t run Netflix off of it, but you can run 2015’s StackOverflow off of it just fine.

                      The other place where AWS works really well is at the very low end, such as when you just need something that’s a single Lambda function or such that glues together other things. I dealt with it for that sort of thing recently and it was reasonably pleasant, though it does tend to start with one service that Does Something and then you need to drag in bunches of other pieces to manage it.

                      1. 4

                        As a member of the VPS instance gang, I’ll play devil’s advocate a little if you don’t mind. ;-)

                        Don’t exclude me! My Battlesnake runs happily on a Digital Ocean instance. I don’t care about downtime.

                        This all boils down to different strokes for different folks. I’m absolutely not saying AWS is the glorious single true way to deploy software systems. With appropriate domain expertise you can be StackOverflow and run on a handful of powerful 1U rack servers. In my first job I worked for a company that sold powerful telecommunication softswitches for small towns and cities, that was an awesome experience.

                        But if you don’t have domain expertise, managed services are pretty awesome. I wasn’t lying before. I interacted with a high-profile internal customer whose Lambda traffic scaled from 0 TPS to 100k TPS within 10ms and throttled excess load. Their complaint was that they needed this to be larger, and we handled it. I’m not saying Lambda is magical and solves all problems, just that this particular customer needed no domain expertise or capital expenditure at all.

                        AWS is a sliding scale with different components you can play with, or you can opt out and build everything by hand, that’s how I see it.

                        1. 2

                          I interacted with a high-profile internal customer whose Lambda traffic scaled from 0 TPS to 100k TPS within 10ms and throttled excess load.

                          Okay, I can’t lie, that is pretty awesome.

                    1. 2

                      Working link; https://web.archive.org/web/20220326171212/https://queue.acm.org/detail.cfm?id=3526209

                      What was the conclusion or purpose of this article? It filled an odd space where it offered no technical guidance, but wasn’t creatively written enough to enjoy as-is.

                      I don’t want to be overly negative. As a sliver of substance, I’d say: when designing a system, start with an asset-centric threat model. Really dig deep. Get it reviewed by people you trust / employ / both.

                      If you don’t know what an asset-centric threat model is, let me know I would enjoy writing about it.

                      1. 2

                        Strange, the link seems to be working for me. (The web server might do some “checks” behind the scenes.)

                        What was the conclusion or purpose of this article?

                        The article’s author has a column on ACM Queue called “Kode Vicious”, where he answers the emails of readers, but doesn’t necessarily give a concrete answer, instead tries to tackle the more philosophical aspects of the problem. Sometimes he even goes on to rant about the topic. :)

                        The point of the current article I think boils down to: just as we have “software engineers” that think about and implement the actual software, there should exist counterparts “data engineers” that should think about and implement the data management part.

                        (Too many times, developers don’t think what to do with the data, and they hope that “the cloud” would just solve any issue…)

                        1. 1

                          (Too many times, developers don’t think what to do with the data, and they hope that “the cloud” would just solve any issue…)

                          Data is central to systems. You can make mistakes modeling and storing data. People are used to shooting themselves in the foot with a single VPS instance with a .22 gun. “The cloud” lets you shoot yourself in the foot with a 10 megaton thermonuclear weapon. All problems need thought, foresight, and mechanisms for recovering from mistakes.

                      1. 4

                        This is an important topic but this article was a struggle to understand and does not clearly and accurately propose a solution. Maybe I never understood lock files?

                        NPM can already record those versions, in a lock file. But npm install of a new package does not use the information in that package’s lock file to decide the versions of dependencies: lock files are not transitive.

                        Why in the name of everything that is holy are NPM lock files not transitive? What does a non-transitive lock file even mean? Am I misunderstanding the article or NPM - doesn’t a lock file permanently freeze the full dependency graph to fixed versions, or do I only get that with npm shrinkwrap?

                        Anyone running modern production systems knows about testing followed by gradual or staged rollouts, in which changes to a running system are deployed gradually over a long period of time, to reduce the possibility of accidentally taking down everything at once.

                        Why even mention this? I was teased into maybe seeing a proposal for gradually revealing package updates from consumers, or maybe automatically running automated tests, but neither were proposed. I think the Go versioning proposal is weak because it still leaves you prey to arbitrary version upgrades.

                        NPM’s design choice is the exact opposite. The latest version of colors was promoted to use in all its dependents before any of them had a chance to test it and without any kind of gradual rollout. Users can disable this behavior today, by pinning the exact versions of all their dependencies. For example here is the fix to aws-cdk. That’s not a good answer, but at least it’s possible.

                        Why is this not a good answer? Later in the article the author argues people should use npm shrinkwrap….which is this answer. “

                        Is this the crux of my misunderstanding? Isn’t a lock file equivalent to expressing specific version dependencies for all depenendencies in the full dependency graph?

                        Other language package managers should take note too. Marak has done all of us a huge favor by highlighting the problems most package managers create with their policy of automatic adoption of new dependencies without the opportunity for gradual rollout or any kind of testing whatsoever.

                        Come on, let’s call a spade a spade here. You need to freeze your dependency graph with a lock file. If npm plays weird semantic games with “oh lock files don’t really lock dependencies” then just Docker the whole thing and wrap it in a lead canister. If you depend on an ecosystem that doesn’t honor freezing dependencies then freeze them yourself.

                        1. 9

                          Am I misunderstanding the article or NPM - doesn’t a lock file permanently freeze the full dependency graph to fixed versions, or do I only get that with npm shrinkwrap?

                          The article. You are correct that your project’s package-lock.json freezes the entire dependency tree for your project. The paragraphs at the end of the article starting “NPM also has an npm shrinkwrap…” are somewhat confusingly written.

                          Say I set up a new project and I want to use library A, which has a dependency on library B.

                          When I run npm install --save A, for the first time, npm will fetch the latest allowed versions (according to version bounds in the dependencies field of package.json) of A and B. It’ll record in package-lock.json whatever versions of A and B it fetched.

                          Next time I run npm install or npm ci in my project, it’ll install exactly the same versions of A and B again that it installed last time. (assuming I haven’t changed dependencies in package.json)

                          What the article is complaining about is this: package A may contain its own package-lock.json. npm did not consult A’s package-lock.json to decide which version of B to pick when I ran npm install --save A. Only the package-lock.json at the root of the project (if there is one) was consulted. On initial project setup, I don’t have a package-lock.json yet at the top level, so I’m getting all fresh new versions of everything and I’m going to get the sabotaged version of color since it’s the newest one.

                          The article proposes that, when you run npm install --save A, npm should check for a package-lock.json inside the package ‘A’ and, if one is present, use the same version of B that A’s package-lock.json has, because A was probably QA’d with that version of B.

                          1. 4

                            Isn’t this what package.json is for? Or is the problem that everyone uses overpermissive dependencies?

                            1. 3

                              No. package.json only specifies dependencies one level deep and only specifies names and version numbers. It’s pretty loose. The lock file has content hashes too and specifies the entire tree.

                              1. 3

                                But hang on, if you check the entire tree of every dependency, aren’t you pretty much always going to get conflicting information? Isn’t that the whole point of package.json that we specify looser dependencies so that we can, SemVer permitting, actually get two packages to agree on which version of a dependency they can both live with? It seems to me that when you give up on the SemVer notion of non-breaking differences, except for very trivial cases, you pretty much give up on versioning.

                                1. 2

                                  Yes, there is a tradeoff here and I suspect that people who wrote npm did actually think about this very issue.

                                  FWIW you can install mutually conflicting libraries in npm as sub dependencies. If A depends on C==1.0 and B depends on C==2.0, and you install both A and B, you get a node_modules tree like:

                                  • node_modules/A
                                  • node_modules/A/node_modules/C
                                  • node_modules/B
                                  • node_modules/B/node_modules/C

                                  So over constraining dependency versions doesn’t necessarily break everything, even though it’s not often what you want.

                            2. 2

                              Thank you!

                          1. 10

                            I would find it more persuasive if the author showed me what idiomatic code to write in some other language with error checking.

                            As it stands the post boils down to “poorly written Shell scripts are footguns”, to which my response is “yes”.

                            1. 2

                              I like this post, thank you for sharing it.

                              If learning how to learn as a software engineer interests you, Apprenticeship Patterns will also interest you. I recommended it to people near the end of Amazon’s internal program that trains people to become software engineers.

                              What’s going on? Does learning make something magic happen? It does! Sort of. Not that the stars will suddenly align the right way for us, but knowledge changes how we see the world. It’s like when you travel, come back home and start seeing things a bit differently.

                              I think curiosity and learning are two of the most powerful general purpose values you can have. Learning is genuinely fun and reminds us to have humility. And as the post says, the additional knowledge unlocks new paths and problem solving techniques in life, and as you wander down those paths at work or in a hobby new learning opportunities appear…

                              1. 4

                                I work for AWS, my opinions are my own. I also used to be an an instructor at Amazon Technical Academy so I have experience teaching, seeing teachers in action, and writing curricula.

                                Thank you for asking this question. This is a tough question for me because there are so many angles that personally involve me.

                                RE: the term “AWS”, just like the IT industry as a whole, AWS means different things to different people. Some people are data engineers, some are networking gurus, some just want to do X or Y. Sure, use A Cloud Guru and Solutions Architects courses to get a broad understanding of AWS. But even then you will continue to feel left out, small, ignorant because there are always someone else solving some other problem (witness the long thread from Azure engineers about how important networking is, whereas many teams in AWS never encounter networking as a concept).

                                RE: the term “learn”, I highly recommend the O’Reilly book Apprenticeship Patterns. I personally love making “Breakable Toys”, solving a problem you are comfortable with but in a new toolset or language or framework you are not comfortable in. This gives you permission to fail and hones your attention because you are not worried about solving a novel problem. What is your go to breakable toy? What problems are you comfortable solving?

                                RE: “foundational knowledge”, I recommend trying out the newest version of AWS CDK, following a tutorial, then trying a breakable toy.

                                I love making silly Alexa skills (did you know they reimburse you $25 a month if you make an Alexa skill that is used at least once per month?). So to learn the Rust runtime for Lambda I made a Lambda function that calls a weather API for air pollution data, dumps the info to S3, then expose that data for Alexa. For me calling an API using code, parsing it, putting it into S3 wasn’t a big deal. But using a new Lambda runtime was mysterious and took a while to solve for me.

                                I could talk about this topic for years, write a book, anything. I love teaching, I love helping learners, I like AWS, and (to be frank) I do not know how easy or hard it is to learn AWS. I’d like to learn more about you and other’s difficulties.

                                1. 6

                                  Section “4.5 Toward a Culture of Safety” is worth reading in and of itself before the entire report. There’s a fantastic opportunity for distributed ledger providers to short circuit all the pain and confusing misery of many open source NoSQL databases and their lack of safety over the past decade, define what safety means, and at least start passively monitoring it.

                                  This is also a clarion call for developers of all kinds of systems: please define safety and liveness! What does your system define as bad and must never do? What does your system define as good and must eventually do? It’s amazing how this report starts and there was ambiguity on the notion of safety of the system vs. clients.

                                  1. 16

                                    I’m glad this post is cathartic for the author. For me the inflection point was:

                                    I ran e2fsck -f on the treefort volumes and hoped for the best. Instead of a clean bill of health, I got lots of filesystem errors. But this still didn’t make any sense to me, I checked the array again, and it was still showing as fully healthy. Accordingly, I decided to run e2fsck -fy on the volumes and hope for the best.

                                    Whew. Speaking from experience, so many battle scars, so many incidents at home and work, I’ve trained myself so that when something doesn’t make sense during an incident I now tell myself “move gently, move slowly, don’t change anything, what do I know, why do I think I know it, what should I do next”.

                                    It reminds me of a coworker in my first job, I once asked them why they typed commands so slowly. They would literally type one character per second, then reread commands entirely, it took 1 minute to run a command. They laughed and said “Speaking from experience, 1 minute is nothing compared to the consequences of doing the wrong thing”.

                                    So something as simple as rsync’ing your data off to a remote box would occur to you if you suspected your file system is hosed but your RAID is “healthy”. Or double checking your backup.

                                    1. 3

                                      If I had noticed a mysteriously detached volume claiming to be totally healthy on a re-attach, I would have looked up how to force-check that volume. Or just wipe it and resilver it from scratch. RAID1 makes me immediately worried about which volume is going to have the correct blocks. Without ZFS checksums it seems like it would be hard to know which blocks are the latest. ZFS checksums go all the way up to the root block, so it knows which tree of blocks is valid. I can’t think of how mdadm could possibly do this in RAID1 mode without getting it wrong some of the time. I can’t even see how mdadm would recover from bit rot with only 2 volumes in a mirror. I’m guessing it just… doesn’t?

                                      1. 2

                                        I’m guessing it just… doesn’t?

                                        That’s how I understand it. To get bitrot protection, you need to run mdadm on top of md-integrity which provides the checksum validation.

                                    1. 7

                                      RE: the size of the data, HTTP CDNs have file size limits, it sucks to find out like this though. I think AWS CloudFront’s limit is 20 GB. Does Azure Blob Storage support requester pays? https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html. Unfortunately this general idea prevents anonymous (unauthenticated) access to the data.

                                      RE: cloud costs, big business risk! I’m continually disappointed that no cloud offers a “hit this cost then email me three times and disable this thing for the month”. We need stock trading “stop orders” for the cloud.

                                      I work for AWS, my opinions are my own not my employer’s.

                                      1. 3

                                        Azure does support this though. The cost alerts (like in the linked post) as well as hard spending limits. Or did you mean something else?

                                        1. 1

                                          Thank you! I did not know Azure supported spending limits. https://docs.microsoft.com/en-us/azure/cost-management-billing/manage/spending-limit

                                          1. 1

                                            Yeah, except now that I read about it, I think it’s not as general purpose as I thought. It looks tied to subscriptions that have monthly credits, which is how I became aware of it.

                                        2. 2

                                          I have cloudwatch alerts setup at different bill levels and as they come in make sure they match what they should be at that point in the month. I’m sure you could link SNS to something that started shutting things down. A regular task I perform is the cost analyzer–I have a budget to maintain.

                                          Alicloud has rate limiters on the device networking interface at least, I forget if you can just limit the data transfer entirely.

                                        1. 6

                                          I enjoy using DynamoDB both at work and in side-projects. As long as you plan your queries and use-cases ahead of time, DynamoDB rewards you with hassle-free dependable performance. But, unlike relational databases, a poor schema can hose you, it’s difficult to migrate or index you way out of e.g. a poorly distributed partition key.

                                          Yes, AWS teams must justify using a relational database, when launching a new service or feature, otherwise they are encouraged to use DynamoDB. But DynamoDB is not good for all work loads. There are no magic bullets for storing and accessing state at scale.

                                          I also like the parts of DynamoDB that let you automatically back up the table and restore it to any point in time within a minute, or export it to S3. This is the part I work on. The problems being solved are quite interesting.

                                          In terms of predictability, there’s a lot of different parts of DynamoDB that work together to ensure overload results in consistent throttling and then trying to scale out. This is something I took away from the Roblox Consul outage, how hard it is reason about distributed system failures during overload and how consistent predictable throttling is the best outcome that is so difficult to deliver.

                                          I work on a part of DynamoDB, my opinions are my own.

                                          1. 2

                                            This article tries to use open source metrics to measure the impact of log4j. But, since Java is used heavily by enterprises, it shouldn’t be shocking that most enterprises do not pull directly from Maven Central but either mirror it or have their own internal repo. Hence this article severely underestimates the impact in my opinion.

                                            Some organizations also quickly realized it was going to be easy patching most hosts but perhaps impossible to patch all of them fast enough and release https://aws.amazon.com/blogs/opensource/hotpatch-for-apache-log4j/.

                                            There are many down sides to Java, but this is one ugly and surprising up side. The JVM agent is a no-holds-barred way of not just monitoring but also modifying runtime behavior. I think this approach is horrific but fascinating, like a car crash.

                                            1. 1

                                              This isn’t really Java’s fault. Many environments have similarly awful footgun APIs.

                                              Many older serialisation schemes had a tendency to give RCE, posix systems have system(), and windows has an equivalent, that is even easier to trip on.

                                              JNDI is just a particularly exciting case because by design it will happily download and execute remote content, but the above (and other) APIs have been abused in very similar ways in the past

                                            1. 10

                                              The post doesn’t have a conclusion, but the bug here is clearly with Nextcloud, right? It should obviously use the same bytes for the metadata request as what the server returned in the directory listing, right? In fact, why does Nextcloud mess with the contents of paths at all..?

                                              1. 3

                                                I also think that a storage system should never ever mess with the bytes of a file (or file name as in this case). The bytes I give to you are the bytes I want back. Not some approximation thereof.

                                                1. 1

                                                  Which means, that with two clients, you may end up with two separate files with the same name.

                                                  1. 1

                                                    The filesystem is perfectly free to not let you create a file it doesn’t like, for whatever reason.

                                                    Besides, the greek question mark and other look-alike characters already make this possible.

                                                    /home/icefox/tmp/text-is-easy-and-never-hurt-anyone > ls
                                                    .rw-r--r-- 4 icefox icefox  1 Jan 12:07 ;
                                                    .rw-r--r-- 5 icefox icefox  1 Jan 12:07 ;
                                                    
                                                  2. 1

                                                    Someone in the stack needs to be responsible for canonicalisation. On *NIX systems, this is typically userspace. This caused a lot of problems in non-unicode systems. If you create a file in a Big-5 locale and then try to open the same file in a Latin code page locale, then it will fail. These problems go away (in theory) if the VFS layer or filesystem driver canonicalise the encoding.

                                                  3. 2

                                                    Although the two characters look exactly the same, their code point sequence is different. This is known as Unicode equivalence and, in theory, addressed by Unicode normalization. But here, normalization caused this issue. Before storing the file name in the cache, Nextcloud normalized the file name (to NFC) in a function normalizePath:

                                                    Linux paths can be any arbitrary sequence of bytes terminated by null. Windows uses UTF-16. SMB has some horrifying backwards compatibility name mangling options. Paths on local and networked file systems are super interesting, and the words “super interesting” chill my bones.

                                                  1. 16

                                                    As valuable as this is specifically to Rust builds, this article clearly illuminates the author’s meta-cognitive process. It:

                                                    • Explains how to use “mechanical sympathy”, how to use an understanding of what a computer is trying to do in order to understand why certain types of solution work. Bridges a general understanding of compilation and linking from C to Rust specifically.
                                                    • Explain when / how / why to measure.
                                                    • Is not afraid to guess and play. The author introduces a new knob e.g. codegen units and wonders “what if I make this big? What if I make this small? What do the measurements show?”.
                                                    • Goes deep and covers multiple domains, for example you can ask cargo to self-report what it is doing, or just use strace which works in all situations but suffers from verbosity.

                                                    From an educator / educational perspective, remarkable and superb. I would read a book by this author.

                                                    1. 2

                                                      Terminal-based documentation is great as a how-to guide. You know what problem you are solving, and you need a list of practical steps in order to solve that problem. I like that I can copy/paste commands to solve the problem. This is how I document my steps at work during operational issues, testing, etc.

                                                      As long as you tell your audience this is a guide for people who know what problem they are solving and just need a list of steps, they will be happy. That’s why at the top of docs and articles I write I put “learning objectives”, e.g. https://asim.ihsan.io/flutter-ffi-libsodium/, so that my audience know what they will be able to do by the end of the article without committing to reading it. You can “create” and “use” and “run”, but you won’t be able to “understand” or “compare and contrast” or “decide”.

                                                      Referring to https://documentation.divio.com/, I don’t think such a format is suitable for tutorials, explanations, or references. Or maybe, if you feel strongly that the format is suitable, for each of the quadrants you can justify why. Especially when a person does not know what problem they need to solve, such a format does not help. What is Django? What problem does it solve? What does Django have to do with Debian? What is Debian? What problem does it solve? Why am I using Docker? What problem does it solve? etc. Depends on your audience and their background.

                                                      1. 2

                                                        Indeed libraries like these are why I have been growing ever-more skeptical of using any depdendencies, and now force myself to read a big chunk of any library before adding it to a project.

                                                        Harsh but fair? I think in this case you don’t even need to read the code, just read the docs page by page. I didn’t even know log4j supported this feature, and I take this ignorance as a failing on my part.

                                                        1. 2

                                                          Even if you read the entire log4j code, it might have grown this feature in a later version. I don’t think it’s feasible to be reading all the code changes that went into a release before upgrading. Granted, studying a changelog carefully might have revealed this (but then again, maybe it might not have, because it seems to be a problem with the implementation more than the idea)

                                                          1. 1

                                                            Even if you read the entire log4j code, it might have grown this feature in a later version. I don’t think it’s feasible to be reading all the code changes that went into a release before upgrading

                                                            I agree, this is a flaw in my proposal, I did not consider changes between releases, and IIRC this was introduced around 2015.

                                                            Granted, studying a changelog carefully might have revealed this (but then again, maybe it might not have, because it seems to be a problem with the implementation more than the idea)

                                                            I’m opposed to the idea of a logger having patterns at all (e.g. env:, jndi:, etc). But as I say that, it’s even less realistic to audit dependencies for their “ideas” or “tenets “ or whatever.

                                                            One more realistic solution is static analysis of dependencies but again I am struggling to conceptualize it what I’d be looking for. The ideal UX is something like Android permissions applied to a dependency graph. “I’d like dependencies that don’t need network access but disk access is OK”.

                                                        1. 2

                                                          The tag show means this was developed by you? If yes, thanks. Very cool idea! I will try it out. One of the major problems of Cloudwatch Logs in my opinion is its bad UI. I don’t expect much from logs, but Cloudwatch even fails to fulfill my few expectations (search, scroll, and live follow). I was perfectly happy with journalctl when we still had only a few servers.

                                                          In Cloudwatch I can do “Search all”, enter my search term, narrow down the time span, and then wait for annoying seconds or even minutes. And then it only shows me the log lines that match the search. To see the context, I have to click on the instance ID and open that in a new tab. Or perform another search with the correct correlation ID.

                                                          And exporting to S3 with the Cloudwatch UI is also quite annoying. I once looked into creating export jobs with the command line interface (to automate it once per day), but also that seemed to be more work than 5 minutes of programming.

                                                          So even if the virtual filesystem is slow (which I expect, because is has to call the same APIs that the Cloudwatch UI probably uses), I hope that it can make my debugging sessions a whole lot more fun. In the worst case I can execute cat some-log-files > my-cached-log to fetch all logs from the relevant time span and then happily grep my way through the cached log. Yay!

                                                          1. 5

                                                            Disclaimer: I work for AWS, and my opinions are my own and not my employer’s.

                                                            I think CloudWatch Logs (CWL) is an awesome service. One of the services I maintain at work outputs 5 MiB/s of logs in a large region that I can slowly but effortlessly search over. There is an internal-only predecessor to CWL that is sometimes so slow that internal teams rule out being able to search over logs in large regions. With CWL there is no such concern. In some regions I happily see CWL stating its search throughput as multiple GiB/s.

                                                            cwl-mount is partially a need for me to be able to grep --context over logs. I also used it as an opportunity to experiment with FUSE and asynchronous Rust. I’m pleased with the proof-of-concept stage of cwl-mount and can use it both for side-projects and at work.

                                                            One of the major problems of Cloudwatch Logs in my opinion is its bad UI. I don’t expect much from logs, but Cloudwatch even fails to fulfill my few expectations (search, scroll, and live follow). I was perfectly happy with journalctl when we still had only a few servers.

                                                            I agree that sometimes a command-line interface is preferable to a graphical user interface, and that logs is one of those times. That being said, personally I find CWL’s UI OK. If you would like to leave feedback for the CWL team I can connect you with someone internally.

                                                            In Cloudwatch I can do “Search all”, enter my search term, narrow down the time span, and then wait for annoying seconds or even minutes. And then it only shows me the log lines that match the search. To see the context, I have to click on the instance ID and open that in a new tab. Or perform another search with the correct correlation ID.

                                                            Yes, this is precisely the (narrow) use-case that cwl-mount fills. Sometimes I know very precisely where to expect logs but I also need context.

                                                            And exporting to S3 with the Cloudwatch UI is also quite annoying. I once looked into creating export jobs with the command line interface (to automate it once per day), but also that seemed to be more work than 5 minutes of programming.

                                                            Yes, especially with S3 bucket permissions this is not a quick feature to set up, and I agree this would take more than 5 minutes to set up. Again this is a gap that cwl-mount fills. That being said, for some customers they value cost efficiency and are willing to set up something to call CreateExportTask to S3 so that they can e.g. ingest it somewhere else, post-process it, etc. That is not my use-case.

                                                            So even if the virtual filesystem is slow (which I expect, because is has to call the same APIs that the Cloudwatch UI probably uses), I hope that it can make my debugging sessions a whole lot more fun. In the worst case I can execute cat some-log-files > my-cached-log to fetch all logs from the relevant time span and then happily grep my way through the cached log.

                                                            cwl-mount is slow mostly because FilterLogEvents has a maximum quota of 5 transactions per second per account per region and this limit cannot be changed. In practice, you can burst above this limit and CWL will honor it for a while. However, cwl-mount for now is cautious and sticks to 5 TPS. I’ll make it configurable soon.

                                                            Please let me know, or cut GitHub issues, for any comments or feedback that you have about cwl-mount. Thank you!