1. 35

Hi all,

I’m trying to learn more about how small startups are using AWS. I’m particularly interested in:

What sort of software you are deploying and how you deploy it (chef, puppet, “by hand” or what not)?

How are you provisioning resources in AWS and what those resources are (terraform, cloud foundation, AWS UI)?

If you are using a language like Python or Ruby, how are you making sure your dependencies are correctly installed? How do you update them?

How large is your engineering team?

Do you have anyone who is responsible for maintaining your deployment and provisioning tools?

Thanks.

  1.  

  2. 7

    I run ipdata.co on AWS.

    • We’re provisioning resources via Terraform, it’s incredibly convenient especially considering the number of resources we deploy to multiple regions
    • We update dependencies manually, re-deploying using Terraform
    • The team is made up of only one guy.
    • One of the benefits of using a tool like Terraform is how little maintenance is needed after the initial setup
    1. 1

      honest question: How do you make money with this?

      I run an instance of https://github.com/fiorix/freegeoip for fun on one of my servers and it does mostly the same thing. It even has an endpoint to returns csv, which makes it easy to use from bash.

      1. 3

        We’re fast and are in 10 datacenters globally, our latencies are very low. Keeping in mind that latency is I think the biggest reason developers hesitate to use third party APIs for geolocation.

        https://updown.io/s8dz ~ 68ms

    2. 5

      Full disclosure: I work at Heroku.

      I started writing something more focused but eventually it turned into a brain dump. It would be shorter and to the point if I had more time.


      My team is responsible for managing Postgres/Redis/Kafka operations at the company and our setup is a little… *ahem* different. We never touch the UI and rely entirely on AWS APIs for our day-to-day operations. Requests for servers come in via HTTP API calls and we provision object representations of higher-level “Services” that contain servers, which contain EC2 instances, volumes, security groups, elastic IPs, etc.

      My team is maybe 20(?) or so people, depending on who you ask and we own the entire provision -> operation -> maintenance -> retirement lifecycle. Other teams have abstracted some things for us so we’re building on the backs of giants who have built on the back of giants.

      Part of our model is “shared nothing most of the time”. In the event of a failure, we don’t try and recover disks or instances. Our backup strategy revolves around getting information off the disks as quickly as possible and onto something more durable. This is S3 in our case and we treat pretty much everything else as ephemeral.

      What I do and what you are looking to do are a little different but my advice would be to investigate the native tools AWS gives you and try to work with them as much as possible. There are cases where you need to down your own tools but you should have a good reason for that (aside from usual tech contrarian opinions). “I don’t trust AWS” or “vendor lock-in” aren’t really things you should consider at an early stage. Get off the ground and worry about the details when you have the revenue to support those ideas. You have to build a wheel before you can build an interstate system.

      Keep ephemeralization in mind. Continue doing more with less until you are doing almost nothing. If you have an operation that AWS can handle for you, just let them. Your objective is to build a business and serve customers, not to build the world’s best infrastructure. Keep UX and self-service in mind. If your developers can’t easily push code changes, recover from bad deploys and scale their apps then you have a problem.

      Look into AWS CodeDeploy, ECS, Lambda, RDS, NLBs, etc. Make sure you understand the AWS networking model as working with VPCs and Security Groups can be quite complex. Don’t rely entirely on their policy simulator as it can be quite confusing at times. Build things in staging and TEST TEST TEST.

      Give each developer their own sub-account for staging/test environments. Some of AWS’s account features make this really easy. Don’t ever use root credentials. Keep an eye on trusted advisor to make sure developers aren’t spinning up dozens of r4.xlarge instances that do nothing (or worse, mine bitcoin).

      MAKE SURE YOUR S3 BUCKETS AREN’T WORLD WRITABLE. This happens more than you think. You’ve gotta pen test your network to make sure you set it up correctly.

      Learn the core concepts and learn to script as much as possible. The UI should only be used for experimentation early on, after that, you should take the time to teach a computer how to do this stuff. Consider immutability as much as possible. It is far better to throw something away and replace it in AWS land than to try and bring it back online if the root cause isn’t quickly apparent.

      Remember that AWS serves loads of customers and they have to prioritize. If your startup is only paying a few thousand a month then don’t expect immediate responses. They’ll do their best but at that stage, you are pretty much on your own. If you can afford Enterprise Support then pay for it. Money well spent.

      Use Reserved Instances as much as possible. You save a ton of money that way and once you get to a certain size AWS will likely start to cut bulk discount deals.


      If this all sounds scary and you are building a basic web app or API, do yourself a favor and use Heroku (or similar) to get started. If your organization doesn’t have the resources to bring on people to build and manage this full-time, you’re doing yourself a disservice by trying anyway. I learned that the hard way at a previous job when I had a CTO who was allergic to the idea of PaaS.

      That’s just my $0.02.

      1. 2

        Do you mind my asking what the most painful parts of using AWS at Heroku are, if not already covered in your (very thorough) write-up?

        1. 4

          Hmm… Where to begin? I’m not an expert in all of these things but I’ve often heard complaints about the following:

          • Insufficient capacity issues with instance types which involves making support calls to AWS to help us limp along. Smaller regions have this issue quite often.
          • Lack of transitive routing with VPC peering, which makes our Private Spaces product a bit cumbersome. Private Links may help but we’re still investigating.
          • STS credentials expiring during long data uploads, which means we need to switch to IAM, which have hard limits.
          • Cloudwatch being way too expensive for our use case so we have to poll a lot of our instances to determine events. We’ve spoken with them a few times about what we are trying to do and it is simply a use-case they aren’t accounting for right now. Maybe someday. The current pricing structure may have been feasible when more of Heroku was multi-tenant, but that isn’t the case anymore. I’ll accept that as a tradeoff.

          Those are at least the most recent sticking points. We’ve been lucky enough to get in a room with some AWS developers in the past and it was reassuring to hear things like “we know all about it” and “we’re working on a solution”. They’re a huge organization and can be slow to make changes but I genuinely believe they are doing their best.

          1. 3

            Oh, oh, oh!

            People not understanding that CPU credits on t2 instances are a thing. AWS gives you part-time access to the full power of a CPU on their cheaper instances but throttles you down if you use too much. It is nice for use-cases where bursting is required but will break your app like nobody’s business if you keep your instance under high load. There is a reason t2s are so cheap (~$40/month with on-demand pricing for a t2.medium).

            You get what you pay for.

            1. 2

              Fascinating, thank you for the write-ups!

        2. 3

          Hey-o! This is how we do it at my company, a startup with about ten software engineers:

          Provision with terraform, deploy with ansible. Ansible would do very little, just pull down our packaged software via apt-get which was python virtualenvs. Since the virtualenv was completely self contained it didn’t need to deal with dependencies at all – the idea of installing application level dependencies via pip or whatever is insane to me.

          We supposedly adhered to the “devops” mantra of “devops is a shared responsibility,” but in practice no one wanted to deal with it so it usually fell on one or two people to keep things sane.

          1. 1

            This is pretty similar to what I’ve done in the past although I’d like to have better answers than Terraform or Ansible. Ansible especially turns into a ball-ache once you’re trying to do more than just rsync a binary.

            I’ve been thinking about writing a CLI around https://github.com/frontrowed/stratosphere that uses cloudformation changesets to give me diffs like what Terraform does.

            1. 1

              Yeah, agreed RE: ansible – as soon as you’re doing something complicated with it, you’re doing it wrong. And it’s very tempting to do so simply because it has so much functionality built-in.

              Our infrastructure was designed to be immutable once provisioned, so it really would have made sense to go the kubernetes / ECS route.

          2. 3

            We’re a small shop (~15 folks, ~10 eng), but old (think early 2000s, using mod_perl at the time). Not really a startup but we match the description otherwise so:

            It’s a Python/Django app, https://actionk.it, which some lefty groups online use to collect donations, run their in-person event campaigns and mailing lists and petition sites, etc. We build AMIs using Ansible/Packer; they pull our latest code from git on startup and pip install deps from an internal pip repo. We have internal servers for tests, collecting errors, monitoring, etc.

            We have no staff focused on ops/tools. Many folks pitch in some, but we’d like to have a bit more capacity for that kind of internal-facing work. (Related: hiring! Jobs at wawd dot com. We work for neat organizations and we’re all remote!)

            We’ve got home-rolled scripts to manage restarting our frontend cluster by having the ASG start new webs and tear the old down. We’ve scripted hotfixes and semi-automated releases–semi-automated meaning someone like me still starts each major step of the release and watches that nothing fishy seems to be happening. We do still touch the AWS console sometimes.

            Curious what prompts the question; sounds like market research for potential product or something. FWIW, many of the things that would change our day-to-day with AWS don’t necessarily qualify as Solving Hard Problems at our scale (or 5x our scale); a lot of it is just little pain points and time-sucks it would be great to smooth out.

            1. 6

              FYI, I get a “Your connection is not private” when going to https://actionk.it. Error is NET::ERR_CERT_COMMON_NAME_INVALID, I got this on Chrome 66 and 65.

              1. 2

                Same here on Safari.

                1. 1

                  Sorry, https://actionkit.com has a more boring domain but works :) . Should have checked before I posted, and we should get the marketing site a cert covering both domains.

                2. 1

                  Firefox here as well.

                  1. 1

                    Sorry, I should have posted https://actionkit.com, reason noted by the other comments here.

                  2. 1

                    https://actionk.it

                    This happens because the served certificate it for https://actionkit.com/

                    1. 1

                      D’oh, thanks. Go to https://actionkit.com instead – I just blindly changed the http://actionk.it URL to https://, but our cert only covers the boring .com domain not the vanity .it. We ought to get a cert that covers both. (Our production sites for clients have an automated Let’s Encrypt setup without this problem, for the record :) )

                  3. 3

                    I’m running on Google Cloud Platform, but there’s enough similarities to AWS that hopefully this is helpful.

                    I use Packer to bake a golden VM image that includes monitoring, logging, e.t.c. based on the most recent Ubuntu 16.04 update. I rebuild the golden image roughly monthly unless there is a security issue to patch. Then when I release new versions of the app I build an app specific image based on the latest golden image. It copies in an Uberjar from Google Cloud Storage (built by Google Cloud Builder). All of the app images live in the same image family

                    I then run a rolling update to replace the current instances in the managed instance group with the new instances.

                    The whole infrastructure is managed with Terraform, but I only need to touch Terraform if I’m changing cluster configuration or other resources. Day to day updates don’t need to go through Terraform at all, although now that the GCP Terraform provider supports rolling updates, I may look at doing it with Terraform.

                    It’s just me for everything, so I’m responsible for it all.

                    1. 3

                      We are close to launching https://fluxguard.com on AWS. We are a 2 person engineering team:

                      • We are 100% Javascript for front and back
                      • We use Lambda functions for everything… no EC2 instances
                      • CloudFormation to completely map our entire config as code
                      • DynamoDB as a data store (I have 2nd thoughts about this honestly)
                      • Cognito for user authentication and account creation
                      • SNS for Lambda function orchestration
                      • CloudWatch for logging and alerts
                      • API Gateway for, well, the API Gateway

                      I strongly urge you to check out CloudFormation early. We initially didn’t, and manually setup everything. We quickly realized that it was bonkers to have 30+ API endpoints with really detailed configuration to Lambda functions… all hand crafted in AWS console. We spent about a month migrating everything to CloudFormation. And we love it. Yes it can be overwrought. But it makes everything much, much better in the end.

                      1. 2

                        I use terraform/manual to provision and nix/nixops to manage software.

                        1. 1

                          Cloudformation, Cloud-Init, Puppet, Boto and Fabric. Works like a charm, but none of these tools are perfect.

                          1. 1

                            Hi, we are a small startup with ~10 engineers. We’re using AWS, both for customer services (we deploy a few instances per customer, so they are completely isolated) and for the services that support our own workflows (CI etc.). For customer instances, we currently use simple scripts utilizing the AWS CLI tools. For our supporting infrastructure we use NixOps. NixOps is a blast when it works, though it requires some nix-fu to keep it working.

                            1. 1

                              Terraform deploys EC2 and other resources EC2 resources are bootstrapped with Ubuntu’s built-in cloud-init scripts Cloud init scripts install and run ansible Later updates can be applied to the machines by running the ansible step again on your EC2 resources

                              1. 1

                                Part of my current gig (https://yours.co) uses the Serverless framework. To answer your questions:

                                • Deploying NodeJS lambda functions along with a variety of CloudFormation resources (Serverless supports raw CloudFormation syntax for spinning up just about anything on AWS - this can even be customized to non-AWS resources as well). Deployment is sls deploy.
                                • Provisioning is CloudFormation (described previously)
                                • Dependencies are managed with NPM/Yarn, and serverless bundles them accordingly. I don’t know how it handles binary/native dependencies since we don’t have them as yet
                                • Six across front/back/mobile
                                • We don’t have anyone responsible for it - it’s permeated through the team and documented in a Nuclino wiki
                                1. 1

                                  What do you consider small? Right now I’m at a company with 25-ish engineers and around 125-ish employees.

                                  About a dozen products (mostly Ruby on Rails, a couple in Go, and one in Clojure). Puppet to make sure dependencies are installed and updated. Except for gems, which are updated during a deploy.

                                  Terraform to provision and configure AWS specific products (RDS, ECS, etc), Puppet to configure instances, internal CI/CD tool as well as Jenkins.

                                  We do have a number fo smaller products and services deployed as artifacts (containers) on a k8s stack.

                                  All of engineering is responsible for the deployment tools and my team (systems engineering) owns the provisioning tools. We expect the rest of engineering to start picking up ownership of Terraform for new products they add.

                                  I do have automation setup for any updates from USN, NVD, and a few other sources to create cards to stay on top of secutiy updates and vulnerabiities announcements.

                                  edit: we also have Rundeck available along with some chatops with cog.

                                  1. 1
                                    • Deploying: Basic webaps - compiled go binaries w/ supporting files and python services
                                    • Provisioning: Ansible (cloud modules)
                                    • Configuration Management: Ansible (various modules including pip)
                                    • Engineering Team: 3, all responsible for maintaining ansible playbooks (in a git repo)
                                    1. 1

                                      What sort of software you are deploying and how you deploy it (chef, puppet, “by hand” or what not)?

                                      We have two production setups. For the first (oldest) one we use Fabric mostly. For the second one we package everything as RPMs and use that to deploy.

                                      If you are using a language like Python or Ruby, how are you making sure your dependencies are correctly installed? How do you update them?

                                      Old system: it was problematic. We had several Python dependencies that were not packaged and used pip to get them in production. We had several virtualenvs with incompatible dependencies. Updates were based on the versions specified in requirements.txt files.

                                      New system: we package everything as RPMs. For Python dependencies, we make a RPM per package that does not exist in the official repos and host them ourselves. It’s harder but way more reliable.

                                      How large is your engineering team?

                                      Total about 10 people, but only 3 deploy back-end code.

                                      Do you have anyone who is responsible for maintaining your deployment and provisioning tools?

                                      We share that responsibility with another colleague.

                                      1. 1

                                        Packer to build AMIs, Terraform to provision resources. I’ve used Ansible to run “setup” tasks during the packer step as well as “finishing touches” tasks during the instance spin-up step.

                                        1. 1

                                          I work at a company with seven people doing technical work, which is broken down into three doing data, two doing application development, and two doing devops. The two of us working on devops are responsible for making sure our deployment and provisioning tools work.

                                          We use terraform to manage all our infrastructure, and have a semi-immutable infrastructure. All our EC2 instances are deployed using custom AMIs, and whenever we need to make a change to a box or class of boxes we build a new AMI. I’ve used ansible to provision a couple of our AMIs, because I prefer it to writing bash commands in json in packer definitions, but that hasn’t taken hold everywhere.

                                          All of our in-house applications, and most the rest of the ones we use, are deployed using the elastic container service, running on EC2 instances. A couple applications that aren’t ours or the AWS service are just run on the box (elasticsearch for an application we didn’t write that can’t use the AWS hosted one, our DNS boxes, our outbound internet proxies, consul). For those that need some level of dynamic configuration, we use a tool called confd and put the config values in dynamodb tables.

                                          For ensuring dependencies are installed, we run everything in docker containers through ECS. We use Concourse for CI and automated job scheduling. Most of the automations I’ve written are run through concourse, which also runs everything in containers, so dependency management for my scripts is done through a “runner” container.

                                          1. 1

                                            Team: 3 backend engineers/devops.

                                            Software: uberjars for http services.

                                            Packaging: started with Packer for immutable AMIs, switched to simple cloud-Init script that downloads & install a .deb package from s3, built by Travis.

                                            Deployment: blue-green deploys, with ASGs. Wrote a small cli tool to orchestrate the scaling up and down because nothing existed at the time.

                                            All in all, it worked really well. The longest part of deployment was building the uberjar on Travis.

                                            1. 1

                                              Shameless plug, but I explain how to setup a secure and fairly simple hosting in AWS in Securing DevOps. Basically: build application containers in CI, then host those containers in AWS ElasticBeanstalk. You can get to a fully automated pipeline in a day or two of work, and it’s mostly maintenance free and autoscales.

                                              1. 1

                                                We’re a small startup that runs on AWS. We deploy a python app, jenkins, gitlab, sentry, and probably other self-hosted services that I’m forgetting.

                                                We’re currently using Terraform, Salt, and bridge some gaps with Python. We started out using Ansible for everything, but couldn’t scale it beyond a few hundred servers. Salt has a higher learning curve, but has been much more scalable for us.

                                                Our team is ~30 devs, and we have 2-3 people who spend part of their time maintaining the above.

                                                1. 1

                                                  At my last place we started from almost 0 AWS infrastructure - so, not a startup but AWS was relatively greenfield.

                                                  We stuck with Terraform for provisioning. It has lots of great features (interpolation syntax, S3 remote backends, DynamoDB locks to prevent devs from modifying the same resources, etc.) I wrote this, it might be slightly out of date now.

                                                  At one point I explored the angle of doing provisioning (i.e. software installs) from Terraform (it has support for basic shell provisioners, and the Salt provisioner as well) but ultimately we settled for:

                                                  • Create EC2 instances with Terraform
                                                  • Using the shell provisioner, install salt minions on the EC2 instances
                                                  • Also using the shell provisioner, SSH to the salt master and add the new EC2 instance
                                                  • Use Salt to configure software (i.e. install kafka/postgres/whatever)

                                                  I believe a coworker wrote https://github.com/vladislavPV/salt-helper to improve that situation.

                                                  1. 1

                                                    I’m at a smallish company, and although we use Azure, I believe our experience still may be useful for you. We are not an “internet company”, we specialize in a niche market and provide customizations of our stack as our customers need. We provide domain-specific computations services (Operations research) with our services via our APIs. Our customers have a tendency to like hybrid-cloud, or even on-premise installation of our stack, as we serve industries which are often pretty old-fashioned.

                                                    • we use terraform to create the infrastructure. This is the only (purely) Azure specific part.
                                                      • Regarding Azure specific part: actually we use some services which have counterparts in other vendors’ stack, eg. blob storage, managed database
                                                      • we mostly use IaaS as we had some problems with some PaaS components (especially LoadBalancers didn’t suit or needs).
                                                      • This helps us be able to provision out solution
                                                      • All our VMs are on the same version of Ubuntu.
                                                      • We tend to avoid cloud provider specific solution as our customers might not buy it. Also data transfer costs and rates can be a problem, but there is no hard rule.
                                                    • We use Ansible to set up the software stack, auto-form clusters on brand new stacks.
                                                      • currently we use separate stacks for each customer, as the solutions are often tailored to their needs, and sometimes they require it to be ran in their accounts.
                                                    • Our state-storage services are a mix of managed, and self-hosted services.
                                                      • self hosted services run “bare” on VMs, no containers
                                                      • we only use self-hosted services if
                                                        • it is cheaper to do so even given the extra costs for redundancy, backup, management and operations costs (eg. managed Postgres did not have enough performance for the buck, but not only was it expensive, it could not meet our performance target)
                                                        • there is no portable abstraction over cloud providers suiting our needs
                                                        • there is no managed service for our needs at all (we met this in the Geographic Information Services domain)
                                                    • Most services are run in docker containers, all state is externalized to the above services
                                                      • Containerized services are managed by Rancher. This is what I dislike the most. We are investigating moving to kubernetes, and ditching rancher (although its newer version are said to support it). It works, but meh.
                                                    • For some machines we have prebuilt VM images, created by Packer. In autoscaling VM groups this greatly reduces startup time.
                                                      • For us it is a maintenance burden given our needs to provide customizations, thus we use it only in special cases.
                                                    • We don’t run terraform manually (only if we need to manually intervene if something has gone terribly wrong. This can only happen in test environments ;)) Instead we run terraform and ansible from “infrastructure ci” builds triggered manually.

                                                    Our engineering team is ~10 people, and we have primary responsible(s) for infrastructure, but we also are doing a round-robin secondary responsibility dispatch to have everybody somewhat up2date in the Ops side. Tasks are dispatched accordingly, considering severity and urgency. This way everybody can kick-start a clean stack at least (regardless of main specialization), and many can fix bugs are many have fresh knowledge about the stack or more inde-pth knowledge of the tooling.

                                                    I also have personal experience with AWS. Personally I see that Azure is primarily competing with AWS (not only in market share), and it seems as if they wanted to copy even the bad parts, to be more AWS-like. This is only an impression though, these similarities may arise from technical constraint under the hood which I have not considered/realized. AWS has simpler and better authorization management. (Yes, there are worse things than IAM out there). I personally liked DigitalOcean, who also had their quirks, but for an IaaS only setup they are pretty competitive. Given these I can my opinion:

                                                    • Terraform has its quirks, but I recommend it. If it has all the tools you need (some stuff may be missing) it is far better than the native Azure or AWS tooling I have seen.
                                                      • if you are a programmer and read the docs might try to solve some problems in a “programmer” approach. Eventually you will realize this won’t work (imperative and functional thinking will also lead here, if you step on this path). The bad part is that the functions and some parts in the docs make you feel you could do it. :D In the end you’ll feel what are the tool’s limitations, but it took quite a few days for me.
                                                    • Ansible is also pretty good, I like it more than terraform, but its stateless nature did not fit the infra provisioning steps well. For installing software on machines it is pretty good.
                                                      • Infra provisioning: for DigitalOcean or AWS Sail it might be ok, I have used it for DO. For creating complex network topologies then VMs it did not cut it for us.
                                                      • If you twist your mind a bit you can even do cluster formation using run-once and rolling updates ;)
                                                      • It is stateless, can be used to keep machines up2date without reinstalling (think certificates, ssh keys especially)
                                                      • installing dependencies with ansible is simple. Make sure, to mirror everything, as packages sometimes disappear from repos. Oh, you can see this if you pin versions, which is also a good idea, as some maintainers don’t follow semver, and even sub-minor version bumps have broken apps for us.
                                                    • Containerization: Docker bugs has made our life harder sometimes. Might still be worth using it.
                                                      • development and testing might have been simplified. I’m not 100% sure, but I tend to feel it was worth it. Creating an “emulator environment” for our stack was some work, but it works pretty well by now for us.
                                                      • Docker is not a middleware. If you start fresh and want to use containers use Kubernetes. Some things which docker cannot do, but K8S is said to be able to (from a 1on1 personal talk with a k8s user, not personal experiences):
                                                        • Proper container dependency handling: only start container if its dependencies are healthy. (Think db migration or queue initialization in the “emulator environment”. The tools need to handle situations when they are started, but the relevant container is only starting yet)
                                                        • cron-like periodic event handling. There were times when this would have been handy.
                                                      • Containerization can handle your dependency problems. Both upgrading, and version pinning is ensured at container build time.