Hi all,
I’m trying to learn more about how small startups are using AWS. I’m particularly interested in:
What sort of software you are deploying and how you deploy it (chef, puppet, “by hand” or what not)?
How are you provisioning resources in AWS and what those resources are (terraform, cloud foundation, AWS UI)?
If you are using a language like Python or Ruby, how are you making sure your dependencies are correctly installed? How do you update them?
How large is your engineering team?
Do you have anyone who is responsible for maintaining your deployment and provisioning tools?
Thanks.
I run ipdata.co on AWS.
honest question: How do you make money with this?
I run an instance of https://github.com/fiorix/freegeoip for fun on one of my servers and it does mostly the same thing. It even has an endpoint to returns csv, which makes it easy to use from bash.
We’re fast and are in 10 datacenters globally, our latencies are very low. Keeping in mind that latency is I think the biggest reason developers hesitate to use third party APIs for geolocation.
https://updown.io/s8dz ~ 68ms
Full disclosure: I work at Heroku.
I started writing something more focused but eventually it turned into a brain dump. It would be shorter and to the point if I had more time.
My team is responsible for managing Postgres/Redis/Kafka operations at the company and our setup is a little… *ahem* different. We never touch the UI and rely entirely on AWS APIs for our day-to-day operations. Requests for servers come in via HTTP API calls and we provision object representations of higher-level “Services” that contain servers, which contain EC2 instances, volumes, security groups, elastic IPs, etc.
My team is maybe 20(?) or so people, depending on who you ask and we own the entire provision -> operation -> maintenance -> retirement lifecycle. Other teams have abstracted some things for us so we’re building on the backs of giants who have built on the back of giants.
Part of our model is “shared nothing most of the time”. In the event of a failure, we don’t try and recover disks or instances. Our backup strategy revolves around getting information off the disks as quickly as possible and onto something more durable. This is S3 in our case and we treat pretty much everything else as ephemeral.
What I do and what you are looking to do are a little different but my advice would be to investigate the native tools AWS gives you and try to work with them as much as possible. There are cases where you need to down your own tools but you should have a good reason for that (aside from usual tech contrarian opinions). “I don’t trust AWS” or “vendor lock-in” aren’t really things you should consider at an early stage. Get off the ground and worry about the details when you have the revenue to support those ideas. You have to build a wheel before you can build an interstate system.
Keep ephemeralization in mind. Continue doing more with less until you are doing almost nothing. If you have an operation that AWS can handle for you, just let them. Your objective is to build a business and serve customers, not to build the world’s best infrastructure. Keep UX and self-service in mind. If your developers can’t easily push code changes, recover from bad deploys and scale their apps then you have a problem.
Look into AWS CodeDeploy, ECS, Lambda, RDS, NLBs, etc. Make sure you understand the AWS networking model as working with VPCs and Security Groups can be quite complex. Don’t rely entirely on their policy simulator as it can be quite confusing at times. Build things in staging and TEST TEST TEST.
Give each developer their own sub-account for staging/test environments. Some of AWS’s account features make this really easy. Don’t ever use root credentials. Keep an eye on trusted advisor to make sure developers aren’t spinning up dozens of
r4.xlarge
instances that do nothing (or worse, mine bitcoin).MAKE SURE YOUR S3 BUCKETS AREN’T WORLD WRITABLE. This happens more than you think. You’ve gotta pen test your network to make sure you set it up correctly.
Learn the core concepts and learn to script as much as possible. The UI should only be used for experimentation early on, after that, you should take the time to teach a computer how to do this stuff. Consider immutability as much as possible. It is far better to throw something away and replace it in AWS land than to try and bring it back online if the root cause isn’t quickly apparent.
Remember that AWS serves loads of customers and they have to prioritize. If your startup is only paying a few thousand a month then don’t expect immediate responses. They’ll do their best but at that stage, you are pretty much on your own. If you can afford Enterprise Support then pay for it. Money well spent.
Use Reserved Instances as much as possible. You save a ton of money that way and once you get to a certain size AWS will likely start to cut bulk discount deals.
If this all sounds scary and you are building a basic web app or API, do yourself a favor and use Heroku (or similar) to get started. If your organization doesn’t have the resources to bring on people to build and manage this full-time, you’re doing yourself a disservice by trying anyway. I learned that the hard way at a previous job when I had a CTO who was allergic to the idea of PaaS.
That’s just my $0.02.
Do you mind my asking what the most painful parts of using AWS at Heroku are, if not already covered in your (very thorough) write-up?
Hmm… Where to begin? I’m not an expert in all of these things but I’ve often heard complaints about the following:
Those are at least the most recent sticking points. We’ve been lucky enough to get in a room with some AWS developers in the past and it was reassuring to hear things like “we know all about it” and “we’re working on a solution”. They’re a huge organization and can be slow to make changes but I genuinely believe they are doing their best.
Oh, oh, oh!
People not understanding that CPU credits on t2 instances are a thing. AWS gives you part-time access to the full power of a CPU on their cheaper instances but throttles you down if you use too much. It is nice for use-cases where bursting is required but will break your app like nobody’s business if you keep your instance under high load. There is a reason t2s are so cheap (~$40/month with on-demand pricing for a t2.medium).
You get what you pay for.
Fascinating, thank you for the write-ups!
Hey-o! This is how we do it at my company, a startup with about ten software engineers:
Provision with terraform, deploy with ansible. Ansible would do very little, just pull down our packaged software via
apt-get
which was python virtualenvs. Since the virtualenv was completely self contained it didn’t need to deal with dependencies at all – the idea of installing application level dependencies viapip
or whatever is insane to me.We supposedly adhered to the “devops” mantra of “devops is a shared responsibility,” but in practice no one wanted to deal with it so it usually fell on one or two people to keep things sane.
This is pretty similar to what I’ve done in the past although I’d like to have better answers than Terraform or Ansible. Ansible especially turns into a ball-ache once you’re trying to do more than just rsync a binary.
I’ve been thinking about writing a CLI around https://github.com/frontrowed/stratosphere that uses cloudformation changesets to give me diffs like what Terraform does.
Yeah, agreed RE: ansible – as soon as you’re doing something complicated with it, you’re doing it wrong. And it’s very tempting to do so simply because it has so much functionality built-in.
Our infrastructure was designed to be immutable once provisioned, so it really would have made sense to go the kubernetes / ECS route.
We’re a small shop (~15 folks, ~10 eng), but old (think early 2000s, using mod_perl at the time). Not really a startup but we match the description otherwise so:
It’s a Python/Django app, https://actionk.it, which some lefty groups online use to collect donations, run their in-person event campaigns and mailing lists and petition sites, etc. We build AMIs using Ansible/Packer; they pull our latest code from git on startup and pip install deps from an internal pip repo. We have internal servers for tests, collecting errors, monitoring, etc.
We have no staff focused on ops/tools. Many folks pitch in some, but we’d like to have a bit more capacity for that kind of internal-facing work. (Related: hiring! Jobs at wawd dot com. We work for neat organizations and we’re all remote!)
We’ve got home-rolled scripts to manage restarting our frontend cluster by having the ASG start new webs and tear the old down. We’ve scripted hotfixes and semi-automated releases–semi-automated meaning someone like me still starts each major step of the release and watches that nothing fishy seems to be happening. We do still touch the AWS console sometimes.
Curious what prompts the question; sounds like market research for potential product or something. FWIW, many of the things that would change our day-to-day with AWS don’t necessarily qualify as Solving Hard Problems at our scale (or 5x our scale); a lot of it is just little pain points and time-sucks it would be great to smooth out.
FYI, I get a “Your connection is not private” when going to https://actionk.it. Error is NET::ERR_CERT_COMMON_NAME_INVALID, I got this on Chrome 66 and 65.
Same here on Safari.
Sorry, https://actionkit.com has a more boring domain but works :) . Should have checked before I posted, and we should get the marketing site a cert covering both domains.
Firefox here as well.
Sorry, I should have posted https://actionkit.com, reason noted by the other comments here.
This happens because the served certificate it for https://actionkit.com/
D’oh, thanks. Go to https://actionkit.com instead – I just blindly changed the http://actionk.it URL to https://, but our cert only covers the boring .com domain not the vanity .it. We ought to get a cert that covers both. (Our production sites for clients have an automated Let’s Encrypt setup without this problem, for the record :) )
I’m running on Google Cloud Platform, but there’s enough similarities to AWS that hopefully this is helpful.
I use Packer to bake a golden VM image that includes monitoring, logging, e.t.c. based on the most recent Ubuntu 16.04 update. I rebuild the golden image roughly monthly unless there is a security issue to patch. Then when I release new versions of the app I build an app specific image based on the latest golden image. It copies in an Uberjar from Google Cloud Storage (built by Google Cloud Builder). All of the app images live in the same image family
I then run a rolling update to replace the current instances in the managed instance group with the new instances.
The whole infrastructure is managed with Terraform, but I only need to touch Terraform if I’m changing cluster configuration or other resources. Day to day updates don’t need to go through Terraform at all, although now that the GCP Terraform provider supports rolling updates, I may look at doing it with Terraform.
It’s just me for everything, so I’m responsible for it all.
We are close to launching https://fluxguard.com on AWS. We are a 2 person engineering team:
I strongly urge you to check out CloudFormation early. We initially didn’t, and manually setup everything. We quickly realized that it was bonkers to have 30+ API endpoints with really detailed configuration to Lambda functions… all hand crafted in AWS console. We spent about a month migrating everything to CloudFormation. And we love it. Yes it can be overwrought. But it makes everything much, much better in the end.
I use terraform/manual to provision and nix/nixops to manage software.
Cloudformation, Cloud-Init, Puppet, Boto and Fabric. Works like a charm, but none of these tools are perfect.
Hi, we are a small startup with ~10 engineers. We’re using AWS, both for customer services (we deploy a few instances per customer, so they are completely isolated) and for the services that support our own workflows (CI etc.). For customer instances, we currently use simple scripts utilizing the AWS CLI tools. For our supporting infrastructure we use NixOps. NixOps is a blast when it works, though it requires some nix-fu to keep it working.
Terraform deploys EC2 and other resources EC2 resources are bootstrapped with Ubuntu’s built-in cloud-init scripts Cloud init scripts install and run ansible Later updates can be applied to the machines by running the ansible step again on your EC2 resources
Part of my current gig (https://yours.co) uses the Serverless framework. To answer your questions:
sls deploy
.What do you consider small? Right now I’m at a company with 25-ish engineers and around 125-ish employees.
About a dozen products (mostly Ruby on Rails, a couple in Go, and one in Clojure). Puppet to make sure dependencies are installed and updated. Except for gems, which are updated during a deploy.
Terraform to provision and configure AWS specific products (RDS, ECS, etc), Puppet to configure instances, internal CI/CD tool as well as Jenkins.
We do have a number fo smaller products and services deployed as artifacts (containers) on a k8s stack.
All of engineering is responsible for the deployment tools and my team (systems engineering) owns the provisioning tools. We expect the rest of engineering to start picking up ownership of Terraform for new products they add.
I do have automation setup for any updates from USN, NVD, and a few other sources to create cards to stay on top of secutiy updates and vulnerabiities announcements.
edit: we also have Rundeck available along with some chatops with cog.
We have two production setups. For the first (oldest) one we use Fabric mostly. For the second one we package everything as RPMs and use that to deploy.
Old system: it was problematic. We had several Python dependencies that were not packaged and used pip to get them in production. We had several virtualenvs with incompatible dependencies. Updates were based on the versions specified in requirements.txt files.
New system: we package everything as RPMs. For Python dependencies, we make a RPM per package that does not exist in the official repos and host them ourselves. It’s harder but way more reliable.
Total about 10 people, but only 3 deploy back-end code.
We share that responsibility with another colleague.
Packer to build AMIs, Terraform to provision resources. I’ve used Ansible to run “setup” tasks during the packer step as well as “finishing touches” tasks during the instance spin-up step.
I work at a company with seven people doing technical work, which is broken down into three doing data, two doing application development, and two doing devops. The two of us working on devops are responsible for making sure our deployment and provisioning tools work.
We use terraform to manage all our infrastructure, and have a semi-immutable infrastructure. All our EC2 instances are deployed using custom AMIs, and whenever we need to make a change to a box or class of boxes we build a new AMI. I’ve used ansible to provision a couple of our AMIs, because I prefer it to writing bash commands in json in packer definitions, but that hasn’t taken hold everywhere.
All of our in-house applications, and most the rest of the ones we use, are deployed using the elastic container service, running on EC2 instances. A couple applications that aren’t ours or the AWS service are just run on the box (elasticsearch for an application we didn’t write that can’t use the AWS hosted one, our DNS boxes, our outbound internet proxies, consul). For those that need some level of dynamic configuration, we use a tool called confd and put the config values in dynamodb tables.
For ensuring dependencies are installed, we run everything in docker containers through ECS. We use Concourse for CI and automated job scheduling. Most of the automations I’ve written are run through concourse, which also runs everything in containers, so dependency management for my scripts is done through a “runner” container.
Team: 3 backend engineers/devops.
Software: uberjars for http services.
Packaging: started with Packer for immutable AMIs, switched to simple cloud-Init script that downloads & install a .deb package from s3, built by Travis.
Deployment: blue-green deploys, with ASGs. Wrote a small cli tool to orchestrate the scaling up and down because nothing existed at the time.
All in all, it worked really well. The longest part of deployment was building the uberjar on Travis.
Shameless plug, but I explain how to setup a secure and fairly simple hosting in AWS in Securing DevOps. Basically: build application containers in CI, then host those containers in AWS ElasticBeanstalk. You can get to a fully automated pipeline in a day or two of work, and it’s mostly maintenance free and autoscales.
We’re a small startup that runs on AWS. We deploy a python app, jenkins, gitlab, sentry, and probably other self-hosted services that I’m forgetting.
We’re currently using Terraform, Salt, and bridge some gaps with Python. We started out using Ansible for everything, but couldn’t scale it beyond a few hundred servers. Salt has a higher learning curve, but has been much more scalable for us.
Our team is ~30 devs, and we have 2-3 people who spend part of their time maintaining the above.
At my last place we started from almost 0 AWS infrastructure - so, not a startup but AWS was relatively greenfield.
We stuck with Terraform for provisioning. It has lots of great features (interpolation syntax, S3 remote backends, DynamoDB locks to prevent devs from modifying the same resources, etc.) I wrote this, it might be slightly out of date now.
At one point I explored the angle of doing provisioning (i.e. software installs) from Terraform (it has support for basic shell provisioners, and the Salt provisioner as well) but ultimately we settled for:
I believe a coworker wrote https://github.com/vladislavPV/salt-helper to improve that situation.
I’m at a smallish company, and although we use Azure, I believe our experience still may be useful for you. We are not an “internet company”, we specialize in a niche market and provide customizations of our stack as our customers need. We provide domain-specific computations services (Operations research) with our services via our APIs. Our customers have a tendency to like hybrid-cloud, or even on-premise installation of our stack, as we serve industries which are often pretty old-fashioned.
Our engineering team is ~10 people, and we have primary responsible(s) for infrastructure, but we also are doing a round-robin secondary responsibility dispatch to have everybody somewhat up2date in the Ops side. Tasks are dispatched accordingly, considering severity and urgency. This way everybody can kick-start a clean stack at least (regardless of main specialization), and many can fix bugs are many have fresh knowledge about the stack or more inde-pth knowledge of the tooling.
I also have personal experience with AWS. Personally I see that Azure is primarily competing with AWS (not only in market share), and it seems as if they wanted to copy even the bad parts, to be more AWS-like. This is only an impression though, these similarities may arise from technical constraint under the hood which I have not considered/realized. AWS has simpler and better authorization management. (Yes, there are worse things than IAM out there). I personally liked DigitalOcean, who also had their quirks, but for an IaaS only setup they are pretty competitive. Given these I can my opinion: