1. 3

I’m not terribly familiar with computer networks, and I wanted to build out an extremely cheap app that requires little compute and (to the best of my knowledge) doesn’t share RAM/disk, but does require lots and lots of DNS namespaces. Is there a way in order to run a process like ./process, restart it if it goes down, and register that process to a DNS server to be exposed consistently, without having to resort to Docker or Firecracker? Not sure where else to ask this.

  1. 5

    DNS names map to IP addresses. In principle, it is possible to have multiple IP addresses for a single computer, and make each process listen on a different IP address.

    Hard to make a concrete suggestion without knowing more of the requirements, but for example you could have each process on the application server run a script on the DNS server via SSH, which changes the DNS configuration to point to the IP address that the process is using.

    1. 3

      DNS names map to IP addresses

      Some of them do. SRV records; however, map to a port and DNS record name. If you have a DNS updating API, you can publish SRV records that advertise the location of the service. You can have multiple SRV records for the same resource, so you could do a single DNS query to get a complete list of all of the nodes if it’s small enough.

      1. 1

        SRV records sound interesting, I’m only familiar with CNAME, A, and AAAA so far. Do you know of any particularly great resources / tutorials to read / code more about SRV records along my use case or similar?

        1. 2

          The Wikipedia page is actually pretty good. It describes the format. For your use case, I imagine that you’d want to set a very short (1-10 second) TTL so that they were quickly invalidated, though you may set something longer and just handle the case that some nodes drop out in your system.

          Note that this can be complicated if you want these to be run on consumer connections. If you’re behind a NAT you may need to do some more clever work to figure out your public IP and the port that’s being forwarded (e.g. via a STUN server or UPnP / IGDP). From your post below, it sounds as if you’re just exposing a public server that multiple clients connect to, so DNS SRV records might be a good choice.

          If you’re providing a (potentially) different connection endpoint for each spreadsheet, then you can use this as your addressing layer. If you name sheets as something like {sheet name}.{username}.{your domain}, then you’d look up something _yingw787sheet._tcp.{sheet name}.{username}.{your domain} to get the host IP and port of the server. Or, if you want to give users the ability to transfer ownership of sheets and rename them, you’d do use a CNAME entry for {sheet name}.{username}.{your domain} that would then return a domain in the form {UUID}.{your domain}, and then you’d do the SRV lookup on _yingw787sheet._tcp.{UUID}.{your domain}. That also makes it fairly trivial for other people to host instances of the service and even lets people do their own namespace management while using your hosted version.

          Of course, like most other uses of DNS, this is a horrible abuse and not what DNS was designed for. You may find that free DNS providers start to object when they see that you have 100,000 DNS entries in a single domain with short TTLs and thousands of requests per second.

      2. 1

        I’m thinking about using Route 53 as my DNS server, and I don’t think I have SSH access. I think I do have API access to create and destroy records though. If I use DNS-based service discovery (which I know nothing about currently), would it be possible to register an EC2 instance with a master process to DNS upon provisioning, then have the master process issue API calls to DNS for each process? If this stack does work, how might this change if I add in a proxy layer?

      3. 3

        I think something like https://www.consul.io/ will fit what you’re looking for.

        1. 1

          This looks interesting. I had heard of Consul before but I didn’t put two and two together. Thanks!

          Have you used Consul in production before? How often might it fail, and how often do you need to update it due to a critical need (e.g. security vulnerability)?

          1. 1

            Also I think Consul is meant to be natively integrated into Kubernetes…and EKS costs something like $0.10 / hr which is $72 / mo…without factoring in EC2 costs on top of that. My budget is something like $5-10 / mo. for personal projects.

          2. 3

            What is this app?

            If it’s an HTTP(s) server, then yes this is pretty straightforward.

            When your browser visits google.com, the HTTP request it sends includes a line like Host: google.com. Your app can switch on that line without needing to have multiple IP addresses or anything.

            1. 1

              I’m building out a tiny spreadsheet alternative for myself; it’s like open-source Airtable, with APIs and DB cursor access. I mostly want to get away from using Google Sheets. The backend stack would be PostgreSQL + PostgREST (web server to lift PostgreSQL into HTTP). Yeah, PostgREST is an HTTP server. I’d either be provisioning PostgreSQL + PostgREST for an individual spreadsheet, or one db + API for all user spreadsheets, but I’d rather do the former because ElephantSQL sets limits on free tiers per database on disk usage + # of concurrent connections (which makes sense to me from first principles).

              I’m not sure I fully understand your point on routing HTTP requests. Can you elaborate more on that please?

              1. 2

                because ElephantSQL sets limits on free tiers per database on disk usage + # of concurrent connections

                Building around ‘what can I get for free from a specific provider’ is a very fast way to burn a weeks engineering time to save $5.

                They can - and likely will - detect and ban you for circumventing the free tier, which will likely cause you to lose all your data.

                For a project at this stage, I’d strongly recommend a single DB and a single API server, both running on the same box. You’ll need a machine to run the API server anyways - put your postgresql there too.

                AWS, google, azure and oracle each have a free tier which can do this.

                The hardest part will be figuring out cron for backups (it never has the right env vars), but you can leave that until you’ve been using it for a few days (don’t wait months though!).

                I’m not sure I fully understand your point on routing HTTP requests. Can you elaborate more on that please?

                As you’re using an existing off-the-shelf server, some things are slightly more complex (assuming you don’t want to patch & compile it yourself), since you’re restricted to what it already knows how to do.

                Unless you have particularly fancy needs (multiple users?), I would recommend against separating ‘document’ and ‘sheet’ (the way e.g. excel does), instead just having sheets. If you jam it all in one schema in one database on one domain name, you could be using it tomorrow. Manually separate unrelated data by giving it a sensible name.

                If you insist on nesting sheets inside documents, consider using the document name as a prefix - you can get your client to run a list all tables whose names have prefix 'mydocument' query and only show those ones.

                Further separation could be obtained (with additional effort) via the multi-schema support in postgrest. However, then you need to update the config file and restart a server every time you add a spreadsheet - see the relevant bit of the postgrest docs.

                1. 1

                  I think you’re right. Thanks for talking me out of my original plan. I do have $50k worth of AWS credits, but since I was on sabbatical and was hesitant to start something serious, I never really expected to take advantage of those credits because I can’t use it after the credits expire. Now I think maybe that’s a bit bone-headed and I should go ahead and build it using the free credits anyways. Worst comes to worst, I can always export my data to S3 and shut down the stack.

                  Hmm, I didn’t realize that bit of PostgREST documentation. I should go back and read through the whole thing before pinging people with questions. Thanks again for your help!!

            2. 3

              You can have a wildcard domain, such as `.mydomain.com that routes to some dispatch service.

              Then each process you run registers (in a database, writes to disk, etc.) itself, and the dispatch service takes myprocessname.mydmain.com and routes it to that process.

              Theoretically, the dispatch service could also do a lookup of active processes running and dynamically route based on that, but I assume there might be some overhead with that approach (without caching).

              1. 1

                Hmm, this is interesting. So if I understand you correctly, you’re saying I can create a master process on the instance (or on a separate server?), that can read a remote database, figures out the process, and routes the request to that process?

                Are there any examples of this approach on GitHub, or in books you know and recommend? How might this approach change if individual processes and instances are shielded behind a load balancer?

                Can you name individual UNIX processes, or do you have to search by PID or command? I’m guessing that even if PIDs don’t change over the process lifetime, if the process goes down and it restarts, then it’ll have a different PID, and relying on a failing process to properly issue an HTTP call to update a remote registry isn’t wise because you’ll be coupling failure models within your system design.

                1. 2

                  Also perhaps look into dbus: https://news.ycombinator.com/item?id=9451023

                  1. 2

                    I don’t know enough about process name / PID internals to know how easy or hard it is to look up - that’s also why I suggest that each process self-registers in some shared store (DB, disk, in-memory service, etc.). Someone further up suggested Consul which fits this role well - in general, “service discovery” is probably what you should be googling.

                    The router (that takes incoming requests and sends them to the right process) can both live on the same instance or somewhere else, assuming you use HTTP for communication. If you want to use sockets or similar IPC, you’ll need to have it on the same instance.

                    To handle failing processes, you could either have a heartbeat mechanism from the router checking that the process is up (with X failures before unregistering it) or you could just have a timeout on all incoming requests and make it so that a new process registering will overwrite the old process’ registration.

                    It’s hard to be more specific without pulling in specific code samples or talking about actual implementation details.

                2. 2
                  • Where is this going to run? The “where” defines a lot about where DNS is going to point at, and who will be responsible for restarting the process. Is it going to be something like Heroku, AWS Lambda or a VM that the process will run?
                  • If it is going to run on a VM, how is it going to be restarted? systemd, supervisord, someone doing it by hand?
                  • Do you own the domain name, for which you are going to use a lot of subdomain names?
                  • Where is that domain name hosted? Is there an API to manage it?
                  • Is your process a web server, or is it simply opening a socket waiting for connections via TCP or UDP?

                  Mind you, you are not required to know all these answers, but unless you provide some context using them as a guide, any advise we offer is useless.

                  1. 1

                    So the impetus behind this idea is I wanted to move away from Google Sheets to something like Airtable, because I’m more familiar with SQL and databases and APIs than I am with spreadsheets and formulas. But I don’t like depending on closed-source platforms because any change they make to their platform is a Black Swan event to me I may need to address (or can’t address). So I wanted to build out a tiny alternative to Airtable using PostgreSQL and PostgREST, and some very basic UI layer I’ll write myself. I don’t care too much about site performance, I care a little bit about availability, I mostly care about data freedom and integrity.

                    I found that ElephantSQL (PostgreSQL as a Service) can create databases for free using a multi-tenant (?) model, and EC2 / DigitalOcean / etc. is fairly cheap for a tiny instance. I’ve worked with containers and there’s a lot of cognitive load just to get it working in production (especially if I move on in my career and forget about it). So I’d rather create a VM and fill it to the brim with processes.

                    Problem is, PostgREST has a static configuration for a PostgreSQL URI, which means it can’t support multiple databases without restarting. I could model my “spreadsheet” around this and create a spreadsheet as a table. However, ElephantSQL restricts 20MB of data for the free tier on a per database level, and database costs are scary for personal projects. It also seems like a waste of resources to have an entire EC2 instance for one process, which could negatively impact my costs over time.

                    • I’m thinking about running these processes either on Heroku free tier (which I think probably uses EC2 spot instances underneath), or using an EC2 group + load balancer. If it’s the EC2 group, then I’d probably want to fill up the server with processes.
                    • I’m not terribly familiar with systemd or supervisord, but I think systemctl uses systemd underneath the hood, so probably that…
                    • Yes, I own the domain name.
                    • The domain is hosted on AWS Route 53, so I can use AWS APIs to manage it.
                    • The process is a web server.
                    1. 2

                      Here is how I’d approach it:

                      • Domain name on R53. API calls to make any changes needed
                      • Server on Hetzner (cheaper than the rest and considerably better than its previous bad reputation). For a 32G machine you can choose between a previously used bare metal, or a VM.
                      • I’ve run multiple instances of postgrest on a 16G VM on Heztner without any issues
                      • Use docker-compose to run everything: The web service, the postgrests and postgres. I’d use a host directory mount as the postgres data partition.
                      • database backups on S3.
                      • Use Mailgun or some other sending provider to send email if needed (they charge too).

                      YMMV, I cannot judge how much you’re willing to spend. You could do most of these with a $4/month server and no docker depending the load.