I like property based testing and it helped me a lot in the past. It’s a great tool for certain domains and with a good setup it could allow more people to contribute.
Having said that, the first part of the article mentions how sales and marketing would rarely contribute to testing, and the very first example starts with using regexes, which are hard even for programmers. Wouldn’t it be easier (yet longer) to implement it with strategy that selects from several other, simpler strategies?
Thanks for sharing! I thought it would be important to give the context from which I approached property-based testing, and the article is mostly an intro for programmers in the field already, hence discussion around corporate and inter-departmental psychology. I believe hypothesis has strategies.characters() which generates a stream of individual characters, but if you want strings I think the only option is st.from_regex(). I might be wrong though. You definitely can chain multiple strategies together.
Man this was jarring. I wish they de-risked the UI change by forming a UX migration plan and then A/B testing one part at a time, than doing everything wholesale.
You can have a wildcard domain, such as `.mydomain.com that routes to some dispatch service.
Then each process you run registers (in a database, writes to disk, etc.) itself, and the dispatch service takes myprocessname.mydmain.com and routes it to that process.
Theoretically, the dispatch service could also do a lookup of active processes running and dynamically route based on that, but I assume there might be some overhead with that approach (without caching).
Hmm, this is interesting. So if I understand you correctly, you’re saying I can create a master process on the instance (or on a separate server?), that can read a remote database, figures out the process, and routes the request to that process?
Are there any examples of this approach on GitHub, or in books you know and recommend? How might this approach change if individual processes and instances are shielded behind a load balancer?
Can you name individual UNIX processes, or do you have to search by PID or command? I’m guessing that even if PIDs don’t change over the process lifetime, if the process goes down and it restarts, then it’ll have a different PID, and relying on a failing process to properly issue an HTTP call to update a remote registry isn’t wise because you’ll be coupling failure models within your system design.
I don’t know enough about process name / PID internals to know how easy or hard it is to look up - that’s also why I suggest that each process self-registers in some shared store (DB, disk, in-memory service, etc.). Someone further up suggested Consul which fits this role well - in general, “service discovery” is probably what you should be googling.
The router (that takes incoming requests and sends them to the right process) can both live on the same instance or somewhere else, assuming you use HTTP for communication. If you want to use sockets or similar IPC, you’ll need to have it on the same instance.
To handle failing processes, you could either have a heartbeat mechanism from the router checking that the process is up (with X failures before unregistering it) or you could just have a timeout on all incoming requests and make it so that a new process registering will overwrite the old process’ registration.
It’s hard to be more specific without pulling in specific code samples or talking about actual implementation details.
Where is this going to run? The “where” defines a lot about where DNS is going to point at, and who will be responsible for restarting the process. Is it going to be something like Heroku, AWS Lambda or a VM that the process will run?
If it is going to run on a VM, how is it going to be restarted? systemd, supervisord, someone doing it by hand?
Do you own the domain name, for which you are going to use a lot of subdomain names?
Where is that domain name hosted? Is there an API to manage it?
Is your process a web server, or is it simply opening a socket waiting for connections via TCP or UDP?
Mind you, you are not required to know all these answers, but unless you provide some context using them as a guide, any advise we offer is useless.
So the impetus behind this idea is I wanted to move away from Google Sheets to something like Airtable, because I’m more familiar with SQL and databases and APIs than I am with spreadsheets and formulas. But I don’t like depending on closed-source platforms because any change they make to their platform is a Black Swan event to me I may need to address (or can’t address). So I wanted to build out a tiny alternative to Airtable using PostgreSQL and PostgREST, and some very basic UI layer I’ll write myself. I don’t care too much about site performance, I care a little bit about availability, I mostly care about data freedom and integrity.
I found that ElephantSQL (PostgreSQL as a Service) can create databases for free using a multi-tenant (?) model, and EC2 / DigitalOcean / etc. is fairly cheap for a tiny instance. I’ve worked with containers and there’s a lot of cognitive load just to get it working in production (especially if I move on in my career and forget about it). So I’d rather create a VM and fill it to the brim with processes.
Problem is, PostgREST has a static configuration for a PostgreSQL URI, which means it can’t support multiple databases without restarting. I could model my “spreadsheet” around this and create a spreadsheet as a table. However, ElephantSQL restricts 20MB of data for the free tier on a per database level, and database costs are scary for personal projects. It also seems like a waste of resources to have an entire EC2 instance for one process, which could negatively impact my costs over time.
I’m thinking about running these processes either on Heroku free tier (which I think probably uses EC2 spot instances underneath), or using an EC2 group + load balancer. If it’s the EC2 group, then I’d probably want to fill up the server with processes.
I’m not terribly familiar with systemd or supervisord, but I think systemctl uses systemd underneath the hood, so probably that…
Yes, I own the domain name.
The domain is hosted on AWS Route 53, so I can use AWS APIs to manage it.
Domain name on R53. API calls to make any changes needed
Server on Hetzner (cheaper than the rest and considerably better than its previous bad reputation). For a 32G machine you can choose between a previously used bare metal, or a VM.
I’ve run multiple instances of postgrest on a 16G VM on Heztner without any issues
Use docker-compose to run everything: The web service, the postgrests and postgres. I’d use a host directory mount as the postgres data partition.
database backups on S3.
Use Mailgun or some other sending provider to send email if needed (they charge too).
YMMV, I cannot judge how much you’re willing to spend. You could do most of these with a $4/month server and no docker depending the load.
If it’s an HTTP(s) server, then yes this is pretty straightforward.
When your browser visits google.com, the HTTP request it sends includes a line like Host: google.com. Your app can switch on that line without needing to have multiple IP addresses or anything.
I’m building out a tiny spreadsheet alternative for myself; it’s like open-source Airtable, with APIs and DB cursor access. I mostly want to get away from using Google Sheets. The backend stack would be PostgreSQL + PostgREST (web server to lift PostgreSQL into HTTP). Yeah, PostgREST is an HTTP server. I’d either be provisioning PostgreSQL + PostgREST for an individual spreadsheet, or one db + API for all user spreadsheets, but I’d rather do the former because ElephantSQL sets limits on free tiers per database on disk usage + # of concurrent connections (which makes sense to me from first principles).
I’m not sure I fully understand your point on routing HTTP requests. Can you elaborate more on that please?
because ElephantSQL sets limits on free tiers per database on disk usage + # of concurrent connections
Building around ‘what can I get for free from a specific provider’ is a very fast way to burn a weeks engineering time to save $5.
They can - and likely will - detect and ban you for circumventing the free tier, which will likely cause you to lose all your data.
For a project at this stage, I’d strongly recommend a single DB and a single API server, both running on the same box. You’ll need a machine to run the API server anyways - put your postgresql there too.
AWS, google, azure and oracle each have a free tier which can do this.
The hardest part will be figuring out cron for backups (it never has the right env vars), but you can leave that until you’ve been using it for a few days (don’t wait months though!).
I’m not sure I fully understand your point on routing HTTP requests. Can you elaborate more on that please?
As you’re using an existing off-the-shelf server, some things are slightly more complex (assuming you don’t want to patch & compile it yourself), since you’re restricted to what it already knows how to do.
Unless you have particularly fancy needs (multiple users?), I would recommend against separating ‘document’ and ‘sheet’ (the way e.g. excel does), instead just having sheets. If you jam it all in one schema in one database on one domain name, you could be using it tomorrow. Manually separate unrelated data by giving it a sensible name.
If you insist on nesting sheets inside documents, consider using the document name as a prefix - you can get your client to run a list all tables whose names have prefix 'mydocument' query and only show those ones.
Further separation could be obtained (with additional effort) via the multi-schema support in postgrest. However, then you need to update the config file and restart a server every time you add a spreadsheet - see the relevant bit of the postgrest docs.
I think you’re right. Thanks for talking me out of my original plan. I do have $50k worth of AWS credits, but since I was on sabbatical and was hesitant to start something serious, I never really expected to take advantage of those credits because I can’t use it after the credits expire. Now I think maybe that’s a bit bone-headed and I should go ahead and build it using the free credits anyways. Worst comes to worst, I can always export my data to S3 and shut down the stack.
Hmm, I didn’t realize that bit of PostgREST documentation. I should go back and read through the whole thing before pinging people with questions. Thanks again for your help!!
DNS names map to IP addresses. In principle, it is possible to have multiple IP addresses for a single computer, and make each process listen on a different IP address.
Hard to make a concrete suggestion without knowing more of the requirements, but for example you could have each process on the application server run a script on the DNS server via SSH, which changes the DNS configuration to point to the IP address that the process is using.
Some of them do. SRV records; however, map to a port and DNS record name. If you have a DNS updating API, you can publish SRV records that advertise the location of the service. You can have multiple SRV records for the same resource, so you could do a single DNS query to get a complete list of all of the nodes if it’s small enough.
SRV records sound interesting, I’m only familiar with CNAME, A, and AAAA so far. Do you know of any particularly great resources / tutorials to read / code more about SRV records along my use case or similar?
The Wikipedia page is actually pretty good. It describes the format. For your use case, I imagine that you’d want to set a very short (1-10 second) TTL so that they were quickly invalidated, though you may set something longer and just handle the case that some nodes drop out in your system.
Note that this can be complicated if you want these to be run on consumer connections. If you’re behind a NAT you may need to do some more clever work to figure out your public IP and the port that’s being forwarded (e.g. via a STUN server or UPnP / IGDP). From your post below, it sounds as if you’re just exposing a public server that multiple clients connect to, so DNS SRV records might be a good choice.
If you’re providing a (potentially) different connection endpoint for each spreadsheet, then you can use this as your addressing layer. If you name sheets as something like {sheet name}.{username}.{your domain}, then you’d look up something _yingw787sheet._tcp.{sheet name}.{username}.{your domain} to get the host IP and port of the server. Or, if you want to give users the ability to transfer ownership of sheets and rename them, you’d do use a CNAME entry for {sheet name}.{username}.{your domain} that would then return a domain in the form {UUID}.{your domain}, and then you’d do the SRV lookup on _yingw787sheet._tcp.{UUID}.{your domain}. That also makes it fairly trivial for other people to host instances of the service and even lets people do their own namespace management while using your hosted version.
Of course, like most other uses of DNS, this is a horrible abuse and not what DNS was designed for. You may find that free DNS providers start to object when they see that you have 100,000 DNS entries in a single domain with short TTLs and thousands of requests per second.
I’m thinking about using Route 53 as my DNS server, and I don’t think I have SSH access. I think I do have API access to create and destroy records though. If I use DNS-based service discovery (which I know nothing about currently), would it be possible to register an EC2 instance with a master process to DNS upon provisioning, then have the master process issue API calls to DNS for each process? If this stack does work, how might this change if I add in a proxy layer?
This looks interesting. I had heard of Consul before but I didn’t put two and two together. Thanks!
Have you used Consul in production before? How often might it fail, and how often do you need to update it due to a critical need (e.g. security vulnerability)?
Also I think Consul is meant to be natively integrated into Kubernetes…and EKS costs something like $0.10 / hr which is $72 / mo…without factoring in EC2 costs on top of that. My budget is something like $5-10 / mo. for personal projects.
That’s a good question! VPC is free (NAT gateways and Elastic IPs you may have to pay for), EBS is pretty cheap, and ECS uses EC2 instances underneath the hood, so if you pay for reserved or spot pricing it’s less of an issue. You can stick multiple containers on an EC2 instance, but if you need to connect them together you may need something like DNS service discovery for those services to talk to each other, which is a PITA. RDS is the most expensive. Everything else is like $5 / mo., RDS is $20-50 / mo.
However, since RDS is merely a data sink for this stack, you don’t have to have RDS. You can use something like ElephantSQL’s free plan and punch in the PostgreSQL URI into the custom db compute CloudFormation template. Then it’d be $5 / mo. total.
Hmm, correction…I’m using t3.medium instance which is $30 / mo., but you can switch that to t3.nano which would be $3.70 / mo., but you’d have to adjust the “CPU / memory units” that ECS uses to allocate for each container on the VM.
The principle should be the same though, you should be able to downscale and not suffer any other behaviour differences except slowness / capacity issues.
If the problem with postgres RDS is lack of extensions or similar, do https://www.elephantsql.com/ or https://aiven.io/ offer more flexibility so that one could avoid the latency of having a postgres to postgres proxy?
That’s…surprisingly great. Do you know whether the “AWS Marketplace” offerings are purely EC2 costs, or do vendors also charge a flat fee for the AMI? I have Startup School credits through AWS (extremely generous) and that’s why all my infra is AWS; would love to try out Aiven for AWS if the former is true. :P
I tried to understand what exactly this is trying to do, but I don’t understand why you’d have two postgres instances. If I understand correctly, the problem with RDS is that you can’t install custom extensions (but they do supply FDW out of the box). But if you then install a second Postgres (in a regular VM?) and connect it to RDS, isn’t the RDS instance redundant and just sitting there costing you extra money, then?
I mean, if you’re installing your own custom install, why not just use that and be done with it?
Only reason I could think of is that the data storage is still on them. Yes you’re proxy going down is more likely, but proxies are normally easy to operate.
One thing I didn’t realize when I first interacted with numerous databases is how SQL is different between each database. I don’t like having to relearn having to fetch and shape data over and over again if I don’t have to, so I proxy data sources through Postgres in order to use Postgres’s flavor of SQL and stored procedures PL/pgSQL.
Just using Postgres as a “front end” to a different database strikes me as very inelegant and too much needless overhead. Also, from my experience, FDWs don’t handle data migrations in the source database very well. It’s very easy for the source table and the FDW definition to get out of sync.
Testing FDW-based tables is a right pain in the neck, too, due to how they interact with transactions (unless you do everything through the FDW, and the databases you connect to are defined statically). I wrote a bit about my experiences with FDW on the company blog a while ago, but I never really bothered to update it. We used it mostly to do queries over multiple instances of the same system.
So while I think FDW are very cool and powerful, I’m not sure I’d be very quick to use them myself.
Very interesting! I was planning on using FDWs as a pass-through entity, where dataflow goes from table -> view -> HTTP. In this situation, I wouldn’t be pushing any data on the source table, and therefore wouldn’t have any issues with data migrations (because there would not be any). For synchronizing foreign table references, I was planning on a stored procedure + batch job to pull in foreign tables every so often and blow existing references away, though that may be tricky if there’s any dependencies on those foreign tables. Any transactions you’d handle on the source table, in a “push/pull” configuration (OLTP work on source, OLAP work on destination). I think for BI / analytics stuff, this should be fine.
I just felt restricted by what you can do with an ORM. I figured since an ORM supports so many databases, it has to cater to the least common denominator of database features, and I wanted more from Postgres.
It’s good to experiment, so definitely do so! Maybe for your use case the issues I ran into are not even relevant.
I’m not a big fan of ORMs either (they can produce inefficient queries, are often a barrier to getting exactly out of the db what you want because not every query result maps cleanly to “objects”, they can easily accidentally generate many more queries than you would want to perform, and they are a total memory hog), but they don’t have to be the least common denominator. For the query language, usually, yes (and for that, dropping down to the SQL level tends to be painful), but for example Django allows you to use Postgres extensions like arrays, JSON fields, partial indexes, and even Postgis fields and spatial operators. Only thing I kind of miss are proper enum fields.
I was amused by the automatic conversion of inline graphviz code to a graph with JavaScript (try disabling JS (or enabling it if you have it disabled)). Pretty nifty!
Thanks! All credit goes to Munif Tanjim (creator of Minimo), Adam Charytoniuk (creator of Engimo), and @spf13 / maintainers of Hugo for their hard work and dedication!
This was a great read! I’ve been looking at postgraphile but wanted something that was just regular rest. Now I know there is postgrest!
But I have been thinking more and more about how critical knowing databases is for being able to build new things, especially when you don’t have the money or the people to build abstractions around databases. But, like mentioned in the article, it does require time, and that is precious.
There are a few tools like PostgREST out there. Supabase, Hasura, XGeneCloud, Prisma. You can also generate a REST API from OpenAPI specs with OpenAPI Generator. I’ve been working on a similar tool that can generate a REST API and admin UI from Postgres or MySQL.
FWIW I’m writing out the Part 2 of this blog post where I ship actual code on AWS (hopefully for publishing tomorrow, repo here: https://github.com/yingw787/postgres_as_app), and it’s shockingly easy to get started with PostgREST. It’s tar one binary, copy in a conf file, and go. The robustness of Haskell with the simplicity of Go. All runtime conf options, like pagination, are handled via HTTP headers. It’s definitely a tool I want to incorporate into my future apps, because it’s stupid simple and transparent.
One thing I do every morning is take a picture of myself in the mirror, with my made bed in the background, and sending it to Boss as a Service, with the subject line “I live in the best of all possible worlds. Everything I do today, I choose to do”. I think it really helps me out to start the day well and to explicitly place myself in a well of positive thinking no matter the actual circumstances.
Start each day remembering the good stuff you experienced, the good stuff you have, and the good stuff you might still do or experience. On the last one, include some things you can do or experience in quarantine. Then try to take action on those each day. Then, they eventually join the list of good things you’ve done or experienced to motivate you towards the next set.
This is part of my morning routine. How well it goes varies day by day. It helps most days, though. Hope it helps you. :)
This looks pretty great! I’m not in the embedded Rust space, and therefore I’m probably not going to buy a ticket, but more conferences should look at providing quality remote-friendly and remote-first services, and I’m glad you guys are making the commitment!
I would also recommend people that aren’t used to remote, to install a browser extension or else to restrict the time on some websites.
Being alone at home (the place where you usually give yourself a good rest) is a more tempting place and time to spend more time than usual on websites such as youtube or lobsters.
I use Ubuntu on my laptop, and I have a multi-layered approach: /etc/hosts to block out a number of websites, and a RescueTime subscription to redirect sites if I happen to unblock something locally (because I have sudo access).
I use links2 as a CLI-based browser. It’s surprisingly great and I learned a good deal about what’s important in an HTML document and how I want to write front-ends. I’ve also looked at Midori, but if I’m going with a GUI I still like Firefox.
Chrome I use for WebRTC-related tasks like video chats with friends and professional contacts. I’m definitely not a FOSS purist, but I try to rely on it whenever I can.
I come to Lobste.rs for the genuine tech conversations that I find harder get on other sites like Hacker News. Beyond Hacker News and Lobste.rs, other sites are kind of just “here’s how to create a basic REST API” or some fluffy crap like that and I don’t bother going there. I would be really sad if I didn’t have you guys to talk to because we didn’t take care of our community.
I really like how the two people I’ve invited to Lobste.rs prefaced their ask for a Lobste.rs invite with some discussions around some of the blog posts I submitted here to confirm their interest and their ability to knowledgeably add to the discussion.
I think it’s so great that Lobste.rs is valuable now (as shown / described). It’s important we understand we have lever in extending invites, and we should learn to say no when we’re uncomfortable or unsure. I personally find it hard (which is why I’m mentioning it), but I find the alternative of losing this community scarier.
It is always super weird to see someone quote me. Really cool! But also super weird. I’m not fully adjusted to that yet ;)
If you find Haskell’s type system interesting, I’d also recommend checking out an ML! The module system in ML is really cool, powerful, and unique. I’m shocked more languages haven’t adopted similar module systems.
Also, a couple of quick pedantic nitpicks:
I was reading through some of Hillel Wayne’s blog posts, and one of them discussed the Curry-Howard correspondence, which somehow proves a 1:1 association between aspects of a mathematical proof and aspects of a type system. You can convert a proof into a type, if your type system supports it. I think Haskell’s type system respects this. I don’t think Python’s type system does.
I don’t think I explained the CHC very well in that. All sound static type systems are “proofs”, where what you are proving is “this is well-typed”. This is true for both Haskell and Python (mypy). But different languages can encode different consequences of being well-typed, like if you can express the type of sorted lists, you can say “I’ve proven this is well-typed ⇒ this sort function correctly sorts the list.” Haskell’s type system is more powerful than most other commercial languages, but it’s not powerful enough to encode arbitrary theorems. For that you need dependent typing a la Idris or Coq.
Also proving type-safety for arbitrary theorems is really, really hard. Tradeoffs!
Python has hypothesis, but having used it a tiny bit for an open-source contribution, I don’t think it’s the same without the Haskell type system.
Gotta defend hypothesis a bit here! It sounds like you were using quickcheck in the context of learning from a book, while you were using hypothesis for a real-world project. In that context quickcheck is gonna seem more elegant. But Hypothesis is world-class and a lot of more modern PBT libraries follow its example. It’s a lot better at generating inputs like “two dictionaries where the values of the first dict are a superset of the keys in the second.”
OCaml seems really cool, and I’ve heard it used in industry; is it the most popular ML for practitioners? I also know of ReasonML used by Facebook to create Facebook Messenger. But I think there’s still so much to learn about Haskell, and between going broad and learning about type systems, I’d want to go deep into Haskell and become intermediate, or even advanced if I’m fortunate enough to have the time and resources to do so!
Hmm, I didn’t know that Idris or Coq would be powerful enough to encode arbitrary theorems. I think if I were to study more into type theory (an interesting subject), I’d try to learn those!
Yeah, I think Corbin mentioned that Hedgehog was closer to hypothesis than QuickCheck was; I think I was too quick in dismissing hypothesis. I do like the notion of using generators over types, and the integrated shrinking (when I wrote this I assumed that QuickCheck had integrated shrinking; I think that’s actually a Hedgehog or maybe hypothesis innovation). I definitely like your blog post point about creating integration tests from a combination of hypothesis and pycontracts, I’ll give that a shot for my next project :D
QuickCheck does shrinking. The Arbitrary typeclass has a shrink :: a -> [a] method which should return a list of smaller versions of its argument.
If you don’t want to create an instance of Arbitrary, you can use the function forAllShrink. It’s type is a more polymorphic version of (RandomSeed -> a) -> (a -> [a]) -> (a -> Bool) -> Bool: the first argument is a generating function, the second is a shrinking function and the third is the property to check.
I like property based testing and it helped me a lot in the past. It’s a great tool for certain domains and with a good setup it could allow more people to contribute.
Having said that, the first part of the article mentions how sales and marketing would rarely contribute to testing, and the very first example starts with using regexes, which are hard even for programmers. Wouldn’t it be easier (yet longer) to implement it with strategy that selects from several other, simpler strategies?
Thanks for sharing! I thought it would be important to give the context from which I approached property-based testing, and the article is mostly an intro for programmers in the field already, hence discussion around corporate and inter-departmental psychology. I believe
hypothesishasstrategies.characters()which generates a stream of individual characters, but if you want strings I think the only option isst.from_regex(). I might be wrong though. You definitely can chain multiple strategies together.There’s a
textstrategy: https://hypothesis.readthedocs.io/en/latest/data.html#hypothesis.strategies.textMan this was jarring. I wish they de-risked the UI change by forming a UX migration plan and then A/B testing one part at a time, than doing everything wholesale.
You can have a wildcard domain, such as `.mydomain.com that routes to some dispatch service.
Then each process you run registers (in a database, writes to disk, etc.) itself, and the dispatch service takes
myprocessname.mydmain.comand routes it to that process.Theoretically, the dispatch service could also do a lookup of active processes running and dynamically route based on that, but I assume there might be some overhead with that approach (without caching).
Hmm, this is interesting. So if I understand you correctly, you’re saying I can create a master process on the instance (or on a separate server?), that can read a remote database, figures out the process, and routes the request to that process?
Are there any examples of this approach on GitHub, or in books you know and recommend? How might this approach change if individual processes and instances are shielded behind a load balancer?
Can you name individual UNIX processes, or do you have to search by PID or command? I’m guessing that even if PIDs don’t change over the process lifetime, if the process goes down and it restarts, then it’ll have a different PID, and relying on a failing process to properly issue an HTTP call to update a remote registry isn’t wise because you’ll be coupling failure models within your system design.
Also perhaps look into dbus: https://news.ycombinator.com/item?id=9451023
I don’t know enough about process name / PID internals to know how easy or hard it is to look up - that’s also why I suggest that each process self-registers in some shared store (DB, disk, in-memory service, etc.). Someone further up suggested Consul which fits this role well - in general, “service discovery” is probably what you should be googling.
The router (that takes incoming requests and sends them to the right process) can both live on the same instance or somewhere else, assuming you use HTTP for communication. If you want to use sockets or similar IPC, you’ll need to have it on the same instance.
To handle failing processes, you could either have a heartbeat mechanism from the router checking that the process is up (with X failures before unregistering it) or you could just have a timeout on all incoming requests and make it so that a new process registering will overwrite the old process’ registration.
It’s hard to be more specific without pulling in specific code samples or talking about actual implementation details.
Mind you, you are not required to know all these answers, but unless you provide some context using them as a guide, any advise we offer is useless.
So the impetus behind this idea is I wanted to move away from Google Sheets to something like Airtable, because I’m more familiar with SQL and databases and APIs than I am with spreadsheets and formulas. But I don’t like depending on closed-source platforms because any change they make to their platform is a Black Swan event to me I may need to address (or can’t address). So I wanted to build out a tiny alternative to Airtable using PostgreSQL and PostgREST, and some very basic UI layer I’ll write myself. I don’t care too much about site performance, I care a little bit about availability, I mostly care about data freedom and integrity.
I found that ElephantSQL (PostgreSQL as a Service) can create databases for free using a multi-tenant (?) model, and EC2 / DigitalOcean / etc. is fairly cheap for a tiny instance. I’ve worked with containers and there’s a lot of cognitive load just to get it working in production (especially if I move on in my career and forget about it). So I’d rather create a VM and fill it to the brim with processes.
Problem is, PostgREST has a static configuration for a PostgreSQL URI, which means it can’t support multiple databases without restarting. I could model my “spreadsheet” around this and create a spreadsheet as a table. However, ElephantSQL restricts 20MB of data for the free tier on a per database level, and database costs are scary for personal projects. It also seems like a waste of resources to have an entire EC2 instance for one process, which could negatively impact my costs over time.
Here is how I’d approach it:
YMMV, I cannot judge how much you’re willing to spend. You could do most of these with a $4/month server and no docker depending the load.
What is this app?
If it’s an HTTP(s) server, then yes this is pretty straightforward.
When your browser visits google.com, the HTTP request it sends includes a line like
Host: google.com. Your app can switch on that line without needing to have multiple IP addresses or anything.I’m building out a tiny spreadsheet alternative for myself; it’s like open-source Airtable, with APIs and DB cursor access. I mostly want to get away from using Google Sheets. The backend stack would be PostgreSQL + PostgREST (web server to lift PostgreSQL into HTTP). Yeah, PostgREST is an HTTP server. I’d either be provisioning PostgreSQL + PostgREST for an individual spreadsheet, or one db + API for all user spreadsheets, but I’d rather do the former because ElephantSQL sets limits on free tiers per database on disk usage + # of concurrent connections (which makes sense to me from first principles).
I’m not sure I fully understand your point on routing HTTP requests. Can you elaborate more on that please?
Building around ‘what can I get for free from a specific provider’ is a very fast way to burn a weeks engineering time to save $5.
They can - and likely will - detect and ban you for circumventing the free tier, which will likely cause you to lose all your data.
For a project at this stage, I’d strongly recommend a single DB and a single API server, both running on the same box. You’ll need a machine to run the API server anyways - put your postgresql there too.
AWS, google, azure and oracle each have a free tier which can do this.
The hardest part will be figuring out cron for backups (it never has the right env vars), but you can leave that until you’ve been using it for a few days (don’t wait months though!).
As you’re using an existing off-the-shelf server, some things are slightly more complex (assuming you don’t want to patch & compile it yourself), since you’re restricted to what it already knows how to do.
Unless you have particularly fancy needs (multiple users?), I would recommend against separating ‘document’ and ‘sheet’ (the way e.g. excel does), instead just having sheets. If you jam it all in one schema in one database on one domain name, you could be using it tomorrow. Manually separate unrelated data by giving it a sensible name.
If you insist on nesting sheets inside documents, consider using the document name as a prefix - you can get your client to run a
list all tables whose names have prefix 'mydocument'query and only show those ones.Further separation could be obtained (with additional effort) via the multi-schema support in postgrest. However, then you need to update the config file and restart a server every time you add a spreadsheet - see the relevant bit of the postgrest docs.
I think you’re right. Thanks for talking me out of my original plan. I do have $50k worth of AWS credits, but since I was on sabbatical and was hesitant to start something serious, I never really expected to take advantage of those credits because I can’t use it after the credits expire. Now I think maybe that’s a bit bone-headed and I should go ahead and build it using the free credits anyways. Worst comes to worst, I can always export my data to S3 and shut down the stack.
Hmm, I didn’t realize that bit of PostgREST documentation. I should go back and read through the whole thing before pinging people with questions. Thanks again for your help!!
DNS names map to IP addresses. In principle, it is possible to have multiple IP addresses for a single computer, and make each process listen on a different IP address.
Hard to make a concrete suggestion without knowing more of the requirements, but for example you could have each process on the application server run a script on the DNS server via SSH, which changes the DNS configuration to point to the IP address that the process is using.
Some of them do. SRV records; however, map to a port and DNS record name. If you have a DNS updating API, you can publish SRV records that advertise the location of the service. You can have multiple SRV records for the same resource, so you could do a single DNS query to get a complete list of all of the nodes if it’s small enough.
SRV records sound interesting, I’m only familiar with CNAME, A, and AAAA so far. Do you know of any particularly great resources / tutorials to read / code more about SRV records along my use case or similar?
The Wikipedia page is actually pretty good. It describes the format. For your use case, I imagine that you’d want to set a very short (1-10 second) TTL so that they were quickly invalidated, though you may set something longer and just handle the case that some nodes drop out in your system.
Note that this can be complicated if you want these to be run on consumer connections. If you’re behind a NAT you may need to do some more clever work to figure out your public IP and the port that’s being forwarded (e.g. via a STUN server or UPnP / IGDP). From your post below, it sounds as if you’re just exposing a public server that multiple clients connect to, so DNS SRV records might be a good choice.
If you’re providing a (potentially) different connection endpoint for each spreadsheet, then you can use this as your addressing layer. If you name sheets as something like {sheet name}.{username}.{your domain}, then you’d look up something _yingw787sheet._tcp.{sheet name}.{username}.{your domain} to get the host IP and port of the server. Or, if you want to give users the ability to transfer ownership of sheets and rename them, you’d do use a CNAME entry for {sheet name}.{username}.{your domain} that would then return a domain in the form {UUID}.{your domain}, and then you’d do the SRV lookup on _yingw787sheet._tcp.{UUID}.{your domain}. That also makes it fairly trivial for other people to host instances of the service and even lets people do their own namespace management while using your hosted version.
Of course, like most other uses of DNS, this is a horrible abuse and not what DNS was designed for. You may find that free DNS providers start to object when they see that you have 100,000 DNS entries in a single domain with short TTLs and thousands of requests per second.
I’m thinking about using Route 53 as my DNS server, and I don’t think I have SSH access. I think I do have API access to create and destroy records though. If I use DNS-based service discovery (which I know nothing about currently), would it be possible to register an EC2 instance with a master process to DNS upon provisioning, then have the master process issue API calls to DNS for each process? If this stack does work, how might this change if I add in a proxy layer?
I think something like https://www.consul.io/ will fit what you’re looking for.
This looks interesting. I had heard of Consul before but I didn’t put two and two together. Thanks!
Have you used Consul in production before? How often might it fail, and how often do you need to update it due to a critical need (e.g. security vulnerability)?
Also I think Consul is meant to be natively integrated into Kubernetes…and EKS costs something like $0.10 / hr which is $72 / mo…without factoring in EC2 costs on top of that. My budget is something like $5-10 / mo. for personal projects.
About how much it cost per month to run VPC, RDS, EBS and ECS? I think I checked once only for RDS for a hobby project and it was a bit expensive.
That’s a good question! VPC is free (NAT gateways and Elastic IPs you may have to pay for), EBS is pretty cheap, and ECS uses EC2 instances underneath the hood, so if you pay for reserved or spot pricing it’s less of an issue. You can stick multiple containers on an EC2 instance, but if you need to connect them together you may need something like DNS service discovery for those services to talk to each other, which is a PITA. RDS is the most expensive. Everything else is like $5 / mo., RDS is $20-50 / mo.
However, since RDS is merely a data sink for this stack, you don’t have to have RDS. You can use something like ElephantSQL’s free plan and punch in the PostgreSQL URI into the custom db compute CloudFormation template. Then it’d be $5 / mo. total.
Thank you very much.
Hmm, correction…I’m using t3.medium instance which is $30 / mo., but you can switch that to t3.nano which would be $3.70 / mo., but you’d have to adjust the “CPU / memory units” that ECS uses to allocate for each container on the VM.
The principle should be the same though, you should be able to downscale and not suffer any other behaviour differences except slowness / capacity issues.
If the problem with postgres RDS is lack of extensions or similar, do https://www.elephantsql.com/ or https://aiven.io/ offer more flexibility so that one could avoid the latency of having a postgres to postgres proxy?
Do they offer extensions / root access? I thought they were managed solutions.
Aiven supports 47 postgresql extensions, including pg_cron, postgis, postgres_fdw and timescaledb.
I searched Elephant SQL’s docs for “Extension” but got no meaningful hits. Looks like they do not support extensions.
That’s…surprisingly great. Do you know whether the “AWS Marketplace” offerings are purely EC2 costs, or do vendors also charge a flat fee for the AMI? I have Startup School credits through AWS (extremely generous) and that’s why all my infra is AWS; would love to try out Aiven for AWS if the former is true. :P
I tried to understand what exactly this is trying to do, but I don’t understand why you’d have two postgres instances. If I understand correctly, the problem with RDS is that you can’t install custom extensions (but they do supply FDW out of the box). But if you then install a second Postgres (in a regular VM?) and connect it to RDS, isn’t the RDS instance redundant and just sitting there costing you extra money, then?
I mean, if you’re installing your own custom install, why not just use that and be done with it?
Only reason I could think of is that the data storage is still on them. Yes you’re proxy going down is more likely, but proxies are normally easy to operate.
One thing I didn’t realize when I first interacted with numerous databases is how SQL is different between each database. I don’t like having to relearn having to fetch and shape data over and over again if I don’t have to, so I proxy data sources through Postgres in order to use Postgres’s flavor of SQL and stored procedures PL/pgSQL.
Doesn’t have to be Postgres as the data source, you can use MySQL, Oracle, Google Sheets, GPU VRAM, or your smarthome lightbulbs, or a combination of these: https://wiki.postgresql.org/wiki/Foreign_data_wrappers
You can also write your own FDWs using Multicorn, a Python library: https://multicorn.org/
Just using Postgres as a “front end” to a different database strikes me as very inelegant and too much needless overhead. Also, from my experience, FDWs don’t handle data migrations in the source database very well. It’s very easy for the source table and the FDW definition to get out of sync.
Testing FDW-based tables is a right pain in the neck, too, due to how they interact with transactions (unless you do everything through the FDW, and the databases you connect to are defined statically). I wrote a bit about my experiences with FDW on the company blog a while ago, but I never really bothered to update it. We used it mostly to do queries over multiple instances of the same system.
So while I think FDW are very cool and powerful, I’m not sure I’d be very quick to use them myself.
Very interesting! I was planning on using FDWs as a pass-through entity, where dataflow goes from table -> view -> HTTP. In this situation, I wouldn’t be pushing any data on the source table, and therefore wouldn’t have any issues with data migrations (because there would not be any). For synchronizing foreign table references, I was planning on a stored procedure + batch job to pull in foreign tables every so often and blow existing references away, though that may be tricky if there’s any dependencies on those foreign tables. Any transactions you’d handle on the source table, in a “push/pull” configuration (OLTP work on source, OLAP work on destination). I think for BI / analytics stuff, this should be fine.
I just felt restricted by what you can do with an ORM. I figured since an ORM supports so many databases, it has to cater to the least common denominator of database features, and I wanted more from Postgres.
It’s good to experiment, so definitely do so! Maybe for your use case the issues I ran into are not even relevant.
I’m not a big fan of ORMs either (they can produce inefficient queries, are often a barrier to getting exactly out of the db what you want because not every query result maps cleanly to “objects”, they can easily accidentally generate many more queries than you would want to perform, and they are a total memory hog), but they don’t have to be the least common denominator. For the query language, usually, yes (and for that, dropping down to the SQL level tends to be painful), but for example Django allows you to use Postgres extensions like arrays, JSON fields, partial indexes, and even Postgis fields and spatial operators. Only thing I kind of miss are proper enum fields.
I was amused by the automatic conversion of inline graphviz code to a graph with JavaScript (try disabling JS (or enabling it if you have it disabled)). Pretty nifty!
Thanks! All credit goes to Munif Tanjim (creator of Minimo), Adam Charytoniuk (creator of Engimo), and @spf13 / maintainers of Hugo for their hard work and dedication!
This was a great read! I’ve been looking at postgraphile but wanted something that was just regular rest. Now I know there is postgrest!
But I have been thinking more and more about how critical knowing databases is for being able to build new things, especially when you don’t have the money or the people to build abstractions around databases. But, like mentioned in the article, it does require time, and that is precious.
There are a few tools like PostgREST out there. Supabase, Hasura, XGeneCloud, Prisma. You can also generate a REST API from OpenAPI specs with OpenAPI Generator. I’ve been working on a similar tool that can generate a REST API and admin UI from Postgres or MySQL.
Do any of these tools intercept the query and rewrite it, possibly joining across multiple SQL backends? Of those that do, do they do query push down?
Stardog’s virtual graphs can do that.
Not that I’m aware of.
I think Hasura can stitch GraphQL queries. Or I guess they call it remote joins: https://hasura.io/blog/remote-joins-a-graphql-api-to-join-database-and-other-data-sources/
FWIW I’m writing out the Part 2 of this blog post where I ship actual code on AWS (hopefully for publishing tomorrow, repo here: https://github.com/yingw787/postgres_as_app), and it’s shockingly easy to get started with PostgREST. It’s tar one binary, copy in a conf file, and go. The robustness of Haskell with the simplicity of Go. All runtime conf options, like pagination, are handled via HTTP headers. It’s definitely a tool I want to incorporate into my future apps, because it’s stupid simple and transparent.
Finding the will to continue in quarantine.
One thing I do every morning is take a picture of myself in the mirror, with my made bed in the background, and sending it to Boss as a Service, with the subject line “I live in the best of all possible worlds. Everything I do today, I choose to do”. I think it really helps me out to start the day well and to explicitly place myself in a well of positive thinking no matter the actual circumstances.
Never heard of BaaS.
Sounds cool.
Start each day remembering the good stuff you experienced, the good stuff you have, and the good stuff you might still do or experience. On the last one, include some things you can do or experience in quarantine. Then try to take action on those each day. Then, they eventually join the list of good things you’ve done or experienced to motivate you towards the next set.
This is part of my morning routine. How well it goes varies day by day. It helps most days, though. Hope it helps you. :)
This looks pretty great! I’m not in the embedded Rust space, and therefore I’m probably not going to buy a ticket, but more conferences should look at providing quality remote-friendly and remote-first services, and I’m glad you guys are making the commitment!
This is a trial run, but I know at least of one other Rust conference that currently considers at least a remote component.
I would also recommend people that aren’t used to remote, to install a browser extension or else to restrict the time on some websites. Being alone at home (the place where you usually give yourself a good rest) is a more tempting place and time to spend more time than usual on websites such as youtube or lobsters.
I’ve been searching for such things for Linux for quite a while. Is there anything you recommend?
I use Ubuntu on my laptop, and I have a multi-layered approach:
/etc/hoststo block out a number of websites, and a RescueTime subscription to redirect sites if I happen to unblock something locally (because I havesudoaccess).Continuing to do some work on TinyDevCRM (or “tinydev”): https://bytes.yingw787.com/posts/2020/02/08/tinydevcrm_1/ (Can’t publish a new post due to new lobste.rs rules, which is all good with me!)
Hopefully I can get to MVP stage by end of next week, how likely that is I’m not sure.
I use
links2as a CLI-based browser. It’s surprisingly great and I learned a good deal about what’s important in an HTML document and how I want to write front-ends. I’ve also looked at Midori, but if I’m going with a GUI I still like Firefox.Chrome I use for WebRTC-related tasks like video chats with friends and professional contacts. I’m definitely not a FOSS purist, but I try to rely on it whenever I can.
I come to Lobste.rs for the genuine tech conversations that I find harder get on other sites like Hacker News. Beyond Hacker News and Lobste.rs, other sites are kind of just “here’s how to create a basic REST API” or some fluffy crap like that and I don’t bother going there. I would be really sad if I didn’t have you guys to talk to because we didn’t take care of our community.
I really like how the two people I’ve invited to Lobste.rs prefaced their ask for a Lobste.rs invite with some discussions around some of the blog posts I submitted here to confirm their interest and their ability to knowledgeably add to the discussion.
I think it’s so great that Lobste.rs is valuable now (as shown / described). It’s important we understand we have lever in extending invites, and we should learn to say no when we’re uncomfortable or unsure. I personally find it hard (which is why I’m mentioning it), but I find the alternative of losing this community scarier.
It is always super weird to see someone quote me. Really cool! But also super weird. I’m not fully adjusted to that yet ;)
If you find Haskell’s type system interesting, I’d also recommend checking out an ML! The module system in ML is really cool, powerful, and unique. I’m shocked more languages haven’t adopted similar module systems.
Also, a couple of quick pedantic nitpicks:
I don’t think I explained the CHC very well in that. All sound static type systems are “proofs”, where what you are proving is “this is well-typed”. This is true for both Haskell and Python (mypy). But different languages can encode different consequences of being well-typed, like if you can express the type of sorted lists, you can say “I’ve proven this is well-typed ⇒ this sort function correctly sorts the list.” Haskell’s type system is more powerful than most other commercial languages, but it’s not powerful enough to encode arbitrary theorems. For that you need dependent typing a la Idris or Coq.
Also proving type-safety for arbitrary theorems is really, really hard. Tradeoffs!
Gotta defend hypothesis a bit here! It sounds like you were using quickcheck in the context of learning from a book, while you were using hypothesis for a real-world project. In that context quickcheck is gonna seem more elegant. But Hypothesis is world-class and a lot of more modern PBT libraries follow its example. It’s a lot better at generating inputs like “two dictionaries where the values of the first dict are a superset of the keys in the second.”
OCaml seems really cool, and I’ve heard it used in industry; is it the most popular ML for practitioners? I also know of ReasonML used by Facebook to create Facebook Messenger. But I think there’s still so much to learn about Haskell, and between going broad and learning about type systems, I’d want to go deep into Haskell and become intermediate, or even advanced if I’m fortunate enough to have the time and resources to do so!
Hmm, I didn’t know that Idris or Coq would be powerful enough to encode arbitrary theorems. I think if I were to study more into type theory (an interesting subject), I’d try to learn those!
Yeah, I think Corbin mentioned that Hedgehog was closer to hypothesis than QuickCheck was; I think I was too quick in dismissing hypothesis. I do like the notion of using generators over types, and the integrated shrinking (when I wrote this I assumed that QuickCheck had integrated shrinking; I think that’s actually a Hedgehog or maybe hypothesis innovation). I definitely like your blog post point about creating integration tests from a combination of hypothesis and pycontracts, I’ll give that a shot for my next project :D
QuickCheckdoes shrinking. TheArbitrarytypeclass has ashrink :: a -> [a]method which should return a list of smaller versions of its argument.If you don’t want to create an instance of
Arbitrary, you can use the functionforAllShrink. It’s type is a more polymorphic version of(RandomSeed -> a) -> (a -> [a]) -> (a -> Bool) -> Bool: the first argument is a generating function, the second is a shrinking function and the third is the property to check.