I’m a fan of separating your operational and analytics data on large databases. A lot of times, the operational system only needs a subset of the overall data. This is usually based on some time frame, ie only the last two year’s worth of data. As data comes in, it’s replicated to the analytics DB and then a scheduled job truncate the out-of-date data in the operational DB.
By splitting it out, we can be better about keeping the operational DB smaller. That allows mitigation of a lot issues described here. We also still get to query the entire data set in the analytics DB. You’re also no longer worried about long running analytics queries taking down production.
Replication is easy if the analytics db is 100% identical (and still worthwhile, some analytics queries can be super heavy), but are there any good tools for replicating while ignoring deletions caused by such scheduled truncations?
That really has to do with the adapter you’re using. A lot of times what you get is a deleted field. So, it’ll still have the deleted data, but be marked ADAPTER_DELETED = TRUE|FALSE
. That way you get to keep all your data and query against the history knowing if it exists across boundaries.
many enterprise databases come with adjacent (but some times separately priced) replication tools (eg [1], [2] )
Most of those replications are based on redo logs (rather than triggers). So they keep the notion of an ‘acid’ transaction when replicating. And at the same time, they allow, often, very sophisticated filtering to avoid replication stuff that is not needed.
Replication into analytics database vs ETL into analytics database, is not a simple decision
Normally, operational databases need to have schemas or document-oriented structures that are better performing for updates, but analytics databases need to be better organized for reads (therefore denormalization, therefore column-oriented storage tech, etc).
Some vendors eg (oracle, sap hana) implemented complex optimizations (relying on FPGAs and vector-oriented CPU instructions) to achieve what they claim a hybrid row+column oriented storages. All that effort is really justified, so that ETL from operations into analytics can be avoided (ETL is source of huge expense and huge number of data quality issues), and instead, Operational db - oriented schemas can be used for Analytics.
There are also approaches, of course, that sit on ‘top’ of the databases, helped by custom application layers (these are mostly described in lambda or kappa architectures).
Examples of database replication tools that allow filtering during replication:
Fivetran is one example I know for sure that does what I mentioned. It costs $, but I’m sure you there are more out there.
Rails has a lot of advantages, especially for smaller teams. It’s a incredibly powerful and fast way to standup a web app.
However, in my experience, once the size of the tech department grows you can definitely experience pain. Call back hell, return type mismatches, etc.
These problems seem to really arise when the app gets too big for individuals to keep in their head.
That said it’s still the first framework I go to for side projects that are web apps.
Your woes make you the exact target audience for his new book: Sustainable Web Development with Ruby on Rails. No, I am not affiliated, but I did work with him at Stitch Fix, and he knows what he’s talking about and has done it. I haven’t read the book, though I want to, and feel I’ve already been a part of success at this.
(For anyone who is confused, “he” and “him” in the parent comment refers to the author of this blog post, who also wrote that book.)
That book looks pretty useful. The 50-page sample PDF is worth reading even if you can’t afford the whole book. From the sample chapter “Business Logic (Does Not Go in Active Records)”, I learned how and why to create a service layer in a Rails app. And from the sample chapter “Jobs”, I learned that Active Job may not be worth its complexity in apps (as opposed to in libraries), and learned of some bugs to avoid when writing jobs: actions may happen twice due to retries after partial failure, and relevant code may be skipped due to data changes between job retries.
We have a lot of stuff going on both Saturday and Sunday, but I’d like to carve out some time to do a write up of using Little’s Law to estimate scaling requirements with some code examples to show how it applies.
Non-paywalled version from dev.to: https://dev.to/martinheinz/it-s-time-to-say-goodbye-to-docker-386h
XPath is a funny little technology. Along with XML Schema and XML itself, it comes from an era when JSON didn’t yet exist, web APIs were new, and we thought all web pages were going to become “valid” XML documents via XHTML. Funny, because, even though JSON largely replaced XML, it turned out that XPath and XML Schema were mostly YAGNI technologies, and so we hardly use alternatives to them when we work with JSON. And, nowadays, the idea of an HTML page being valid structured data against some schema seems quaint – what with the focus on “does it render across browsers?”, “is it responsive?”, “Is it fast?”, etc.
And don’t forget XSLT! ;-) Dredging up these old technologies can feel like wandering into an alternate reality.
XPath is indispensable for quick-and-dirty web-scraping, and XSLT is, well, something—there isn’t really any direct replacement for it! I find myself writing XSLT about once a year. Most recently: rendering ALTO files as SVG overlays on scanned newspapers (to make the text selectable) completely on the client-side!
XSLT is interesting for sure. I used it on a job for about 4 years. It was great for converting source documents into the specific format our system knew how to ingest.
I haven’t written any since I left that project and can’t say I miss it, even though it was really effective for our use case.
Hi5 XSLT buddy!
I used it early in my career for more years than I’d like to remember but it was on point for that specific employer: It was a publisher that converted XML into a whole load of other formats and also added meta-data.
The ‘crown’ of my work was converting the XML into RTF.
Ha ha ha! Yeah, it’s a good tool for those jobs. Sounds like we did the opposites. I took in lots of different XML sources and converted them all into our one XML format.
And, nowadays, the idea of an HTML page being valid structured data against some schema seems quaint
I don’t think it’s surprising or necessarily bad that web pages aren’t, but the state of schema validation for JSON that does represent structured data makes me kinda sad. JSON Schema seems to have enough momentum that I’m not worried about its future, but the validator ecosystem doesn’t feel particularly healthy… and also, it’s just not as capable as XML schema.
As for the rest of the XML tooling, it still does get used in some areas, and for me personally it’s a massive relief when I need to analyse data from some random proprietary thing and it turns out to be backed by XML, because it means the tooling to let me write queries about it is just there. Despite how popular JSON is, it hasn’t really ever got there (jq is good but limited by design; jsoniq… exists, I guess).
$work: I’m defining some API contracts between our team’s service and some other, internal, services we’re going to be using.
$personal: I want to finish up reading Effective C and start on a small side project. I want to write it in Go, but I’m not too sure about the authentication/sessions story for Go and webapps yet.
I really like how this describes the complexity of implementing infinite scroll. I hope this will give teams thinking of implementing pause and reduce the amount of apps that feel like they need it.
I hate infinite scroll. Sure, it’s convenient, but at the cost of typical breakpoints and perpetuating screen-time suck. It’s not particularly hard or annoying to hit a next
button and that gives your mind a quick break to evaluate if you really want to move on to the next set.
Infinite scroll is merely there to keep your attention glued to your device looking for that next infinitesimally small dopamine hit while hiding how far in the depths of trash you’ve waded into.
I personally never found the pop psychology very convincing. Partly because it may not even be based on an accurate model of how dopamine works, but mostly because I’m not convinced that fixed-size pagination is the best way to provide what you’re looking for.
If Twitter had explicit pages, do you think it would be less of a time-suck? I actually think the answer’s no. What they would have done instead is made the pages as small as they could get away with, like one tweets to a page, so that the reader would be forced to get into a rhythm clicking “Next” even if they only wanted to spend a few minutes checking what’s new. And since they’d also show no page numbers, you’d get no on-site indicators of your time spent.
It’s clearly not the only factor. I also don’t think saying infinite scroll isn’t part of the problem because they would use another tactic is correct. Either tactic would accomplish the same thing. It’s a whole lot of design and strategy around ensuring they’re sucking as much time from you as they can.
A good argument for hosting email is sieve support (or any filtering language), which ironically is a protocol/format aiming portability of filtering across vendors.
But nearly all paid internet services are built like a walled garden, like everything you can get without investing your time (gratis non-free services, paid services…).
There are a load of things that mail server software can do that most providers don’t expose. For example, postfix can use a different authenticated relay depending on the pair of From address and authenticated user. If you send email from a load of different devices, it’s annoying to have to configure a load of different accounts on each one. It’s easy to forward incoming email to a single account but I’ve yet to see a provider that makes it easy to forward outgoing email so that it automatically goes through the correct relay.
I’ve got mixed feelings about this. It’s yet another syntax for Ruby, which is already fairly complicated. It is very close to duplicating pre-existing functionality in the form of Struct and OStruct - neither of which I actually like using because there’s no built-in immutability. I don’t know that this proposal is HARMFUL, but neither do I know that it is a benefit. So, “meh”
I like using Structs in Ruby, especially as a form of documentation and, as the author said, ensuring that a key passed in doesn’t silently fail. It’s much easier to see the options available for a bit of code where the struct is defined in one place instead of hunting for accessors of a hash.
That being said, I don’t like this syntax at all. It just adds yet another way of doing something when, I feel, the current way is sufficient and easy enough already.
I’ve been using dry-struct for this reason, in concert with dry-types, and I really like it.
I agree with what their saying about the two different camps of development. A lot of developers I know, including myself, tend to have interests that align with one or the other.
However, the conclusion here should be applied to both camps. There is a ever growing mountain of resources to help the application (“plumber”) developers figure out and grow in their career path. There seems to be far less resources for the tooling and systems level developers though.
This article seems to shrug that off as a something a CS degree gets you, but that is far from a guarantee you’ll be able to land a job working on tools, complex systems, or systems level development.
On that note, any resources on the latter are appreciated.
I am definitely more of the plumber type, but here’s some things that may be of interest to you:
I’ve had a couple of pretty bad technical interviews from technical people. In those cases the interviewer wasn’t asking for a solution to the problem, but instead the exact solution that they had in their mind.
I know someone who flunked a coding interview at Twitter because the graph-theory problem they used had two standard solutions (depending on what tradeoffs you cared about), the interviewer only knew one, she provided the other, and he just failed her without even checking whether it worked.
!$
has been my favorite shell “command” lately. It inserts the last argument of the previous command. So, if I blame a file:
git blame this-file
And then I want to edit that file:
vim !$
It’s a decent time saver for me.
At home, I go back and forth from wanting to work on some side projects I have queued up to wanting to focus on anything but computers. So, it’s a toss up there.
At work, I’m on point for handling this week’s unplanned work. So, trying to chew through that small backlog that’s there and quickly resolve anything that pops up.
This is a really interesting way to set up your Airflow DAGs. We broke ours out into a couple of Airflow instances due to the size of one. They also had a logical separation based on what they were processing though.
Are you pulling directly into the current DAGs or are you pulling into a separate dir that you cut over to? IIRC, you have to signal Airflow to reload the DAGs.
We pull directly into the current DAG directory. Airflow now automatically detects DAG change and reloads them on the fly, so we didn’t have to do anything special.
It’s mostly used as a ETL system, from my experience, and would be more akin to systems like Luigi.
if you’re not on-call for your code, who is?
While each developer needs to have ownership over the code they write, the team also needs to have ownership over their feature sets. You don’t want your teams to be isolated engineers fighting their own battles. You want your teams to act as teams and fight battles together. Especially in production, your team’s main objective should be solving the problem, not finding out who to pass the issue off to. Obviously, you should pull the people in with the the most domain knowledge to solve the problem as fast as possible, but you never want to get caught casting blame and passing the buck.
QA Gates Make Quality Worse
I have worked in two distinct QA structures. One of them seems to be what the author is describing: a dedicated QA team that catches code tossed over the wall. This was a miserable way to work and did cause a lot of conflict that was described. But, I’ve also worked on teams with embedded QEs. This proved to have far better results. The QE assigned to the team was knowledgeable about the features being produced and worked closely during the planning to determine what the testing scenarios should be. This allows them to provide other insight and scenarios that the engineer may not have written tests for. It also helps with standardizing a lot of testing procedures within the teams for each given product.
Boring Technology is Great
I love this advice. Unless there is a business need for new and shiny, stick with what you know. It reduces friction allowing products to be produced faster with a higher quality.
Things Will Always Break
There’s a pretty big dichotomy here. You have to accept some risk when deploying new code and it is inevitable you will deploy bugs to production. That doesn’t mean you shouldn’t analyze the risks and ensure you’re mitigating the risks that could have the greatest negative impact. This is especially true when you’re dealing with systems that have a effect on physical objects. If you deploy a high-risk change set that requires physical labor and manual intervention to correct the cost to the company or clients is far greater than one that can be corrected by just another deploy. These need to be taken into account. There are times when you have no choice but to deploy those high-risk change sets, but you should have done your leg work already and identified it as a risk and be ready with a plan in case it goes wrong. But, if you’re able to reasonably mitigate this risk, you should do so ahead of time.
This is a research paper that studied the effects of companies moving to a open office floor plan and what happened to social interaction. It actually decreased significantly, in their cases, and promoted digital communication:
https://royalsocietypublishing.org/doi/full/10.1098/rstb.2017.0239
I do work in a similar setting as you’ve described. I don’t think many people have in person conversations. However, there are a lot more barriers to include remote people in the meetings. The teleconferencing needs to be on the calendar, you need to book a room, or sometimes screen sharing/video doesn’t work properly. These definitely make it challenging to include remote teammates.
Here’s the referenced article from Martin Fowler:
During the days I’ll mostly be enjoying time with the family.
Other than that, I want to get the “gym wall” painted in our garage and a TV off a stand and mounted, so the 1 y/o will quit trying to climb the stand.
Also, I’ve been reading Database Internals and will hopefully make some good progress on that, too.