This really is a horrifying outage - session tokens were being returned to the wrong users, allowing them to access each other’s data. And it was caused by a Terraform change. Who’s unit tests are catching that?
No unit test will catch if someone does something inherently wrong (I can imagine no situation where caching a set-cookie header would make sense) and touching auth.
The bigger problem would be that the Ops people would need to have some sort of knowledge about webdev, so I’d say this would maybe be mitigated by cross-funtional teams and someone (even just accidentally) seeing the commit, or even reviewing it.
No unit test, but good integration tests could have. Especially ones that do pathological things like click your submit/login/do stuff button really fast, and also from different sessions. Sadly very very few places have this level of testing. I’ve never seen it so far.
IME, the trouble with “this level of testing” is that it’s very hard to have really comprehensive tests that test system-wide corner case behaviors like this AND are not full of non-deterministic false failures. If you commit to that level of testing, you’ll reach a point where you’ll have to rerun the tests 100 times before you manage to observe a run when all of them pass. The end result is that your engineers will be spending an obscene amount of time and effort fighting with these tests and for every bug that these tests catch (once every 4 months, and that is if people didn’t get used to ignoring the failures), you’ll end up spending thousands of programmer hours fighting with the tests. Time that those programmers could’ve used for catching and fixing orders of magnitude more bugs.
That just points to the need for dedicated people who just write tests. It’s a different skill, and proper tests like that can be built in a way that they aren’t likely to cause false alarms, but it’s fairly difficult and sometimes non-obvious. I’ve been seeing a slow death of testing as a part of a software production pipeline, being absorbed into programming, when IMO these parts should be separate.
proper tests like that can be built in a way that they aren’t likely to cause false alarms, but it’s fairly difficult and sometimes non-obvious
I agree with this very much and it’s difficult enough that you usually can’t expect the average programmer (that you will encounter in your company) to write tests like this and you will almost certainly not encounter tests like this when you join a new project. An organizational issue that often makes this problem even worse is that if you have dedicated test engineers, they will typically get a lower pay, so your prospects of keeping a programmer with the required caliber for writing race-free deterministic tests gets even worse. And reconsider the kind of testing GP has brought up:
Especially ones that do pathological things like click your submit/login/do stuff button really fast, and also from different sessions.
Writing a test like this in a race-free manner isn’t something a dedicated test engineer can do without changing the application code and the deployment infrastructure. You typically need to turn the code inside out to expose the time-dependent aspects in a way that you can interact with in a controlled fashion. F.x if two parallel DB queries are racing with each other and your test needs to target a particular order, you need to be able to plug in a mock database or at least some sort of middleware or an abstraction layer between the DB and the application to arrange that order in your test. BTW, I’m not at all suggesting that’s how software should be written. I’d much rather spend my time on making sure that this complexity doesn’t arise in the first place, instead of letting it arise and then dance around it with my tests. But left to the average programmer, the complexity will arise and if you have dedicated test engineers in a separate team, they will have to deal with it.
This really is a horrifying outage - session tokens were being returned to the wrong users, allowing them to access each other’s data. And it was caused by a Terraform change. Who’s unit tests are catching that?
No unit test will catch if someone does something inherently wrong (I can imagine no situation where caching a set-cookie header would make sense) and touching auth.
The bigger problem would be that the Ops people would need to have some sort of knowledge about webdev, so I’d say this would maybe be mitigated by cross-funtional teams and someone (even just accidentally) seeing the commit, or even reviewing it.
No unit test, but good integration tests could have. Especially ones that do pathological things like click your submit/login/do stuff button really fast, and also from different sessions. Sadly very very few places have this level of testing. I’ve never seen it so far.
IME, the trouble with “this level of testing” is that it’s very hard to have really comprehensive tests that test system-wide corner case behaviors like this AND are not full of non-deterministic false failures. If you commit to that level of testing, you’ll reach a point where you’ll have to rerun the tests 100 times before you manage to observe a run when all of them pass. The end result is that your engineers will be spending an obscene amount of time and effort fighting with these tests and for every bug that these tests catch (once every 4 months, and that is if people didn’t get used to ignoring the failures), you’ll end up spending thousands of programmer hours fighting with the tests. Time that those programmers could’ve used for catching and fixing orders of magnitude more bugs.
That just points to the need for dedicated people who just write tests. It’s a different skill, and proper tests like that can be built in a way that they aren’t likely to cause false alarms, but it’s fairly difficult and sometimes non-obvious. I’ve been seeing a slow death of testing as a part of a software production pipeline, being absorbed into programming, when IMO these parts should be separate.
I agree with this very much and it’s difficult enough that you usually can’t expect the average programmer (that you will encounter in your company) to write tests like this and you will almost certainly not encounter tests like this when you join a new project. An organizational issue that often makes this problem even worse is that if you have dedicated test engineers, they will typically get a lower pay, so your prospects of keeping a programmer with the required caliber for writing race-free deterministic tests gets even worse. And reconsider the kind of testing GP has brought up:
Writing a test like this in a race-free manner isn’t something a dedicated test engineer can do without changing the application code and the deployment infrastructure. You typically need to turn the code inside out to expose the time-dependent aspects in a way that you can interact with in a controlled fashion. F.x if two parallel DB queries are racing with each other and your test needs to target a particular order, you need to be able to plug in a mock database or at least some sort of middleware or an abstraction layer between the DB and the application to arrange that order in your test. BTW, I’m not at all suggesting that’s how software should be written. I’d much rather spend my time on making sure that this complexity doesn’t arise in the first place, instead of letting it arise and then dance around it with my tests. But left to the average programmer, the complexity will arise and if you have dedicated test engineers in a separate team, they will have to deal with it.
What’s the point of tests that don’t prevent someone from doing something wrong?