I’ve been a member of the ACM for the past twenty years. I’m not currently a member of any SIGs, but I have been a member of SIGPLAN, SIGSIM, and SIGWEB. When I was a student I signed up for multiple SIGs so I could read the proceedings at a reduced rate. With changes to ACM’s publication policy and the advancement of their digital library, I’m not sure if there is an economic benefit between the publications and membership anymore.
Membership within a SIG provides some benefits if you are interested in going to that community’s conferences or publishing in their journals. It is a signal that you are interested in joining and sustaining that community. But if you are just idly interested in the area, I don’t see the value.
At my last employer, a health check was allowed to fail (explicitly returning false) during initialization. Once a service served a positive health check, then normal practice and policy would be to restart the service/container if it started to return new negative health checks. This practice was quite reliable.
We also had a concept of ‘deep’ health checks which were reporting on the status of dependencies. This was convenient because we could leverage the same monitoring and alerting infrastructure, but deep health check failures did not trigger container restarts or change load balancer policies.
It’s a good idea to separate the deep healthchecks like that. I guess if it’s not directly affecting regular health it can be used as an observability metric.
The problem is that if, say, your application literally cannot do anything without a DB connection, and a particular machine (or pod, or whatever) has a missing/failing DB connection, you don’t want traffic going to it because the only possible outcome of it is an HTTP 500. You want all the alarms going off and you want something to try restarting the server/pod/whatever – if the DB really is down, well, your whole site was going to be down no matter what and you can always manually toggle off the restarts for the duration of the outage. And if it’s only down for that one server/pod/whatever, you want a restart to happen quickly because a restart is the cheapest and easiest way to try to fix a bad server/pod.
I think it’s critical to treat this complex problem in layers; e.g.,
An error report that a component generated an exception and had to fail a request, or health check failed; one or two of these might be interesting but not necessarily hard down. A large number could represent a serious fault.
The diagnosis that a component is broken; e.g., an instance is generating exceptions for more than 10% of requests, or a health check failed twice consecutively. This is as much a policy decision as anything else, and is quite hard to definitively get right! Diagnoses often need to encompass telemetry from multiple layers, so that a fault can correctly be pinned on a backend database that’s down rather than just everything, for example.
Corrective action; e.g., restarting something or replacing a component with a new one. You don’t always want to leap straight to restarting stuff, you may want to restart at most N instances in an hour to avoid making things worse by constantly restarting. If it’s a RAID array, for example, you might offline a busted disk that’s making everything slow, but obviously you can’t offline more disks than you have parity stripes without data loss.
There are others, as well, like how to report or alarm on errors, diagnoses, and corrective actions. Tying all this in to how you do deployments. And how to express policy, etc.
I agree. It’s hard to make a blanket call on whether restarting is worthwhile, e.g. failing to connect to a db could be due to a connection pool being emptied by connections not being returned, and that would be fixed by restarting.
As with ztoz’s ‘deep’ health checks, I think it probably makes sense to have a separate concept for “this container needs restarting” from “this container is not in a working state”, where the latter is a metric signal collected to alert on at a higher level than the orchestrator, and the former is a direct signal to the orchestrator to restart the container.
The author chose Forth, Occam, APL, Simula, SNOBOL, Starset, and m4 as the set of seven with the objective of broadening the reader’s perspective. All of these languages have available implementations. Would you exchange any of these languages for a different “obscure” language?
Definitely Self. It’s what I believe to be the ultimate object programming experience. The language is still maintained (although pretty inactively) and there has been a new release just this year. Its optimization technologies underpin almost every dynamically typed programming language VM (V8, JSC, PyPy, …).
There are thousands of obscure languages, and possibly more that were never released. Off the top of my head I can name ICON, Alice, and Hope. But perhaps the most obscure language I am aware of (that to my knowledge, had a commercially available compiler, and only two commercial products were ever produced with it) is INRAC. The central conceit of INRAC is a line-oriented (each line is a sub-routine, except if it ends with a GOTO-NEXT-LINE marker), nondeterministic execution.
SNOBOL is so primitive that it is virtually unusable from a modern perspective. Interesting from the point of view of understanding how low the standards could be for a programming language back in the 1960’s, even though good languages were also being used at that time.
Instead of SNOBOL I would pick Icon, a modern and usable SNOBOL successor, which is mind expanding, and gives you a new perspective on programming, but in a GOOD way. https://www2.cs.arizona.edu/icon/
I’ve used occam exactly once and would pick erlang over it because occam is very limiting. It deliberately cannot express things like programs that can use an amount of memory or stack space that is determined at runtime. The entire communication structure and everything is all fixed up front. Channels aren’t first class values.
Occam was co-designed with the transputer. The 1980s transputers had communications links whose topology was fixed by the way the hardware was wired together. Much of the runtime support was implemented in silicon. So an occam program would be compiled to match the hardware it would run on. It wasn’t until the 1990s with the T9 project that transputers became more dynamic. There were some useful spinouts: the physical layer of the T9 fast links was reused by FireWire, the fast link crossbar switch was reused for ATM — but the T9 itself failed. I don’t know if there was a new more dynamic version of occam to go with the more capable hardware.
The conceptual model and formalism underlying Occam is what’s called a process calculus. It was developed by Milner. Later versions include CCS, the Pi-calculus, and his final version, the bigraph model. All of these have been influential in different ways, but the earlier formalisms map better to hardware.
All of these languages have available implementations.
Are you sure? I’ve looked for a Starset implementation in the past, and tried again just now. I can’t find it. The author is a professor at Suffolk University, and supervised a student implementing a Starset compiler, but it appears to not be publically available online.
There’s very little information available online about Starset, but it appears to be a database query language for a database model based on sets (rather than the more familar models based on relations, graphs, etc). I can’t even find a language specification online, just vague sketches about what the language does.
I would instead choose the language “Setl”, which is a general purpose programming language based on set theory. Sort of like APL, except for sets (and tuples, maps, relations, and powersets) instead of arrays. It may be obscure, but it was also influential. It borrowed the idea of set comprehensions from mathematics and made them into a programming language feature. The list comprehensions in Python, Haskell, etc, come from Setl. https://en.wikipedia.org/wiki/SETL https://setl.org/setl/
On Starset availability: I missed this caveat the first time: “Finally, a Starset interpreter, christened “Suffolk Starset” or s3, is developed and maintained by the team at my own Suffolk University. Its GitHub repository will be made public as soon as we release the first fully functional version.”
I was curious about the list of names in the protocol. The protocol includes a “APID 2047 Idle packet” which seems to be used as either padding or for pinging/keep-alive purposes. The payload is meaningless semantically. Based on other posts on destevez.net, it seems that each mission tailors the content to some custom message.
I wonder what’s the use case where you would need such a large timeout value. I’d expect programmers to reach for a scheduler in this case, which would also survive javascript engine restarts.
We ran into this at my old job, where we wrote a scheduler in node.js. It took a CSV input of desired future events and set up timeouts to run them. If the service got restarted, it reloaded the CSV and set up the timeouts again.
The guy who generated the CSVs pinged us and said “this latest CSV I’ve uploaded seems to be sending out all the commands as soon as I uploaded it, what’s up with that?” and so we spent a few hours chasing down this weird behaviour.
It’s funny, the scheduler I use is affected by this bug! It surely survives reboots because on reboot, the timers are all reloaded, but any system that runs less frequently than 25 days gets kicked off. One example is one that handles lets-encrypt recertification.
Our domain is special, so it needs a specialized, standardized language
Our review of existing options shows none of them are sufficient
We must invent a new language
is so familiar.
Given the same domain, Erlang is a much more innovative solution.
The tutorial/overview linked by fanf includes a comparison against Java (this would be around J2SE 1.2). CHILL features some structured concurrency (more structured than threads), tuples, and there may have been Ada-like program verification, but the extent is unclear. Amusingly, under ‘Additional Elements’ they note that Java includes remote procedure calls and internet access but there is no corresponding element on the CHILL side. Since CHILL was meant for the telecom domain, you would expect them to advertise library support for telecom standards. Since they note CHILL’s recent support for Unicode, they do recognize interoperability as a requirement.
If you have a lot of these kinds of convos, I would highly recommend setting up explicit “on call” for team members to be in charge of handling external requests. Otherwise you can have motivated people be a bit ambitious with handling everything, get overloaded with firefighting, and end up just kinda exhausted. All while not sharing the institutional knowledge enough so they become a SPOF in an odd way.
Always good to make sure team members are taking steps forwards, of course. I just think that when people are doing this on rotation then it removes a bit of variability. Not a hard and fast rule of course.
In addition to an explicit ‘on call,’ there were two other practices at my former employer that I think advance this philosophy.
One, we had a policy that a customer could not be redirected more than three times. If you were last in the line, you had to hold the ticket to completion rather than redirect to another team. The one time I was the one holding the ticket, the client was seeing random errors throughout the product. As it turned out, they had a HTTP proxy on their side that was randomly failing requests (but only on certain domains), but the policy forced someone to fully investigate rather than keep on passing the buck once symptoms could be ascribed to a different team.
Secondly, as the company grew, we added an ‘engineer support’ role that could support the on-calls. They could handle long-term investigations and support jobs that were longer than a week, but not big or long enough to warrant an actual project.
Totally agree with your advice for an explicit “on call” during business-hours.
Crucially, moving support out of DMs and into public channels means others can search logs for advice on similar issues (and sometimes even answer their own questions!)
I wrote an internal bot a few years ago that syncs Slack user groups with $work’s on call scheduler. Folks can say @myteam-oncall in a public channel and instantly reach the right person without overambitious members needing to be involved in triage. It’s also easy enough to say @friendlyteam-oncall and redirect folk in-place to another team without switching channels or losing context.
In Jean Sammet’s Programming Languages: History and Fundamentals (1969), she notes that periods and semicolons are commonly used to delimit the smallest executable units (III.3.2.1). The blog author’s hypothesis that the semicolon was chosen because of its similarity to English is supported by how the ‘smallest executable units’ are treated.
One of the predecessors to ALGOL58, IAL, used semicolons as the separator between single statements. Although the “publication language” allowed semicolons to be dropped and to just have statements on separate lines. Periods only seem to be used to denote real numbers. So, since ALGOL58 was heavily influenced by IAL, it perhaps simply inherited the syntax.
There were alternatives: MATH-MATIC, or AT-3, which started development in 1955, uses periods at the end of statements (IV.2.2.1). FORTRAN did not include a statement delimiter, since the physical form of the code, i.e. the punch card, was the delimiter. The FORTRAN character set did not include a semicolon, but it did include a period. An extremely limited character set was common in this period and the syntax used to communicate an algorithm might not use the same characters as required for physical input (e.g. a < used instead of .LT.). For instance, see the table of ALGOL60 characters in section IV.4.3 which include logical and set operators.
As a practical matter, I have read semi-colons were preferred to periods because periods were easy to miss since they were very small. However, I think that’s anecdotal. It may also have been easier to parse code when periods and semicolons did not share the same role, but that’s also speculation.
IAL and ALGOL58 are different names for the same thing.
I had a look through some of the early issues of CACM (open access at last, yay!) to see how algorithms were written, and they had not yet settled on a style that was algorithmic rather than mathematical.
In prose, when writing a complicated list where the elements might contain commas, it’s normal to separate them with semicolons. So it makes sense to use ; to separate a sequence of algorithmic steps.
For systems that I haven’t used myself, such as SourceSafe and ClearCase, I would love to hear from you about your experience with them.
It is sad how little public information there is about the proprietary systems. My company uses RTC (part of IBM ALM) a lot, but I’m only involved in transitioning away from it. Developers rant a lot about it, so the proprietary ones always have a bad image (slow! over-engineered! etc)
However, I have the feeling that it isn’t the fault of these systems. One point is that these systems are often abused, like storing lots of big binary blobs in there. Git doesn’t handle that well either. So when introducing git, we are forced to also introduce additional infrastructure like Artifactory, build scripts and pipelines and conventions to make them work together. In the end, the whole system is even more complex.
These proprietary VCSs do have some valuable lessons in there. However, they are not needed for OSS projects, so nobody cares (in public).
I’d love to hear more about RTC and other proprietary version control systems! How to use them? What are they good at? What do you enjoy? What is annoying ?
I am continuing with the blog post series into the 90s where ClearCase, Visual Source Safe, Perforce and other proprietary systems pop up and become popular. However I haven’t used some of these (with the exception of Perforce) so I am really looking for more input.
I think proprietary ones can be really good. Perforce for example is still widely used and well regarded (of course … it depends) in games companies, but of course with dominance of a few systems and the availability of very good open source systems, making a case for spending USD a year in source control systems can be tricky.
Note that the blog post is part 1 of an (hopefully) multi-blog series, and thus only covers the very early days of source control systems for now.
First, vrthra’s writeup agrees with my memory of how CC worked.
I was an employee of an aerospace company from roughly 2005 to 2013. CC was the sorta-official configuration management tool. (Company rules mandated some form of config management to be used on projects, but didn’t mandate a single tool.) Requirement documents, specifications, drawings, source code, and builds were all (typically) stored in CC.
CC was labor intensive, as each company site had a CC administrator and this was a full-time job at least one site. We also had CM specialists and some software engineers received advanced training and partial administrative rights to help handle issues.
Users would have a drive or a mounted filesystem linked to a CC VOB. The file locking mechanism of CC worked fairly well for people working on documents (Word files) or other binary files as the multi-user features of those programs were typically primitive at this time. With extensions, the user might not even be aware of the locking feature, as the file would be locked and unlocked for them if they made changes. (Unintended changes were fairly common, so using these extensions was discouraged.) These drives also acted as a distribution mechanism, as an update to a software build could be invisibly propagated to test machines.
The ConfigSpec/View feature of CC was very flexible, but also very confusing. Most users were told what spec to use and didn’t deviate. Developers used the spec to grab individual developer branches and create branches of their own, although it was common to forget to update the spec before making an edit. Fixing problems almost always required an expert to debug.
Many software developers were unhappy with CC (and adjacent tools, like Rational Rose). In particular, the administrative overhead was onerous if you had a project that lasted less than a year. Since my role was typically ‘project adjacent,’ I mostly used Subversion during my time there, sometimes syncing the SVN repository to CC for an official release. Towards the end, I noticed software developers finding excuses to use git.
For quite a bit (I’d say 2009ish to 2013ish) when git was new to most people, one of our main selling points was “It doesn’t matter if you mess up more than with svn, everything is easier to fix”.
Often when I hear about the commercial systems people have a similar vibe, “I can’t read the code, and we’re only users, we never have people who kinda went through everything from A to Z and understood it completely”. Not sure if that is also a main point, I have a feeling it very often is with commercial systems, unless it’s such a big beast that you have consultants and dedicated support engineers by the org selling it…
Page 8 has diagrams of the keyboard layout. It looks like the caret symbol ^ would require a shift-N for the 1967 version. The 1963 keyboard would print an up-arrow, although that appears to be protocol equivalent.
Not entirely relevant, but interesting perhaps, to note that at the same time (ASCII 1967) up-arrow became caret, back-arrow became underscore. The model 33 used ASCII 1963 and had up and back arrows.
The back-arrow was used for variable assignment in some programming languages. I would guess the only easily accessible example of this today is Smalltalk.
And on a bit-paired keyboard, shift + _ is ascii '\x20' | '\x5F' which is '\x7F' which is the delete character. For example, the ADM 3A has RUB (out) on the _ key. I haven’t found a keyboard that clearly shares <- (delete) and <- (ascii ’63 backarrow) on a keycap.
My theory is that flipped when lowercase came out. The actual ASR 33 is uppercase-only, and has shift-N as uparrow and shift-O as backarrow, but a separate RUBOUT key. So the shift key is turning bit 5 on. The 7F character is out by itself on the chart so it just gets its own key. The Datapoint 3300, designed as a Model 33 emulator, has this setup.
If you have lowercase, then shift still turns bit 5 on for numbers, but it turns bit 6 off for letters. Putting DEL in that bit 6 zone means shift-DEL generates underscore (not backarrow, because lowercase implies ASCII 1967). The Datapoint 2200 is a very bit-paired keyboard (shift-zero even generates a space!) and has an underscore on the RUB key like the ADM3A.
On the practical side, ‘pooling variability’ can have some great practical benefits for control and throughput. Similar to some of the author’s examples at the end, with one system at work, we had a steady stream of requests of about 1 to 20 “units” in size. However, sometimes a request would come in with a size of thousands or millions of units. We created two parallel systems, one to handle the routine work and one to handle the outsize work. Routine was predictable and well-controlled and rarely caused any troubles. With the outsize work system, we could focus on scalability and monitoring.
I was a bit disappointed at first when I realized that this is a guided structure or an outline of what one should study rather than the full material. But I was told this is a translation/language problem. Maybe I shouldn’t have expected that?
In the OED, a curriculum is a “regular course of study or training”, referencing the use at the University of Glasgow. Whether a course of study refers to the structure of learning or includes specific learning materials, seems to be rather debatable. Wikipedia has a long section on the various meanings.
I don’t think MDN is breaking any new ground; ACM uses similar terminology in their documents on the definition of computer science education. The term “standards” seems to be the trend in public education, though.
The MDN page does include links to some tutorials for certain subjects and it sounds like they plan to add more references, including to courses that people can take.
I’m surprised they implemented division and square roots as hardware instructions, given the extra complexity and costs, rather than leaving them for a software implementation like the EDSAC. Presumably the real-time constraints of the application necessitated it.
The Arma Micro Computer is just one of the dozens of compact aerospace computers of the 1960s, a category that is mostly forgotten and ignored.
I think this is partly due to the question of whether this falls into aerospace history or computing history. As an embedded device, it would fall naturally into a history of improved guidance controls and their transition from analog to digital.
A singular or small set of end-to-end tests can help focus on valuable functionality. If these functions also deliver value (i.e. the greater organization or some customers can start using them), they also serve an important purpose of demonstrating concrete value and securing management favor.
In my last job, our coding systems underwent three major versions. The first version was a little more than a prototype and used to validate that our clients wanted that kind of functionality. It was also useful in teasing out the interfaces with other team’s services and some performance numbers. The second version was, unfortunately, completed in a rush, as clients really wanted the functionality. We had been designing it, but ended up having to cut back on several features in order to meet schedule. This version taught us a lot about the reliability issues in our problem set and performance requirements. Fortunately, we were able to limp along with the second version for several years, which allowed a number of design iterations and alignment of design vision with other teams.
So, prototypes can be useful in learning a domain and how to build a system, although I recommend you go in with an idea of what you want to learn. I’ve often seen prototype just mean “build the first version, but without tests”, which is not the point.
Organizationally, can you have a program manager assigned to your project? A good PM will know how to elicit risks from the development team, guide prioritization of new functionality versus tech debt, and work with management on resourcing and schedule. (If you have a bad PM, they’ll either be a tax on productivity or potentially drive the project to failure. That’s the risk.)
I’m lucky in that there already is a previous version (albeit with a smaller remit) that a lot of the current spec is based off of. That systems been in production for about 10 years now so I’m well aware of the dark corners.
Unfortunately I don’t have access to resources like program managers or analysts for this. It would be helpful, but I also have no time pressure so I can take the time to make sure things are correct and valid and really think things through.
In North American middle schools, kids often get these annoying “math questions” along the line of “continue this sequence of numbers…” along with 5 or so numbers. Disregarding that a mathematical sequence defines its value at each input, every other kid rediscovers (or gets taught) the method of finite differences (or finite x for some operation x) where they find the difference between terms and iterate between levels until it goes to a constant sequence.
In the ‘why should you care’ bit, it would be nice to mention that the underlying mathematics behind this is also used in compiler optimisations based on scalar evolution. If you have a loop nest where each level is doing some addition over an induction variable (which is most loop nests, and even more after some canonicalisation) then you have a sequence in this form where you already know the differences. Scalar evolution works by modelling the polynomial and performing transforms of the loop based on this model.
This is one of the more complex bits of mathematics (if you aren’t doing polyhedral optimisation) in a modern compiler and is one of the things that gets some really big wins and is not resent in simple compilers.
When I first found out about Scalar evolutions I liked them a lot, although it is unclear whether they provide any benefits over a more general approach like http://apron.cri.ensmp.fr/library/ (except perhaps simpler code?). At some point GCC used to depend on PPL, but doesn’t anymore (got replaced by ISL, and I don’t know whether that’d be a substitute for SCEV).
I expected the author to obtain the savings by eliminating some of the easy dumb stuff you can do building a docker image (e.g. creating multiple layers through multiple update and install steps), but they get to the trickier stuff right away.
I think there is a maintainability question as the script expects packages to store specific files in specific locations and changes to packages could easily break those assumptions. However, a useful demonstration.
I thought more that it might make it easier and less error prone to copy out the libraries from the first layer, if the packages were just installed into an alt-root-dir in the first player (as above), as they would then be in a known location with less “other stuff” to worry about.
I think maintenance will be a hit-and-miss. The locations where these shared libraries are located is pretty stable (for now).
The only thing I am remotely worried about is the change in version numbers of these libs. But I think that can be fixed to some extent by copying them via a pattern. Like I did for musl
it’s not immediately clear to me what the advantage of using a would be from an accessibility viewpoint, if you’ve got the written description right there.
Very understated. Having the written fallback for visual users, which then doubles up what the accessibility services see. Frustrating to no end!
Aside: progressbar is a type of meter right? Is a donation goal a progressbar or a meter?
A donation goal is a meter because donations can be exceeded. In fact, exceeding a donation is something to be celebrated, so the client would want to call attention to that fact. In contrast, a progress bar restricts the value to the max value or 1 if no max is given because, conceptually, tasks can only be completed.
Sorry in advance for the long-winded context-setting, but it’s the only way I know how to answer this question.
There are a few important things to understand. First, even though HealthCare.gov was owned by CMS, there was no one single product owner within CMS. Multiple offices at CMS were responsible for different aspects of the site. The Office of Communications was responsible for the graphic design and management of the static content of the site, things like the homepage and the FAQ. The Center for Consumer Information and Insurance Oversight was responsible for most of the business logic and rules around eligibility, and management of health plans. The Office of Information Technology owned the hosting and things like release management. And so on. Each of these offices has their own ability to issue contracts and set of vendors they prefer to work with.
Second, HealthCare.gov was part of a federal health marketplace ecosystem. States integrated with their own marketplaces and their Medicaid populations. Something called the Digital Services Hub (DSH) connected HealthCare.gov to the states and to other federal agencies like IRS, DHS, and Social Security, for various database checks during signup. The DSH was its own separately procured and contracted project. An inspector general report I saw said there were over 60 contracts comprising HealthCare.gov. Lead contractors typically themselves subcontract out much of the work, increasing the number of companies involved considerably.
Then you have the request for proposals (RFP) process. RFPs have lists of requirements that bidding contractors will fulfill. Requirements come from the program offices who want something built. They try to anticipate everything needed in advance. This is the classic waterfall-style approach. I won’t belabor the reasons this tends not to work for software development. This kind of RFP rewards responses that state how they will go about dutifully completing the requirements they’ve been given. Responding to an RFP is usually a written assertion of your past performance and what you claim to be able to do. Something like a design challenge, where bidding vendors are given a task to prove their technical bona fides in a simulated context, while somewhat common now, was unheard of when HealthCare.gov was being built.
Now you have all these contractors and components, but they’re ostensibly for the same single thing. The government will then procure a kind of meta-contract, the systems integrator role, to manage the project of tying them all together. (CGI Federal, in our case.) They are not a “product owner”, a recognizable party accountable for performance or end-user experience. They are more like a general contractor managing subs.
In addition, CMS required that all software developed in-house conform to a specified systems design, called the “CMS reference architecture”. The reference architecture mandated things like: the specific number of tiers or layers, including down to the level of reverse proxies; having a firewall between each layer; communication between layers had to use a message queue (typically a Java program) instead of say a TCP socket; and so forth. They had used it extensively for most of the enterprise software that ran CMS, and had many vendors and internal stakeholders that were used to it.
Finally, government IT contracts tend to attract government IT contractors. The companies that bid on big RFPs like this are typically well-evolved to the contracting ecosystem. Even though it is ostensibly an open marketplace that anyone can bid on, the reality is that the federal goverment imposes a lot of constraints on businesses to be able to bid in the first place. Compliance, accreditation, clearance, accounting, are all part of it, as well as having strict requirements on rates and what you can pay your people. There’s also having access to certain “contracting vehicles”, or pre-chosen groups of companies that are the only ones allowed to bid on certain contracts. You tend to see the same businesses over and over again as a result. So when the time comes to do something different – eg., build a modern, retail-like web app that has a user experience more like consumer tech than traditional enterprise government services – the companies and talent you need that has that relevant experience probably aren’t in the procurement mix. And even if they were, if they walked in to a situation where the reference architecture was imposed on them and responsibility was fragmented across many teams, how likely would they be to do what they are good at?
tl;dr:
No single product owner within CMS; responsibilities spread across multiple offices
Over 60 contracts just for HealthCare.gov, not including subcontractors
Classic waterfall-style RFP process; a lot of it designed in advance and overly complicated
Systems integrator role for coordinating contractors, but not owning end-user experience
CMS mandated a rigid reference architecture
Government IT contracts favor specialized government contractors, not necessarily best-suited for the job
At any point did these people realize they kinda sucked, technically? Did they kinda suck?
I think they saw quickly from the kind of experience, useful advice, and sense of calm confidence we brought – based on knowing what a big transactional public-facing web app is supposed to look like – that they had basically designed the wrong thing (the enterprise software style vs a modern digital service) and had been proceeding on the fairly impossible task of executing from a flawed conception. We were able to help them get some quick early wins with relatively simple operational fixes, because we had a mental model of what an app that’s handling high throughput with low latency should be, and once monitoring was in place, going after the things that were the biggest variance from that model. For example, a simple early thing we did that had a big impact was configuring db connections to more quickly recycle themselves back into the pool; they had gone with a default timeout that kept the connection open long after the request had been served, and since the app wasn’t designed to stuff more queries into an already-open connection, it simply starved itself of available threads. Predictability, clearing this bottleneck let more demand flow more quickly through the system, revealing further pathologies. Rinse and repeat.
Were there poor performers there? Of course. No different than most organizations. But we met and worked with plenty of folks who knew their stuff. Most of them just didn’t have the specific experience needed to succeed. If you dropped me into a mission-critical firmware project with a tight timeline, I’d probably flail, too. For the most part, folks there were relieved to work with folks like us and get a little wind at their backs. Hard having to hear that you are failing at your job in the news every day.
Did anything end up happening to the folks that bungled the launch? Were there any actual monetary consequences?
You can easily research this, but I’ll just say that it’s actually very hard to meaningfully penalize a contractor for poor performance. It’s a pretty tightly regulated aspect of contracting, and if you do it can be vigorously protested. Most of the companies are still winning contracts today.
What sort of load were y’all looking at?
I would often joke on the rescue that for all HealthCare.gov’s notoriety, it probably wasn’t even a top-1k or even 10k site in terms of traffic. (It shouldn’t be this hard, IOW.) Rough order of mag, peak traffic was around 1,000 rps, IIRC.
What was the distribution of devices, if you remember?
You could navigate the site on your phone if you absolutely had to, but it was pretty early mobile-responsive layout days for government, and given the complexity of the application and the UI for picking a plan, most people had to use a desktop to enroll.
What were some of the most arcane pieces of tech involved in the project?
Great question. I remember being on a conference bridge late in the rescue, talking with someone managing an external dependency that we were seeing increasing latency from. You have to realize, this was the first time many government agencies where doing things we’d recognize as 3rd party API calls, exposing their internal dbs and systems for outside requests. This guy was a bit of a grizzled old mainframer, and he was game to try to figure it out with us. At one point he said something to the effect of, “I think we just need more MIPS!” As in million instructions per second. And I think he literally went to a closet somewhere, grabbed a board with additional compute and hot-swapped it into the rig. It did the trick.
In general I would say that the experience, having come from what was more typical in the consumer web dev world at the time - Python and Ruby frameworks, AWS EC2, 3-tier architecture, RDBMSes and memcached, etc. - was like a bizarro world of tech. There were vendors, products, and services that I had either never heard of, or were being put to odd uses. Mostly this was the enterprise legacy, but for example, code-generating large portions of the site from UML diagrams: I was aware that had been a thing in some contexts in, say, 2002, but, yeah, wow.
To me the biggest sin, and I mean this with no disrespect to the people there who I got to know and were lovely and good at what they did, was the choice of MarkLogic as the core database. Nobody I knew had heard of it, which is not necessarily what you want from what you hope is the most boring and easily-serviced part of your stack. This was a choice with huge ramifications up and down, from app design to hardware allocation. An under-engineered data model and a MySQL cluster could easily have served HealthCare.gov’s needs. (I know this, because we built that (s/MySQL/PostgreSQL) and it’s been powering HealthCare.gov since 2016.)
A few years after your story I was on an airplane sitting next to a manager at MarkLogic. He said the big selling point for their product is the fact it’s a NoSQL database “Like Mongo” that has gone through the rigorous testing and access controls necessary to allow it to be used on [government and military jobs or something to that effect] “like only Oracle before us”.
He’s probably referring to getting FedRAMP certified or similar, which can be a limiting factor on what technologies agencies can use. Agencies are still free to make other choices (making an “acceptable risk” decision).
In HealthCare.gov’s case, the question wasn’t what database can they use, but what database makes the most sense for the problem at hand and whether a document db was the right type of database for the application. I think there’s lots of evidence that, for data modeling, data integrity, and operational reasons, it wasn’t. But the procurement process led the technology choices, rather than the other way round.
I’m not Paul, but I recall hearing Paul talk about it one time when I was his employee, and one of the problems was MarkLogic, the XML database mentioned in the post. It just wasn’t set up to scale and became a huge bottleneck.
Paul also has a blog post complaining about rules engines.
Have you found something else as meaningful for you since then?
Yes, my work at Ad Hoc - we started the company soon after the rescue and we’ve been working on HealthCare.gov pretty much ever since. We’ve also expanded to work on other things at CMS like the Medicare plan finder, and to the Department of Veterans Affairs where we rebuilt va.gov and launched their flagship mobile app. And we have other customers like NASA and the Library of Congress. Nothing will be like the rescue because of how unique the circumstances, but starting and growing a company can be equally as intense. And now the meaning is found in not just rescuing something but building something the right way and being a good steward to it over time so that it can be boring and dependable.
Also what’s your favorite food?
A chicken shawarma I had at Max’s Kosher Café (now closed 😔) in Silver Spring, MD.
This is only slightly sarcastic. I’m hopeful with your successful streak you can target the EDI process itself, because it’s awful. I only worked with hospice claims, but CMS certainly did a lot to make things complicated over the years. I only think we kept up because I had a well-built system in Ruby.
It’s literally the worst thing in the world to debug and when you pair that with a magic black box VM that randomly slurps files from an NFS share, it gets dicey to make any changes at all.
For those not familiar, X12 is a standard for exchanging data that old-school industries like insurance and transportation use that well-predates modern niceties like JSON or XML.
X12 is a … to call it a serialization is generous, I would describe it more like a context-sensitive stream of transactions where parsing is not a simple matter of syntax but is dependent what kind of X12 message you are handling. They can be deeply nested, with variable delimiters, require lengthy specifications to understand, and it’s all complicated further by versioning.
On top of that, there is a whole set of X12-formatted document types that are used for specific applications. Relevant for our discussion, the 834, the Benefit Enrollment and Maintenance document, is used by the insurance industry to enroll, update, or terminate the coverage.
To give you a little flavor of what these are like, here is the beginning of a fake 834 X12 doc:
ISA*00* *00* *ZZ*CMSFFM *ZZ*54631 *131001*1800*^*00501*000000844*1*P*:~
GS*BE*NC0*11512NC0060024*20131001*1800*4975*X*005010X220A1~
ST*834*6869*005010X220A1~
BGN*00*6869*20131001*160730*ET***2~
QTY*TO*1~
QTY*DT*0~
N1*P5*Paul Smith*FI*123456789~
N1*IN*Blue Cross and Blue Shield of NC*FI*560894904~
INS*Y*18*021*EC*A***AC~
REF*0F*0000001001~
REF*17*0000001001~
REF*6O*NC00000003515~
DTP*356*D8*20140101~
NM1*IL*1*Smith*Paul****34*123456789~
PER*IP**TE*3125551212*AP*7735551212~
N3*123 FAKE ST~
N4*ANYPLACE*NC*272885636**CY*37157~
HealthCare.gov needed to generate 834 documents and send them to insurance carriers when people enrolled or changed coverage. Needless to say, this did not always go perfectly.
I personally managed to avoid having to deal with 834s until the final weeks of 2013. An issue came up and we need to do some analysis across a batch of 834s. Time was of the essence, and there was basically no in-house tooling I could use to help - it only generated 834s, not consume them. So naturally I whipped up a quick parser. It didn’t have to be a comprehensive, fully X12-compliant parser, it only needed enough functionality to extract the fields needed to do the analysis.
So I start asking around for anyone with 834 experience for help implementing my parser. They’re incredulous: it can’t be done, the standard is hundreds of pages long, besides you can’t see it (the standard), it costs hundreds of dollars for a license. (I emailed myself a copy of the PDF I found on someone’s computer.)
I wrote my parser in Python and had the whole project done in a few days. You can see a copy of it here. The main trick I used was to maintain a stack of segments (delimited portions of an X12 doc) and the in the main loop would branch to methods corresponding to the ID of the current segment, allowing context-sensitive decisions to be made about what additional segments to pull off the stack.
The lesson here is that knowing how to write a parser is a super power that will come in handy in a pinch, even if you’re not making a compiler.
just wanted to give you props. I’m a contractor myself and I’ve been on both sides of this (I have been in the gaggle of contractors and been one of the people they send in to try to fix things) and neither side is easy, and I’ve never come close to working on anything as high profile as this and even those smaller things were stressful. IME contracting is interesting in some ways (I have worked literally everywhere in the stack and with tons of different languages and environments which I wouldn’t have had the opportunity to do otherwise) and really obnoxious in other ways (legacy systems are always going to exist but getting all the contractors and civilian employees pulling in the same direction is a pain, especially without strong leadership from the government side)
Thanks, and back at ya. My post didn’t get into it, but it really was a problem of a lack of a single accountable product lead who was empowered to own the whole thing end-to-end. Combine that with vendors who were in a communications silo (we constantly heard from people who had never spoken to their counterparts other teams because the culture was not to talk directly except through the chain of command) and the result was, everyone off in their corner doing their own little thing, but no one with a view to the comprehensive whole.
Extremely. Aside from the performance-related problems that New Relic gave us a handle on, there were many logic bugs (lots of reasons for this: hurried development, lack of automated tests, missing functionality, poorly-understood and complicated business rules that were difficult to implement). This typically manifested in users “getting stuck” partway through somewhere, and they could neither proceed to enrollment or go back and try again.
For example, we heard early on about a large number of “lost souls” - people who had completed one level of proving your identity (by answering questions about your past ala a credit check), but due to a runtime error in a notification system, they never got the link to go to the next level (like uploading a proof of ID document). Fixing this would involve not just fixing the notification system, but figuring out which users were stuck (arbitrary db queries) and coming up with a process to either notify them (one-off out-of-band batch email) or wipe their state clean and hope they come back and try again. Or if they completed their application for the tax credit, they were supposed to get an official “determination of eligibility” email and a generated PDF (this was one of those “requirements”). But sometimes the email didn’t get sent or the PDF generator tipped over (common). Stuck.
The frontend was very chatty over Ajax, and we got a report that it was taking a long time for some people to get a response after they clicked on something. And we could correlate those users with huge latency spikes in the app on New Relic. Turns out for certain households, the app had a bug where it would continually append members of a household to a list over and over again without bound after every action. The state of the household would get marshalled back and forth over the wire and persisted to the db. Due to a lack of a database schema, nothing prevented this pathological behavior. We only noticed it because it was starting to impact user experience. We went looking in the data, and found households with hundreds of people in them. Any kind of bug report was just the tip of the iceberg of some larger system problem and eventual resolution.
I have a good story about user feedback from a member of Congress live on TV that led to an epic bugfix. I’ve been meaning to write it up for a long while. This anniversary is giving me the spur to do it. I’ll blog about it soon.
I agree with df — I think there’s more space, and practical benefit, to apply OR to the SaaS/distributed system space. Also, apg is correct that Google’s SRE book will give you a good overview of the existing applications of OR and control theory to large-scale software systems.
As an example of OR to distributed systems, my former employer experienced a number of operational and cost issues scheduling containers within our Nomad fleet. (Kubernetes will be similar.) The containers were very heterogeneous in memory and CPU reqts because our company had many microservices, but not that many instances of any single microservice. As I recall, Nomad only had two bin packing policies, and neither worked all that well. Either, we had tightly packed hosts and idle hosts (and if a tightly packed host died, there was a lot of pain as containers were re-scheduled) or a large number of mostly occupied hosts, leading to blocked deployments as no host could take very large jobs.
I used queuing theory on multiple projects, but one challenge is that the systems-in-use don’t match the theoretical queues all that cleanly. Kafka, which isn’t a true queuing or messaging system but is often used as one, can have very poor performance if the variability of work items is high (so certain partitions lag others). I haven’t seen Amazon SQS modeled theoretically to any precision. A practical, theoretical model would be useful in both cases to aid in operational planning and understanding.
I think colleges would jump at adding operation course if there was an equivalent to Factory Physics for software systems.
What is the reason to not def send_emails(list_of_recipients, bccs=[]):?
From my Clojure backend I wonder what is sending the emails (and how you can possibly test that unless you are passing in something that will send the email), but perhaps there is something I miss from the Python perspective?
The answer is mutable default arguments; if you’re coming from Clojure, I can see why this wouldn’t be an issue 😛
When Python is defining this function, it’s making the default argument a single instance of an empty list, not “a fresh empty list created when invoking the function.” So if that function mutates the parameter in some way, that mutation will persist across calls, which leads to buggy behavior.
An example might be (with a whole module):
from sendgrid import actually_send_emails
from database_functions import get_alternates_for
def send_emails(list_of_recipients, bccs=[]):
# suppose this app has some settings where users can specify alternate
# addresses where they always want a copy sent. It's
# contrived but demonstrates the bug:
alternate_recipients = get_alternates_for(list_of_recipients)
bccs.extend(alternate_recipients) # this mutates bccs
actually_send_email(
to=list_of_recipients,
subject="Welcome to My app!",
bccs=bccs,
message="Thanks for joining!")
What happens is: every time you call it without specifying a second argument, the emails added from previous invocations will still be in the default bccs list.
The way to actually do a “default to empty list” in Python is:
def send_emails(list_of_recipients, bccs=None):
if bccs is None:
bccs = []
# rest of the function
As for how you test it, it’s probably with mocking. Python has introspection/monkey patching abilities, so if you relied on another library to handle actually sending the email (the example above I pretended sendgrid had an SDK with a function called actually_send_email, in Python you would usually do something like
# a test module
from unittest.mock import patch
def test_send_emails():
with patch('my_module.sendgrid.actually_send_email') as mocked_email_send:
to = ['pablo@example.com']
bccs = ['pablo_alternate@example.com']
send_emails(to, bccs)
mocked_email_send.assert_called_once_with(
to=to,
subject="Welcome to My app!",
bccs=bccs,
message="Thanks for joining!")
This mucks with the module at runtime to convert sendgrid.actually_send_email to instead record its calls instead of what the function did originally.
I’ve been a member of the ACM for the past twenty years. I’m not currently a member of any SIGs, but I have been a member of SIGPLAN, SIGSIM, and SIGWEB. When I was a student I signed up for multiple SIGs so I could read the proceedings at a reduced rate. With changes to ACM’s publication policy and the advancement of their digital library, I’m not sure if there is an economic benefit between the publications and membership anymore.
Membership within a SIG provides some benefits if you are interested in going to that community’s conferences or publishing in their journals. It is a signal that you are interested in joining and sustaining that community. But if you are just idly interested in the area, I don’t see the value.
I also take this approach when defining healthchecks. I’ve not seen it explicitly stated though, and I wonder how commonly it’s followed.
Is a good rule of thumb to only fail a healthcheck if restarting the container would lead the problem being resolved?
At my last employer, a health check was allowed to fail (explicitly returning false) during initialization. Once a service served a positive health check, then normal practice and policy would be to restart the service/container if it started to return new negative health checks. This practice was quite reliable.
We also had a concept of ‘deep’ health checks which were reporting on the status of dependencies. This was convenient because we could leverage the same monitoring and alerting infrastructure, but deep health check failures did not trigger container restarts or change load balancer policies.
It’s a good idea to separate the deep healthchecks like that. I guess if it’s not directly affecting regular health it can be used as an observability metric.
The problem is that if, say, your application literally cannot do anything without a DB connection, and a particular machine (or pod, or whatever) has a missing/failing DB connection, you don’t want traffic going to it because the only possible outcome of it is an HTTP 500. You want all the alarms going off and you want something to try restarting the server/pod/whatever – if the DB really is down, well, your whole site was going to be down no matter what and you can always manually toggle off the restarts for the duration of the outage. And if it’s only down for that one server/pod/whatever, you want a restart to happen quickly because a restart is the cheapest and easiest way to try to fix a bad server/pod.
I think it’s critical to treat this complex problem in layers; e.g.,
There are others, as well, like how to report or alarm on errors, diagnoses, and corrective actions. Tying all this in to how you do deployments. And how to express policy, etc.
I agree. It’s hard to make a blanket call on whether restarting is worthwhile, e.g. failing to connect to a db could be due to a connection pool being emptied by connections not being returned, and that would be fixed by restarting.
As with ztoz’s ‘deep’ health checks, I think it probably makes sense to have a separate concept for “this container needs restarting” from “this container is not in a working state”, where the latter is a metric signal collected to alert on at a higher level than the orchestrator, and the former is a direct signal to the orchestrator to restart the container.
The author chose Forth, Occam, APL, Simula, SNOBOL, Starset, and m4 as the set of seven with the objective of broadening the reader’s perspective. All of these languages have available implementations. Would you exchange any of these languages for a different “obscure” language?
I honestly don’t think Forth is obscure as opposed to old. I’d pick something like ATS (http://www.ats-lang.org/) over Forth.
I’ve been learning this one. Quirky keyword usage, but amazing ideas & performance opportunity.
I have a very soft spot for M(UMPS), not sure if I’d change any of the propose languages for it…
Definitely Self. It’s what I believe to be the ultimate object programming experience. The language is still maintained (although pretty inactively) and there has been a new release just this year. Its optimization technologies underpin almost every dynamically typed programming language VM (V8, JSC, PyPy, …).
There are thousands of obscure languages, and possibly more that were never released. Off the top of my head I can name ICON, Alice, and Hope. But perhaps the most obscure language I am aware of (that to my knowledge, had a commercially available compiler, and only two commercial products were ever produced with it) is INRAC. The central conceit of INRAC is a line-oriented (each line is a sub-routine, except if it ends with a GOTO-NEXT-LINE marker), nondeterministic execution.
I know of Racter. What was the other one?
It was a program to help write poetry. I unfortunately don’t recall the name.
INRAC sounds interesting, I can’t find anything about it online, does anything about it even exist anymore?
Only in a few places. I have several entries about INRAC (an isolated entry from 2008 and there are several over a two month span in 2015) and I have found two repositories on Github but haven’t gone into depth with those). And that’s about it.
SNOBOL is so primitive that it is virtually unusable from a modern perspective. Interesting from the point of view of understanding how low the standards could be for a programming language back in the 1960’s, even though good languages were also being used at that time.
Instead of SNOBOL I would pick Icon, a modern and usable SNOBOL successor, which is mind expanding, and gives you a new perspective on programming, but in a GOOD way.
https://www2.cs.arizona.edu/icon/
I’ve used occam exactly once and would pick erlang over it because occam is very limiting. It deliberately cannot express things like programs that can use an amount of memory or stack space that is determined at runtime. The entire communication structure and everything is all fixed up front. Channels aren’t first class values.
Occam was co-designed with the transputer. The 1980s transputers had communications links whose topology was fixed by the way the hardware was wired together. Much of the runtime support was implemented in silicon. So an occam program would be compiled to match the hardware it would run on. It wasn’t until the 1990s with the T9 project that transputers became more dynamic. There were some useful spinouts: the physical layer of the T9 fast links was reused by FireWire, the fast link crossbar switch was reused for ATM — but the T9 itself failed. I don’t know if there was a new more dynamic version of occam to go with the more capable hardware.
Yeah I know. I took a class taught by David May.
I still think Occam sucks.
The conceptual model and formalism underlying Occam is what’s called a process calculus. It was developed by Milner. Later versions include CCS, the Pi-calculus, and his final version, the bigraph model. All of these have been influential in different ways, but the earlier formalisms map better to hardware.
Are you sure? I’ve looked for a Starset implementation in the past, and tried again just now. I can’t find it. The author is a professor at Suffolk University, and supervised a student implementing a Starset compiler, but it appears to not be publically available online.
There’s very little information available online about Starset, but it appears to be a database query language for a database model based on sets (rather than the more familar models based on relations, graphs, etc). I can’t even find a language specification online, just vague sketches about what the language does.
I would instead choose the language “Setl”, which is a general purpose programming language based on set theory. Sort of like APL, except for sets (and tuples, maps, relations, and powersets) instead of arrays. It may be obscure, but it was also influential. It borrowed the idea of set comprehensions from mathematics and made them into a programming language feature. The list comprehensions in Python, Haskell, etc, come from Setl.
https://en.wikipedia.org/wiki/SETL
https://setl.org/setl/
On Starset availability: I missed this caveat the first time: “Finally, a Starset interpreter, christened “Suffolk Starset” or s3, is developed and maintained by the team at my own Suffolk University. Its GitHub repository will be made public as soon as we release the first fully functional version.”
The github repository referenced is https://github.com/dzinoviev/starset (and is not available at the time of this writing).
I was curious about the list of names in the protocol. The protocol includes a “APID 2047 Idle packet” which seems to be used as either padding or for pinging/keep-alive purposes. The payload is meaningless semantically. Based on other posts on destevez.net, it seems that each mission tailors the content to some custom message.
I wonder what’s the use case where you would need such a large timeout value. I’d expect programmers to reach for a scheduler in this case, which would also survive javascript engine restarts.
We ran into this at my old job, where we wrote a scheduler in node.js. It took a CSV input of desired future events and set up timeouts to run them. If the service got restarted, it reloaded the CSV and set up the timeouts again.
The guy who generated the CSVs pinged us and said “this latest CSV I’ve uploaded seems to be sending out all the commands as soon as I uploaded it, what’s up with that?” and so we spent a few hours chasing down this weird behaviour.
It’s funny, the scheduler I use is affected by this bug! It surely survives reboots because on reboot, the timers are all reloaded, but any system that runs less frequently than 25 days gets kicked off. One example is one that handles lets-encrypt recertification.
https://github.com/jhuckaby/Cronicle/issues/108
The process of:
is so familiar.
Given the same domain, Erlang is a much more innovative solution.
The tutorial/overview linked by fanf includes a comparison against Java (this would be around J2SE 1.2). CHILL features some structured concurrency (more structured than threads), tuples, and there may have been Ada-like program verification, but the extent is unclear. Amusingly, under ‘Additional Elements’ they note that Java includes remote procedure calls and internet access but there is no corresponding element on the CHILL side. Since CHILL was meant for the telecom domain, you would expect them to advertise library support for telecom standards. Since they note CHILL’s recent support for Unicode, they do recognize interoperability as a requirement.
If you have a lot of these kinds of convos, I would highly recommend setting up explicit “on call” for team members to be in charge of handling external requests. Otherwise you can have motivated people be a bit ambitious with handling everything, get overloaded with firefighting, and end up just kinda exhausted. All while not sharing the institutional knowledge enough so they become a SPOF in an odd way.
Always good to make sure team members are taking steps forwards, of course. I just think that when people are doing this on rotation then it removes a bit of variability. Not a hard and fast rule of course.
In addition to an explicit ‘on call,’ there were two other practices at my former employer that I think advance this philosophy.
One, we had a policy that a customer could not be redirected more than three times. If you were last in the line, you had to hold the ticket to completion rather than redirect to another team. The one time I was the one holding the ticket, the client was seeing random errors throughout the product. As it turned out, they had a HTTP proxy on their side that was randomly failing requests (but only on certain domains), but the policy forced someone to fully investigate rather than keep on passing the buck once symptoms could be ascribed to a different team.
Secondly, as the company grew, we added an ‘engineer support’ role that could support the on-calls. They could handle long-term investigations and support jobs that were longer than a week, but not big or long enough to warrant an actual project.
Totally agree with your advice for an explicit “on call” during business-hours.
Crucially, moving support out of DMs and into public channels means others can search logs for advice on similar issues (and sometimes even answer their own questions!)
I wrote an internal bot a few years ago that syncs Slack user groups with $work’s on call scheduler. Folks can say
@myteam-oncallin a public channel and instantly reach the right person without overambitious members needing to be involved in triage. It’s also easy enough to say@friendlyteam-oncalland redirect folk in-place to another team without switching channels or losing context.My thing was to create a Slack action, where when it was done on an action it would:
Was excellent stuff IMO
In Jean Sammet’s Programming Languages: History and Fundamentals (1969), she notes that periods and semicolons are commonly used to delimit the smallest executable units (III.3.2.1). The blog author’s hypothesis that the semicolon was chosen because of its similarity to English is supported by how the ‘smallest executable units’ are treated.
One of the predecessors to ALGOL58, IAL, used semicolons as the separator between single statements. Although the “publication language” allowed semicolons to be dropped and to just have statements on separate lines. Periods only seem to be used to denote real numbers. So, since ALGOL58 was heavily influenced by IAL, it perhaps simply inherited the syntax.
There were alternatives: MATH-MATIC, or AT-3, which started development in 1955, uses periods at the end of statements (IV.2.2.1). FORTRAN did not include a statement delimiter, since the physical form of the code, i.e. the punch card, was the delimiter. The FORTRAN character set did not include a semicolon, but it did include a period. An extremely limited character set was common in this period and the syntax used to communicate an algorithm might not use the same characters as required for physical input (e.g. a < used instead of .LT.). For instance, see the table of ALGOL60 characters in section IV.4.3 which include logical and set operators.
As a practical matter, I have read semi-colons were preferred to periods because periods were easy to miss since they were very small. However, I think that’s anecdotal. It may also have been easier to parse code when periods and semicolons did not share the same role, but that’s also speculation.
Programming Languages: History and Fundamentals: https://archive.org/details/SammetProgrammingLanguages1969/
Syntax and Semantics of IAL: https://www.softwarepreservation.org/projects/ALGOL/paper/Backus-Syntax_and_Semantics_of_Proposed_IAL.pdf
Preliminary Report: International Algebraic Language: https://dl.acm.org/doi/10.1145/377924.594925
IAL and ALGOL58 are different names for the same thing.
I had a look through some of the early issues of CACM (open access at last, yay!) to see how algorithms were written, and they had not yet settled on a style that was algorithmic rather than mathematical.
In prose, when writing a complicated list where the elements might contain commas, it’s normal to separate them with semicolons. So it makes sense to use ; to separate a sequence of algorithmic steps.
I’m going to go and read the HOPL papers on ALGOL60 to see if they have any notes on punctuation…
It is sad how little public information there is about the proprietary systems. My company uses RTC (part of IBM ALM) a lot, but I’m only involved in transitioning away from it. Developers rant a lot about it, so the proprietary ones always have a bad image (slow! over-engineered! etc)
However, I have the feeling that it isn’t the fault of these systems. One point is that these systems are often abused, like storing lots of big binary blobs in there. Git doesn’t handle that well either. So when introducing git, we are forced to also introduce additional infrastructure like Artifactory, build scripts and pipelines and conventions to make them work together. In the end, the whole system is even more complex.
These proprietary VCSs do have some valuable lessons in there. However, they are not needed for OSS projects, so nobody cares (in public).
I’d love to hear more about RTC and other proprietary version control systems! How to use them? What are they good at? What do you enjoy? What is annoying ?
I am continuing with the blog post series into the 90s where ClearCase, Visual Source Safe, Perforce and other proprietary systems pop up and become popular. However I haven’t used some of these (with the exception of Perforce) so I am really looking for more input.
I think proprietary ones can be really good. Perforce for example is still widely used and well regarded (of course … it depends) in games companies, but of course with dominance of a few systems and the availability of very good open source systems, making a case for spending USD a year in source control systems can be tricky.
Note that the blog post is part 1 of an (hopefully) multi-blog series, and thus only covers the very early days of source control systems for now.
Some anecdotes about ClearCase (CC):
First, vrthra’s writeup agrees with my memory of how CC worked.
I was an employee of an aerospace company from roughly 2005 to 2013. CC was the sorta-official configuration management tool. (Company rules mandated some form of config management to be used on projects, but didn’t mandate a single tool.) Requirement documents, specifications, drawings, source code, and builds were all (typically) stored in CC.
CC was labor intensive, as each company site had a CC administrator and this was a full-time job at least one site. We also had CM specialists and some software engineers received advanced training and partial administrative rights to help handle issues.
Users would have a drive or a mounted filesystem linked to a CC VOB. The file locking mechanism of CC worked fairly well for people working on documents (Word files) or other binary files as the multi-user features of those programs were typically primitive at this time. With extensions, the user might not even be aware of the locking feature, as the file would be locked and unlocked for them if they made changes. (Unintended changes were fairly common, so using these extensions was discouraged.) These drives also acted as a distribution mechanism, as an update to a software build could be invisibly propagated to test machines.
The ConfigSpec/View feature of CC was very flexible, but also very confusing. Most users were told what spec to use and didn’t deviate. Developers used the spec to grab individual developer branches and create branches of their own, although it was common to forget to update the spec before making an edit. Fixing problems almost always required an expert to debug.
Many software developers were unhappy with CC (and adjacent tools, like Rational Rose). In particular, the administrative overhead was onerous if you had a project that lasted less than a year. Since my role was typically ‘project adjacent,’ I mostly used Subversion during my time there, sometimes syncing the SVN repository to CC for an official release. Towards the end, I noticed software developers finding excuses to use git.
For quite a bit (I’d say 2009ish to 2013ish) when git was new to most people, one of our main selling points was “It doesn’t matter if you mess up more than with svn, everything is easier to fix”.
Often when I hear about the commercial systems people have a similar vibe, “I can’t read the code, and we’re only users, we never have people who kinda went through everything from A to Z and understood it completely”. Not sure if that is also a main point, I have a feeling it very often is with commercial systems, unless it’s such a big beast that you have consultants and dedicated support engineers by the org selling it…
This is a manual for the Teletype Model 35.
Page 8 has diagrams of the keyboard layout. It looks like the caret symbol ^ would require a shift-N for the 1967 version. The 1963 keyboard would print an up-arrow, although that appears to be protocol equivalent.
Not entirely relevant, but interesting perhaps, to note that at the same time (ASCII 1967) up-arrow became caret, back-arrow became underscore. The model 33 used ASCII 1963 and had up and back arrows.
The back-arrow was used for variable assignment in some programming languages. I would guess the only easily accessible example of this today is Smalltalk.
And on a bit-paired keyboard, shift +
_is ascii'\x20' | '\x5F'which is'\x7F'which is the delete character. For example, the ADM 3A has RUB (out) on the _ key. I haven’t found a keyboard that clearly shares <- (delete) and <- (ascii ’63 backarrow) on a keycap.My theory is that flipped when lowercase came out. The actual ASR 33 is uppercase-only, and has shift-N as uparrow and shift-O as backarrow, but a separate RUBOUT key. So the shift key is turning bit 5 on. The
7Fcharacter is out by itself on the chart so it just gets its own key. The Datapoint 3300, designed as a Model 33 emulator, has this setup.If you have lowercase, then shift still turns bit 5 on for numbers, but it turns bit 6 off for letters. Putting DEL in that bit 6 zone means shift-DEL generates underscore (not backarrow, because lowercase implies ASCII 1967). The Datapoint 2200 is a very bit-paired keyboard (shift-zero even generates a space!) and has an underscore on the RUB key like the ADM3A.
On the practical side, ‘pooling variability’ can have some great practical benefits for control and throughput. Similar to some of the author’s examples at the end, with one system at work, we had a steady stream of requests of about 1 to 20 “units” in size. However, sometimes a request would come in with a size of thousands or millions of units. We created two parallel systems, one to handle the routine work and one to handle the outsize work. Routine was predictable and well-controlled and rarely caused any troubles. With the outsize work system, we could focus on scalability and monitoring.
I was a bit disappointed at first when I realized that this is a guided structure or an outline of what one should study rather than the full material. But I was told this is a translation/language problem. Maybe I shouldn’t have expected that?
In the OED, a curriculum is a “regular course of study or training”, referencing the use at the University of Glasgow. Whether a course of study refers to the structure of learning or includes specific learning materials, seems to be rather debatable. Wikipedia has a long section on the various meanings.
I don’t think MDN is breaking any new ground; ACM uses similar terminology in their documents on the definition of computer science education. The term “standards” seems to be the trend in public education, though.
The MDN page does include links to some tutorials for certain subjects and it sounds like they plan to add more references, including to courses that people can take.
I’m surprised they implemented division and square roots as hardware instructions, given the extra complexity and costs, rather than leaving them for a software implementation like the EDSAC. Presumably the real-time constraints of the application necessitated it.
I think this is partly due to the question of whether this falls into aerospace history or computing history. As an embedded device, it would fall naturally into a history of improved guidance controls and their transition from analog to digital.
A singular or small set of end-to-end tests can help focus on valuable functionality. If these functions also deliver value (i.e. the greater organization or some customers can start using them), they also serve an important purpose of demonstrating concrete value and securing management favor.
In my last job, our coding systems underwent three major versions. The first version was a little more than a prototype and used to validate that our clients wanted that kind of functionality. It was also useful in teasing out the interfaces with other team’s services and some performance numbers. The second version was, unfortunately, completed in a rush, as clients really wanted the functionality. We had been designing it, but ended up having to cut back on several features in order to meet schedule. This version taught us a lot about the reliability issues in our problem set and performance requirements. Fortunately, we were able to limp along with the second version for several years, which allowed a number of design iterations and alignment of design vision with other teams.
So, prototypes can be useful in learning a domain and how to build a system, although I recommend you go in with an idea of what you want to learn. I’ve often seen prototype just mean “build the first version, but without tests”, which is not the point.
Organizationally, can you have a program manager assigned to your project? A good PM will know how to elicit risks from the development team, guide prioritization of new functionality versus tech debt, and work with management on resourcing and schedule. (If you have a bad PM, they’ll either be a tax on productivity or potentially drive the project to failure. That’s the risk.)
I’m lucky in that there already is a previous version (albeit with a smaller remit) that a lot of the current spec is based off of. That systems been in production for about 10 years now so I’m well aware of the dark corners.
Unfortunately I don’t have access to resources like program managers or analysts for this. It would be helpful, but I also have no time pressure so I can take the time to make sure things are correct and valid and really think things through.
In North American middle schools, kids often get these annoying “math questions” along the line of “continue this sequence of numbers…” along with 5 or so numbers. Disregarding that a mathematical sequence defines its value at each input, every other kid rediscovers (or gets taught) the method of finite differences (or finite x for some operation x) where they find the difference between terms and iterate between levels until it goes to a constant sequence.
In the ‘why should you care’ bit, it would be nice to mention that the underlying mathematics behind this is also used in compiler optimisations based on scalar evolution. If you have a loop nest where each level is doing some addition over an induction variable (which is most loop nests, and even more after some canonicalisation) then you have a sequence in this form where you already know the differences. Scalar evolution works by modelling the polynomial and performing transforms of the loop based on this model.
This is one of the more complex bits of mathematics (if you aren’t doing polyhedral optimisation) in a modern compiler and is one of the things that gets some really big wins and is not resent in simple compilers.
A good description of scalar evolutions can be found in http://cri.ensmp.fr/~pop/gcc/mar04/slides.pdf Also https://kristerw.blogspot.com/2019/04/how-llvm-optimizes-geometric-sums.html?m=1 has some details and links to more papers.
When I first found out about Scalar evolutions I liked them a lot, although it is unclear whether they provide any benefits over a more general approach like http://apron.cri.ensmp.fr/library/ (except perhaps simpler code?). At some point GCC used to depend on PPL, but doesn’t anymore (got replaced by ISL, and I don’t know whether that’d be a substitute for SCEV).
I’ll look into it.
I expected the author to obtain the savings by eliminating some of the easy dumb stuff you can do building a docker image (e.g. creating multiple layers through multiple update and install steps), but they get to the trickier stuff right away.
I think there is a maintainability question as the script expects packages to store specific files in specific locations and changes to packages could easily break those assumptions. However, a useful demonstration.
Using
apk --root …and installing into an alternate location in the base image would likely ease tracking lib deps (turn it into a tree copy).EDIT: A more complete example:
Interesting, thanks for sharing, so ultimately building the 2nd layer would become easier with this approach?
I thought more that it might make it easier and less error prone to copy out the libraries from the first layer, if the packages were just installed into an alt-root-dir in the first player (as above), as they would then be in a known location with less “other stuff” to worry about.
Author here
I think maintenance will be a hit-and-miss. The locations where these shared libraries are located is pretty stable (for now).
The only thing I am remotely worried about is the change in version numbers of these libs. But I think that can be fixed to some extent by copying them via a pattern. Like I did for
muslVery understated. Having the written fallback for visual users, which then doubles up what the accessibility services see. Frustrating to no end!
Aside:
progressbaris a type ofmeterright? Is a donation goal aprogressbaror ameter?A donation goal is a meter because donations can be exceeded. In fact, exceeding a donation is something to be celebrated, so the client would want to call attention to that fact. In contrast, a progress bar restricts the value to the max value or 1 if no max is given because, conceptually, tasks can only be completed.
From
meter’s documentation:Their handling of maximums is functionally the same
Sure, their handling of maximum is the same. But the donation target shouldn’t be modeled as a maximum, but either by the ‘high’ or ‘optimum’ value.
AMA!
I’ve got a bunch of questions, but I guess a few things (appreciating that from a career standpoint you may not be able to answer):
(I can guess as to most of these, given the usual way of things, but figured I might as well ask.)
More on the tech side:
Thanks for posting here, and thanks for helping right the ship!
Sorry in advance for the long-winded context-setting, but it’s the only way I know how to answer this question.
There are a few important things to understand. First, even though HealthCare.gov was owned by CMS, there was no one single product owner within CMS. Multiple offices at CMS were responsible for different aspects of the site. The Office of Communications was responsible for the graphic design and management of the static content of the site, things like the homepage and the FAQ. The Center for Consumer Information and Insurance Oversight was responsible for most of the business logic and rules around eligibility, and management of health plans. The Office of Information Technology owned the hosting and things like release management. And so on. Each of these offices has their own ability to issue contracts and set of vendors they prefer to work with.
Second, HealthCare.gov was part of a federal health marketplace ecosystem. States integrated with their own marketplaces and their Medicaid populations. Something called the Digital Services Hub (DSH) connected HealthCare.gov to the states and to other federal agencies like IRS, DHS, and Social Security, for various database checks during signup. The DSH was its own separately procured and contracted project. An inspector general report I saw said there were over 60 contracts comprising HealthCare.gov. Lead contractors typically themselves subcontract out much of the work, increasing the number of companies involved considerably.
Then you have the request for proposals (RFP) process. RFPs have lists of requirements that bidding contractors will fulfill. Requirements come from the program offices who want something built. They try to anticipate everything needed in advance. This is the classic waterfall-style approach. I won’t belabor the reasons this tends not to work for software development. This kind of RFP rewards responses that state how they will go about dutifully completing the requirements they’ve been given. Responding to an RFP is usually a written assertion of your past performance and what you claim to be able to do. Something like a design challenge, where bidding vendors are given a task to prove their technical bona fides in a simulated context, while somewhat common now, was unheard of when HealthCare.gov was being built.
Now you have all these contractors and components, but they’re ostensibly for the same single thing. The government will then procure a kind of meta-contract, the systems integrator role, to manage the project of tying them all together. (CGI Federal, in our case.) They are not a “product owner”, a recognizable party accountable for performance or end-user experience. They are more like a general contractor managing subs.
In addition, CMS required that all software developed in-house conform to a specified systems design, called the “CMS reference architecture”. The reference architecture mandated things like: the specific number of tiers or layers, including down to the level of reverse proxies; having a firewall between each layer; communication between layers had to use a message queue (typically a Java program) instead of say a TCP socket; and so forth. They had used it extensively for most of the enterprise software that ran CMS, and had many vendors and internal stakeholders that were used to it.
Finally, government IT contracts tend to attract government IT contractors. The companies that bid on big RFPs like this are typically well-evolved to the contracting ecosystem. Even though it is ostensibly an open marketplace that anyone can bid on, the reality is that the federal goverment imposes a lot of constraints on businesses to be able to bid in the first place. Compliance, accreditation, clearance, accounting, are all part of it, as well as having strict requirements on rates and what you can pay your people. There’s also having access to certain “contracting vehicles”, or pre-chosen groups of companies that are the only ones allowed to bid on certain contracts. You tend to see the same businesses over and over again as a result. So when the time comes to do something different – eg., build a modern, retail-like web app that has a user experience more like consumer tech than traditional enterprise government services – the companies and talent you need that has that relevant experience probably aren’t in the procurement mix. And even if they were, if they walked in to a situation where the reference architecture was imposed on them and responsibility was fragmented across many teams, how likely would they be to do what they are good at?
tl;dr:
I think they saw quickly from the kind of experience, useful advice, and sense of calm confidence we brought – based on knowing what a big transactional public-facing web app is supposed to look like – that they had basically designed the wrong thing (the enterprise software style vs a modern digital service) and had been proceeding on the fairly impossible task of executing from a flawed conception. We were able to help them get some quick early wins with relatively simple operational fixes, because we had a mental model of what an app that’s handling high throughput with low latency should be, and once monitoring was in place, going after the things that were the biggest variance from that model. For example, a simple early thing we did that had a big impact was configuring db connections to more quickly recycle themselves back into the pool; they had gone with a default timeout that kept the connection open long after the request had been served, and since the app wasn’t designed to stuff more queries into an already-open connection, it simply starved itself of available threads. Predictability, clearing this bottleneck let more demand flow more quickly through the system, revealing further pathologies. Rinse and repeat.
Were there poor performers there? Of course. No different than most organizations. But we met and worked with plenty of folks who knew their stuff. Most of them just didn’t have the specific experience needed to succeed. If you dropped me into a mission-critical firmware project with a tight timeline, I’d probably flail, too. For the most part, folks there were relieved to work with folks like us and get a little wind at their backs. Hard having to hear that you are failing at your job in the news every day.
You can easily research this, but I’ll just say that it’s actually very hard to meaningfully penalize a contractor for poor performance. It’s a pretty tightly regulated aspect of contracting, and if you do it can be vigorously protested. Most of the companies are still winning contracts today.
I would often joke on the rescue that for all HealthCare.gov’s notoriety, it probably wasn’t even a top-1k or even 10k site in terms of traffic. (It shouldn’t be this hard, IOW.) Rough order of mag, peak traffic was around 1,000 rps, IIRC.
You could navigate the site on your phone if you absolutely had to, but it was pretty early mobile-responsive layout days for government, and given the complexity of the application and the UI for picking a plan, most people had to use a desktop to enroll.
Great question. I remember being on a conference bridge late in the rescue, talking with someone managing an external dependency that we were seeing increasing latency from. You have to realize, this was the first time many government agencies where doing things we’d recognize as 3rd party API calls, exposing their internal dbs and systems for outside requests. This guy was a bit of a grizzled old mainframer, and he was game to try to figure it out with us. At one point he said something to the effect of, “I think we just need more MIPS!” As in million instructions per second. And I think he literally went to a closet somewhere, grabbed a board with additional compute and hot-swapped it into the rig. It did the trick.
In general I would say that the experience, having come from what was more typical in the consumer web dev world at the time - Python and Ruby frameworks, AWS EC2, 3-tier architecture, RDBMSes and memcached, etc. - was like a bizarro world of tech. There were vendors, products, and services that I had either never heard of, or were being put to odd uses. Mostly this was the enterprise legacy, but for example, code-generating large portions of the site from UML diagrams: I was aware that had been a thing in some contexts in, say, 2002, but, yeah, wow.
To me the biggest sin, and I mean this with no disrespect to the people there who I got to know and were lovely and good at what they did, was the choice of MarkLogic as the core database. Nobody I knew had heard of it, which is not necessarily what you want from what you hope is the most boring and easily-serviced part of your stack. This was a choice with huge ramifications up and down, from app design to hardware allocation. An under-engineered data model and a MySQL cluster could easily have served HealthCare.gov’s needs. (I know this, because we built that (s/MySQL/PostgreSQL) and it’s been powering HealthCare.gov since 2016.)
A few years after your story I was on an airplane sitting next to a manager at MarkLogic. He said the big selling point for their product is the fact it’s a NoSQL database “Like Mongo” that has gone through the rigorous testing and access controls necessary to allow it to be used on [government and military jobs or something to that effect] “like only Oracle before us”.
He’s probably referring to getting FedRAMP certified or similar, which can be a limiting factor on what technologies agencies can use. Agencies are still free to make other choices (making an “acceptable risk” decision).
In HealthCare.gov’s case, the question wasn’t what database can they use, but what database makes the most sense for the problem at hand and whether a document db was the right type of database for the application. I think there’s lots of evidence that, for data modeling, data integrity, and operational reasons, it wasn’t. But the procurement process led the technology choices, rather than the other way round.
You laugh, but they advertise MUMPS like a modern NoSQL database suitable for greenfield.
I’m not Paul, but I recall hearing Paul talk about it one time when I was his employee, and one of the problems was MarkLogic, the XML database mentioned in the post. It just wasn’t set up to scale and became a huge bottleneck.
Paul also has a blog post complaining about rules engines.
See also Paul’s appearance on the Go Time podcast: https://changelog.com/gotime/262
I’m happy to see that you feel proud of your work. It’s an incredibly rare opportunity to be able to help so many people at such a large scale.
Have you found something else as meaningful for you since then?
Also what’s your favorite food?
Thank you!
Yes, my work at Ad Hoc - we started the company soon after the rescue and we’ve been working on HealthCare.gov pretty much ever since. We’ve also expanded to work on other things at CMS like the Medicare plan finder, and to the Department of Veterans Affairs where we rebuilt va.gov and launched their flagship mobile app. And we have other customers like NASA and the Library of Congress. Nothing will be like the rescue because of how unique the circumstances, but starting and growing a company can be equally as intense. And now the meaning is found in not just rescuing something but building something the right way and being a good steward to it over time so that it can be boring and dependable.
A chicken shawarma I had at Max’s Kosher Café (now closed 😔) in Silver Spring, MD.
Can you please kill X12. That is all.
This is only slightly sarcastic. I’m hopeful with your successful streak you can target the EDI process itself, because it’s awful. I only worked with hospice claims, but CMS certainly did a lot to make things complicated over the years. I only think we kept up because I had a well-built system in Ruby.
It’s literally the worst thing in the world to debug and when you pair that with a magic black box VM that randomly slurps files from an NFS share, it gets dicey to make any changes at all.
Seriously. ☠️
For those not familiar, X12 is a standard for exchanging data that old-school industries like insurance and transportation use that well-predates modern niceties like JSON or XML.
X12 is a … to call it a serialization is generous, I would describe it more like a context-sensitive stream of transactions where parsing is not a simple matter of syntax but is dependent what kind of X12 message you are handling. They can be deeply nested, with variable delimiters, require lengthy specifications to understand, and it’s all complicated further by versioning.
On top of that, there is a whole set of X12-formatted document types that are used for specific applications. Relevant for our discussion, the 834, the Benefit Enrollment and Maintenance document, is used by the insurance industry to enroll, update, or terminate the coverage.
To give you a little flavor of what these are like, here is the beginning of a fake 834 X12 doc:
HealthCare.gov needed to generate 834 documents and send them to insurance carriers when people enrolled or changed coverage. Needless to say, this did not always go perfectly.
I personally managed to avoid having to deal with 834s until the final weeks of 2013. An issue came up and we need to do some analysis across a batch of 834s. Time was of the essence, and there was basically no in-house tooling I could use to help - it only generated 834s, not consume them. So naturally I whipped up a quick parser. It didn’t have to be a comprehensive, fully X12-compliant parser, it only needed enough functionality to extract the fields needed to do the analysis.
So I start asking around for anyone with 834 experience for help implementing my parser. They’re incredulous: it can’t be done, the standard is hundreds of pages long, besides you can’t see it (the standard), it costs hundreds of dollars for a license. (I emailed myself a copy of the PDF I found on someone’s computer.)
I wrote my parser in Python and had the whole project done in a few days. You can see a copy of it here. The main trick I used was to maintain a stack of segments (delimited portions of an X12 doc) and the in the main loop would branch to methods corresponding to the ID of the current segment, allowing context-sensitive decisions to be made about what additional segments to pull off the stack.
The lesson here is that knowing how to write a parser is a super power that will come in handy in a pinch, even if you’re not making a compiler.
just wanted to give you props. I’m a contractor myself and I’ve been on both sides of this (I have been in the gaggle of contractors and been one of the people they send in to try to fix things) and neither side is easy, and I’ve never come close to working on anything as high profile as this and even those smaller things were stressful. IME contracting is interesting in some ways (I have worked literally everywhere in the stack and with tons of different languages and environments which I wouldn’t have had the opportunity to do otherwise) and really obnoxious in other ways (legacy systems are always going to exist but getting all the contractors and civilian employees pulling in the same direction is a pain, especially without strong leadership from the government side)
Thanks, and back at ya. My post didn’t get into it, but it really was a problem of a lack of a single accountable product lead who was empowered to own the whole thing end-to-end. Combine that with vendors who were in a communications silo (we constantly heard from people who had never spoken to their counterparts other teams because the culture was not to talk directly except through the chain of command) and the result was, everyone off in their corner doing their own little thing, but no one with a view to the comprehensive whole.
How helpful were non-technical metrics such as user feedback and bug reports to the process?
Extremely. Aside from the performance-related problems that New Relic gave us a handle on, there were many logic bugs (lots of reasons for this: hurried development, lack of automated tests, missing functionality, poorly-understood and complicated business rules that were difficult to implement). This typically manifested in users “getting stuck” partway through somewhere, and they could neither proceed to enrollment or go back and try again.
For example, we heard early on about a large number of “lost souls” - people who had completed one level of proving your identity (by answering questions about your past ala a credit check), but due to a runtime error in a notification system, they never got the link to go to the next level (like uploading a proof of ID document). Fixing this would involve not just fixing the notification system, but figuring out which users were stuck (arbitrary db queries) and coming up with a process to either notify them (one-off out-of-band batch email) or wipe their state clean and hope they come back and try again. Or if they completed their application for the tax credit, they were supposed to get an official “determination of eligibility” email and a generated PDF (this was one of those “requirements”). But sometimes the email didn’t get sent or the PDF generator tipped over (common). Stuck.
The frontend was very chatty over Ajax, and we got a report that it was taking a long time for some people to get a response after they clicked on something. And we could correlate those users with huge latency spikes in the app on New Relic. Turns out for certain households, the app had a bug where it would continually append members of a household to a list over and over again without bound after every action. The state of the household would get marshalled back and forth over the wire and persisted to the db. Due to a lack of a database schema, nothing prevented this pathological behavior. We only noticed it because it was starting to impact user experience. We went looking in the data, and found households with hundreds of people in them. Any kind of bug report was just the tip of the iceberg of some larger system problem and eventual resolution.
I have a good story about user feedback from a member of Congress live on TV that led to an epic bugfix. I’ve been meaning to write it up for a long while. This anniversary is giving me the spur to do it. I’ll blog about it soon.
I agree with df — I think there’s more space, and practical benefit, to apply OR to the SaaS/distributed system space. Also, apg is correct that Google’s SRE book will give you a good overview of the existing applications of OR and control theory to large-scale software systems.
As an example of OR to distributed systems, my former employer experienced a number of operational and cost issues scheduling containers within our Nomad fleet. (Kubernetes will be similar.) The containers were very heterogeneous in memory and CPU reqts because our company had many microservices, but not that many instances of any single microservice. As I recall, Nomad only had two bin packing policies, and neither worked all that well. Either, we had tightly packed hosts and idle hosts (and if a tightly packed host died, there was a lot of pain as containers were re-scheduled) or a large number of mostly occupied hosts, leading to blocked deployments as no host could take very large jobs.
I used queuing theory on multiple projects, but one challenge is that the systems-in-use don’t match the theoretical queues all that cleanly. Kafka, which isn’t a true queuing or messaging system but is often used as one, can have very poor performance if the variability of work items is high (so certain partitions lag others). I haven’t seen Amazon SQS modeled theoretically to any precision. A practical, theoretical model would be useful in both cases to aid in operational planning and understanding.
I think colleges would jump at adding operation course if there was an equivalent to Factory Physics for software systems.
What is the reason to not
def send_emails(list_of_recipients, bccs=[]):?From my Clojure backend I wonder what is sending the emails (and how you can possibly test that unless you are passing in something that will send the email), but perhaps there is something I miss from the Python perspective?
Hi! Thanks for reading! 🙂
The answer is mutable default arguments; if you’re coming from Clojure, I can see why this wouldn’t be an issue 😛
When Python is defining this function, it’s making the default argument a single instance of an empty list, not “a fresh empty list created when invoking the function.” So if that function mutates the parameter in some way, that mutation will persist across calls, which leads to buggy behavior.
An example might be (with a whole module):
What happens is: every time you call it without specifying a second argument, the emails added from previous invocations will still be in the default
bccslist.The way to actually do a “default to empty list” in Python is:
This is horrifying. TIL, but I almost wish I hadn’t.
It makes sense outside of the pass-by-reference/pass-by-value dichotomy but it’s still a massive footgun.
As for how you test it, it’s probably with mocking. Python has introspection/monkey patching abilities, so if you relied on another library to handle actually sending the email (the example above I pretended
sendgridhad an SDK with a function calledactually_send_email, in Python you would usually do something likeThis mucks with the module at runtime to convert
sendgrid.actually_send_emailto instead record its calls instead of what the function did originally.docs here
It’s not about sending emails, but the mutable default argument: https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments
Python will initialize the function argument variables on module import, rather than whenever the function is called.