1. 104
    1. 18

      Further to this point. Strive to design your data structures so that ideally there is only one way to represent each value. That means for example NOT storing your datetimes as strings. This will imply that your parsing step also has a normalization step. In fact storing anything important as a string is a code smell.

      1. 10

        A person’s name should be stored as a string. City names. Stock symbols. Lots of things are best stored as strings.

        1. 4

          Should stock symbols contain emojis or newlines, or be 100 characters long? Probably not. I assume there are standards for what a stock symbol can be. If you have a StockSymbol type constructed by a parser that disallows these things, you can catch errors earlier. Of course then the question is what do you do when the validation fails, but it does force a decision about what to do when you get garbage, and once you have a StockSymbol you can render it in the UI with confidence that it will fit.

        2. 3

          Names of things are best stored as strings, yes.

          1. 1

            What I recommend though is to encode them in their own named string types, to prevent using the strings in ways that they are not meant to be used. We often use that to encode ID references to things which should be mostly “opaque BLOBs”.

      2. 3

        Interesting. If you don’t mind, I’d like to poke at that a bit.

        Why should you care what anything is stored as? In fact, why should you expect the rest of the universe, including the persistence mechanism, to maintain anything at all about your particular type system for your application?

        1. 8

          They can maintain it in their own type system. The issue is information loss. A string can contain nearly anything and thus I know nearly nothing about it and must bear a heavy burden learning (parsing or validating). A datetime object can contain many fewer things and thus I know quite a lot about it lowering the relearning burden.

          You can also build and maintain “tight” connections between systems where information is not lost. This requires owning and controlling both ends. But generally these tight connections are hard to maintain because you need some system which validates the logic of the connection and lives “above” each system being connected.

          Some people use a typed language with code generation, for instance.

        2. 3

          @zxtx told the reader trying to leverage type systems to design their own program a certain way that derives more benefit from type systems. The rest of the universe can still do their own thing.

          I’m not sure the string recommendation is correct. There’s several languages that are heavily based on strings powering all kinds of things out there successfully. I’ve also seen formally-verified implementations of string functionality. zxtx’s advice does sound like a good default.

          We probably should have, in addition to it, verified libraries for strings and common conversions. Then, contracts and/or types to ensure calling code uses them correctly. Then, developers can use either option safely.

          1. 3

            Sure some things inevitably have to be strings: personal names, addresses, song titles. But if you are doing part-of-speech tagging or word tokenization, an enumerative type is a way better choice than string. As a fairly active awk user I definitely sympathize with the power of string-y languages, but I think people new to typed languages overuse rather than underuse strings.

            1. 2

              Unfortunately, even folks who have used typed languages for years (or decades) still overuse strings. I’m guilty of this.

        3. 3

          I admit to going back and forth on this subject….

          As soon as you store a person name as a PersonName object…… it’s no longer a POD and you’re constricted to a tiny tiny subset of operations on it…. (With the usual backdoor of providing a toString method)

          On the other hand Bjarne Stoustrup’s assertion that if you have a class invariant to enforce… that’s the job of an object / type.

          Rich Hickey the clojure guy has an interesting talk exactly on this subject with an interesting different take….

          Instead of hiding the data in a type with an utter poverty of operators, leave everything as a pod of complex structure which can be validated and specified checked and asserted on using a clojure spec.

          ie. If you want something with a specific shape, you have the spec to rely on, if you want to treat it as ye olde list or array of string….. go ahead.

          1. 12

            I stuck to simple examples of the technique in my blog post to be as accessible as possible and to communicate the ideas in the purest possible way, but there are many slightly more advanced techniques that allow you to do the kind of thing you’re describing, but with static (rather than dynamic) guarantees. For some examples, I’d highly recommend taking a look at the Ghosts of Departed Proofs paper cited in the conclusion, since it addresses many of your concerns.

            1. 1

              Ok. That took me awhile to digest…. but was worth it. Thanks.

              For C++/D speakers it’s worth looking at this first to get the idea of phantom types…

              https://blog.demofox.org/2015/02/05/getting-strongly-typed-typedefs-using-phantom-types/

          2. 5

            As someone who worked professionally with both Clojure (before spec but with Prismatic Schema) and OCaml and I have to say I utterly prefer to encode invariants in a custom type with only a few operations instead of the Clojure way of having everything in a hashmap with some kind of structure (hopefully) and lots of operations which operate on them.

            My main issue writing Clojure was that I did apply some of these (really useful and versatile) functions on my data, but the data didn’t really match what I had expected so the results were somewhat surprising in edge cases and I had to spend a lot of brain time to figure out what was wrong and how and where that wrong data came to be.

            In OCaml I rarely have the problem and if I want to use common functions, I can base my data structures on existing data structures that provide the functions I want to over the types I need, so in practice not being able to use e.g. merge-with on any two pieces of data is not that painful. For some boilerplate, deriving provides an acceptable compromise between verbosity and safety.

            I can in theory do a similar thing in Clojure as well, but then I would need to add validation basically everywhere which makes everything rather verbose.

            1. 3

              I’ve used Clojure for 8 years or so, and have recently been very happy with Kotlin, which supports sealed types that you can case-match on, and with very little boilerplate—but also embraces immutability, like Clojure.

              With Clojure, I really miss static analysis, and it’s a tough tradeoff with the lovely parts (such as the extremely short development cycle time.)

          3. 3

            The ability to “taint” existing types is the answer we need for this. Not a decorator / facade sort of thing, just a taint/blessing that exists only within the type system, with a specific gatekeeper being where the validation is done and the taint removed/blessing applied.

          4. 3

            In Go, wrapping a string in a new type is zero-overhead, and you can cast it back easily. So it’s mostly just a speedbump to make sure that if you do something unsafe, you’re doing it on purpose and it will be seen in code review. If the type doesn’t have very many operators, you might have more type casts that need to be checked for safety, but it’s usually pretty easy to add a method.

            On the other hand, the Go designers decided not to validate the string type, instead accepting arbitrary binary data with it only being convention that it’s usually UTF8. This bothers some people. But where it’s important, you could still do Unicode validation and create a new type if you want, and at that point there’s probably other validation you should be doing too.

          5. 1

            The last one is the best.

            Instead of scaling out code, we should be scaling out tests. We’re doing it backwards.

            I’ve been meaning to put together a conference proposal on this but haven’t gotten around to it. It’s the kind of thing that blows people’s minds.

            1. 1

              Can you expand a little on this? Sounds interesting.

              1. 4

                People don’t understand what tests do. If you ask them, they might say they help your code be less buggy, or they show your business customers that your program does what they’re paying for.

                That’s all true, but horribly incomplete. Tests resolve language.

                That is, whether it’s science, programming, running a business, or any of hundreds of other areas where human language intersects science, tests are the only tools for determining what’s true or not in unambiguous terms. Come up with some super cool new way of making a superconductor? Great! Let’s have somebody go out and make it on their own, perform a test. If the test passes, you’re on to something. Yay! If the tests fails? Either you’re mistaken or the language and terms you’re using to describe your new process has holes the reproducer was unable to resolve. Either way, that’s important information. It’s also information you wouldn’t have gained otherwise without a test.

                In coding, as I mentioned above, we have two levels of tests. The unit level, which asks “Is this code working the way I expected it to?” and the acceptance level, which asks “Is the program overall performing as it should?” (I understand the testing pyramid, I am simplifying for purposes of making a terse point). But there are all sorts of other activities we do in which the tests are not visible. Once the app is deployed, does it make a profit? Is your team working the best way it can? Are you building this app the best way you should? Are you wasting time on non-critical activities? Will this work with other, unknown apps in the future? And so on.

                We’ve quantitized some of this with things like integration testing (which only works with existing apps). Frankly, we’ve made up other stuff out of whole cloth, just so we can have a test, something to measure. In most all cases, when we make stuff up we end up actually increasing friction and decreasing productivity, just the opposite of what we want.

                So how do we know if we’re doing the best job we can? Only through tests, whether hidden or visible. How are we doing at creating tests? I’d argue pretty sucky. How can we do tests better? More to the point, if we do tests correctly, doesn’t it make whatever language, platform, or technology we use a seconary-effect as opposed to a primary one? We spend so much time and effort talking about tools in this biz when nobody can agree on whether we’re doing the work right. I submit that this happens because we’re focusing far, far too much on our reactions to the problem than the problems themselves. If we can create and deploy tests in a comprehensive and tech-independent manner, we can then truly begin discussing how to take this work to the next level. Either that or we’re going to spend the next 50 years talking about various versions of hammers instead of how to build safe, affordable, and desirable houses, which is what we should be doing.

                There’s a lot missing in my reply, but once we accept that our test game sucks? Then a larger and better conversation can happen.

                1. 1

                  It will take me some time to digest this properly… it’s a completely different angle to which I usually approach the matter. (I’m not saying you’re wrong, I’m just saying you coming at it from such a different angle I’m going to have to step back and contemplate.)

                  To understand where I’m coming from let me add…

                  I regard tests as a lazy pragmatic “good enough” alternative to program proving.

                  If we were excellent mathematicians, we would prove our programs were correct exactly the way mathematicians prove theorems.

                  Except we have a massive shortage of that grade of mathematicians, so what can we do?

                  Design by Contract and testing.

                  DbC takes the raw concepts of program proving (pre-conditions and post conditions and invariants) and then we use the tests to setup the preconditions.

                  Writing complete accurate postconditions is hard, about as hard as writing the software, so we have a “useful subset” of postconditions for particular instance of the inputs.

                  Crude, very crude, but fairly effective in practice.

                  My other view of unit tests is closer to yours…

                  They are our executable documentation (proven correct and current) of how to use our software and what it does. So a design principle for tests is they should be Good, Readable understandable documentation.

                  Now I will shutup and contemplate for a day or two.

                  1. 1

                    We are saying the same thing. Heck we might even be agreeing. We’re just starting from completely opposite sides of the problem. Formally validating a program proves that it matches the specification. In this case the formal specification is the test.

                    I think when I mention tests you may be thinking of testing as it was done in the IT industry, either manual or automated. But I mean the term in the generic sense. We all test, all the time.

                    What I realized was that you can’t write a line of code without a test. The vast majority of times that test is in your head. Works for me. You say to yourself “How am I going to do X?” then you write some code in. You look at the code. It appears to do X. Life is good.

                    So you never get away from tests. The only real questions are what kinds of tests, where do they live, who creates them, and so forth. I’m not providing any answers to these questions. My point is that once you realize you don’t create structure without some kind of tests somewhere, even if only in your head, you start wondering exactly which tests are being used to create which things.

                    My thesis is that if we were as good at creating tests as we were at creating code, the coding wouldn’t matter. Once again, just like I don’t care whether you’re an OCAML person or a Javascript person, for purposes of this comment I don’t care if your tests are based on a conversation at a local bar or written in stone. That’s not the important part. The thing is that in various situations, all of these things we talk about doing with code, we should be doing with tests. If the tests are going to run anyway, and the tests have to pass for the project to be complete or problem solved, then it’s far more important to talk about the meaning of a successful completion to the project or a solution to the problem than it is to talk about how to get there.

                    Let’s picture two programmers. Both of them have to create the world’s first accounting program. Programmer A sits down with his tool of choice and begins slinging out code. Surely enough, in a short time voila! People are happy. Programmer B spends the same amount of time creating tests that describe a successful solution to the problem. He has nothing to show for it.

                    But now let’s move to the next day. Programmer A is just now beginning to learn about all of the things he missed when he was solving the problem. He’s learning that for a variety of reasons, many of which involve the fact that we don’t understand something until we attempt to codify it. He begins fixing stuff. Programmer B, on the other hand, does nothing. He can code or he can hire a thousand programmers. The tech details do not matter.

                    Programmer B, of course, will learn too, but he will learn by changing his tests. Programmer A will learn inside his own head. From there he has a mental test. He writes code. It is fixed. Hopefully. Programmer A keeps adjusting his internal mental model, then making his code fit the model, until the tests pass, ie nobody complains. Programmer B keeps adjusting an external model, doing the same thing.

                    Which of these scale when we hire more coders? Which are these are programs the programmer can walk away from? Formal verification shows that the model meets the spec. What I’m talking about is how the spec is created, the human process. That involves managing tests, in your head, on paper, in code, wherever. The point here is that if you do a better, quicker job of firming the language up into a spec, the tech stuff downstream from that becomes less of an issue. In fact, now we can start asking and answering questions about which coding technologies might or might not be good for various chores.

                    I probably did a poor job of that. Sorry. There’s a reason various programming technologies are better or worse at various tasks. Without the clarification tests provide, discussions on their relative merits lack a common system of understanding.

              2. 1

                ADD: I’ll add that most all of the conversations we’re having around tech tools are actually conversations we should be having about tests: can they scale, can they run anywhere, can they be deployed in modules, can we easily create and consume stand-alone units, are they easy-to-use, does it do only what it’s supposed to do and nothing else, is it really needed, is it difficult to make mistakes, and so on. Testing strikes me as being in the same place today as coding was in the early-to-mid 80s when OO first started becoming popular. We’re just not beginning to think about the right questions, but nowhere near coming up with answers.

                1. 1

                  Hmm… In some ways we hit “Peak Testing” a few years back when we had superb team of manual testers, well trained, excellent processes, excellent documentation.

                  If you got a bug report it had all the details you needed to reproduce it, configs, what the behaviour that was expected, what was the behaviour found, everything. You just sat down and started fixing.

                  Then test automation became The Big Thing and we hit something of a Nadir in test evolution which we are slowly climbing out of…

                  This is how it was in the darkest of days…

                  “There’s a bug in your software.”

                  Ok, fine, I’ll fix how do I reproduce…..

                  “It killed everything on the racks, you’ll have to visit each device and manually rollback.”

                  (Shit) Ok, so what is the bug?

                  “A test on Jenkins failed.”

                  Ok, can I have a link please?

                  “Follow from the dashboard”

                  What is this test trying to test exactly?

                  “Don’t know, somebody sometime ago thought it a good idea”.

                  Umm, how do I reproduce this?

                  “You need a rack room full of equipment, a couple of cloud servers and several gigabytes of python modules mostly unrelated to anything”.

                  I see. Can I have a debug connector to the failing device.

                  “No.”

                  Oh dear. Anyway, I can’t seem to reproduce it… how often does it occur?

                  “Oh we run a random button pusher all weekend and it fails once.”

                  Umm, what was it doing when it failed?

                  “Here is a several gigabyte log file.”

                  Hmm. Wait a bit, if I my close reading of these logs are correct, the previous test case killed it, and the only the next test case noticed…. I’ve been looking at the wrong test case and logs for days.

        4. 1

          Because throughout your program you will need to do comparisons or equality checks and if you aren’t normalizing, that normalization needs to happen at every point you do some comparison or equality check. Inevitably, you will forget to do this normalization and hard to debug errors will get introduced into the codebase.

          1. 1

            Ok. Thank you. I figured out what my hang up was. You first say “Strive to design your data structures so that ideally there is only one way to represent each value.” which I was completely agreeing with. Then you said “In fact storing anything important as a string is a code smell” which made me do a WTF. The assumption here is that you have one and only one persistent data structure for any type of data. In a pure functional environment, what I do with a customer list in one situation might be completely different from what I would do with it in another, and I associate any constraints I would put on the type to be much more related to what I want to do with the data than to my internal model of how the data would be used everywhere. I really don’t have a model of how the universe operates with “customer”. Seen too many different customer classes in the same problem domain written in all kinds of ways. What I want is a parsed, strongly-typed customer class right now to do this one thing.

            See JohnCarter’s comment above. It’s a thorny problem and there are many ways of looking at it.

            1. 1

              I think ideally you still do want a single source of truth. If you have multiple data structures storing customer data you have to keep them synced up somehow. But these single sources of data are cumbersome to work with. I think in practice the way this manifests in my code is that I will have multiple data structures for the same data, but total functions between them.

              1. 2

                Worked with a guy once where we were going to make a domain model for an agency. “No problem!” he said, “They’ve made a master domain model for everything!”

                This was an unmitigated disaster. The reason was that it was confusing a people process (determining what was valid for various concepts in various contexts) with two technical processes (programming and data storage) All three of these evolved dramatically over time, and even if you could freeze the ideas, any three people probably wouldn’t agree on the answers.

                I’m not saying there shouldn’t be a single source of data. There should be. There should even be a single source of truth. My point is that this single point of truth is the code that evaluates the data to perform some certain action. This is because when you’re coding that action, you’ll have the right people there to answer the questions. Should some of that percolate up into relational models and database constraints? Sure, if you want them to. But then what do you do if you get bad incoming data? Suppose I only get a customer with first name, last name, and email? Most everybody in the org will tell you that it’s invalid. Except for the marketing people. To them all they need is email.

                Now you may say but that’s not really a customer, that’s a marketing lead, and you’d be correct. But once again, you’re making the assumption that you can somehow look over the entire problem space and know everything there is to know. Do the mail marketing guys think of that as a lead? No. How would you know that? It turns out that for anything but a suite of apps you entirely control and a business you own, you’re always wrong. There’s always an impedance mismatch.

                So it is fine however people want to model and code their stuff. Make a single place for data. But the only way to validate any bit of data is when you’re trying to use it for something, so the sole source of truth has to be in the type code that parses the data going into the function that you’re writing – and that type, that parsing, and that function are forever joined. (by the people and business that get value from that function)

                1. 2

                  I suspect we might be talking a bit past each other. To use your example, I might ask what it means to be a customer. It might require purchasing something or having a payment method associated with them.

                  I would in this case have a data type for Lead that is only email address, a unique uuid and optionally a name. Elsewhere there is code that turns a Lead into a Customer. The idea being to not keep running validation logic beyond when it is necessary. This might mean having data types Status = Active | Inactive | Suspended which needs to be pulled from external data regularly. I can imagine hundreds of different data types used for all the different ways you might interact with a customer, many instances of these data types created likely right before they are used.

      3. 1

        Mostly agree, but I’d like to add that the ability to pass along information from one part of the system to another should not necessarily require understanding that information from the middle-man perspective. Often this takes the form of implicit ambient information such as a threadlocal or “context” system that’s implemented in library code, but as a language feature it could be made first-class.

    2. 8

      I really liked this article. Having gotten in functional programming with Elm for the last 2 years, the nonEmpty type is brilliant, and I’m going to re-implement it in Elm. The tie in of language theoretic security was a nice touch. I’ve been promoting that at work for a while.

      it’s extremely easy to forget.

      Anything that can be forgotten by a developer, will be forgotten.

      A better solution is to choose a data structure that disallows duplicate keys by construction

      “Making Impossible States Impossible” by Richard Feldman is a talk on effectively the same concept.

      1. 9

        However, sometimes it is quite annoying to conflate the type and structure of the list with the fact that it is nonempty. For example the functions from Data.List don’t work with the NonEmpty type. I think the paper on “departed proofs” linked near the bottom points to a different approach where various claims about a value are represented as separate proof objects.

        1. 4

          In this instance, the fact that both [] (lists) and NonEmpty are instances of Foldable will help: http://hackage.haskell.org/package/base-4.12.0.0/docs/Data-Foldable.html

      2. 2

        I ran across a case of the nonempty approach in File.Select.files the other day:

        Notice that the function that turns the resulting files into a message takes two arguments: the first file selected and then a list of the other selected files. This guarantees that one file (or more) is available. This way you do not have to handle “no files loaded” in your code. That can never happen!

        1. 1

          Nice find! I’ve actually used that before but hadn’t made the connection, haha.

    3. 7

      “The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information.” - Alan J. Perlis

Stories with similar links:

  1. Parse, don’t validate (2019) via bshanks 2 years ago | 49 points | 8 comments