1. 14
  1.  

  2. 2

    I like the overall idea, but I’m unclear on something.

    Part 1 says:

    Even logging invalid data could potentially lead to a breach.

    I can’t think how that would be the case.

    Also, the example of that is:

    log_error(“invalid_address, #{inspect address}”)

    In the reworked example, you show

     {:error, validation_error} ->
        log_error(validation_error)
        return_error_to_user(validation_error)
    

    But validation_error contains (a couple levels deep) an input: field with the original input. So wouldn’t it have the same problem?

    1. 1

      Yeah, I totally agree that we’re cheating here! This is a design tension that we’re not sure how to resolve. On one hand, we don’t want to expose unsanitized inputs to the caller, while on the other we’d love to log examples of payloads that cause the parser to fail. (For auditability).

      Do you have any pointers (or links to resources) on ways to resolve this tension? There’s always the option of “defanging” the original input by base64-encoding it, etc., but perhaps there’s a more elegant way out?

    2. 2

      For stateless validations (must be a number between 0 and 100), this is a nice approach. For stateful validations (this e-mail address has already been taken), it should probably be a two-stage process–unless we want to put filesystem/database/etc calls inside our parsers, which seems like a terrible idea.

      1. 3

        Yes, putting some kind of IO (service-call/db/etc) inside a parser would be terrible. I try to tackle stateful validation problems like this:

        1. Model the syntactically-valid data type, and use a parser to “smart-construct” it. So in this case we’d have an %EmailAddress{}. This data type doesn’t tell us anything about whether the email has been claimed or not.

        2. Down the line, when(if) we actually need to work with email addresses that are unclaimed, we give the service responsible for instantiating them expose a function typed:

        @spec to_unclaimed_email_address(
          %EmailAddress{}) :: Result.t(%UnclaimedEmailAddress, some_error())
        

        This function does the necessary legwork to either create a truly unclaimed email address, or tell you that it’s not possible with the data you brought it. It still conforms to the ‘railway-oriented style’, but at another level of the architecture.

        Of course this opens up another can of worms in terms of concurrency, but that’s state for you.

      2. 1

        A couple of years ago, I slapped together a couple of modules in python for taking nested JSON documents and pulling out slices of data for loading into a data warehouse. I’m happy with the general idea, but I have been wanting to refactor the implementation to separate out some of the concerns and improve flexibility. Everything is too bound up, too opinionated.

        If you had a JSON document for a film like so:

        {
          "id": 1,
          "title": "Titantic",
          "cast": [
            {"talent_id": 1, "name": "DeCaprio", "role": "Jack"},
            {"talent_id": 2, "name": "Winslet", "role": "Rose"}
          ],
          "release_dates": [
            {"location": "US", "date": "1997-12-19"},
            {"location": "CA", "date": "1997-12-20"}
          ]
        }
        

        Then you could write schemas like so:

        film = {
          "id": Field("titleId", int),
          "title": Field("title", String50),
          "country_of_origin": Field("originalCountry", NullableString50),
        }
        
        cast = {
          "id": Field("titleId", int),
          "cast": {
            "talent_id": Field("talentId", int),
            "name": Field("talentName", String100),
            "role": Field("role", String100),
          }
        }
        
        release_dates = {...}  # you get the picture
        

        Which would result in dictionaries like:

        films = [{"titleId": 1, "title": "Titanic", "originalCountry": null}]
        
        cast = [
          {"titleId": 1, "talentId": 1, "talentName": "DeCaprio", "role": "Jack"},
          {"titleId": 1, "talentId": 2, "talentName": "Winlet "role": "Rose"},
        ]
        

        I built some plumbing around deserializing the document, passing it to a series of schemas, pulling out the record instances, and serializing each instance to it’s own location. If any subcomponent failed, I’m failing out the whole set of records to ensure the database has a logical view of the entity. Overall, it’s worked pretty well for well organized, consistently typed JSON data. Unfortunately, there is a lot of nasty JSON data out there and it can get pretty complex.

        I suppose this is a long way of saying that this article gives me a couple ideas of how I might decouple some of this logic. Are you going to be discussing building entire structs in the next time? Or are you looking at a per field perspective?

        Looking forward to the next article!

        1. 1

          Hey, thanks for the great feedback! Yeah, we’re going to be building entire structs – if you take a look at the previous post, at the end (“Under the Hood”) there’s a snippet that uses a Data.Constructor.struct/3 to specify parsers for the particular fields. The next installment is going to be about how to make your struct parsing more flexible: for example, if you have a big flat JSON object coming in, but want to use it to create a nested hierarchy.

          In general, we’re taking a fractal approach of composing smaller parsers to create larger ones. The ‘struct’ parser constructor is a complex combinator with some specific semantics, but it’s fundamentally similar to the list/1 combinator. So yeah, to answer your question, we will be BOTH constructing entire structs and looking at it from a per-field perspective. It all comes together in the end.

          1. 1

            Awesome! Look forward to reading about it!