1. 9
  1.  

  2. 1

    Interesting! I think I might use a similar set of ideas for extracting values out of large json objects. I’m a data engineer and spend a lot of time doing ETL jobs. It was helpful to build out a simple schema based approach to flattening documents so I could load them into our warehouse. What I call a schema is really like a series of information paths joined together into a single structure. The schema defines where to find the data and localized transformations. From there, I can apply a common extraction function to get a stream of flat records.

    I talked a bit about it here, but it’s changed a bit since then. I’m pretty happy with some changes that I made in the last couple months and have been considering doing a write up on my experience.

    1. 2

      I like your approach. It reminds me about JSON Schema.

      What I’m not sure about is whether the schema should also contain the data transformation. Maybe it’s better to have two maps:

      1. One that describes the expected data shape
      2. One that tells how to transform data
      1. 1

        Yeah, that’s the conclusion I came to except that instead of a second map, I use a python library called pydantic. The original way had too much coupling and made it hard to aggregate different data sources into the same schema, but with pydantic I just create separate factory functions for different data sources.

        Additionally, I pared down on the schema so that it uses only basic python types: dictionaries, strings, tuples, and functions. So it might look something like this (not totally valid python code):

        class Metric(BaseModel):
          metricId: int
          metricDate: datetime
          metric: float
        
          def from_api(record):
            return [
              Metric(**result)
              for result in extract(Metric.api_schema(), record)
            ]
        
          def api_schema():
            return {
              "id": "metricId",
              "date": ("metricDate", parse_date),
              "observations": {
                "value": "metric",
              }
            }
        

        You might notice that the tuple packages a field name with a transformation. This is both an optimization and an extension point. If you had a observations array with 10,000 observations, you might need to convert a date string 10,000 times. However, if you can perform the transformation when you first encounter the value in the document, then you only need to convert it once and the value can be shared with the 10,000 observations.

        Overall, I’ve found this design to be much more resilient and capable than the previous one.

    2. 1

      Is data-oriented programming the same as value-oriented design?

      (Data-oriented programming seems different than data-oriented design.)

      1. 1

        Could you send a link about Value-oriented design?

        1. 1

          There was a reply to your HN post with links.

          It’s also been called value and protocol oriented programming; here’s a good writeup, and here’s another.

          Rich Hickey spoke about The Value of Values. What was especially interesting was this document about values and objects in programming languages from 1981 saying essentially the same thing. Major hat tip to this Stackoverflow response.

          It’s frustrating that there seems to be a mindshare battle happening right now. Does data-oriented mean:

          1. Immutable values composed in generic data structures
          2. Mechanically sympathetic layout of data (e.g. arrays); avoid using pointers, prefer handles or indexes
          1. 1

            Thank you for sharing all those links!

            1. Do you think that Rich Hickey read “values and objects in programming languages” paper?
            2. What makes you think there is a mindshare battle?
            3. An important part of DOP is that access to information should be flexible, which is not the case in FP languages that are statically-typed.
            1. 1
              1. Do you think that Rich Hickey read “values and objects in programming languages” paper?

              I wouldn’t speculate.

              1. What makes you think there is a mindshare battle?

              Because both usages of “data-oriented” have been hot, at least in the programming circles I run in.

              n.b. I don’t think it’s a hot war. More that “data-oriented” sounds good; it’s mental geography to be claimed.

              1. An important part of DOP is that access to information should be flexible, which is not the case in FP languages that are statically-typed.

              Thank you for raising this point!

              In TFA, you noted that the use of structs/classes/records over maps would break one or more of the DOP principles you laid out.

              I don’t see how it conflicts.

              My best guess is that “generic data structures” in the DOP context has a constrained meaning, probably limited to: lists, maps, (maybe) sets, and (very unlikely) trees. Essentially, what we get for free in popular languages and in popular serialisation formats.

              I looked at some solutions to your challenges; a few used static types to no detriment that I could observe.

              1. 1

                Regarding the term “data-oriented”, we have data-oriented design and data-oriented programming. I wrote an article that clarifies the distinction between the two.

                Your guess is correct: structs/classes/records break Principle #3 as it its formulated for the moment:

                Represent data with generic data structures

                But I think, we could loose a bit and reformulate Principle #3 as:

                Provide a flexible access to data

                It will open the door for data representation with structs/classes/records but then you need to use reflection.

                That’s why I am so curious to see solutions with static types. Please submit a PR!

                1. 1

                  What does “flexible access to data” mean? My guess: being able to access computed field names

                  What is an example of what static types impede?

                  1. 1

                    Flexible access to data means:

                    1. Ability to access a field by its name
                    2. Ability to go over the fields
                    3. Ability to create an aggregation of data with no rigid shape

                    In statically-typed language, we could get naturally a flexible access to data if we represent data as string maps. However, we we represent data with structs/classes/records it’s more challenging.

                    1. Access data by field name: requires a knowledge of the class definition or reflection
                    2. Going over the fields: requires reflection
                    3. Creating an aggregation of data with no rigid shape: not sure if it is possible

                    The challenges form the article require an application of those 3 points. That’s why I am so interested in seeing solutions to the challenges from the article in a statically-typed language.

      2. 1

        Is _ lodash? How does #3 work, given _.set mutates the object?

        1. 2

          In this article, I am using Lodash FP configured so that it never mutates data in place. Instead of mutating data in place, functions like _.set() create a new version.

          I have just added a word about it in the article, as it was not clear. Thank you for your question.

        2. 1

          Neat article format. I love the integration of executable code.

          1. 2

            Thank you!

            The interactive code snippets are powered by a tool of mine named Klipse.

          2. 1

            How does “represent data with generic data structures” interact with static typechecking? This would probably work in TypeScript since it has very strong support for this, it seems like if I wanted to follow this example in something with a Java-ish strong static type system, I’d have to have all my data objects be Map<String, Object>, which isn’t very ergonomic and throws away a lot of compile-time validation.

            1. 1

              It’s an interesting question. On purpose of my article is to trigger to curiosity of folks from statically-typed languages. Hopefully, they’ll come up with innovative solutions.

              Meanwhile, I gave a try and imagine how we could support dynamic data access in Java. The result is here.

              1. 1

                Interesting! The ‘typed getter’ approach reminds me a lot of something like lenses in Haskell. A lens is both a getter and a setter for a field, and they have the nice property that you can compose lenses just like function composition (which is just the . operator in Haskell).

            2. 1

              Nice article. Is there a document that describes why lodash made _.get? It is possible to just go obj[path0][path1]…etc. I suspect they have an interesting insight.

              1. 3

                To me, the problem that _.get solves is that the obj[path0][path1]... doesn’t compose well. By that, I mean that you have to know at “compile time” the exact length of your path.

                _.get allows you to treat the path as a first-class citizen.