1. 2

    It might sound like a silly question, but if I already have the file on my local filesystem, what eoningain by adding a network layer on top of it?

    1. 3

      In my company we have a lot of slowly-changing data sets. This looks like a way to distribute that data via API without all the machinery of a database underneath it.

      Combine it with immutable infrastructure and we could have a version controlled file distributed by deployment without the need for custom source code for each dataset.

      1. 3

        This is one of the main inspirations behind roapi! We have lots of datasets that gets updated once a day from ETL pipelines. Right now we are exporting those data back to MySQL at the end of every pipeline run. But MySQL is a huge overkill for this particular use-case. The export process can easily overload the MySQL instance and impact ongoing production traffic too. Just serving those data from a stateless read-only API backed by an embedded analytical query engine is a much simpler and more scalable setup. Serving newly produced data can be done through deploying a new version of the API without any impact to the existing traffic.

      2. 2
        • The data is centralised by default.
        • The data is read-only.
        • You can edit the underlying file at any time and your users will see the new data without needing to re-download anything.
        • Serve via HTTPS so the reader can verify the source of the data, know it was transmitted privately and wasn’t tampered with during transit.
        • You only need to expose a network connection, not the file system. Keeps the attack surface smaller.
        • Allowing whatever is consuming it to query it using SQL.
        • Get all the speedups that Parquet offers over CSV.
        • Single command on the CLI, very little to mess up.
        1. 2

          If you are ok with distributing the same dataset in all nodes that need access to the data, then you don’t need the extra network layer. if fact, roapi comes with a columnq library and cli to help you perform the same type of analytical queries over local dataset, see https://github.com/roapi/roapi/tree/main/columnq-cli and https://github.com/roapi/roapi/tree/main/columnq.

          But if you have a 10GB data you want to serve and you have 1000 clients, you probably don’t want to copy and duplicate that 10GB data to all the clients. This is where ROAPI will meet your needs. Or if your clients are all in different languages and you want to provide a consistent query interface and capabilities for all of them, wrapping the data behind an API is a good idea even if the dataset size is small.

        1. 2

          Cool project! It looks like a great little tool for improving quality of life for anything that’s still just some hosted file.

          It would be pretty fun to make a little static HTML file for executing the basic commands and stuff from web browser (sort of like a Open API Spec browser)

          1. 2

            yep, that’s a pretty cool idea, we were discussing about this couple weeks ago in https://github.com/roapi/roapi/issues/80#issuecomment-923160321.

          1. 1

            Is Delta Lake only for big data, or could it be useful sometimes for smaller projects?

            1. 2

              It used to be only for big data because you can only read/write it from Spark. With the new rust implementation, this is not the case anymore: https://github.com/delta-io/delta-rs. You can use it for smaller project as well.

            1. 3

              I don’t see any mention of delivery sematics in the linked repo. @houqp, perhaps you can expand on this? Right now, the linked repo seems like a kafka connector, but there’s not much in there from what I can see.

              1. 2

                Yes, it’s a native kafka delta lake connector. In short, the exactly once message delivery is accomplished by batching the message and kafka offset into a single Delta Table commit so they are written to the table atomically. If a message has been written to a Delta Table, trying to write the same message again will result in transaction conflicts because kafka-delta-ingest only allows the offset to go forward.

              1. 5

                I’m a huge fan of both Rust and Delta Lake, but my eyebrows shot off my face when I saw “exactly once delivery”

                EDIT

                I asked about this years ago and @aphyr and @mjb gave me memorable answers about why exactly once delivery isn’t possible. I highly recommend reading it. More recently, we’ve discovered that there are some cases where exactly once delivery is possible, but the semantics are very difficult to grok, to the point that it’s probably best to only claim “exactly once” when you’re in the presence of extremely knowledgeable people. Any “exactly once” guarantees require a strict protocol with the client, so the semantics don’t bubble up to larger systems.

                For example, someone might extrapolate from this title that when I stream the Delta Lake table that I’ll receive the message exactly once. That’s not true, Delta Lake doesn’t give those guarantees, neither does Kafka. Only the connector from Kafka into Delta Lake gives the guarantees.

                It’s still useful, to be sure. But be careful about the semantics.

                1. 4

                  Thanks @kellogh for the links. I fully agree with what you said and what was said in the discussion you linked. Like you said, it all comes down to semantics.The kafka-delta-ingest project is a Kafka to Delta connector. What I meant in the title is we delivery a message from Kafka to Delta Table exactly once. Notice I used the phrase “from Kafka to Delta Lake” in the title, not “From Kafka to Delta Lake to your client” ;) It certainly doesn’t make sense to talk about exactly once delivery to an opaque client in a physical sense. In real world distributed systems, messages get redelivered all the time. The consumer of a Delta Table or Kafka topic will need to have its own atomic progress tracking capability in order to process the message exactly once logically.

                1. 1

                  What’s the use case for this. If you have the file, why not just reading it directly?

                  1. 1

                    To present a unified query interface across different data sources. The core of the query logic is in fact factored out as an independent crate so it can be used in embedded mode where downstream applications want to handle small files directly at runtime.

                    But for larger datasets that requires a lot of memory to hold, you would want to load it once in an API so downstream services won’t need to do that. The query results are usually a lot more smaller than the dataset itself. There is also the case where dataset owner wants to maintain the update of dataset in a single place instead of embedding the files (or uri) into downstream services, serving those datasets as an API is a good way to do that.

                  1. 1

                    Once this starts up will it reflect changes to the underlaying datasources i.e - if a separate process updated the backing csv/json/google doc? I’m wondering if this would be useful for a event sourced read API.

                    1. 2

                      Currently it assumes the dataset is static, so a redeploy is required to pick up changes from the dataset.

                      I do want to add support for consuming updates from the dataset in a streaming fashion, effectively supporting a event sourced read API use-case you proposed. It should be pretty straight forward to implement. We just need to subscribe to the stream and append new data as Arrow record batches into the in memory table.

                    1. 1

                      Great read! Have you had a chance to check out the analysis package from the Go subrepositories?

                      1. 1

                        I haven’t. Thanks for the recommendation, will take a deeper look at it :)

                      1. 2

                        My team uses EFS to solve this same problem.

                        We used to utilize S3, but the eventual persistence got annoying. As well as the fact that people would just upload to it from their local machines, etc.

                        1. 2

                          Our staging and production S3 buckets are not accessible from local machine, so all changes have to go through git commit and CI/CD.

                          What was the replication delay you experienced from S3? So far I have only experienced worst case delay of couple seconds, which is not causing any issue for us.

                          1. 1

                            Based upon how our large organization worked we had to refresh our AWS Creds every hour, which resulted in us needing to run a sidecar for this refresh. Now that we just mount to EFS no cred management is needed. (This was the true root of our problem)

                            1. 1

                              Interesting, wouldn’t use of IAM role automatically handle the cred rotation for you?

                              1. 1

                                Our role sessions only last for an hour

                        1. 1

                          This is a really interesting way to set up your Airflow DAGs. We broke ours out into a couple of Airflow instances due to the size of one. They also had a logical separation based on what they were processing though.

                          Are you pulling directly into the current DAGs or are you pulling into a separate dir that you cut over to? IIRC, you have to signal Airflow to reload the DAGs.

                          1. 1

                            We pull directly into the current DAG directory. Airflow now automatically detects DAG change and reloads them on the fly, so we didn’t have to do anything special.

                          1. 2

                            I’d never heard of Airflow before. Something with workflows… How does it compare to Node Red?

                            1. 2

                              It’s mostly used as a ETL system, from my experience, and would be more akin to systems like Luigi.

                              1. 1

                                That’s right, Airflow solves the same problem that Luigi solves.

                                1. 1

                                  Thanks, looked a little enterprisy :P Now I’m off to look what Luigi is ;)

                            1. 2

                              I don’t see anything wrong with the testing package? If I look at your example, then this can be done with testing with the same or fewer lines of code.

                              1. 1

                                If testing package works for your project, you should definitely stick with it. I am all for less magic unless really necessary.

                                This is for larger test code base that requires non-trivia setup and teardown code at the beginning and end of each test. For example, calling go routine leak check at the end of every test or setting up database transaction for each test.

                                1. 1

                                  You can do that with testing as well; here’s a real-world example: the setup() does some magic to intercept stdout/err and os.Exit(), and the function that’s defer()’d restores the original state. You can add db setup, teardowns, etc. etc. like this:

                                  func TestSearch(t *testing.T) {
                                  	tests := []struct {
                                  		in        []string
                                  		want      string
                                  		wantLines int
                                  		wantExit  int
                                  	}{
                                  		{[]string{"s", ""}, "need search term", 1, 1},
                                  		// [..]
                                  	}
                                  
                                  	for _, tt := range tests {
                                  		t.Run(fmt.Sprintf("%v", tt.in), func(t *testing.T) {
                                  			outbuf, c := setup(t, tt.in, tt.wantExit)
                                  			defer c()
                                  
                                  			out := outbuf.String()
                                  			if lines := strings.Count(out, "\n"); lines != tt.wantLines {
                                  				t.Errorf("wrong # of lines\nout:  %d\nwant: %d", lines, tt.wantLines)
                                  			}
                                  			if !strings.Contains(out, tt.want) {
                                  				t.Errorf("wrong output\nout:  %q\nwant: %q", out, tt.want)
                                  			}
                                  		})
                                  	}
                                  }
                                  
                                  1. 1

                                    If table driven testing works for your project, then gtest is definitely an overkill :)

                                    It is for doing setup and teardown on a group of tests that could have completely different test logic, where you cannot abstract into table driven tests. In your case, what if some of the tests doesn’t need to check for exit coee? What if one of the test want to do a http call to check response code? If you use sub test from testing package to achieve that, then you will end up with gtest.

                                    I had to do it manually every time for a new go project to the point that I decided to turn my copy paste code into a library so I don’t have to copy paste again.

                                    1. 1

                                      In your case, what if some of the tests doesn’t need to check for exit code?

                                      Passing -1 skips that check: https://github.com/arp242/uni/blob/master/uni_test.go#L268 (or more accurately: it ensures that os.Exit() isn’t called). In general, if tt.wantFoo != <zero value> && ... works reasonably well.

                                      I’ve thus far not seen a scenario where table-driven tests didn’t work at all. Although just because I’ve not seen it doesn’t mean it doesn’t exist, every time I’ve seen someone use a “testing framework” in Go, they could have used table-driven tests too.

                                      YMMV, and if this works well for you then that’s great! But for me the #1 reason for me to not contribute to various open source projects is because the test framework is too painful to deal with: spend 5 minutes making a small change, and then 50 minutes figuring out how to fix the tests. I wrote a thing about this a while ago: https://arp242.net/testing.html (it talks about testing frameworks specifically about halfway through; search for framework).

                                      1. 2

                                        Passing -1 skips that check: https://github.com/arp242/uni/blob/master/uni_test.go#L268 (or more accurately: it ensures that os.Exit() isn’t called). In general, if tt.wantFoo != && … works reasonably well.

                                        That’s exactly my point, you are moving logic that should reside in individual tests into a shared global setup code. Now each test is paying the penalty of having to be aware of this global setup logic whether it cares about exit code or not. Again, as I mentioned in my previous comment, what if one of the test need to issue a http call?

                                        A very common use-case that table driven test won’t work is testing API handlers that talks to databases. You will need to provide a DB transaction to each test while every api handler test would have completely different test logic depending on behavior of the API. Now imagine having to create temporary db with predefined schema before injecting new transaction to each test.

                                        Another example is adding go routine leak check at the end of all tests, you can’t possibly fit all your tests into one giant table do you?

                                        Again, all of the above is achievable through the builtin testing package with custom code. But if you do that, you will end up with gtest.

                                        YMMV, and if this works well for you then that’s great! But for me the #1 reason for me to not contribute to various open source projects is because the test framework is too painful to deal with: spend 5 minutes making a small change, and then 50 minutes figuring out how to fix the tests.

                                        I have experienced this myself as well. For me personally, familiarity of the language and framework plays a major role here. If a project uses pytest, which should be the case for most python projects, i can jump right in and be productive on day one.

                                        I have seen the extreme opposite as well where in a complex project, the author basically created its own test framework within the project by refusing to use a well adopted one. Then I ended up having to read through the undocumented custom setup to figure out how to add a new test. This is a much worse experience IMO :)

                                        1. 1

                                          you are moving logic that should reside in individual tests into a shared global setup code. Now each test is paying the penalty of having to be aware of this global setup logic whether it cares about exit code or not. Again, as I mentioned in my previous comment, what if one of the test need to issue a http call?

                                          I find it’s usually a small price to pay, just an if.

                                          I don’t understand your HTTP call question? You mean the code being testing calls an HTTP API or some such?

                                          A very common use-case that table driven test won’t work is testing API handlers that talks to databases. You will need to provide a DB transaction to each test while every api handler test would have completely different test logic depending on behavior of the API. Now imagine having to create temporary db with predefined schema before injecting new transaction to each test.

                                          I’ve done this several times, and it works fairly well. Basically the same method as above, see for example this and this.

                                          Another example is adding go routine leak check at the end of all tests, you can’t possibly fit all your tests into one giant table do you?

                                          You can just add a function call for this, right? If you want it for literally every test you can define TestMain().

                                          Again, all of the above is achievable through the builtin testing package with custom code. But if you do that, you will end up with gtest.

                                          I’m not so sure about that; the “plumbing” for these kind of tests is a lot smaller, and typically doesn’t use reflection.

                                          1. 1

                                            I find it’s usually a small price to pay, just an if.

                                            I agree with you if it’s just one or two ifs, it’s a very small price. If this is the case, I would just do that manually instead of adding an extra dependency. In fact, I would start out my tests that way and only introduce dependencies like gtest later on if the complexity gets out of control.

                                            What I found in real world complex projects is manual setup doesn’t scale in some cases. You may start out as one if, then eventually grow into a lot of ifs, which makes it harder to read and maintain as given by the example you linked: https://github.com/zgoat/goatcounter/blob/master/handlers/http_test.go#L51:6. Notice that runTest function only handles a specific type of http handler test which only calls an endpoint and checks for response. Look at how many if statements it contains and how nested the code already is.

                                            If you find https://github.com/zgoat/goatcounter/blob/master/handlers/backend_test.go#L60 easy to read and extend, then that’s great and you should stick with that. I personally find the multi-level nested for and if statements harder to read compared to a clean struct method.

                                            You can just add a function call for this, right? If you want it for literally every test you can define TestMain().

                                            I have tried that in the past. TestMain applies to the full package, not a selected group of tests unfortunately.

                                            I’m not so sure about that; the “plumbing” for these kind of tests is a lot smaller, and typically doesn’t use reflection.

                                            https://github.com/houqp/gtest/blob/master/gtest.go#L256 is only 50+ lines of code including error handling. Only a small portion of that is spent on supporting per test and per test group setup/teardown. I would say the “plumbing” code size is at the same level as the example you gave and more readable. In case you wonder, that setup/teardown reflection pattern is taken from https://github.com/grpc/grpc-go/pull/2523/files as documented in my code, which again is another example of mini test framework within a project that one needs to learn and maintain. I try to avoid using reflection in application code as much as possible, for tests I think it’s a reasonable tradeoff to use reflection to make things more readable and maintainable.

                                            At the end of the day, if table driven tests scales fine for your test requirements, then there is no point introducing another small dependency.