1. 7

    My understanding of the GDPR is that it requires user data to have a retention period. This means that if a user leaves we eventually have to hard delete their data. Maybe my understanding is off, but does Dolt allow this? The comparable operation in git would be history rewriting, which is frowned upon in public histories.

    1. 3

      The blockchain GDPR problem

      1. 2

        This means that if a user leaves we eventually have to hard delete their data.

        That is not true. One need to make PII, non-business critical, information unreadable, that doesn’t mean that all data need to be deleted, nor that account deletion is request for deleting data.

        About allowing to do so, there are 2 ways to achieve that:

        • store PII in separate DB and just reference between these two
        • encrypt data with per-user key, and delete key on user request.
        1.  

          Yeah, key shredding is a great approach. Saw this a bunch at Google.

        2.  

          You’re not wrong, this is a big problem for people who store customer data.

          The solution we have for now is basically rebase. You censor the offending fields in the entire history. This means everybody has to re-clone.

          As we mature we’ll look into less disruptive ways to achieve the same thing.

        1. 7

          What are some interesting use cases for this?

          1. 7

            The most interesting one currently is paying people to crowdsource large data sets. We just ran a couple large bounties that paid $25k and $10k:

            https://www.dolthub.com/blog/2021-02-15-election-bounty-review/

            https://www.dolthub.com/blog/2021-03-03-hpt-bounty-review/

            There’s another one going on for $10k right now.

            But most people who are paying us money for the product are using it as an application server, basically to replace MySQL or postgres. They want to be able to manage versioning of their production database like git.

            There’s a ton of use cases though, it’s a really cool product.

            https://www.dolthub.com/blog/2020-03-30-dolt-use-cases/

            1. 2

              Are there any real-world analysis being done on DoIt’s election precinct data?

              1. 2

                Not that I know of, but it’s only a month old.

                We recently announced we’re going to be partnering with OpenElections to fill out the rest of the data for 2020, so a more complete set of data might spur further interest.

                https://www.dolthub.com/blog/2021-02-24-open-elections-followup/

          1. 1

            Does go-mysql-server try to emulate mysql exactly, or is there a plan to improve SQL standard confomance?

            1. 2

              The goal is to be a drop-in replacement for MySQL, with local testing a significant use case. Emulation is not an explicit goal, especially when it comes to reproducing behavior that is widely agreed to be incorrect, but generally speaking the server should behave identically to MySQL. This view could evolve over time – the goal of the project is to be a useful SQL engine for people who are familiar with MySQL, not to be an exact bug-for-bug clone. The main area where behavior will be strictly identical is parsing and dialect specific grammar.

            1. 4

              for folks, like myself who did not know what ‘Dolt’ is

              “.. Dolt is a relational database, i.e. it has tables, and you can execute SQL queries against those tables. It also has version control primitives that operate at the level of table cell. Thus Dolt is a database that supports fine grained value-wise version control, where all changes to data and schema are stored in commit log. “ [1]

              [1] https://github.com/liquidata-inc/dolt

              1. 7

                Dolt is git for data. It has the same command line as Git, but it versions tables instead of files. And it adds a couple other commands for the database part, like sql for running a SQL shell against your tables, or sql-server for starting up a MySQL compatible server to connect to.

                1. 2

                  dolt:

                  n. A stupid person; a dunce.
                  To waste time foolishly; behave foolishly.
                  n. A dull, stupid fellow; a blockhead; a numskull.
                  

                  Unfortunate name.

                  1. 4

                    It’s an homage to how Linus named git:

                    https://en.wikipedia.org/wiki/Git#Naming

                1. 2

                  Most of my work is golang, so I spend most of my time in jetbrains GoLand. It’s pretty great, well worth the license.

                  1. 1

                    I use a Windows desktop with a last-gen Ryzen and 16GB of RAM. Here’s my full specs:

                    https://www.userbenchmark.com/UserRun/19325657

                    To answer the obvious question: I use Windows because it’s still the most common desktop environment in the world by a long shot, and we need to be able to support customers running it. Me and one other guy in our 10 person startup agreed to be the Windows guys. Everybody else uses MBPs.

                    I am using windows subsystem for linux (WSL) for my command line experience, and it’s been surprisingly nice. For those not in the know, WSL is a Linux virtualization env that plays nicely with Windows. I can access my windows files pretty seamlessly from my virtual unbuntu instance (although at a performance penalty), and run windows executables from the command line on my ubuntu file system. For example, I have go aliased to go.exe, so that my command line go commands use the same go env as my GoLand IDE. There have been some bumps, but it all works pretty great overall.

                    1. 2

                      For those not in the know, WSL is a Linux virtualization env that plays nicely with Windows

                      This is true only of WSL2, which runs a Linux kernel in a VM with 9P-over-VMBus for forwarding the filesystem in either direction. WSL is closer to the *BSD / Solaris Linux-compat layer: your Linux processes are NT processes (a special kind), system calls go to the NT kernel, and so on. This means that you can, for example, start a Linux X11 terminal, have it connect to a native Windows X server (I use vcxsrv, which is basically X.org built with Visual Studio and packaged), and then run cmd.exe or powershell.exe and get a Windows console.

                    1. 0

                      Results in a sentence, please? Not just for me but for others who don’t have the time to sift through your well-detailed post :) Great job on it.

                      1. 3

                        Mortality outcomes for Congress are very sensitive to assumptions about spread and mortality. At the low end for each parameter, we can expect to lose 2 members of the house on average, and no senators. At the high end, we lose 6 senators and 20 house members. The age and sex distribution of the two parties means that on average the Republicans are slightly more at risk in the Senate, and the Democrats are dramatically more at risk in the House. In the worst case, these losses are:

                        Senate republicans: 3 Senate democrats: 3 House republicans: 6 House democrats: 14

                        1. 3

                          They have a link to the results in the second sentence of the post..

                        1. 2

                          I’m getting a 404 on the first link.

                          1. 1

                            Thanks for the heads up, fixed

                          1. 4

                            One big oversight in reaching the conclusion is that the mortality rates are based on the observed cases of contamination. This is likely to be a much smaller number than the actual contamination count since not the full population is tested, but only hospital patients or other people directly related to the health care system. This means a much lower actual risk of decease.

                            1. 3

                              CFR is hard to estimate. Assuming that there are many uncounted cases is just that: an assumption. There is by definition no evidence to support that claim. For places that have sampled entire populations (a few Italian towns, Iceland), they haven’t found evidence that the virus is substantially widespread in the population, or at least not in numbers that would be necessary to drive down CFR substantially. At this point in the epidemic curve, something like 85% of cases are less than 2 weeks old, which means that currently measured fatalities are probably an underestimate. Most people who will eventually die haven’t yet.

                              1. 2

                                Agreed. I do believe though that it’s important to at least mention this consideration when presenting a rather alarming visual as is done in this article.

                            1. 2

                              You can get pretty far with PostgreSQL (and probably many other traditional SQL databases). Something like Liquibase for versioning the schema(and static data if you wanted) and an “audit log”[0] that takes transactions against tables and records them somewhere. Then you get unlimited(for as long as you want to store the audit log) undo, auditing, etc.

                              It’s not the same certainly and merging across databases gets tricky, but for many use cases it’s probably good enough.

                              0: https://severalnines.com/database-blog/postgresql-audit-logging-best-practices

                              1. 2

                                Yeah, if what you’re looking for is “versioned database,” there are lots of mature and performant commercial solutions. What none of them really address is what Git is good at: collaboration and distribution. If you have multiple independent parties contributing to a data set, standard DBs don’t really give you any help to make it all work. All the features we’re used to in source control (forking, merging, PRs) just don’t exist. Even among data sharing sites like Kaggle, it’s impossible to contribute edits to someone else’s data without emailing CSVs back and forth. We built Dolt to specifically to address the collaboration use case.

                                1. 2

                                  Agreed. Besides the obvious of just giving them an account in your PG instance, or access to your liquibase git repo.

                                  Dolt seems useful for public datasets, but I don’t see much value in private datasets, since if you are managing access anyway, the obvious solutions above seem easier, and you get all the benefits of a nice SQL database.

                                  Anyways, I really enjoyed your post and the new perspective, as before now I never saw any value or even thought about git for databases really. Now I’ve started to think about it, so thanks!

                                  1. 3

                                    Well, I hope you’re wrong about the value in private datasets. Kind of building a whole company around that model :p

                                    The way I think about this is, whenever the primary interaction mode for data is human-curated, offline edits, maybe with human review on merge, then the git model of collaboration has a ton of value. We are betting that there are actually very many private data sets like this. It’s not a good match for OLTP, or most application backends that get frequent writes. But compare it to the state of the art for sharing data, which is literally still emailing CSV files around, and the value is obvious.

                                    1. 1

                                      I can think of a few human-curated, human-merge datasets at $WORK, but we solved this problem through just having an approval system as part of our PG DB(i.e. person A requests change Z and person B approves change Z).

                                      Your perspective seems to do the same thing(s) but using the git model of collaboration, so it’s much fuzzier who can approve change Z here, where my model is hard-defined(via some other tables, groups, etc) as to who is allowed to approve change Z. For us, the hard-definition(s) is a win, but I’m in a much-regulated organization that’s > 100 years old. I could see the softer approval definitions, useful in fuzzier orgs, or even at the beginning of an orgs life.

                                      I agree about CSV. I def. could see some value around sharing of data across orgs, or maybe even inside of orgs with something like this, but a lot of the time sharing this stuff is circulated around reporting, which means you need strong data exploration tools (which is usually just $spreadsheet for CSV).

                                      But thinking about this more, I can see a use-case for tentative edits, where some $bossy type wants a bunch of changes made and the software isn’t really built for that change, so we end up sending CSV files back and forth between us and $bossy type(s). For us the value would be in being able to then integrate the “finished” back into our PG/SQL DB. I don’t see $bossy type navigating dolthub though. They can barely handle $spreadsheet some of the time.

                                      I’m not sure I’m a target audience for dolthub. Anyways, good luck, and for your sake I hope I’m wrong too!

                              1. 8

                                Today I’m writing perl scripts to parse wikipedia entries on house and senate membership, with the ultimate goal of joining this data to coronavirus fatality rates for age and sex, so that I can simulate how the virus might shift legislative power.

                                1. 4

                                  Not sure if helpful, but you may be able to get that data from https://www.wikidata.org in a less brittle way

                                  1. 2

                                    the ultimate goal of joining this data to coronavirus fatality rates for age and sex, so that I can simulate how the virus might shift legislative power.

                                    I’m confused. Are you trying to see how if a congressional district’s constituencies suddenly dies off from the disease how the number of eligible voters might change?

                                    1. 1

                                      No, just the reps themselves. Doing a deep dive on voting pattern changes based on voters dying would also be interesting, but a little beyond my current scope.

                                      I expect the result will be that power shifts not very much to either side, maybe mode of 2 dead congressmen? But I want to actually run the numbers.

                                    2. 1

                                      That’s definitely interesting! Will you be posting your results anywhere?

                                      1. 1

                                        Yup, that’s the plan! It will be a blog post here some time this week: https://www.dolthub.com/blog/

                                        1. 1

                                          That sounds fascinating. Do you use a lot of Perl at dolthub for data mangling like this?

                                          1. 1

                                            Yeah, perl is our main squeeze. A bunch of us are from Amazon back when Amazon was a perl shop. It’s a terrible language for programming in the large, but for text munging there’s really nothing better, even today.

                                    1. 2

                                      We’re building Git for Data: https://github.com/liquidata-inc/dolt

                                      Dolt is a SQL database that stores table data in a Merkle DAG of commits, identical to git. This means you can clone, fork, branch, and merge your databases in a distributed fashion. Merges happen on a row-by-row, cell-by-cell basis. The dolt command line is a clone of git, so there’s no learning curve for people already familiar with git. If you also know SQL, then dolt sql pops you into a SQL shell where you can select or modify the data, create new tables, etc. Or start a mysql compatible server with dolt sql-server. Then when you’re done making updates, you can dolt add .; dolt commit -m "updates". If you also are using DoltHub, then push your changes back to master with dolt push origin master.

                                      The mashup of SQL and git lets us do some interesting things. For example, I can run queries on a previous commit with SELECT * FROM table AS OF 'HEAD~', or on a branch with SELECT * FROM table AS OF 'branch-name'.

                                      Anyway, we’re pretty excited about contributing to this space. We have a blog that we’ve been updating several times a week if you’re interested in following along: https://www.dolthub.com/blog/

                                      1. 1

                                        Cool, I will check that out!

                                        1. 1

                                          Hope you find it useful. Feel free to message me with any questions! We’re also always interested in PRs if you have a contribution you want to make.

                                      1. 3

                                        I started using React about a month ago, and was expecting to hate it. I have an instinctive distrust for frameworks, but I have also heard a lot of whining and moaning about how complicated it is. Much to my surprise, everything just… worked? The declarative, functional style translated pretty much directly into a DOM with few surprises. Handling mutable state with useState() declarations mostly just worked, and was a far cry simpler than $watcher in Angular. I had to read a few doc pages for things like contexts and routes, but everything went surprisingly smoothly.

                                        At this point, React seems pretty great as far as I can tell. Am I just naive and the real pain is yet to come?

                                        1. 4

                                          I really like React as well and have been using it since 2013. It has essentially stayed the same since then, with the paradigm it championed influencing the UI libraries of big players like Apple and Google as well. It has a clear API surface, and does one thing very well. The biggest recent improvement was Hooks, and it changed the game for me and many other people too. The team is also continuing to come up with novel solutions to some pretty hard problems that have been historically unaddressed by JS libraries, such as Concurrent Mode. Not that everyone has these problems of course.

                                          It is one of the most mentioned keywords in job postings, and it’s here to stay. Your investment in it will certainly not be wasted.

                                          With that said, I’ll play the devil’s advocate and try to answer your question:

                                          It is not a framework. When viewed from the eyes of a novice, or someone who bootstraps a lot of projects, or onboards a lot of new developers, the lack of conventions are quite tough. Every React codebase I’ve seen are wildly different, from state management to styling to code conventions to project structure to… you name it. In the end, for a frontend library to not even have a good solution for things like styling and routing is pretty out there.

                                          This didn’t stop it from getting to where it is of course, and more complete things like Angular, Vue etc have been suffering from backwards-compatibility issues in this fast moving platform, language and ecosystem. I read a quote by Vue’s author in which he said that (paraphrased) he likes Svelte’s approach, he would’ve liked to adopt some of it if not for backwards compatibility. And just a few moments ago Vue was the new hotness.

                                          When I bootstrap my own projects, I love it. I get to engineer my application just the way I want for the problem at hand. Usually choosing ReasonReact. But every time I join a new team using React, I have to learn everything from scratch, and a lot of your knowledge, war stories, preferences don’t transfer. Right now “we use React” tells me just as much as “we use Lodash” about a company’s engineering culture, programming paradigm etc.

                                          It’s also very easy to get overwhelmed with the sheer amount of choices you have to make, usually in a group of people with different opinions and experience, from the start and at every step of scaling your app and your team.

                                          1. 2

                                            So what you’re saying is: Hell is other engineers. Guess I have our lead web guy to thank for how easy it was get to started with our React app.

                                            1. 2

                                              Just that when it gets bad it can get very bad, and there’s nothing in JS and its ecosystem that stop you from building projects that drown in tech debt, horrible performance, and completely go against React’s programming paradigm. You basically have to learn everything the hard way, or have awesome people in your team. For some the answer is more hand-holding and out-of-box featureful frameworks. For me it was using functional and immutable-by-default languages to use it at its best, and skip all the bad parts of JS.

                                              1. 2

                                                Just wanted to second @osener. I’ve been using React since 2015 across 3 companies and it felt quite different at each. Routing was different, state management was different, etc. Given all that, React is still my number 2 choice for web apps.

                                        1. 3

                                          This looks promising as what I really want is the diff/blame versioning of git, but for data. Lots of people build versioned databases, but they are easy to bypass and change the underlying data by skipping the app layer and going straight to the db.

                                          I’d like to have a database that I can tell if even the admin has changed data and what they changed without having to invoke blockchains.

                                          1. 3

                                            This is exactly right. Blockchain is a major headache and terrible for efficiency / performance if what you really want is just an audit. The genius of git is that anybody can clone your repo and become an independent source of truth, including all past history. As long as multiple people have clones, it’s basically impossible to commit fraud without getting caught.

                                            1. 3

                                              Datomic will let you do this (and doesn’t use block chains).

                                            1. 2

                                              Is this project somehow related https://github.com/attic-labs/noms ?

                                              Noms is a decentralized database philosophically descendant from the Git version control system.

                                              1. 2

                                                We built a database product that uses noms as its backing store: https://github.com/liquidata-inc/dolt. It’s basically Git for data, with a command line that copies Git’s. See my comment elsewhere in this thread for details.

                                                1. 1

                                                  Interesting, thanks for the link!