Have been using foundationdb a bunch, really a huge fan. It provides a simple api - key/value transactions and watches - But gives great performance and relatively easy operations that can scale to hundreds of cpus.
I’m a notorious flakes opponent, so take my comment with a grain of salt, but I read a lot of things written by newcomers to Nix like this:
I see words like “flakes” and “derivations,” and I currently don’t know what they mean.
And it is really quite jarring. Flakes have such a PR machine behind them (through the consulting companies that push their adoption) that newcomers believe they’re some fundamental concept that needs to be understood, whereas in reality they are an abstraction layer on top of the fundamental concept ands - in my opinion - should not be used by people who haven’t understood what’s happening below. In beginner chats (e.g. the @ru_nixos Telegram group) a huge portion of posts are problems people have with flakes, which they only have because of flakes, and which they would be able to solve if they had understood the fundamentals.
I’m just braindumping here, this isn’t intended to attack the newcomers that think like this, but the people that keep pushing out blog posts etc. that perpetuate it.
I don’t understand Nix’s language syntax, but it’s enough like JavaScript and Python that I can fake my way through at this point. But to use Nix effectively, I’m obviously going to need to learn the language.
Flakes have such a PR machine behind them (through the consulting companies that push their adoption)
This seems to imply that the consulting companies have something to gain by pushing flakes, but that doesn’t seem very plausible to me.
If anything, flakes solve problems that non-experts would have shot themselves in the foot with, which is something consulting companies could bill for.
For me personally, flakes were what unlocked nix and nixos for me. Inputs are explicit and locked, as opposed to outside the declaration as channels require. (Technically inputs don’t have to be specific because of the flake registry, but I wouldn’t use that). With flakes config/code sharing between nixos systems declarations has a structure I can follow and build on. They’re not perfect by far, but without them I would probably not have gone as deep into the ecosystem.
Anecdotally I see just as many people in Discord struggling with flake-specific problems as I see struggling with channel problems.
Flakes vs. channels is a false dichotomy though, both are bolted on features on top of the core concept. Note that almost no experienced, non-flake users use channels, instead preferring to just pin nixpkgs commits directly.
I don’t follow this at all; despite considering myself more than proficient in Nix, I don’t even know how to use NixOS without channels or flakes. niv doesn’t even have an example of how to use it for NixOS configurations. Certainly as far as I know, nixos-rebuild is going to invoke the channel by default to build the config. I guess for non-NixOS scenarios, I see the point (builtins.fetchTarball to get a nixpkgs).
And I can’t count the … 3, 4, 5 dozen times that users have had issues because of channel management. And the complexity of supporting them because you have to interrogate the state of the world to determine if they’re on the right channel, oh no, are they really on the right channel since root/users have different channels, etc, etc.
That having been said, I do wish that there was a clearer, up-front understanding that flakes is “sugar” on top, and doesn’t fundamentally change the Nix underneath. (well, pure eval is a big thing too, but that feels like a different point).
Comments like this make me legit mad. I have spent actual time trying to learn this stuff and coming out frustrated, with all resources I’ve found pointing at one of these two solutions, and then this.
Just get the core concept y’all. Grok it. If you know you know.
You said flakes are a problem for beginners, and I pointed out channels are also a problem for beginners. Given that channels is the default for both nix and nixos installations, this is not a false dichotomy it’s a valid concern. If we want to talk about experienced users, that’s a different topic and is moving the goalposts.
As you point out flakes are nothing revolutionary, they’re just a set of tooling on top of nix.
Also because there is no actual relationship between foo and foo2, foo/v2 does tell you that it’s the same project’s decision. For instance in the Python ecosystem you have pypdf, pypdf2, 3, and 4.
Pypdf2 was a fork (with I understand the blessing of the original maintainer), which has since been renamed to pypdf and has replaced the original on pypi. pypdf3 and pypdf4 were hostile forks of pypdf2 which have apparently been abandoned.
Pypdf2 was a fork (with I understand the blessing of the original maintainer), which has since been renamed to pypdf and has replaced the original on pypi. pypdf3 and pypdf4 were hostile forks of pypdf2 which have apparently been abandoned.
Parts of me want to know more about this micro-drama, the rational parts tell me to stay the eff away.
Yeah, I’m pointing out it was a largely pointless complication in the ecosystem. I don’t see the benefit at all of having to update all tools to handle this edge case, not to mention the confusing way import paths map to the package name now.
Also because there is no actual relationship between foo and foo2, foo/v2 does tell you that it’s the same project’s decision. For instance in the Python ecosystem you have pypdf, pypdf2, 3, and 4.
I don’t buy this argument, packages in go are namespaced by domains. You can’t just hijack my github username and publish a package under it.
Yeah, I’m pointing out it was a largely pointless complication in the ecosystem. I don’t see the benefit at all of having to update all tools to handle this edge case, not to mention the confusing way import paths map to the package name now.
There is no complication? The article literally notes that module-unaware tools can ignore it and will work fine, awareness is useful in order to relate major versions of a program, which your scheme makes impossible.
As an example, the go tools now all must understand version branches - this includes godoc - this is clearly a complication. Like I said sqlite3 is an example of a major software that is both a library and a program that used semantic versioning without any support from the language needing to be baked in.
As an example, the go tools now all must understand version branches - this includes godoc - this is clearly a complication.
Literally the entire point of /v2 is to not be version branches.
Like I said sqlite3 is an example of a major software that is both a library and a program that used semantic versioning without any support from the language needing to be baked in.
An interesting thing about /v2 is that they just paved the cow path. There was already https://labix.org/gopkg.in which let you release gopkg.in/yaml.v1 and gopkg.in/yaml.v2 simultaneously. Go modules just made it so that this works for any source repo.
the confusing way import paths map to the package name now.
They don’t map at all. The import path tells you how to find a package, but the package name is independent of its import path. It never had to match the final path component. This is nothing new.
I don’t see the benefit at all of having to update all tools to handle this edge case.
If tools relied on the fact that the package name matches the final path component, the tools were buggy to begin with.
It never had to match the final path component. This is nothing new.
Was it commonly used before this version suffix?
If tools relied on the fact that the package name matches the final path component
Plenty of tools need to be version aware now, including the go tools themselves and tools like godoc, none of it was needed in my opinion. I just don’t really see the benefit vs having a new import path or repository.
As you can see, the ‘v2’ part is in a branch, not in the path. Maybe I am wrong, but I cannot to see how the go tools can handle this without some special logic to disambiguate the version suffix, especially when you ask for the v1. All of a sudden if you develop go tooling you need support for poking inside branches of VCS repositories just to work out which version you mean.
Because the goal is to replace math/rand for future use, not have a second, parallel API; decalring a v2 is quietly sunsetting the v1 API in Go. If people still need it, they can use the import math/rand/v1 instead, but they live with caveats that updates and fixes to that version of the code are plausibly unlikely. The biggest thing that rsc is suggesting is changing the way you initialize randomness. For most people, the changes will be enough to use automated tools.
I actually fundamentally disagree with Russ’s take here that people shouldn’t be using this for arbitrary random byte sequences. Noisy, random data is great for some things, but we’re not in crypto land here, where the Real randomness is important, and it could be aliased to crypto/rand’s byte source.
It would retroactively fix a lot of bad software that does depend on it for cryptographic random bytes if the one in rand quietly called the crypto version under the hood.
It would however break software that depends on seeding the RNG in a particular way, though that is a different issue that russ brings up, and I do agree with hin on some level.
I actually fundamentally disagree with Russ’s take here that people shouldn’t be using this for arbitrary random byte sequences.
rand.Read is already deprecated because it has very low utility (getting non-crypto random bytes is a lot less common than getting cryptographically secure random bytes) and it’s easily mistaken for crypto’s (possibly after refactorings or because the editor’s completion is not noticed) which has been the source of security troubles.
Note that rand.Rand.Read is still valid. Plus it’s not like rand.Rand.Read does much that’s useful in the first place: it just fills the buffer by pulling 8 bytes at a time from the rng.
You are not born into Discord Nitro. That is the defining characteristic of a caste system.
How is Discord Nitro different from any other display of wealth? If the problem is with wealth distribution in general, why not generalize the argument?
I think it’s entirely compatible to be against wealth inequality in general, and also be against adding additional wealth stratification into new social contexts.
Caste can be decoupled from wealth. The literature of previous ages is full of accounts of the poor but honorable gentleman of birth being put upon by the wealthy parvenu.
Seems like a distribution/packaging solution for clang that happens to be written in zig. Not sure they need to be combined into the zig programming language at all.
The build system is implemented in Zig, requires you to write Zig, and it’s another important component. From there you have a long series of improvements in Zig when it comes to using C libraries.
More in general we’re seeing more adoption of Zig as a build system than as a language simply because that’s both the most stable part of the toolchain and because it’s the expected starting point. While it’s obviously not guaranteed that the language will take off, the current state of affairs is what you would expect to see also in the success scenario.
It doesn’t have to be. Zig just did the work needed to package the compiler and libc it properly for all platforms, instead of continuing the tradition of saying “it’s not my problem” and pointing at somebody else.
GPT has an annoying habit of being almost right, to the point that it takes enough research for me to verify it’s results as to just do it myself, and I’m still worried there’s something subtlety wrong with what it tells me.
It’s not useless, and i use it, but I don’t trust it. If I did, it would have made a fool of me more than once.
I think the bigger problem is that it’s always confident. I’m working on a paper right now and I’m using pgfplots, which is a fantastic package for generating tables and graphs in LaTeX, with a thousand-page manual that I always spend ages reading when I use it, and then forget because I use it so rarely. This time, I used Bing Chat (which finds the manual and a load of stack-overflow-style answers to add into its pre-trained data). I tried to get it to explain to me how to get labels to use SI prefixes as contractions. It gave me a plausible answer along with an example and the output that I wanted. There was just one problem: copying its example verbatim into my TeX file gave totally different output. No variation on its example worked and I couldn’t prompt it to give me one that worked in half an hour of trying.
In contrast, when I wanted to create a bar chart with the same package, it basically told me what to do. Everything I wanted to change, I asked it for and it showed me a summary of a relevant part of the manual. For fairness, I had the manual open in another window and tried looking up the answer while it was processing and it was faster almost all of the time. For things that loads of people have done, but I can’t remember how to do (or never knew), it was a great tool.
Unfortunately, the responses in both cases were indistinguishable. Whether it’s giving me a verbatim quote from an authoritative source or spouting bullshit, it’s always supremely confident that its answer is correct. It definitely reminds me of some VPs that I’ve met.
If it produces an incorrect answer to a question, what stops it from “verifying” that initial incorrect answer? Or is this more like, another layer, another chance for it to maybe be accurate?
I couldn’t tell you, I don’t know how it works, just that on solutions I’ve known to be incorrect, asking that question has provided the expected corrections, so it’s doing something.
My understanding is in ChatGPT4 data only flows in one direction, by asking it to reflect on its own answer it is giving the network a chance to process its own thoughts in a way (inner monologue?) at the cost of more compute time.
When asking it if it’s correct, the answer it gave previously will be part of the context, so it will be available from the first layer, so it will be better processed by all the layers of the Transformer. Whereas when you ask it for something, it will use some layers to figure out what you meant and in what direction to answer, meaning there is less compute available to producing the correct answer.
I think its super handy as an assistant where you give it limited trust. Some things I have used GPT4 for:
Generating type definitions from some example json documents.
Refactoring some go code to use generics where the old version used interface{}.
Generating some boilerplate code to process a .ics file from python.
Generating benchmarks to quickly answer questions like: ‘generate me a benchmark to show how many nanoseconds it takes to create a goroutine?’
Asking how to check for a specific error condition and generate a unit test demonstrating that case.
Writing some boilerplate code - like a ./configure script that doesn’t depend on autoconfig.
Generating unit tests for edge cases in some simple text processing functions.
Stubbing out a specification document based on a freeform text description.
Writing some simple boilerplate for processing CSV files.
All these are easy to verify and tweak and don’t require much trust. I also didn’t feel like I was missing out on much learning by having the tool do these for me.
It definitely makes a lot of mistakes, but for me it crossed a threshold of usefulness and it will probably keep improving from here.
I have a paid subscription that gave me access to chatgpt4. All I really do is open a chat window on a browser tab on a second monitor then ask it questions in the times I don’t want to waste time on something trivial that takes a while, or would normally be googling. While it is generating I can work on other things then come back to what it has generated and integrate it or modify it.
An example questions might be:
Given the go function 'func NukeDirectory(foo string) error', could you generate me a set of go unit tests that cover cases like where the directory is read only or any other less common scenarios. Please use testing.T tempdir functions where possible.
Then in response chatgpt4 gave me this fairly resonable output (which I included nearly all as a demonstration):
...
func TestNukeDirectory(t *testing.T) {
t.Run("normalDirectory", func(t *testing.T) {
tempDir := t.TempDir()
err := NukeDirectory(tempDir)
if err != nil {
t.Errorf("Expected no error, got: %v", err)
}
if _, err := os.Stat(tempDir); !errors.Is(err, os.ErrNotExist) {
t.Errorf("Expected directory to be removed, got: %v", err)
}
})
t.Run("readOnlyDirectory", func(t *testing.T) {
tempDir := t.TempDir()
err := os.Chmod(tempDir, 0444)
if err != nil {
t.Fatalf("Failed to change directory permissions: %v", err)
}
err = NukeDirectory(tempDir)
if err == nil {
t.Error("Expected an error, got nil")
}
if !errors.Is(err, fs.ErrPermission) {
t.Errorf("Expected permission error, got: %v", err)
}
})
t.Run("nonExistentDirectory", func(t *testing.T) {
tempDir := filepath.Join(t.TempDir(), "nonexistent")
err := NukeDirectory(tempDir)
if err == nil {
t.Error("Expected an error, got nil")
}
if !errors.Is(err, fs.ErrNotExist) {
t.Errorf("Expected not exist error, got: %v", err)
}
})
t.Run("fileInsteadOfDirectory", func(t *testing.T) {
tempDir := t.TempDir()
tempFile := filepath.Join(tempDir, "file.txt")
err := os.WriteFile(tempFile, []byte("hello"), 0666)
if err != nil {
t.Fatalf("Failed to create file: %v", err)
}
err = NukeDirectory(tempFile)
if err == nil {
t.Error("Expected an error, got nil")
}
if !errors.Is(err, fs.ErrNotADirectory) {
t.Errorf("Expected not a directory error, got: %v", err)
}
})
t.Run("directoryWithSubdirectories", func(t *testing.T) {
tempDir := t.TempDir()
subDir := filepath.Join(tempDir, "subdir")
err := os.Mkdir(subDir, 0755)
if err != nil {
t.Fatalf("Failed to create subdirectory: %v", err)
}
err = NukeDirectory(tempDir)
if err != nil {
t.Errorf("Expected no error, got: %v", err)
}
if _, err := os.Stat(tempDir); !errors.Is(err, os.ErrNotExist) {
t.Errorf("Expected directory to be removed, got: %v", err)
}
})
}
If you signed up and tried gpt4 then I accept your opinion - if you did not then your opinion is probably worth very little in this context - I think its hard to gauge something so new without actually trying it seriously.
I was always interested in Lua because it was nice and small, but I felt the language itself was a quirky, with some footguns … Also interested in Clojure, but not the JVM.
In my experience Fennel fixes about 90% of the footguns of Lua. About the only ones left are “1 based array indexing” and “referring to a nonexistent variable/table value returns nil instead of being an error”, which are pretty hard to change without fundamentally changing the runtime.
There’s quite a bit of history to how fennel came to be what it is today. It is correct that Calvin (creator of Janet) started it, but it would have just been an experiment in their github if it weren’t for technomancy’s interest in reviving/expanding on it. I don’t know if it is written down anywhere, but Phil did a talk at FennelConf 2021 about the history of fennel, which is the most detailed background for those interested. https://conf.fennel-lang.org/2021
I did a survey a while back about new lisps of the past 2 decades. IIRC the only one to evolve beyond a personal project and have multiple nontrivial contributors but not use Clojure-style brackets is LFE, but LFE was released only a few months after Clojure. It’s safe to say Clojure’s influence has been enormous.
However, Janet seems to take some characteristics of Clojure out of context where they don’t make sense. For instance, Janet has if-let even tho if-let only exists in Clojure because Rich hates pattern matching. Janet also uses Clojure’s style of docstring before arglist, even tho Clojure’s reason for doing this (functions can have multiple arglists) does not apply in Janet as far as I can tell.
Although there’s also the curse of Lisp where the ecosystem becomes fragmented
The other main influence of Clojure is not syntactic at all but rather the idea that a language specifically designed to be hosted on another runtime can be an enormous strength that neatly sidesteps the fragmentation curse.
Ahh very interesting, what were the others? (out of idle curiosity)
I think I remember Carp uses square brackets too.
There’s also femtolisp, used to bootstrap Julia, but that actually may have existed before Clojure as a personal project. It’s more like a Scheme and uses only parens.
I agree the runtime is usually the thing I care about, and interop within a runtime is crucial.
Here’s the ones I found in my survey; I omitted languages which (at the time) had only single-digit contributors or double-digit commit counts, but all of these were released (but possibly not started) after Clojure:
LFE
Joxa
Wisp
Hy
Pixie
Lux
Ferret
Carp
Fennel
Urn
Janet
Maru
MAL
All of these except Urn and LFE were created by someone who I could find documented evidence of them using Clojure, and all of them except Urn and LFE use square brackets for arglists. LFE is still going as far as I can tell but Urn has been abandoned since I made the list.
I was working on this as a talk proposal in early 2020 before Covid hit and the conference was canceled. I’d like to still give it some day at a different conference: https://p.hagelb.org/new-lisps.html
Implicit quoting is when lisps like CL or Scheme treat certain data structure literal notation treats the data structure as if it were quoted despite there being no quote.
For example, in Racket you can have a vector #[(+ 2 3)], without implicit quoting this is a vector containing 5 but with implicit quoting it contains the list (+ 2 3) instead where + is a symbol, not a function. Hash tables also have this problem. It’s very frustrating. Newer lisps all avoid it as far as I know.
Not to take away from Clojure’s influence, just want to mention that Interlisp has square brackets, but with a different meaning. IIRC, a right square bracket in Interlisp closes all open round brackets.
Python has been trying to move toward a re-entrant VM for long time, with subinterpreters, etc. – I think all the global vars are viewed as a mistake. Aside from just being cleaner, it makes the GIL baked in rather than an application policy, which limits scalability.
This kind of API looks suboptimal to me. It would be nice to take something like a lua_State.
The interpreter is thread local in janet, you can actually swap interpreters on the thread too so it doesn’t stop things like rust async from working if you add extra machinery.
The main reason that I use Lua is Sol3. Lua itself is just Smalltalk with weird syntax, but the integration with C++ that you get from Sol3 is fantastic.
I can’t quite picture “spinning a signal around a circle” in my mind’s eye, can you? It is just stating in English what e^(i*2*pi) is, which is a circle.
How about this:
“Probing your signal with a unit vector spinning at all angles between 0 and 2π to calculate how much of it lies along each angle”
I think that you’re half-right. Your approach is “simpler” in the sense that it’s more specific to signals, and that’s the context that we’ll usually care about. I wonder whether we could also be “simpler” in the sense that we would talk about conjugates instead of signals in particular. Maybe we could adapt the article’s sentence to something like:
To find the conjugate variable for a particular measurement, differentiate that variable with respect to its conjugate at that measurement, and average the values that the variable can take.
Let’s try it for position and (linear) momentum:
To find the momentum of a particle given our ability to measure its position, take the finite differences between each measured position, and average them.
Technically correct! The missing ingredient is time; we normally don’t want to talk about time because we are using time as one of our two domains for signal analysis, and this is a blind spot for the original one-sentence approach.
I’m not a lawyer, but personally I’m skeptical, I don’t think anything that github did is likely to have required a license in the first place. To the extent that copilot produces verbatim copies, it seems to do so only of tiny samples of code that have been replicated numerous times by humans before. I expect the court will find that to be fair use/de minimis copying and not actionable. Without the initial copyright infringement occurring, I don’t think many of the other claims survive this, they either require it to be copyright infringement as a precursor (e.g. the DMCA), or they require it to be unlawful.
I’m less sure what to think about the personal information claims.
Regardless, I’m pretty happy that this suit is happening. Getting clarity as to the law relating to AI models sooner rather than later will be good for everyone, and both sides here should have deep enough pockets to do a good job at arguing their side, so the decisions come out saying what they should say.
Getting clarity as to the law relating to AI models sooner rather than later will be good for everyone
It’s pretty clear already today; this litigation is rather a publicity stunt; the neural net itself is neither a copy nor a derived work of the training data; what the neural net spits out is computer generated in the first place and thus not protected by copyright law, unless the result is sufficiently similar to something human generated, which itself is sufficiently creative; a few code snippets will hardly suffice for this (and even if they do it is very likely fair use according to current jurisprudence); but this must be judged on a case-by-case basis, not in a class action suit. I also can’t understand the outrage of many developers; on the one hand people seem to take it for granted that others provide them code or services for free on a grand scale (e.g. Github hosting and additional features heavily used by the open source community); but at the slightest suspicion that they should give something away, all hell breaks loose.
the neural net itself is neither a copy nor a derived work of the training data; what the neural net spits out is computer generated in the first place and thus not protected by copyright law,
I don’t believe that this is settled precedent yet. In particular, it is clear that a neural network can memorise some of the input date. The fact that it’s a neural network doesn’t really matter - if it were a rule-based system with a database of code snippets that it could combine to produce output then it would be equally legal or illegal and that’s up to a court to decide.
unless the result is sufficiently similar to something human generated, which itself is sufficiently creative
That’s the crux of the matter. It is established that Copilot can generate exact copies of code. It is not yet established whether these snippets are sufficiently creative to merit copyright. This is a very tricky issue because it does not depend exactly on length. A two-line snippet might be copyrightable if it is doing something sufficiently original and a court agrees that it is a creative work. In that case, you may still be allowed to quote it but then you may have attribution requirements, depending on the jurisdiction. It is more likely that a long fragment is both considered a creative work and not covered by fair use, but some long things can be considered non-copyrightable (e.g. if they are mechanical implementations of a published specification).
Well, we will see what comes out (likely not much).
it is clear that a neural network can memorise some of the input date
That’s not correct; the DNN doesn’t just make a copy or memorizes a copy; it might be able to reproduce parts or the training set, though this is not the actual purpose, but a rather coincidental an unwanted side effect (which occurs less than 1% according to Github officials as far as I remember). Also note that it is not comparable to a simple database, not even a compressed or encrypted one, since there is no technical means to restore the original works used to train a DNN; it’s rather like a hash sum; the abstraction and transformation done by the DNN training algorithm is substantial; the original works are unrecognizable and unrecoverable; the DNN is thus no derivative work; any other outcome of the trial would be a big surprise.
then it would be equally legal or illegal and that’s up to a court to decide
Storing copyrighted work is generally no violation of copyright law (in some countries it might be illegal if the copyrighted works were not legally acquired). This is established legal practice; we don’t have to wait for a decision.
This is a very tricky issue because it does not depend exactly on length
Not that tricky; there is a well established legal practice in this regard with various precedents; if the DNN would repeatably produce code sufficiently equal to existing code, the matter would have to be clarified in the individual case anyway, whereby the burden of proof of authorship as well as similarity and copyright infringement would lie with the individual plaintiff; and the defendant in this case would not be Github, but the developers using the code in question.
It could be like lossy compression. If you make a shitty JPEG copy of a copyrighted artwork, the new bytes look nothing like the original, and you can’t even restore the original, but it may still be an infringement when it’s a close-enough copy.
You could also look at this from a higher level:
code goes in -> black box -> same code comes out
The complex implementation details of the black box may be irrelevant legally. If you put a copyrighted work in a paper shredder and then glue the shreds back together in the original order, even by chance, the court may not care how you did it, only that you have ended up making a copy of the original.
That’s essentially the concept of all electronic media today. If you take a picture of Mona Lisa, the camera, the memory card and the JPEG format in use are a blackbox for the majority of users; even though they are able to view or even publish the picture displaying Mona Lisa with little effort.
This also nicely demonstrates the present situation. Neither the manufacturer of the camera, nor the inventor of the JPEG format, nor the photographer making and keeping the picture is liable of copyright infringement. But if the photographer wants to publish the picture, a permission of the copyright holder may be necessary; this depends on what you can see on the picture, not on how the picture was taken or stored, or the slight quality loss of the format.
In the present case, the DNN is conceptually comparable to the photographer and the storage format; but the DNN doesn’t store a copy nor a “picture” of the original, but a statistical abstraction of certain features of millions of originals. So the DNN doesn’t simply transport or “publish” original content, but it is able to synthesize new content based on the feature abstractions.
It’s a similar process as when you write code, remembering concepts you have learned over the years (I am not talking about the widespread method here where developers simply copy existing code from the internet). If by chance something comes out of the DNN that resembles an existing work, the user still has the responsibility that copyright imposes on him, and the copyright holder still has all the possibilities that copyright grants him; but this is not Github’s responsibility.
JPEG compression transforms pixels into a completely different domain which does not visually resemble the original image at all (what Grant Sanderson calls the “Fourier world”); the only reason why this works is because we have a series of master theorems which establish the bidirectional connection between our visual world and its Fourier-transformed counterpart. But this is analogous to the situation with neural networks; they operate in a completely different perceptual domain than us, and we rely on theorems which state that such networks are faithful approximants of the desired functions.
If I take a JPEG-compressed image of a copyrighted work and alter its coefficients in a bidirectional transformation, producing an image which visually resembles the original work to a human observer, then I may have infringed. Similarly, if I take a neural-network-encoded pile of copyrighted source code and approximate its coefficients in a bidirectional transformation, producing source code which syntactically resembles the original work to a human observer, than I may have infringed. It doesn’t matter whether I tuned the coefficients by hand or used an automated tool to compute their new values; what matters is that the prior values were computed by summarizing copyrighted works.
But this is analogous to the situation with neural networks; they operate in a completely different perceptual domain than us, and we rely on theorems which state that such networks are faithful approximants of the desired functions.
That’s not the way Copilot or e.g. GPT-3 work. You can indeed approximate compression functions with DNNs, but that’s not what is done here. Anyway, even if they only implemented an indexing and search algorithm on the repositories, without any DNNs, this was no copyright infringement, even when search results would show snippets the same way they do today. There are already precedents for this.
the problem is not so much whether github is infringing but whether as a user of copilot you may unwittingly infringe someone’s copyright without having any way to know whether it happened - google is not infringing by serving up search results, but the fact that you found something on google doesn’t grant you any rights to use or republish that content in your own work
if copilot were just a search engine, again, github would be in the clear, but you would still need to check the license to see if you can use it. all that changes by making it a language model is that you can’t easily check so you never know if its output is safe to use in your own projects.
a key part of the complaint is the stripping of license information which they are responsible for preserving, a problem they would not have if they’d simply built a search engine
They’re not stripping license information; they synthesize snippets, a tiny fraction of which might resemble existing code (which is very likely for any sufficiently small snippet and thus barely avoidable). But let’s see what comes out; the litigation is now filed.
What is really interesting to me about this whole Copilot situation is how much the zeitgeist has completely flipped.
I remember years and years ago people proposed all sorts of weird multi-part scrambling schemes that would take an input and produce a bunch of seemingly-random data blocks, none of which could reproduce the input or a subset of it, but it you had all of them you could recombine in a way that got back the exact input. And people literally thought this was an end run around copyright, since you could have, say, a P2P system where each peer distributes only a subset of the blocks needed to reconstitute a popular song or movie, and thus none of them were distributing a “copy” of it because none of those individual blocks could reconstitute it on their own – the fact that at the end you had a perfect copy didn’t matter, it was claimed, because only the intermediate distribution format mattered for copyright law.
And things like this would get lots of attention and hype on tech forums and be cheered on as proof of the incoherency of even the concept of copyright.
Now GitHub has invented something not too far off conceptually, and the tech community is screaming for it to be destroyed and arguing about how it’s the output that matters, not the intermediate format.
Now GitHub has invented something that they seem to think is actually a magic copyright remover, and the tech community is screaming for it to be destroyed.
I think that nobody wants to achieve any “destruction” here. All I want is that my copyright remains intact.
The best possible outcome is that the defense wins, thus striking a
serious blow against the legal fiction of intellectual property. Yeah,
I have no love for copilot. It is essentially strip-mining the
commons. And Microsoft Github is another tentacle of the surveillance
capitalism octopus. In this specific case, I’m rooting for the
megacorp.
Though I have to admit, a teensy cut of that class action suit would sure help
me out right now.
The litigation includes no such examples, which is a pretty strong signal to me that no such examples exist because it would seem to be the exact sort of sample that gives the best (still small IMHO) chance of winning.
In this case I’d say Tensorflow (or whatever NN library they use) is the algorithm provider not responsible for its usage, but Microsoft is the user feeding copyrighted data into it.
a partially trained DNN is kind of like a zip file with a bunch of files already in it - adding another one is going to take up less of the capacity if its similar to what’s already there, the trick is that information is shared - generative models are kind of like a lossy compressor whose compression artifacts take the form of making the input more generic, more like the training set (“faceapp yourself but don’t actually apply any filters” type distortion), and the degree of distortion is simply a factor of the model capacity
training a high capacity model on a small dataset inevitably memorises things verbatim, because the training task for these models is reconstruction, that they appear to be doing something else is mostly a factor of capacity limits and sometimes intentionally distortion-inducing sampling methods
and you can observe different degrees of distortion even in text models like copilot - depending on how common a code snippet it is reproducing is and your settings, it may reproduce existing snippets nearly exactly but with different variable names or commenting style, which shows that it has an internal representation that doesn’t necessarily need to store the “stylistic” details, but is still obviously close enough to be license infringement
when given a context that isn’t especially close to any single training sample it appears to be more “creative” as its having to mix together information gleaned from multiple samples and rely more on the surrounding context, but the big problem with copilot is you can never really know when its being “creative” and when its just coughing up a training sample almost exactly so its always going to be kind of legally awkward to use
the real annoying part is that language model based code completers are still really useful when trained on much less code, and a lot of the code that copilot was trained on isn’t just encumbered by licenses that don’t allow unattributed copying, but is also poor quality. There is conceptually a more useful tool you could build with the same methods by being more selective about its training data, but copilot feels like GitHub and OpenAI trying to retroactively recoup the cost of training a huge model more than an intentionally designed product.
a partially trained DNN is kind of like a zip file
No. ZIP is a lossless, deterministic compression algorithm, in no way comparable to what the present DNN or its training algorithms do.
Ultimately, the degree of similarity and the degree of creativity of the snippet will be decisive. Unfortunately, however, the value of such snippets is greatly exaggerated in the present discussion. It is undisputed that in copyright law source code (unfortunately) automatically counts as a work, and this (even more unfortunately) also applies to parts of it. However, this is a perversion of the concept of a work. Because the probability that any snippet is present in any number of other source codes in a very similar way is close to 100%. Industry and open source developers are already suffering from the perverted use of patent law; now they are to be bothered also by the perverted use of copyright law. Judging whether or not a snippet meets the creativity requirements is usually arbitrary. Fortunately, problems with this kind of misuse of copyright can be circumvented with its own means relatively easily by simply rewriting the snippet.
The point of the analogy was it containing multiple items and sharing information, read “mpeg file” if you’re hung up on the lossy vs lossless distinction
They have legal ownership of the copy that is in their possession, given that they acquired it lawfully (which they did). The same way you own a book.
You can do lots of things to code without a license. Read it, lend it to your friends, sell it to the used code store, stick it on a shelf so your guests can admire how big a library you have, execute it, etc. They don’t have copyright, but they absolutely have normal property rights.
I think ownership might be the thing that’s at least debatable here. I don’t think GitHub owns the code it hosts. Similar to a web hoster not owning all the photos and your ISP not owning everything it caches or goes through the network.
Or more IT comparison. If I am a code reviewer, some consultant or something and code is given to me to inspect it. That doesn’t mean I own it, simply because I legally have the data on my hard drive. If said code was some service and I’d just run it the actual owner would likely be very unhappy.
I agree this is about copyright and not license. The question is whether what they do is some kind of fair use or anything you are allowed to do under copyright law.
I’d argue it’s not, because it doesn’t create a benefit for society, like most fair use does for example.
If it turns out it is, what would happen to let’s say anything that re-compresses an image, maybe lossy, as part of a service. They (likely) do that it in this case even with the explicit authorization of the copyright owner. They run ti through some algorithm and get something new out of it that kind of reassembles the originally, but not rally and certainly not in terms of bytes. Does that make them the owners?
Or what if someone simply wrote some “AI” that let’s say mostly strips comments, reorganizes code, maybe even just works with some sort of AST. Would it make the output owned by whoever runs it?
Does that mean I mean one could make an “AI” that disassembles binaries, maybe makes some redundant changes and outputs new modified binaries? Would that work.
What if it was more involved and you actually train a NN, and just teach it the bytes of some software or even a movie. You have a prompt where you can enter “The bytes on C:\videos\plan-9.mp4 are video files of Plan 9 from Outer Space. Remember this!”. It does, but not just by copying, but by adding it into its (language) model. Then since its your language model you share it on the web. Someone else may download it and say “Hey there. I need the bytes for Plan 9 from Outer Space in C:\warez\plan9.mp4, please store them there for me”. Who holds the copyright on what the AI creates through its language model? It might even have learned to skip redundant license statements of software, strip FBI warning from videos and who knows what.
What if the AI does more? What if it even can “watch” and “learn” the movie, potentially scale it up to 4k monitors, output to any format, knows how to change it just enough so any AIs looking for copyright infringements can’t differentiate it anymore? What if it can lean to even change movies, just enough so that copyright lawyers consider it a new work of art.
Where do you draw the line? Where does what’s allowed under copyright law end?
I really don’t have the answer, but I think with copyright law a huge mess was created in first place, because laws work best when they are something you can agree on at large and they change or come down when a large amount of people change their opinion (homosexuality, slavery, women voting, witchcraft, etc.). I don’t think with copyright there ever was huge amounts of agreements, and if it was was applied to the letter and copyright holders would really sue everyone who crosses its lines the majority of people would have voted to abandon at least large parts of it.
Besides that the like between being inspired by, learning from something or even learning something (think reciting a poem) simply are some form of copying with some translation. There already are existing huge topics on fair news, see sampling, mixing, etc. and laws that nobody feels comfortable executing, like singing copyrighted songs on parties, or other private settings in some countries.
I’d say all of this is at least something that’s not so clear in law, so whichever route it takes I am sure there certainly is potential far-reaching effect whatever the conclusion will be.
It seems that in the above “right” is being used to mean “moral right”, while many here are using “right” to mean “legal right”. Confusing those two things might be the source of some of the misunderstanding that I’m seeing here.
Copyright law says that there’s a list of things you can’t do without getting permission from the copyright holder.
For open-source/Free Software, the copyright holder does grant permission to do some of those things, in some ways, via a license.
But if the thing you are doing is not one of the list of things that requires the copyright holder’s permission, then the license terms are irrelevant, because you are not dependent on the license for your permission to do those things.
There is also the other option that if you have access to a piece of software under multiple potential license grants, one of which is more permissive than the other, you can choose the more permissive one without having to observe the less permissive one. I’ve pointed out in past threads about Copilot that I would not be surprised at all if the license grant embedded in GitHub’s terms of service turns out to be more than sufficient to allow everything Copilot does, for example.
No, if they don’t need a license I don’t think the particular licenses matter at all. Licenses grant permission to do things that were otherwise illegal under law with some conditions attached. If you didn’t do anything that was otherwise illegal, licenses don’t do anything.
If they did need a license, they’re obviously in trouble with pretty much any license, because they didn’t comply with pretty much any license (other than the CC0 and wtfpl style ones).
Some licenses grant you permission to reproduce some/all of the work provided you meet the conditions, Microsoft did not meet the conditions in those cases, yet the reproduced the works anyways.
How is violating copyright law not illegal? Or are you insinuating that copyright law doesn’t cover source code?
How is this any different than taking text from books in the library and trying to pass it off as your own work, while not paying dues to the copyright holders (or whatever conditions they have, if any)?
How is violating copyright law not illegal? Or are you insinuating that copyright law doesn’t cover source code?
I’m saying I don’t think they violated copyright law. I’m not saying it doesn’t cover source code, but that I don’t think it covers this kind of use of copyrighted material of any kind.
How is this any different than taking text from books in the library and trying to pass it off as your own work, while not paying dues to the copyright holders (or whatever conditions they have, if any)?
I don’t think the distinction between text from text books and code is relevant if that’s what you’re asking. If you trained the same kind of model on an equally large collection of equally diverse textbooks and released it as a book writing aid I think you would have exactly the same legal status. (Edit: I should say “and served requests to the model as a book writing aid”, releasing the model vs serving requests is potentially legally relevant, with the latter being slightly more likely to be problematic IMO).
I don’t think it’s fair to describe what’s happening here as “taking text from books in the library and trying to pass it off as your own work” though. There are many more steps happening here, and they’re achieving a much broader goal than rote copying pieces of text. And sure, sometimes the process manages to create a copy of a small piece of someone else’s copyrighted work that has been copied many times already into other places, but that’s de minimis copying and not copyright infringement.
It might be worth noting for instance, that Google won Google v. Oracle despite copying 11,500 lines of code. In part because that wasn’t a substantial portion of the work. I’d expect a similar analysis here.
The samples that were duplicated that they use to justify the lawsuit are things like part of an exercise from a textbook, it’s not a substantial portion of the book.
And I think they are violating copyright law, but I’m not a lawyer and you probably aren’t either. I hope this goes to trial so we get to hear some random judge’s opinion.
Your whole argument rests on “they didn’t follow the license!”
If their whole argument is “we didn’t need that license to do what we did”, then your argument is not really relevant. That’s what people are trying to get you to understand – the license terms may literally have no relevance whatsoever.
A license grants you rights to do things that would not otherwise be covered by copyright law. You do not require a license to do things that are covered by Fair Use / Fair Dealings (delete as appropriate for your jurisdiction) or by other explicit statute law. For example, in the USA you do not require an explicit license to record a cover of a song because compulsory licensing is enshrined in statute law and so you just have to pay a fixed amount for every copy that you sell.
The argument (disclaimer: I work for MS but have no connection to this project) it that this kind of use is covered by explicit law. I don’t know precisely what the laws in question are, there are a few things on building machine learning systems and databases that may apply but it will be up to the court to decide.
Whether they win or not, I think they’ve achieved their goal. We (MS) have spent a huge amount of time and money building a reputation as a good open source citizen. I’m able to hire great people on the basis of that and the expectation that they will be paid to contribute to the F/OSS ecosystem. Being on the other side of a lawsuit from the SFLC does a lot of damage to that reputation and, in some ways, winning a lawsuit against the SFLC would do more damage than losing.
A license is just that: a license to do something with a copyrighted work. It can’t take away rights that were already granted by copyright law, such as fair use.
Ordinary users of GitHub receive code under the license chosen by the person who posted the code.
GitHub has a choice between receiving code under that license, or under the license granted in GitHub’s terms of service.
GitHub can simply choose the more permissive of the two, in which case the more restrictive of the two is in fact irrelevant.
Think of it like any other dual-licensing scheme. Suppose I write a piece of software, we’ll call it foolib. And I offer it under a choice of BSD or AGPL. If you choose to receive foolib from me under the BSD offer, you will be able to do things with foolib that the AGPL would not have allowed. And you will be able to do that because the AGPL is not the license under which you received foolib and so is not the license which governs your use of it. No amount of yelling “that’s an AGPL violation!” would be relevant there.
Similarly, even if I only offer it under AGPL, you could still do certain things with it – such as fair use – without having to follow the AGPL’s terms. And again no amount of yelling “but that’s an AGPL violation!” would matter, because there are things copyright law still lets you do without needing to obtain or follow the terms of a license.
The point being made here is simply that saying “But that’s a license violation!” over and over is not relevant, because the original argument is that GitHub either has access under an alternative, more permissive license, or is doing things that do not require a license in the first place. In the former case, the only license terms which matter are the more permissive ones; in the latter case, no license terms matter.
These types of comments really drive me up a wall. It feels like what you are saying is “this is a common feature in other languages, the people behind Go (vague bad thing) since they didn’t add the feature too” which is just not sound reasoning.
In order to form a judgement about how bad something is, we should consider the consequences of it. The “normalness” of a behavior is an OK pointer, but thats all it is.
Maybe you can argue that the consequences have been grave and thus this is a grave failure, but that doesn’t seem true to me.
I can’t argue that any of the things about Go’s design that people have been unproductively grouchy about in comments sections for the past decade have had grave consequences for any given Go adopter or for the widespread adoption of Go. Having written one short program in Go for my own purposes, the lack of a proper idiomatic iteration construct (no map, no iterating for loop, no list comprehension, no yielding to a block, just the apparently-hacky for-range) was flummoxing. Go’s designers are entitled to their priorities but idk I feel like I’m entitled to make fun of those priorities a little bit, especially because they get paid more than I do.
IMO there is a solid iteration idiom, and they hit on it early in the stdlib, although the fact that they managed to do it several different ways afterwards is a disappointment. It’s the
for iter.Next() {
item := iter.Val()
// ...
}
one. You can do pretty much anything with it, it doesn’t impose any burden on the caller, it doesn’t make the caller ugly, and you can implement it on top of pretty much anything else. With generics you could even codify it as an interface (parameterized on the return type of Val).
None of which is to say that I opposite this proposal — it looks pretty nice to me. But in 7+ years of writing Go professionally, the lack of an iterator syntax hasn’t been a major thorn in my side — or even a substantial annoyance.
I saw in the latest blog that the Zig for loop is being changed, but AFAIK there will still be no iterators. You basically do what you do in Go – write your own set of methods on a struct.
So it seems like Zig will be a ~2025 language without iterators
I would say languages have different priorities, and it’s harder than you think. Or you could just make vague statements about people doing bad things for no reason
Zig deliberately has no interfaces or traits whatsoever. You put the methods in the struct and they get called and it will compile if and only if the types work out after comptime propagation. I might be wrong but as far as I understand “iterators” in the language will be a bit of syntax sugar relying on a documented (but informal) interface, and Zig will very much have iterators exactly like, say, Julia or JavaScript or Python have iterators (except those languages check if things work out at runtime instead of compile time).
On the other hand the major selling point of Go is having interfaces enforced by the compiler. But a fast iteration interface needs to be a generic interface so that wasn’t really possible until recently…
Eh I don’t see what you’re saying. My point is that, as of now, Go and Zig are the same as far as iterators.
As of ~2025, Go might have iterators, and Zig probably won’t. Go is thinking about adding some concept of iterators to the language to enforce consistency.
Python’s for loop and list comprehensions understand iterators; I don’t think the same is true of Zig.
If you know otherwise, please provide a link to the docs.
The dominating quality of Go’s development over the past decade has been the most extreme caution when it came to adding features. You can’t have performant, extensible iteration without some kind of generics and they were stuck in place on that issue out of fear of C++ compile times until finally a couple of years ago.
You can’t have performant, extensible iteration without some kind of generics
It’s even stronger than that: if you do want to map’n’filter, you need a boatload of machinery inside the compiler to make that fast, in addition to significant amount of machinery to make it expressible at all.
Rust’s signature for map is roughly
trait Iterator {
type Item;
fn map<T, F>(self, f: F) -> Map<Self, F>
where
F: FnMut(Self::Item) -> B;
}
That is, .map returns a struct, called Map, which is parameterized by the type of the original iterator Self, as well as the type of unnameable closure F. Meditating on this single example for a long time explains half of why Rust looks the way it does.
Go’s developers focused on higher priority concerns, such as pretty great performance, a pretty great (though basic) type system, an awesome runtime and compilation model, and fantastic tooling. Go’s feature set (including the features it elided) made developers really, really productive compared with other languages.
While there are a few use cases that weren’t feasible without generics, the absence of generics made for some really interesting and compelling properties–like “everyone writes their code the same way (and thus any developer can jump into any other project and be immediately productive)” and “code is very concrete; people don’t usually try to make things overly abstract” which aren’t present in other languages. It wasn’t actually as obvious that generics were the right choice as Go’s critics claim (whose analyses flatly pretended as though there were no disadvantages to generics).
The net upside to generics (including iterators) was relatively small, so it makes sense that the decision was deferred.
Go is a Google language. If a proposal helps or is of benefit to Google, it’ll be added. If it’s bad for Google, it will be ignored. If it’s neutral, then the only concern Google has is how well does it externalize training costs for Google.
Google doesn’t really figure in at this level of discussion. The Plan 9 guys who made Go are the relevant actors for this. They were skeptical of generics, so it wasn’t a priority for Go 1.0. With no generics, a generic iterator protocol doesn’t make any sense, so that wasn’t in Go 1.0 either. Now Go has generics as of Feb. 2022, so there is a discussion about the best way to do an iterator protocol. This is the second proposal, which builds off of ideas from the first discussion and some ideas that had been in the issues tracker before that. It’s not really more complicated than that.
You’re obviously right that the decision making an process is entirely about Google’s desires, but I’d hesitate to assume that it’s necessarily utilitarian. Google does a lot of self-sabotage.
There is no standard way to iterate over a sequence of values in Standard ML, which is from circa 1970s/80s depending on who you ask, and is widely considered one of the most elegant of language designs. Something went badly wrong here or…?
After having to deal with Rust iteration for a bit and missing out on Python…. I think the decent explanation here is that in more dynamic languages with stuff like coroutines it’s pretty easy to come up with a nice iterator protocol, but in more serious things it’s harder to come up with one that is both flexible enough for the “right” use cases without being very hard to use.
Like C++ has iterators right? And they do the job but they’re kind of miserable to use (or at least were 5+ years back, I’m sure things are better now).
Combine that with the perrenial generics things meaning container classes aren’t a thing and “stuff things into arrays and use indices” feels like a pretty OK solution for a long time.
I think C++ iterators are uniquely awkward in their design, and it’s not an inherent design problem or any sort of static typing limitation.
C++ iterators are based around emulating pointer arithmetic with operator overloading, with a state awkwardly split between two objects. There’s no reason to do it this way other than homage to C and a former lack of for loop syntax sugar.
And C++ iterators aren’t merely tasked with iterating over a set once from start to finish, but double as a general-purpose description of a collection, which needlessly makes both roles harder.
There’s no reason to do it this way other than homage to C and a former lack of for loop syntax sugar.
I think this is a little unfair, the primary reason to do it this way is so that code, especially templates work on pointers or iterators, eg being able to have a single implementation for something like std::find work for list or pointers. It’s not a “homage” so much as a source level interoperability consideration.
OK, “homage” is a poor way of phrasing it. But it’s still an “interoperability consideration” with pointer arithmetic and C’s way of doing things, rather than a ground-up iterator design. The messy end result is not because iteration is such a hard problem, but because preserving C legacy is messy.
Right it’s not inherent to “designing iterators in statically typed language.” Go doesn’t have a different language it’s trying to be incrementally adoptable from.
This basically would let you make zig programs selectively memory safe via conservative garbage collection - the linked article shows the overheads are quite low. Then you could just turn it on for portions of programs and deployments that require that extra bit of safety.
I also think you could just expose it as a GC allocator rather than a quarantine allocator and let people take advantage of GC when they want it.
Disagree. Multiple binaries are big enough problem that Apple created fat binary. Linux also needs it, and there is FatELF, but it is not adopted. Typical Linux failure.
It’s really not clear to me why Apple did fat binaries, I suspect it was for compatibility with the exec family of system calls. NeXT had fat bundles with a different directory for every platform and architecture, allowing them to share resources and be thinned by just deleting some bits. Apple’s file format doesn’t give a space saving over this: they just concatenate the two binaries and slap a header on them. The Windows version allows sharing sections between the two (very useful for data, somewhat useful for code where the 32- and 64-bit versions of some x86 functions are the same). FatELF works the same way as the Apple versions and it’s really not clear to me what problem it solves: the number of situations where I want to provide a single download for multiple architectures on Linux/FreeBSD is very limited: people generally don’t provide a single binary for multiple Linux distributions, let alone architectures.
Cosmopolitan is interesting as a common platform that, in addition to providing a common binary, provides the same API independent of the underlying kernel. Once you have that, and (as they do) the ability to run the same functions on every platform with the same architecture, their form of fat binary becomes interesting because it’s only very slightly fatter than a single platform binary.
NeXTSTEP did have fat Mach-Os for a long while, identical to how modern macOS does it. The different directories were only for platform.
Regardless of how you do it (bundles or fat binaries), it makes it easier for things like Migration Assistant - snarf the entire contents of /Applications onto your new machine, and use the native slice on the new machine.
Regardless of how you do it (bundles or fat binaries), it makes it easier for things like Migration Assistant - snarf the entire contents of /Applications onto your new machine, and use the native slice on the new machine.
That’s true. I have done that for PowerPC to Intel, and was bitten by how well Rosetta worked: I thought the Core 2 was pretty slow because it wasn’t running VLC much faster than the G4. It turned out that VLC was not a fat binary and was just a PowerPC build. That’s not been an issue on other *NIX, because I do the migration by dumping the list of installed packages and installing it on the new machine. It’s also not an issue on iOS because the App Store does something equivalent.
Aiui, Cosmopolitan is a crossplatform bootloader for x86-64 binaries; it does not solve the same problem as universal binaries or FatELF, which is cross-architecture compatibility. They also take fundamentally different approaches, one embedding multiple compiled binaries, the other embedding an architecture-sensitive stub at the start.
You could maybe combine Cosmo with FatELF to get a cross-platform cross-architecture binary. But for true cross-architecture write-once-run-anywhere, what you want instead is some sort of low-level “portable assembly” that is compiled on the target, like JVM or CLR - at which point you don’t need Cosmo anymore.
Does anybody backup sqlite files with restic? I noticed that it’s not just as a simple as pointing restic to the folder that contains the DB, as the content might get corrupted.
That’s in general a very bad idea. There’s situations when you can backup a running database, when both the database make sure that a finished write won’t leave the database in an inconsistent state (most serious databases) and the files system is able to make snapshots in a certain point, not in the middle of a write (ZFS can do that for example).
And good software tends to have ways related to backups documented. I’d strongly recommend reading that for SQLite, but also for Postgres (it’s way too common to just go with an SQL dump, which has a great number of downsides).
Don’t blindly back up anything that might write. It could mean that your backup is worthless.
Probably not, however I am investigating ways to support filesystems that don’t support file locks. One example is how sqlite3 does it, with an option to use a fallback instead of a filesystem lock (like attempting to create a directory).
Another alternative for asymmetric encryption is rdedup, but I
haven’t found (or bothered writing myself) the tooling around it
to make it a worthwhile backup solution for me.
I’m currently using restic on my file server, backuping to wasabi
s3-compatible storage. Works great.
I get the reason for this, but am a little bummed. Deno was a great chance to step away from the tirefire of the NPM/node ecosystem and start fresh.
We want Deno to be accessible and solve people’s problems
…so they have decided to reintroduce all of the problems of the packages on NPM. T_T
I know it’s a little heavy-handed, but forcing people to bring over the stuff they actually care about to a better language (Typescript) is maybe the right answer.
I agree. There is simply so much crap on npm that nuking it and forging a new path appears the preferable option - I suppose a lack of meaningful uptake due to reliance on npm packages forces this, but still, it is a disappointment.
As of now Deno has no package manager and unsafe defaults for third party dependencies, including resolution (via DNS!) and linking at runtime. NPM at least locks dependency versions by default and offers protection against takeover attacks in addition to offering a way to automate dependency updates.
Whatever problems exist with NPM aren’t solved by going backwards in time instead.
Excel is a local maxima, which sucks because it’s not good enough. Ordinary people can use Excel, which is great, but then the date type is actively harmful, which is insane. It mangles zipcodes in spite of it having been made by a US corporation for its whole existence! Like, I get it, sometimes you mangle foreign conventions due to unfamiliarity, but all of New England has its zipcodes mangled. That’s bad! And then because Excel is a local maxima, new products like Numbers and Sheets clone it instead of searching for a new maxima. It’s a pity because we can definitely do better.
I’ll take a stab at some things I think would make it much better for many of its uses:
Allow clearly separated display and editing views*: the current system involves hiding cells or referencing cells outside of the current view which I think could be handled better by having separate display and editing views. You can basically do this now but it’s really cumbersome.
A mechanism for running commands on sets of data and either replacing it in place or inserting the results somewhere: the current idiom is to write a bunch of formulas, copy, and the paste as values, which is horribly broken.
Better delineation of tables within Excel which would also provide better guard rails against things like sorting a single column when you meant to sort a whole table.
Better handling for non-numeric data, this is touched upon in your complaint about zipcodes - Excel presumes that everything is fundamentally a number which I think it inherits from primarily financial spreadsheet programs.
Better tools for creating custom data types - related to the above.
Fundamentally, I’d say that my complaint about it is that Excel really, really wants to be a financial spreadsheet program when many (perhaps most) users want it to be a tool for working with generic tabular data.
* More broadly, I’d say this and several of my other points boil down to “Excel doesn’t do a great job at allowing the user to control complexity” and this becomes more clear as the volume of data manipulated in it increases. I think that’s one of the differences between it and, say, visidata. I prefer visdata for my own use but the reality is that many of the people I interact with simply won’t learn to use tools such as it rather than Excel (for a variety of reasons, some of which are fair).
A mechanism for running commands on sets of data and either replacing it in place or inserting the results somewhere: the current idiom is to write a bunch of formulas, copy, and the paste as values, which is horribly broken.
The powerquery feature is really good for this, though its transformation dsl is weaker than excel formulates.
Better tools for creating custom data types - related to the above.
I’m hoping the new “linked data types” will help with this! Right now you can make custom data types but it’s pretty cumbersome. It’s pretty new though and I think they’re still improving it.
Not disagreeing at all, but curious what you think some of the big changes would be if a great designer did a total rethink, but incorporating the good stuff that works?
Take a look at Quantrix Modeller. Lotus had two spreadsheet products:
123, which was a VisiCalc clone. It used a rectangular grid because they thought that it would appeal to accountants.
Improv, which had a clean separation of data and formulae and used pivot tables as its core data type. You’d define a new column as a single formula, rather than copying and pasting. This is the one that accountants actually liked.
Excel is a 123 clone, as are most other spreadsheets. Quantrix Modeller is, as far as I know, the only surviving Improv clone. They have some great videos about why this model is better. It’s less error-prone, easier to change, and so on.
When most people say that they want a spreadsheet, what they actually want is a database with a rich set of numerical library routines.
I guess there are niches where you can still get away with this, but it’s become (even in the enterprise space, where it used to rule) less common to the degree that I often see products without at least some general pricing guidance published get pushed to the bottom of procurement lists simply because no one can be faffed to talk to a bunch of sales people to get quotes just to do a comparison.
I’m not imagining a big redesign. Just basic stuff: add types for timestamps, civil times, durations, locations, currency. Have a difference between something’s type and its display. Fix the 1900 leap year bug FFS. Default to not clobbering the next cell when something doesn’t fit. You could still have freeform A1 cells but you should push users towards using it like a database with proper rows and columns as much as you can. Design the app as though it were the most commonly used tool in business and science analysis and not whatever Dan Bricklin happened to think of in 1979.
I worry mere mortals (normal programmers) won’t be able to write a library that is both sync and async just due to how confusing it all is. I hope Its not too hard in the end, interesting to see what they come up with.
Have been using foundationdb a bunch, really a huge fan. It provides a simple api - key/value transactions and watches - But gives great performance and relatively easy operations that can scale to hundreds of cpus.
I’m a notorious flakes opponent, so take my comment with a grain of salt, but I read a lot of things written by newcomers to Nix like this:
And it is really quite jarring. Flakes have such a PR machine behind them (through the consulting companies that push their adoption) that newcomers believe they’re some fundamental concept that needs to be understood, whereas in reality they are an abstraction layer on top of the fundamental concept ands - in my opinion - should not be used by people who haven’t understood what’s happening below. In beginner chats (e.g. the @ru_nixos Telegram group) a huge portion of posts are problems people have with flakes, which they only have because of flakes, and which they would be able to solve if they had understood the fundamentals.
I’m just braindumping here, this isn’t intended to attack the newcomers that think like this, but the people that keep pushing out blog posts etc. that perpetuate it.
Check out https://code.tvl.fyi/about/nix/nix-1p
This seems to imply that the consulting companies have something to gain by pushing flakes, but that doesn’t seem very plausible to me.
If anything, flakes solve problems that non-experts would have shot themselves in the foot with, which is something consulting companies could bill for.
For me personally, flakes were what unlocked nix and nixos for me. Inputs are explicit and locked, as opposed to outside the declaration as channels require. (Technically inputs don’t have to be specific because of the flake registry, but I wouldn’t use that). With flakes config/code sharing between nixos systems declarations has a structure I can follow and build on. They’re not perfect by far, but without them I would probably not have gone as deep into the ecosystem.
Anecdotally I see just as many people in Discord struggling with flake-specific problems as I see struggling with channel problems.
Flakes vs. channels is a false dichotomy though, both are bolted on features on top of the core concept. Note that almost no experienced, non-flake users use channels, instead preferring to just pin nixpkgs commits directly.
I don’t follow this at all; despite considering myself more than proficient in Nix, I don’t even know how to use NixOS without channels or flakes.
niv
doesn’t even have an example of how to use it for NixOS configurations. Certainly as far as I know, nixos-rebuild is going to invoke the channel by default to build the config. I guess for non-NixOS scenarios, I see the point (builtins.fetchTarball to get a nixpkgs).And I can’t count the … 3, 4, 5 dozen times that users have had issues because of channel management. And the complexity of supporting them because you have to interrogate the state of the world to determine if they’re on the right channel, oh no, are they really on the right channel since root/users have different channels, etc, etc.
That having been said, I do wish that there was a clearer, up-front understanding that flakes is “sugar” on top, and doesn’t fundamentally change the Nix underneath. (well, pure eval is a big thing too, but that feels like a different point).
Could you elaborate on what “core concept” you’re referring to?
Side note, “almost no experienced, non-flake users use channels” is a heck of a statement, do you have any evidence to back that up?
Comments like this make me legit mad. I have spent actual time trying to learn this stuff and coming out frustrated, with all resources I’ve found pointing at one of these two solutions, and then this.
Just get the core concept y’all. Grok it. If you know you know.
You said flakes are a problem for beginners, and I pointed out channels are also a problem for beginners. Given that channels is the default for both nix and nixos installations, this is not a false dichotomy it’s a valid concern. If we want to talk about experienced users, that’s a different topic and is moving the goalposts.
As you point out flakes are nothing revolutionary, they’re just a set of tooling on top of nix.
I personally never use channels or flakes - I just manually use nixpkgs as git submodule and set NIX_PATH. I find that far better.
Literally every guide that’s coming out now takes flakes as a starting point.
Surely all of these people must be wrong… How could it be any different?
But seriously: I haven’t seen even a case made for coherent non-flakes usage of Nix. Let me know what I’ve missed.
Why couldn’t it just be math/rand2? why did they need to special case this ‘v2’ stuff? It sort of irks me to be honest.
In general you could then import things as:
Sqlite3 did it right in C, which has no special support in the language for this.
Because the Go project has decided that things should work this way (https://go.dev/blog/v2-go-modules).
Also because there is no actual relationship between
foo
andfoo2
,foo/v2
does tell you that it’s the same project’s decision. For instance in the Python ecosystem you have pypdf, pypdf2, 3, and 4.Pypdf2 was a fork (with I understand the blessing of the original maintainer), which has since been renamed to pypdf and has replaced the original on pypi. pypdf3 and pypdf4 were hostile forks of pypdf2 which have apparently been abandoned.
Parts of me want to know more about this micro-drama, the rational parts tell me to stay the eff away.
Yeah, I’m pointing out it was a largely pointless complication in the ecosystem. I don’t see the benefit at all of having to update all tools to handle this edge case, not to mention the confusing way import paths map to the package name now.
I don’t buy this argument, packages in go are namespaced by domains. You can’t just hijack my github username and publish a package under it.
There is no complication? The article literally notes that module-unaware tools can ignore it and will work fine, awareness is useful in order to relate major versions of a program, which your scheme makes impossible.
As an example, the go tools now all must understand version branches - this includes godoc - this is clearly a complication. Like I said sqlite3 is an example of a major software that is both a library and a program that used semantic versioning without any support from the language needing to be baked in.
Literally the entire point of /v2 is to not be version branches.
Yes you’ve made that assertion a lot.
An interesting thing about /v2 is that they just paved the cow path. There was already https://labix.org/gopkg.in which let you release gopkg.in/yaml.v1 and gopkg.in/yaml.v2 simultaneously. Go modules just made it so that this works for any source repo.
They don’t map at all. The import path tells you how to find a package, but the package name is independent of its import path. It never had to match the final path component. This is nothing new.
If tools relied on the fact that the package name matches the final path component, the tools were buggy to begin with.
Was it commonly used before this version suffix?
Plenty of tools need to be version aware now, including the go tools themselves and tools like godoc, none of it was needed in my opinion. I just don’t really see the benefit vs having a new import path or repository.
You not infrequently see a tool at github.com.com/user/go-thingie where the package name is thingie without the “go-”.
/v2 is “a new import path”.
Why do I see repositories like this all the time?
https://github.com/hanwen/go-fuse
As you can see, the ‘v2’ part is in a branch, not in the path. Maybe I am wrong, but I cannot to see how the go tools can handle this without some special logic to disambiguate the version suffix, especially when you ask for the v1. All of a sudden if you develop go tooling you need support for poking inside branches of VCS repositories just to work out which version you mean.
Because the goal is to replace
math/rand
for future use, not have a second, parallel API; decalring a v2 is quietly sunsetting the v1 API in Go. If people still need it, they can use the importmath/rand/v1
instead, but they live with caveats that updates and fixes to that version of the code are plausibly unlikely. The biggest thing that rsc is suggesting is changing the way you initialize randomness. For most people, the changes will be enough to use automated tools.I actually fundamentally disagree with Russ’s take here that people shouldn’t be using this for arbitrary random byte sequences. Noisy, random data is great for some things, but we’re not in
crypto
land here, where the Real randomness is important, and it could be aliased tocrypto/rand
’s byte source.Is there an advantage to aliasing cypto/rand.Read in math/rand/v2 vs. just importing cypto/rand.Read directly in your app?
It would retroactively fix a lot of bad software that does depend on it for cryptographic random bytes if the one in
rand
quietly called thecrypto
version under the hood.It would however break software that depends on seeding the RNG in a particular way, though that is a different issue that russ brings up, and I do agree with hin on some level.
rand.Read
is already deprecated because it has very low utility (getting non-crypto random bytes is a lot less common than getting cryptographically secure random bytes) and it’s easily mistaken for crypto’s (possibly after refactorings or because the editor’s completion is not noticed) which has been the source of security troubles.Note that
rand.Rand.Read
is still valid. Plus it’s not likerand.Rand.Read
does much that’s useful in the first place: it just fills the buffer by pulling 8 bytes at a time from the rng.You are not born into Discord Nitro. That is the defining characteristic of a caste system.
How is Discord Nitro different from any other display of wealth? If the problem is with wealth distribution in general, why not generalize the argument?
I think it’s entirely compatible to be against wealth inequality in general, and also be against adding additional wealth stratification into new social contexts.
Caste can be decoupled from wealth. The literature of previous ages is full of accounts of the poor but honorable gentleman of birth being put upon by the wealthy parvenu.
I was intending to respond to the second part of Janus’ comment.
Thanks for clarifying.
I imagine it is much harder to change your caste in some cultures than it is to save enough money to buy a discord perk, seems pretty different to me.
zig cc is a great tool and even if I didn’t ever use zig for anything else I would probably continue to use it
I have cross-compiled c/c++/rust and go all pretty painlessly using zig cc
Seems like a distribution/packaging solution for clang that happens to be written in zig. Not sure they need to be combined into the zig programming language at all.
The build system is implemented in Zig, requires you to write Zig, and it’s another important component. From there you have a long series of improvements in Zig when it comes to using C libraries.
This video has a good recap of the various ways in which Zig is better than C at using C libraries: https://www.youtube.com/watch?v=Gv2I7qTux7g
More in general we’re seeing more adoption of Zig as a build system than as a language simply because that’s both the most stable part of the toolchain and because it’s the expected starting point. While it’s obviously not guaranteed that the language will take off, the current state of affairs is what you would expect to see also in the success scenario.
It doesn’t have to be. Zig just did the work needed to package the compiler and libc it properly for all platforms, instead of continuing the tradition of saying “it’s not my problem” and pointing at somebody else.
GPT has an annoying habit of being almost right, to the point that it takes enough research for me to verify it’s results as to just do it myself, and I’m still worried there’s something subtlety wrong with what it tells me.
It’s not useless, and i use it, but I don’t trust it. If I did, it would have made a fool of me more than once.
I think the bigger problem is that it’s always confident. I’m working on a paper right now and I’m using pgfplots, which is a fantastic package for generating tables and graphs in LaTeX, with a thousand-page manual that I always spend ages reading when I use it, and then forget because I use it so rarely. This time, I used Bing Chat (which finds the manual and a load of stack-overflow-style answers to add into its pre-trained data). I tried to get it to explain to me how to get labels to use SI prefixes as contractions. It gave me a plausible answer along with an example and the output that I wanted. There was just one problem: copying its example verbatim into my TeX file gave totally different output. No variation on its example worked and I couldn’t prompt it to give me one that worked in half an hour of trying.
In contrast, when I wanted to create a bar chart with the same package, it basically told me what to do. Everything I wanted to change, I asked it for and it showed me a summary of a relevant part of the manual. For fairness, I had the manual open in another window and tried looking up the answer while it was processing and it was faster almost all of the time. For things that loads of people have done, but I can’t remember how to do (or never knew), it was a great tool.
Unfortunately, the responses in both cases were indistinguishable. Whether it’s giving me a verbatim quote from an authoritative source or spouting bullshit, it’s always supremely confident that its answer is correct. It definitely reminds me of some VPs that I’ve met.
this is what I am finding too.
After provides me with a solution, I always ask “is that correct?” so that it verifies its own output.
If it produces an incorrect answer to a question, what stops it from “verifying” that initial incorrect answer? Or is this more like, another layer, another chance for it to maybe be accurate?
I couldn’t tell you, I don’t know how it works, just that on solutions I’ve known to be incorrect, asking that question has provided the expected corrections, so it’s doing something.
My understanding is in ChatGPT4 data only flows in one direction, by asking it to reflect on its own answer it is giving the network a chance to process its own thoughts in a way (inner monologue?) at the cost of more compute time.
When asking it if it’s correct, the answer it gave previously will be part of the context, so it will be available from the first layer, so it will be better processed by all the layers of the Transformer. Whereas when you ask it for something, it will use some layers to figure out what you meant and in what direction to answer, meaning there is less compute available to producing the correct answer.
I’ll start taking this stuff more seriously when the parties lauding it have skin in the game: use a LLM-generated contract, for example.
I think its super handy as an assistant where you give it limited trust. Some things I have used GPT4 for:
All these are easy to verify and tweak and don’t require much trust. I also didn’t feel like I was missing out on much learning by having the tool do these for me.
It definitely makes a lot of mistakes, but for me it crossed a threshold of usefulness and it will probably keep improving from here.
Can you describe your workflow with gpt in regard specifications?
I have a paid subscription that gave me access to chatgpt4. All I really do is open a chat window on a browser tab on a second monitor then ask it questions in the times I don’t want to waste time on something trivial that takes a while, or would normally be googling. While it is generating I can work on other things then come back to what it has generated and integrate it or modify it.
An example questions might be:
Then in response chatgpt4 gave me this fairly resonable output (which I included nearly all as a demonstration):
Unconvincing
Could you elaborate?
If you signed up and tried gpt4 then I accept your opinion - if you did not then your opinion is probably worth very little in this context - I think its hard to gauge something so new without actually trying it seriously.
This is very well written and motivated :)
I was always interested in Lua because it was nice and small, but I felt the language itself was a quirky, with some footguns … Also interested in Clojure, but not the JVM.
Janet sounds interesting
In my experience Fennel fixes about 90% of the footguns of Lua. About the only ones left are “1 based array indexing” and “referring to a nonexistent variable/table value returns nil instead of being an error”, which are pretty hard to change without fundamentally changing the runtime.
Hm I’ve seen both fennel and Janet, but didn’t realize until now that both use the square brackets and braces from Clojure
That’s cool, and a sign clojure is influential. Although there’s also the curse of Lisp where the ecosystem becomes fragmented
Both languages were written by the same person if you weren’t aware
Which two?
Fennel and Janet are both from Calvin Rose.
Huh I actually didn’t know that! @technomancy seems to be the defacto maintainer since 2020 or so.
There’s quite a bit of history to how fennel came to be what it is today. It is correct that Calvin (creator of Janet) started it, but it would have just been an experiment in their github if it weren’t for technomancy’s interest in reviving/expanding on it. I don’t know if it is written down anywhere, but Phil did a talk at FennelConf 2021 about the history of fennel, which is the most detailed background for those interested. https://conf.fennel-lang.org/2021
I did a survey a while back about new lisps of the past 2 decades. IIRC the only one to evolve beyond a personal project and have multiple nontrivial contributors but not use Clojure-style brackets is LFE, but LFE was released only a few months after Clojure. It’s safe to say Clojure’s influence has been enormous.
However, Janet seems to take some characteristics of Clojure out of context where they don’t make sense. For instance, Janet has if-let even tho if-let only exists in Clojure because Rich hates pattern matching. Janet also uses Clojure’s style of docstring before arglist, even tho Clojure’s reason for doing this (functions can have multiple arglists) does not apply in Janet as far as I can tell.
The other main influence of Clojure is not syntactic at all but rather the idea that a language specifically designed to be hosted on another runtime can be an enormous strength that neatly sidesteps the fragmentation curse.
Ahh very interesting, what were the others? (out of idle curiosity)
I think I remember Carp uses square brackets too.
There’s also femtolisp, used to bootstrap Julia, but that actually may have existed before Clojure as a personal project. It’s more like a Scheme and uses only parens.
I agree the runtime is usually the thing I care about, and interop within a runtime is crucial.
Here’s the ones I found in my survey; I omitted languages which (at the time) had only single-digit contributors or double-digit commit counts, but all of these were released (but possibly not started) after Clojure:
All of these except Urn and LFE were created by someone who I could find documented evidence of them using Clojure, and all of them except Urn and LFE use square brackets for arglists. LFE is still going as far as I can tell but Urn has been abandoned since I made the list.
I was working on this as a talk proposal in early 2020 before Covid hit and the conference was canceled. I’d like to still give it some day at a different conference: https://p.hagelb.org/new-lisps.html
That link is super cool. What do you mean by “implicit quoting”?
Thanks!
Implicit quoting is when lisps like CL or Scheme treat certain data structure literal notation treats the data structure as if it were quoted despite there being no quote.
For example, in Racket you can have a vector
#[(+ 2 3)]
, without implicit quoting this is a vector containing 5 but with implicit quoting it contains the list(+ 2 3)
instead where + is a symbol, not a function. Hash tables also have this problem. It’s very frustrating. Newer lisps all avoid it as far as I know.Not to take away from Clojure’s influence, just want to mention that Interlisp has square brackets, but with a different meaning. IIRC, a right square bracket in Interlisp closes all open round brackets.
Hm although now that I look, the VM doesn’t appear to be re-entrant like Lua
https://janet.guide/embedding-janet/
Python has been trying to move toward a re-entrant VM for long time, with subinterpreters, etc. – I think all the global vars are viewed as a mistake. Aside from just being cleaner, it makes the GIL baked in rather than an application policy, which limits scalability.
This kind of API looks suboptimal to me. It would be nice to take something like a
lua_State
.The interpreter is thread local in janet, you can actually swap interpreters on the thread too so it doesn’t stop things like rust async from working if you add extra machinery.
The main reason that I use Lua is Sol3. Lua itself is just Smalltalk with weird syntax, but the integration with C++ that you get from Sol3 is fantastic.
I just migrated the backend of managed bupstash backup repositories* to hetzner, now I need to make and push a new release of bupstash itself.
*https://bupstash.io/ is the encrypted deduplicated backup tool I have written.
Wouldn’t it be simpler not to even talk about energy and spinning etc? Here is a simpler one sentence:
FT is an operation for calculating which frequencies your signal contains and how much.
You are explaining what it is but not how it works.
I can’t quite picture “spinning a signal around a circle” in my mind’s eye, can you? It is just stating in English what
e^(i*2*pi)
is, which is a circle.How about this:
“Probing your signal with a unit vector spinning at all angles between 0 and 2π to calculate how much of it lies along each angle”
This I can picture..
Maybe just “wrap” might be better than “spin”.
Sure I can. It’s like an oscilloscope where the horizontal sweep is a polar sweep instead.
I think that you’re half-right. Your approach is “simpler” in the sense that it’s more specific to signals, and that’s the context that we’ll usually care about. I wonder whether we could also be “simpler” in the sense that we would talk about conjugates instead of signals in particular. Maybe we could adapt the article’s sentence to something like:
Let’s try it for position and (linear) momentum:
Technically correct! The missing ingredient is time; we normally don’t want to talk about time because we are using time as one of our two domains for signal analysis, and this is a blind spot for the original one-sentence approach.
Court listener docket for those who want to follow along.
I’m not a lawyer, but personally I’m skeptical, I don’t think anything that github did is likely to have required a license in the first place. To the extent that copilot produces verbatim copies, it seems to do so only of tiny samples of code that have been replicated numerous times by humans before. I expect the court will find that to be fair use/de minimis copying and not actionable. Without the initial copyright infringement occurring, I don’t think many of the other claims survive this, they either require it to be copyright infringement as a precursor (e.g. the DMCA), or they require it to be unlawful.
I’m less sure what to think about the personal information claims.
Regardless, I’m pretty happy that this suit is happening. Getting clarity as to the law relating to AI models sooner rather than later will be good for everyone, and both sides here should have deep enough pockets to do a good job at arguing their side, so the decisions come out saying what they should say.
It’s pretty clear already today; this litigation is rather a publicity stunt; the neural net itself is neither a copy nor a derived work of the training data; what the neural net spits out is computer generated in the first place and thus not protected by copyright law, unless the result is sufficiently similar to something human generated, which itself is sufficiently creative; a few code snippets will hardly suffice for this (and even if they do it is very likely fair use according to current jurisprudence); but this must be judged on a case-by-case basis, not in a class action suit. I also can’t understand the outrage of many developers; on the one hand people seem to take it for granted that others provide them code or services for free on a grand scale (e.g. Github hosting and additional features heavily used by the open source community); but at the slightest suspicion that they should give something away, all hell breaks loose.
I don’t believe that this is settled precedent yet. In particular, it is clear that a neural network can memorise some of the input date. The fact that it’s a neural network doesn’t really matter - if it were a rule-based system with a database of code snippets that it could combine to produce output then it would be equally legal or illegal and that’s up to a court to decide.
That’s the crux of the matter. It is established that Copilot can generate exact copies of code. It is not yet established whether these snippets are sufficiently creative to merit copyright. This is a very tricky issue because it does not depend exactly on length. A two-line snippet might be copyrightable if it is doing something sufficiently original and a court agrees that it is a creative work. In that case, you may still be allowed to quote it but then you may have attribution requirements, depending on the jurisdiction. It is more likely that a long fragment is both considered a creative work and not covered by fair use, but some long things can be considered non-copyrightable (e.g. if they are mechanical implementations of a published specification).
Well, we will see what comes out (likely not much).
That’s not correct; the DNN doesn’t just make a copy or memorizes a copy; it might be able to reproduce parts or the training set, though this is not the actual purpose, but a rather coincidental an unwanted side effect (which occurs less than 1% according to Github officials as far as I remember). Also note that it is not comparable to a simple database, not even a compressed or encrypted one, since there is no technical means to restore the original works used to train a DNN; it’s rather like a hash sum; the abstraction and transformation done by the DNN training algorithm is substantial; the original works are unrecognizable and unrecoverable; the DNN is thus no derivative work; any other outcome of the trial would be a big surprise.
Storing copyrighted work is generally no violation of copyright law (in some countries it might be illegal if the copyrighted works were not legally acquired). This is established legal practice; we don’t have to wait for a decision.
Not that tricky; there is a well established legal practice in this regard with various precedents; if the DNN would repeatably produce code sufficiently equal to existing code, the matter would have to be clarified in the individual case anyway, whereby the burden of proof of authorship as well as similarity and copyright infringement would lie with the individual plaintiff; and the defendant in this case would not be Github, but the developers using the code in question.
It could be like lossy compression. If you make a shitty JPEG copy of a copyrighted artwork, the new bytes look nothing like the original, and you can’t even restore the original, but it may still be an infringement when it’s a close-enough copy.
You could also look at this from a higher level:
The complex implementation details of the black box may be irrelevant legally. If you put a copyrighted work in a paper shredder and then glue the shreds back together in the original order, even by chance, the court may not care how you did it, only that you have ended up making a copy of the original.
That’s essentially the concept of all electronic media today. If you take a picture of Mona Lisa, the camera, the memory card and the JPEG format in use are a blackbox for the majority of users; even though they are able to view or even publish the picture displaying Mona Lisa with little effort.
This also nicely demonstrates the present situation. Neither the manufacturer of the camera, nor the inventor of the JPEG format, nor the photographer making and keeping the picture is liable of copyright infringement. But if the photographer wants to publish the picture, a permission of the copyright holder may be necessary; this depends on what you can see on the picture, not on how the picture was taken or stored, or the slight quality loss of the format.
In the present case, the DNN is conceptually comparable to the photographer and the storage format; but the DNN doesn’t store a copy nor a “picture” of the original, but a statistical abstraction of certain features of millions of originals. So the DNN doesn’t simply transport or “publish” original content, but it is able to synthesize new content based on the feature abstractions.
It’s a similar process as when you write code, remembering concepts you have learned over the years (I am not talking about the widespread method here where developers simply copy existing code from the internet). If by chance something comes out of the DNN that resembles an existing work, the user still has the responsibility that copyright imposes on him, and the copyright holder still has all the possibilities that copyright grants him; but this is not Github’s responsibility.
JPEG compression transforms pixels into a completely different domain which does not visually resemble the original image at all (what Grant Sanderson calls the “Fourier world”); the only reason why this works is because we have a series of master theorems which establish the bidirectional connection between our visual world and its Fourier-transformed counterpart. But this is analogous to the situation with neural networks; they operate in a completely different perceptual domain than us, and we rely on theorems which state that such networks are faithful approximants of the desired functions.
If I take a JPEG-compressed image of a copyrighted work and alter its coefficients in a bidirectional transformation, producing an image which visually resembles the original work to a human observer, then I may have infringed. Similarly, if I take a neural-network-encoded pile of copyrighted source code and approximate its coefficients in a bidirectional transformation, producing source code which syntactically resembles the original work to a human observer, than I may have infringed. It doesn’t matter whether I tuned the coefficients by hand or used an automated tool to compute their new values; what matters is that the prior values were computed by summarizing copyrighted works.
That’s not the way Copilot or e.g. GPT-3 work. You can indeed approximate compression functions with DNNs, but that’s not what is done here. Anyway, even if they only implemented an indexing and search algorithm on the repositories, without any DNNs, this was no copyright infringement, even when search results would show snippets the same way they do today. There are already precedents for this.
the problem is not so much whether github is infringing but whether as a user of copilot you may unwittingly infringe someone’s copyright without having any way to know whether it happened - google is not infringing by serving up search results, but the fact that you found something on google doesn’t grant you any rights to use or republish that content in your own work
if copilot were just a search engine, again, github would be in the clear, but you would still need to check the license to see if you can use it. all that changes by making it a language model is that you can’t easily check so you never know if its output is safe to use in your own projects.
I recommend reading the filed complaint. And it is not Github’s duty to enforce law.
a key part of the complaint is the stripping of license information which they are responsible for preserving, a problem they would not have if they’d simply built a search engine
They’re not stripping license information; they synthesize snippets, a tiny fraction of which might resemble existing code (which is very likely for any sufficiently small snippet and thus barely avoidable). But let’s see what comes out; the litigation is now filed.
If the whole file is “this snippet and a copyright header”, the term “snippet” is misleading.
What is really interesting to me about this whole Copilot situation is how much the zeitgeist has completely flipped.
I remember years and years ago people proposed all sorts of weird multi-part scrambling schemes that would take an input and produce a bunch of seemingly-random data blocks, none of which could reproduce the input or a subset of it, but it you had all of them you could recombine in a way that got back the exact input. And people literally thought this was an end run around copyright, since you could have, say, a P2P system where each peer distributes only a subset of the blocks needed to reconstitute a popular song or movie, and thus none of them were distributing a “copy” of it because none of those individual blocks could reconstitute it on their own – the fact that at the end you had a perfect copy didn’t matter, it was claimed, because only the intermediate distribution format mattered for copyright law.
And things like this would get lots of attention and hype on tech forums and be cheered on as proof of the incoherency of even the concept of copyright.
Now GitHub has invented something not too far off conceptually, and the tech community is screaming for it to be destroyed and arguing about how it’s the output that matters, not the intermediate format.
I think that nobody wants to achieve any “destruction” here. All I want is that my copyright remains intact.
The best possible outcome is that the defense wins, thus striking a serious blow against the legal fiction of intellectual property. Yeah, I have no love for copilot. It is essentially strip-mining the commons. And Microsoft Github is another tentacle of the surveillance capitalism octopus. In this specific case, I’m rooting for the megacorp. Though I have to admit, a teensy cut of that class action suit would sure help me out right now.
The litigation includes no such examples, which is a pretty strong signal to me that no such examples exist because it would seem to be the exact sort of sample that gives the best (still small IMHO) chance of winning.
In this case I’d say Tensorflow (or whatever NN library they use) is the algorithm provider not responsible for its usage, but Microsoft is the user feeding copyrighted data into it.
a partially trained DNN is kind of like a zip file with a bunch of files already in it - adding another one is going to take up less of the capacity if its similar to what’s already there, the trick is that information is shared - generative models are kind of like a lossy compressor whose compression artifacts take the form of making the input more generic, more like the training set (“faceapp yourself but don’t actually apply any filters” type distortion), and the degree of distortion is simply a factor of the model capacity
training a high capacity model on a small dataset inevitably memorises things verbatim, because the training task for these models is reconstruction, that they appear to be doing something else is mostly a factor of capacity limits and sometimes intentionally distortion-inducing sampling methods
and you can observe different degrees of distortion even in text models like copilot - depending on how common a code snippet it is reproducing is and your settings, it may reproduce existing snippets nearly exactly but with different variable names or commenting style, which shows that it has an internal representation that doesn’t necessarily need to store the “stylistic” details, but is still obviously close enough to be license infringement
when given a context that isn’t especially close to any single training sample it appears to be more “creative” as its having to mix together information gleaned from multiple samples and rely more on the surrounding context, but the big problem with copilot is you can never really know when its being “creative” and when its just coughing up a training sample almost exactly so its always going to be kind of legally awkward to use
the real annoying part is that language model based code completers are still really useful when trained on much less code, and a lot of the code that copilot was trained on isn’t just encumbered by licenses that don’t allow unattributed copying, but is also poor quality. There is conceptually a more useful tool you could build with the same methods by being more selective about its training data, but copilot feels like GitHub and OpenAI trying to retroactively recoup the cost of training a huge model more than an intentionally designed product.
No. ZIP is a lossless, deterministic compression algorithm, in no way comparable to what the present DNN or its training algorithms do.
Ultimately, the degree of similarity and the degree of creativity of the snippet will be decisive. Unfortunately, however, the value of such snippets is greatly exaggerated in the present discussion. It is undisputed that in copyright law source code (unfortunately) automatically counts as a work, and this (even more unfortunately) also applies to parts of it. However, this is a perversion of the concept of a work. Because the probability that any snippet is present in any number of other source codes in a very similar way is close to 100%. Industry and open source developers are already suffering from the perverted use of patent law; now they are to be bothered also by the perverted use of copyright law. Judging whether or not a snippet meets the creativity requirements is usually arbitrary. Fortunately, problems with this kind of misuse of copyright can be circumvented with its own means relatively easily by simply rewriting the snippet.
The point of the analogy was it containing multiple items and sharing information, read “mpeg file” if you’re hung up on the lossy vs lossless distinction
Thanks. I have a formal education in both information technology and law.
Microsoft and you.
I don’t think GitHub has any right to other people’s work, unless granted by a license.
They have legal ownership of the copy that is in their possession, given that they acquired it lawfully (which they did). The same way you own a book.
You can do lots of things to code without a license. Read it, lend it to your friends, sell it to the used code store, stick it on a shelf so your guests can admire how big a library you have, execute it, etc. They don’t have copyright, but they absolutely have normal property rights.
I think ownership might be the thing that’s at least debatable here. I don’t think GitHub owns the code it hosts. Similar to a web hoster not owning all the photos and your ISP not owning everything it caches or goes through the network.
Or more IT comparison. If I am a code reviewer, some consultant or something and code is given to me to inspect it. That doesn’t mean I own it, simply because I legally have the data on my hard drive. If said code was some service and I’d just run it the actual owner would likely be very unhappy.
I agree this is about copyright and not license. The question is whether what they do is some kind of fair use or anything you are allowed to do under copyright law.
I’d argue it’s not, because it doesn’t create a benefit for society, like most fair use does for example.
If it turns out it is, what would happen to let’s say anything that re-compresses an image, maybe lossy, as part of a service. They (likely) do that it in this case even with the explicit authorization of the copyright owner. They run ti through some algorithm and get something new out of it that kind of reassembles the originally, but not rally and certainly not in terms of bytes. Does that make them the owners?
Or what if someone simply wrote some “AI” that let’s say mostly strips comments, reorganizes code, maybe even just works with some sort of AST. Would it make the output owned by whoever runs it?
Does that mean I mean one could make an “AI” that disassembles binaries, maybe makes some redundant changes and outputs new modified binaries? Would that work.
What if it was more involved and you actually train a NN, and just teach it the bytes of some software or even a movie. You have a prompt where you can enter “The bytes on C:\videos\plan-9.mp4 are video files of Plan 9 from Outer Space. Remember this!”. It does, but not just by copying, but by adding it into its (language) model. Then since its your language model you share it on the web. Someone else may download it and say “Hey there. I need the bytes for Plan 9 from Outer Space in C:\warez\plan9.mp4, please store them there for me”. Who holds the copyright on what the AI creates through its language model? It might even have learned to skip redundant license statements of software, strip FBI warning from videos and who knows what.
What if the AI does more? What if it even can “watch” and “learn” the movie, potentially scale it up to 4k monitors, output to any format, knows how to change it just enough so any AIs looking for copyright infringements can’t differentiate it anymore? What if it can lean to even change movies, just enough so that copyright lawyers consider it a new work of art.
Where do you draw the line? Where does what’s allowed under copyright law end?
I really don’t have the answer, but I think with copyright law a huge mess was created in first place, because laws work best when they are something you can agree on at large and they change or come down when a large amount of people change their opinion (homosexuality, slavery, women voting, witchcraft, etc.). I don’t think with copyright there ever was huge amounts of agreements, and if it was was applied to the letter and copyright holders would really sue everyone who crosses its lines the majority of people would have voted to abandon at least large parts of it.
Besides that the like between being inspired by, learning from something or even learning something (think reciting a poem) simply are some form of copying with some translation. There already are existing huge topics on fair news, see sampling, mixing, etc. and laws that nobody feels comfortable executing, like singing copyrighted songs on parties, or other private settings in some countries.
I’d say all of this is at least something that’s not so clear in law, so whichever route it takes I am sure there certainly is potential far-reaching effect whatever the conclusion will be.
Their ToS do grant them some rights, but their Copilot actively violates their own ToS:
https://docs.github.com/en/site-policy/github-terms/github-terms-of-service#4-license-grant-to-us
I don’t see at all how Copilot violates their own terms. Could you give an actual explanation, with detailed specific claims?
It seems that in the above “right” is being used to mean “moral right”, while many here are using “right” to mean “legal right”. Confusing those two things might be the source of some of the misunderstanding that I’m seeing here.
It depends on the particular licenses that are violated, I guess.
The argument is something like:
There is also the other option that if you have access to a piece of software under multiple potential license grants, one of which is more permissive than the other, you can choose the more permissive one without having to observe the less permissive one. I’ve pointed out in past threads about Copilot that I would not be surprised at all if the license grant embedded in GitHub’s terms of service turns out to be more than sufficient to allow everything Copilot does, for example.
No, if they don’t need a license I don’t think the particular licenses matter at all. Licenses grant permission to do things that were otherwise illegal under law with some conditions attached. If you didn’t do anything that was otherwise illegal, licenses don’t do anything.
If they did need a license, they’re obviously in trouble with pretty much any license, because they didn’t comply with pretty much any license (other than the CC0 and wtfpl style ones).
Some licenses grant you permission to reproduce some/all of the work provided you meet the conditions, Microsoft did not meet the conditions in those cases, yet the reproduced the works anyways.
How is violating copyright law not illegal? Or are you insinuating that copyright law doesn’t cover source code?
How is this any different than taking text from books in the library and trying to pass it off as your own work, while not paying dues to the copyright holders (or whatever conditions they have, if any)?
I’m saying I don’t think they violated copyright law. I’m not saying it doesn’t cover source code, but that I don’t think it covers this kind of use of copyrighted material of any kind.
I don’t think the distinction between text from text books and code is relevant if that’s what you’re asking. If you trained the same kind of model on an equally large collection of equally diverse textbooks and released it as a book writing aid I think you would have exactly the same legal status. (Edit: I should say “and served requests to the model as a book writing aid”, releasing the model vs serving requests is potentially legally relevant, with the latter being slightly more likely to be problematic IMO).
I don’t think it’s fair to describe what’s happening here as “taking text from books in the library and trying to pass it off as your own work” though. There are many more steps happening here, and they’re achieving a much broader goal than rote copying pieces of text. And sure, sometimes the process manages to create a copy of a small piece of someone else’s copyrighted work that has been copied many times already into other places, but that’s de minimis copying and not copyright infringement.
It might be worth noting for instance, that Google won Google v. Oracle despite copying 11,500 lines of code. In part because that wasn’t a substantial portion of the work. I’d expect a similar analysis here.
The samples that were duplicated that they use to justify the lawsuit are things like part of an exercise from a textbook, it’s not a substantial portion of the book.
And I think they are violating copyright law, but I’m not a lawyer and you probably aren’t either. I hope this goes to trial so we get to hear some random judge’s opinion.
Your whole argument rests on “they didn’t follow the license!”
If their whole argument is “we didn’t need that license to do what we did”, then your argument is not really relevant. That’s what people are trying to get you to understand – the license terms may literally have no relevance whatsoever.
What’s the point in any license if it’s not relevant here? My point is that argument, that the license is not relevant, is, uhh, not relevant.
A license grants you rights to do things that would not otherwise be covered by copyright law. You do not require a license to do things that are covered by Fair Use / Fair Dealings (delete as appropriate for your jurisdiction) or by other explicit statute law. For example, in the USA you do not require an explicit license to record a cover of a song because compulsory licensing is enshrined in statute law and so you just have to pay a fixed amount for every copy that you sell.
The argument (disclaimer: I work for MS but have no connection to this project) it that this kind of use is covered by explicit law. I don’t know precisely what the laws in question are, there are a few things on building machine learning systems and databases that may apply but it will be up to the court to decide.
Whether they win or not, I think they’ve achieved their goal. We (MS) have spent a huge amount of time and money building a reputation as a good open source citizen. I’m able to hire great people on the basis of that and the expectation that they will be paid to contribute to the F/OSS ecosystem. Being on the other side of a lawsuit from the SFLC does a lot of damage to that reputation and, in some ways, winning a lawsuit against the SFLC would do more damage than losing.
Still, I’m glad it goes to court and I even hope MS wins. Copyright has limits and that’s important.
We’ve had this before. Publishers wanting to block used-book sales, for example.
A license is just that: a license to do something with a copyrighted work. It can’t take away rights that were already granted by copyright law, such as fair use.
Ordinary users of GitHub receive code under the license chosen by the person who posted the code.
GitHub has a choice between receiving code under that license, or under the license granted in GitHub’s terms of service.
GitHub can simply choose the more permissive of the two, in which case the more restrictive of the two is in fact irrelevant.
Think of it like any other dual-licensing scheme. Suppose I write a piece of software, we’ll call it
foolib
. And I offer it under a choice of BSD or AGPL. If you choose to receivefoolib
from me under the BSD offer, you will be able to do things withfoolib
that the AGPL would not have allowed. And you will be able to do that because the AGPL is not the license under which you receivedfoolib
and so is not the license which governs your use of it. No amount of yelling “that’s an AGPL violation!” would be relevant there.Similarly, even if I only offer it under AGPL, you could still do certain things with it – such as fair use – without having to follow the AGPL’s terms. And again no amount of yelling “but that’s an AGPL violation!” would matter, because there are things copyright law still lets you do without needing to obtain or follow the terms of a license.
The point being made here is simply that saying “But that’s a license violation!” over and over is not relevant, because the original argument is that GitHub either has access under an alternative, more permissive license, or is doing things that do not require a license in the first place. In the former case, the only license terms which matter are the more permissive ones; in the latter case, no license terms matter.
I can’t help but feel like something went badly wrong here but what do I know.
These types of comments really drive me up a wall. It feels like what you are saying is “this is a common feature in other languages, the people behind Go (vague bad thing) since they didn’t add the feature too” which is just not sound reasoning.
In order to form a judgement about how bad something is, we should consider the consequences of it. The “normalness” of a behavior is an OK pointer, but thats all it is.
Maybe you can argue that the consequences have been grave and thus this is a grave failure, but that doesn’t seem true to me.
I can’t argue that any of the things about Go’s design that people have been unproductively grouchy about in comments sections for the past decade have had grave consequences for any given Go adopter or for the widespread adoption of Go. Having written one short program in Go for my own purposes, the lack of a proper idiomatic iteration construct (no map, no iterating for loop, no list comprehension, no yielding to a block, just the apparently-hacky for-range) was flummoxing. Go’s designers are entitled to their priorities but idk I feel like I’m entitled to make fun of those priorities a little bit, especially because they get paid more than I do.
IMO there is a solid iteration idiom, and they hit on it early in the stdlib, although the fact that they managed to do it several different ways afterwards is a disappointment. It’s the
one. You can do pretty much anything with it, it doesn’t impose any burden on the caller, it doesn’t make the caller ugly, and you can implement it on top of pretty much anything else. With generics you could even codify it as an interface (parameterized on the return type of
Val
).None of which is to say that I opposite this proposal — it looks pretty nice to me. But in 7+ years of writing Go professionally, the lack of an iterator syntax hasn’t been a major thorn in my side — or even a substantial annoyance.
I’m pretty sure Zig has no concept of iterators either
https://ziglang.org/documentation/master/
I saw in the latest blog that the Zig for loop is being changed, but AFAIK there will still be no iterators. You basically do what you do in Go – write your own set of methods on a struct.
So it seems like Zig will be a ~2025 language without iterators
I would say languages have different priorities, and it’s harder than you think. Or you could just make vague statements about people doing bad things for no reason
(edit: this bug seems to confirm what I thought https://github.com/ziglang/zig/issues/6185)
Zig deliberately has no interfaces or traits whatsoever. You put the methods in the struct and they get called and it will compile if and only if the types work out after comptime propagation. I might be wrong but as far as I understand “iterators” in the language will be a bit of syntax sugar relying on a documented (but informal) interface, and Zig will very much have iterators exactly like, say, Julia or JavaScript or Python have iterators (except those languages check if things work out at runtime instead of compile time).
On the other hand the major selling point of Go is having interfaces enforced by the compiler. But a fast iteration interface needs to be a generic interface so that wasn’t really possible until recently…
Hopefully it all works out on both fronts.
Eh I don’t see what you’re saying. My point is that, as of now, Go and Zig are the same as far as iterators.
As of ~2025, Go might have iterators, and Zig probably won’t. Go is thinking about adding some concept of iterators to the language to enforce consistency.
Python’s for loop and list comprehensions understand iterators; I don’t think the same is true of Zig.
If you know otherwise, please provide a link to the docs.
The dominating quality of Go’s development over the past decade has been the most extreme caution when it came to adding features. You can’t have performant, extensible iteration without some kind of generics and they were stuck in place on that issue out of fear of C++ compile times until finally a couple of years ago.
It’s even stronger than that: if you do want to map’n’filter, you need a boatload of machinery inside the compiler to make that fast, in addition to significant amount of machinery to make it expressible at all.
Rust’s signature for map is roughly
That is,
.map
returns a struct, calledMap
, which is parameterized by the type of the original iteratorSelf
, as well as the type of unnameable closureF
. Meditating on this single example for a long time explains half of why Rust looks the way it does.Go’s developers focused on higher priority concerns, such as pretty great performance, a pretty great (though basic) type system, an awesome runtime and compilation model, and fantastic tooling. Go’s feature set (including the features it elided) made developers really, really productive compared with other languages.
While there are a few use cases that weren’t feasible without generics, the absence of generics made for some really interesting and compelling properties–like “everyone writes their code the same way (and thus any developer can jump into any other project and be immediately productive)” and “code is very concrete; people don’t usually try to make things overly abstract” which aren’t present in other languages. It wasn’t actually as obvious that generics were the right choice as Go’s critics claim (whose analyses flatly pretended as though there were no disadvantages to generics).
The net upside to generics (including iterators) was relatively small, so it makes sense that the decision was deferred.
Go is a Google language. If a proposal helps or is of benefit to Google, it’ll be added. If it’s bad for Google, it will be ignored. If it’s neutral, then the only concern Google has is how well does it externalize training costs for Google.
Google doesn’t really figure in at this level of discussion. The Plan 9 guys who made Go are the relevant actors for this. They were skeptical of generics, so it wasn’t a priority for Go 1.0. With no generics, a generic iterator protocol doesn’t make any sense, so that wasn’t in Go 1.0 either. Now Go has generics as of Feb. 2022, so there is a discussion about the best way to do an iterator protocol. This is the second proposal, which builds off of ideas from the first discussion and some ideas that had been in the issues tracker before that. It’s not really more complicated than that.
You’re obviously right that the decision making an process is entirely about Google’s desires, but I’d hesitate to assume that it’s necessarily utilitarian. Google does a lot of self-sabotage.
There is no standard way to iterate over a sequence of values in Standard ML, which is from circa 1970s/80s depending on who you ask, and is widely considered one of the most elegant of language designs. Something went badly wrong here or…?
After having to deal with Rust iteration for a bit and missing out on Python…. I think the decent explanation here is that in more dynamic languages with stuff like coroutines it’s pretty easy to come up with a nice iterator protocol, but in more serious things it’s harder to come up with one that is both flexible enough for the “right” use cases without being very hard to use.
Like C++ has iterators right? And they do the job but they’re kind of miserable to use (or at least were 5+ years back, I’m sure things are better now).
Combine that with the perrenial generics things meaning container classes aren’t a thing and “stuff things into arrays and use indices” feels like a pretty OK solution for a long time.
I think C++ iterators are uniquely awkward in their design, and it’s not an inherent design problem or any sort of static typing limitation.
C++ iterators are based around emulating pointer arithmetic with operator overloading, with a state awkwardly split between two objects. There’s no reason to do it this way other than homage to C and a former lack of
for
loop syntax sugar.And C++ iterators aren’t merely tasked with iterating over a set once from start to finish, but double as a general-purpose description of a collection, which needlessly makes both roles harder.
I think this is a little unfair, the primary reason to do it this way is so that code, especially templates work on pointers or iterators, eg being able to have a single implementation for something like
std::find
work for list or pointers. It’s not a “homage” so much as a source level interoperability consideration.OK, “homage” is a poor way of phrasing it. But it’s still an “interoperability consideration” with pointer arithmetic and C’s way of doing things, rather than a ground-up iterator design. The messy end result is not because iteration is such a hard problem, but because preserving C legacy is messy.
Right it’s not inherent to “designing iterators in statically typed language.” Go doesn’t have a different language it’s trying to be incrementally adoptable from.
Not much.
I feel like zig should add an allocator that uses conservative heap and stack scanning to the stdlib selection of allocators - see https://security.googleblog.com/2022/05/retrofitting-temporal-memory-safety-on-c.html . I did a hare version for fun and it wasn’t so hard to do.
This basically would let you make zig programs selectively memory safe via conservative garbage collection - the linked article shows the overheads are quite low. Then you could just turn it on for portions of programs and deployments that require that extra bit of safety.
I also think you could just expose it as a GC allocator rather than a quarantine allocator and let people take advantage of GC when they want it.
Of all the myriad problems of multi-platform software, having multiple binaries is the absolute least interesting or encumbering.
This also solves cross compilation/multiple toolchains - which are definitely annoying things.
Disagree. Multiple binaries are big enough problem that Apple created fat binary. Linux also needs it, and there is FatELF, but it is not adopted. Typical Linux failure.
It’s really not clear to me why Apple did fat binaries, I suspect it was for compatibility with the exec family of system calls. NeXT had fat bundles with a different directory for every platform and architecture, allowing them to share resources and be thinned by just deleting some bits. Apple’s file format doesn’t give a space saving over this: they just concatenate the two binaries and slap a header on them. The Windows version allows sharing sections between the two (very useful for data, somewhat useful for code where the 32- and 64-bit versions of some x86 functions are the same). FatELF works the same way as the Apple versions and it’s really not clear to me what problem it solves: the number of situations where I want to provide a single download for multiple architectures on Linux/FreeBSD is very limited: people generally don’t provide a single binary for multiple Linux distributions, let alone architectures.
Cosmopolitan is interesting as a common platform that, in addition to providing a common binary, provides the same API independent of the underlying kernel. Once you have that, and (as they do) the ability to run the same functions on every platform with the same architecture, their form of fat binary becomes interesting because it’s only very slightly fatter than a single platform binary.
NeXTSTEP did have fat Mach-Os for a long while, identical to how modern macOS does it. The different directories were only for platform.
Regardless of how you do it (bundles or fat binaries), it makes it easier for things like Migration Assistant - snarf the entire contents of
/Applications
onto your new machine, and use the native slice on the new machine.That’s true. I have done that for PowerPC to Intel, and was bitten by how well Rosetta worked: I thought the Core 2 was pretty slow because it wasn’t running VLC much faster than the G4. It turned out that VLC was not a fat binary and was just a PowerPC build. That’s not been an issue on other *NIX, because I do the migration by dumping the list of installed packages and installing it on the new machine. It’s also not an issue on iOS because the App Store does something equivalent.
Aiui, Cosmopolitan is a crossplatform bootloader for x86-64 binaries; it does not solve the same problem as universal binaries or FatELF, which is cross-architecture compatibility. They also take fundamentally different approaches, one embedding multiple compiled binaries, the other embedding an architecture-sensitive stub at the start.
You could maybe combine Cosmo with FatELF to get a cross-platform cross-architecture binary. But for true cross-architecture write-once-run-anywhere, what you want instead is some sort of low-level “portable assembly” that is compiled on the target, like JVM or CLR - at which point you don’t need Cosmo anymore.
Does anybody backup sqlite files with restic? I noticed that it’s not just as a simple as pointing restic to the folder that contains the DB, as the content might get corrupted.
Definitely don’t do that.
https://www.sqlite.org/backup.html
You should be backing up file system snapshots if you want to avoid backup smearing, the backup tool can’t coordinate with modifications in progress.
That’s in general a very bad idea. There’s situations when you can backup a running database, when both the database make sure that a finished write won’t leave the database in an inconsistent state (most serious databases) and the files system is able to make snapshots in a certain point, not in the middle of a write (ZFS can do that for example).
And good software tends to have ways related to backups documented. I’d strongly recommend reading that for SQLite, but also for Postgres (it’s way too common to just go with an SQL dump, which has a great number of downsides).
Don’t blindly back up anything that might write. It could mean that your backup is worthless.
If you want a similar tool that also supports asymmetric encryption (air gapped decryption keys), then please also give my one a try:
https://github.com/andrewchambers/bupstash
Nice! A state of the art feature, as I understand it, is to permit multiple concurrent writers to the same archive. Does bupstash support this?
Duplicacy supports this via lock free techniques: https://github.com/gilbertchen/duplicacy/wiki/Lock-Free-Deduplication
It supports multiple concurrent writers with no problem, the lock free removal is interested though - maybe something I can add.
Out of curiosity, would such a change sidestep the issue in this issue from awhile back?
Probably not, however I am investigating ways to support filesystems that don’t support file locks. One example is how sqlite3 does it, with an option to use a fallback instead of a filesystem lock (like attempting to create a directory).
Ok thanks for the detail. I am really excited to see what you come up with. Bupstash looks absolutely amazing to me.
Another alternative for asymmetric encryption is rdedup, but I haven’t found (or bothered writing myself) the tooling around it to make it a worthwhile backup solution for me.
I’m currently using restic on my file server, backuping to wasabi s3-compatible storage. Works great.
I get the reason for this, but am a little bummed. Deno was a great chance to step away from the tirefire of the NPM/node ecosystem and start fresh.
…so they have decided to reintroduce all of the problems of the packages on NPM. T_T
I know it’s a little heavy-handed, but forcing people to bring over the stuff they actually care about to a better language (Typescript) is maybe the right answer.
Those systems never get used at dayjobs, and never provide an upgrade path for people trying to get off the legacy platform.
You need an upgrade path from the old thing to become dominant.
I agree. There is simply so much crap on npm that nuking it and forging a new path appears the preferable option - I suppose a lack of meaningful uptake due to reliance on npm packages forces this, but still, it is a disappointment.
I suppose nothing forces to use that when you don’t need it. But it might be handy for when it is required to get a job done fast.
I dunno, not having a package management story at all is its own tire fire waiting to happen.
What do you mean by package management story?
As of now Deno has no package manager and unsafe defaults for third party dependencies, including resolution (via DNS!) and linking at runtime. NPM at least locks dependency versions by default and offers protection against takeover attacks in addition to offering a way to automate dependency updates.
Whatever problems exist with NPM aren’t solved by going backwards in time instead.
Excel is a local maxima, which sucks because it’s not good enough. Ordinary people can use Excel, which is great, but then the date type is actively harmful, which is insane. It mangles zipcodes in spite of it having been made by a US corporation for its whole existence! Like, I get it, sometimes you mangle foreign conventions due to unfamiliarity, but all of New England has its zipcodes mangled. That’s bad! And then because Excel is a local maxima, new products like Numbers and Sheets clone it instead of searching for a new maxima. It’s a pity because we can definitely do better.
I’ll take a stab at some things I think would make it much better for many of its uses:
Fundamentally, I’d say that my complaint about it is that Excel really, really wants to be a financial spreadsheet program when many (perhaps most) users want it to be a tool for working with generic tabular data.
* More broadly, I’d say this and several of my other points boil down to “Excel doesn’t do a great job at allowing the user to control complexity” and this becomes more clear as the volume of data manipulated in it increases. I think that’s one of the differences between it and, say, visidata. I prefer visdata for my own use but the reality is that many of the people I interact with simply won’t learn to use tools such as it rather than Excel (for a variety of reasons, some of which are fair).
The powerquery feature is really good for this, though its transformation dsl is weaker than excel formulates.
I’m hoping the new “linked data types” will help with this! Right now you can make custom data types but it’s pretty cumbersome. It’s pretty new though and I think they’re still improving it.
Maximum. ‘Maxima’ is the plural🙂
Not disagreeing at all, but curious what you think some of the big changes would be if a great designer did a total rethink, but incorporating the good stuff that works?
Take a look at Quantrix Modeller. Lotus had two spreadsheet products:
Excel is a 123 clone, as are most other spreadsheets. Quantrix Modeller is, as far as I know, the only surviving Improv clone. They have some great videos about why this model is better. It’s less error-prone, easier to change, and so on.
When most people say that they want a spreadsheet, what they actually want is a database with a rich set of numerical library routines.
No thanks.
I guess there are niches where you can still get away with this, but it’s become (even in the enterprise space, where it used to rule) less common to the degree that I often see products without at least some general pricing guidance published get pushed to the bottom of procurement lists simply because no one can be faffed to talk to a bunch of sales people to get quotes just to do a comparison.
Quantrix Modeller sales demo video
Hear, hear!
I’m not imagining a big redesign. Just basic stuff: add types for timestamps, civil times, durations, locations, currency. Have a difference between something’s type and its display. Fix the 1900 leap year bug FFS. Default to not clobbering the next cell when something doesn’t fit. You could still have freeform A1 cells but you should push users towards using it like a database with proper rows and columns as much as you can. Design the app as though it were the most commonly used tool in business and science analysis and not whatever Dan Bricklin happened to think of in 1979.
I am confused when you say it is a local maxima yet think small improvements would fix it - that is a contradiction.
I worry mere mortals (normal programmers) won’t be able to write a library that is both sync and async just due to how confusing it all is. I hope Its not too hard in the end, interesting to see what they come up with.