Threads for benji

  1. 8

    Every step here looks and smells like a ‘oopsie’ backdoor. Plausibly deniable, high contextual worth for LEO, low risk for abuse or discovery outside of it.

    1. 1

      That’s a great point. That leads me to wonder if the “fix” similarly contains a new, even more obscure path to bypass the lock screen.

    1. 17

      Court listener docket for those who want to follow along.

      I’m not a lawyer, but personally I’m skeptical, I don’t think anything that github did is likely to have required a license in the first place. To the extent that copilot produces verbatim copies, it seems to do so only of tiny samples of code that have been replicated numerous times by humans before. I expect the court will find that to be fair use/de minimis copying and not actionable. Without the initial copyright infringement occurring, I don’t think many of the other claims survive this, they either require it to be copyright infringement as a precursor (e.g. the DMCA), or they require it to be unlawful.

      I’m less sure what to think about the personal information claims.

      Regardless, I’m pretty happy that this suit is happening. Getting clarity as to the law relating to AI models sooner rather than later will be good for everyone, and both sides here should have deep enough pockets to do a good job at arguing their side, so the decisions come out saying what they should say.

      1. 4

        Getting clarity as to the law relating to AI models sooner rather than later will be good for everyone

        It’s pretty clear already today; this litigation is rather a publicity stunt; the neural net itself is neither a copy nor a derived work of the training data; what the neural net spits out is computer generated in the first place and thus not protected by copyright law, unless the result is sufficiently similar to something human generated, which itself is sufficiently creative; a few code snippets will hardly suffice for this (and even if they do it is very likely fair use according to current jurisprudence); but this must be judged on a case-by-case basis, not in a class action suit. I also can’t understand the outrage of many developers; on the one hand people seem to take it for granted that others provide them code or services for free on a grand scale (e.g. Github hosting and additional features heavily used by the open source community); but at the slightest suspicion that they should give something away, all hell breaks loose.

        1. 13

          the neural net itself is neither a copy nor a derived work of the training data; what the neural net spits out is computer generated in the first place and thus not protected by copyright law,

          I don’t believe that this is settled precedent yet. In particular, it is clear that a neural network can memorise some of the input date. The fact that it’s a neural network doesn’t really matter - if it were a rule-based system with a database of code snippets that it could combine to produce output then it would be equally legal or illegal and that’s up to a court to decide.

          unless the result is sufficiently similar to something human generated, which itself is sufficiently creative

          That’s the crux of the matter. It is established that Copilot can generate exact copies of code. It is not yet established whether these snippets are sufficiently creative to merit copyright. This is a very tricky issue because it does not depend exactly on length. A two-line snippet might be copyrightable if it is doing something sufficiently original and a court agrees that it is a creative work. In that case, you may still be allowed to quote it but then you may have attribution requirements, depending on the jurisdiction. It is more likely that a long fragment is both considered a creative work and not covered by fair use, but some long things can be considered non-copyrightable (e.g. if they are mechanical implementations of a published specification).

          1. 1

            Well, we will see what comes out (likely not much).

            it is clear that a neural network can memorise some of the input date

            That’s not correct; the DNN doesn’t just make a copy or memorizes a copy; it might be able to reproduce parts or the training set, though this is not the actual purpose, but a rather coincidental an unwanted side effect (which occurs less than 1% according to Github officials as far as I remember). Also note that it is not comparable to a simple database, not even a compressed or encrypted one, since there is no technical means to restore the original works used to train a DNN; it’s rather like a hash sum; the abstraction and transformation done by the DNN training algorithm is substantial; the original works are unrecognizable and unrecoverable; the DNN is thus no derivative work; any other outcome of the trial would be a big surprise.

            then it would be equally legal or illegal and that’s up to a court to decide

            Storing copyrighted work is generally no violation of copyright law (in some countries it might be illegal if the copyrighted works were not legally acquired). This is established legal practice; we don’t have to wait for a decision.

            This is a very tricky issue because it does not depend exactly on length

            Not that tricky; there is a well established legal practice in this regard with various precedents; if the DNN would repeatably produce code sufficiently equal to existing code, the matter would have to be clarified in the individual case anyway, whereby the burden of proof of authorship as well as similarity and copyright infringement would lie with the individual plaintiff; and the defendant in this case would not be Github, but the developers using the code in question.

            1. 12

              It could be like lossy compression. If you make a shitty JPEG copy of a copyrighted artwork, the new bytes look nothing like the original, and you can’t even restore the original, but it may still be an infringement when it’s a close-enough copy.

              You could also look at this from a higher level:

              code goes in -> black box -> same code comes out
              

              The complex implementation details of the black box may be irrelevant legally. If you put a copyrighted work in a paper shredder and then glue the shreds back together in the original order, even by chance, the court may not care how you did it, only that you have ended up making a copy of the original.

              1. 2

                code goes in -> black box -> same code comes out

                That’s essentially the concept of all electronic media today. If you take a picture of Mona Lisa, the camera, the memory card and the JPEG format in use are a blackbox for the majority of users; even though they are able to view or even publish the picture displaying Mona Lisa with little effort.

                This also nicely demonstrates the present situation. Neither the manufacturer of the camera, nor the inventor of the JPEG format, nor the photographer making and keeping the picture is liable of copyright infringement. But if the photographer wants to publish the picture, a permission of the copyright holder may be necessary; this depends on what you can see on the picture, not on how the picture was taken or stored, or the slight quality loss of the format.

                In the present case, the DNN is conceptually comparable to the photographer and the storage format; but the DNN doesn’t store a copy nor a “picture” of the original, but a statistical abstraction of certain features of millions of originals. So the DNN doesn’t simply transport or “publish” original content, but it is able to synthesize new content based on the feature abstractions.

                It’s a similar process as when you write code, remembering concepts you have learned over the years (I am not talking about the widespread method here where developers simply copy existing code from the internet). If by chance something comes out of the DNN that resembles an existing work, the user still has the responsibility that copyright imposes on him, and the copyright holder still has all the possibilities that copyright grants him; but this is not Github’s responsibility.

                1. 4

                  JPEG compression transforms pixels into a completely different domain which does not visually resemble the original image at all (what Grant Sanderson calls the “Fourier world”); the only reason why this works is because we have a series of master theorems which establish the bidirectional connection between our visual world and its Fourier-transformed counterpart. But this is analogous to the situation with neural networks; they operate in a completely different perceptual domain than us, and we rely on theorems which state that such networks are faithful approximants of the desired functions.

                  If I take a JPEG-compressed image of a copyrighted work and alter its coefficients in a bidirectional transformation, producing an image which visually resembles the original work to a human observer, then I may have infringed. Similarly, if I take a neural-network-encoded pile of copyrighted source code and approximate its coefficients in a bidirectional transformation, producing source code which syntactically resembles the original work to a human observer, than I may have infringed. It doesn’t matter whether I tuned the coefficients by hand or used an automated tool to compute their new values; what matters is that the prior values were computed by summarizing copyrighted works.

                  1. 2

                    But this is analogous to the situation with neural networks; they operate in a completely different perceptual domain than us, and we rely on theorems which state that such networks are faithful approximants of the desired functions.

                    That’s not the way Copilot or e.g. GPT-3 work. You can indeed approximate compression functions with DNNs, but that’s not what is done here. Anyway, even if they only implemented an indexing and search algorithm on the repositories, without any DNNs, this was no copyright infringement, even when search results would show snippets the same way they do today. There are already precedents for this.

                    1. 2

                      the problem is not so much whether github is infringing but whether as a user of copilot you may unwittingly infringe someone’s copyright without having any way to know whether it happened - google is not infringing by serving up search results, but the fact that you found something on google doesn’t grant you any rights to use or republish that content in your own work

                      if copilot were just a search engine, again, github would be in the clear, but you would still need to check the license to see if you can use it. all that changes by making it a language model is that you can’t easily check so you never know if its output is safe to use in your own projects.

                      1. 1

                        I recommend reading the filed complaint. And it is not Github’s duty to enforce law.

                        1. 3

                          a key part of the complaint is the stripping of license information which they are responsible for preserving, a problem they would not have if they’d simply built a search engine

                          1. 1

                            They’re not stripping license information; they synthesize snippets, a tiny fraction of which might resemble existing code (which is very likely for any sufficiently small snippet and thus barely avoidable). But let’s see what comes out; the litigation is now filed.

                            1. 3

                              If the whole file is “this snippet and a copyright header”, the term “snippet” is misleading.

                              1. 3

                                What is really interesting to me about this whole Copilot situation is how much the zeitgeist has completely flipped.

                                I remember years and years ago people proposed all sorts of weird multi-part scrambling schemes that would take an input and produce a bunch of seemingly-random data blocks, none of which could reproduce the input or a subset of it, but it you had all of them you could recombine in a way that got back the exact input. And people literally thought this was an end run around copyright, since you could have, say, a P2P system where each peer distributes only a subset of the blocks needed to reconstitute a popular song or movie, and thus none of them were distributing a “copy” of it because none of those individual blocks could reconstitute it on their own – the fact that at the end you had a perfect copy didn’t matter, it was claimed, because only the intermediate distribution format mattered for copyright law.

                                And things like this would get lots of attention and hype on tech forums and be cheered on as proof of the incoherency of even the concept of copyright.

                                Now GitHub has invented something not too far off conceptually, and the tech community is screaming for it to be destroyed and arguing about how it’s the output that matters, not the intermediate format.

                                1. 3

                                  Now GitHub has invented something that they seem to think is actually a magic copyright remover, and the tech community is screaming for it to be destroyed.

                                  I think that nobody wants to achieve any “destruction” here. All I want is that my copyright remains intact.

                                  1. 2

                                    The best possible outcome is that the defense wins, thus striking a serious blow against the legal fiction of intellectual property. Yeah, I have no love for copilot. It is essentially strip-mining the commons. And Microsoft Github is another tentacle of the surveillance capitalism octopus. In this specific case, I’m rooting for the megacorp. Though I have to admit, a teensy cut of that class action suit would sure help me out right now.

                                  2. 2

                                    The litigation includes no such examples, which is a pretty strong signal to me that no such examples exist because it would seem to be the exact sort of sample that gives the best (still small IMHO) chance of winning.

                      2. 3

                        In this case I’d say Tensorflow (or whatever NN library they use) is the algorithm provider not responsible for its usage, but Microsoft is the user feeding copyrighted data into it.

                        1. 1

                          a partially trained DNN is kind of like a zip file with a bunch of files already in it - adding another one is going to take up less of the capacity if its similar to what’s already there, the trick is that information is shared - generative models are kind of like a lossy compressor whose compression artifacts take the form of making the input more generic, more like the training set (“faceapp yourself but don’t actually apply any filters” type distortion), and the degree of distortion is simply a factor of the model capacity

                          training a high capacity model on a small dataset inevitably memorises things verbatim, because the training task for these models is reconstruction, that they appear to be doing something else is mostly a factor of capacity limits and sometimes intentionally distortion-inducing sampling methods

                          and you can observe different degrees of distortion even in text models like copilot - depending on how common a code snippet it is reproducing is and your settings, it may reproduce existing snippets nearly exactly but with different variable names or commenting style, which shows that it has an internal representation that doesn’t necessarily need to store the “stylistic” details, but is still obviously close enough to be license infringement

                          when given a context that isn’t especially close to any single training sample it appears to be more “creative” as its having to mix together information gleaned from multiple samples and rely more on the surrounding context, but the big problem with copilot is you can never really know when its being “creative” and when its just coughing up a training sample almost exactly so its always going to be kind of legally awkward to use

                          the real annoying part is that language model based code completers are still really useful when trained on much less code, and a lot of the code that copilot was trained on isn’t just encumbered by licenses that don’t allow unattributed copying, but is also poor quality. There is conceptually a more useful tool you could build with the same methods by being more selective about its training data, but copilot feels like GitHub and OpenAI trying to retroactively recoup the cost of training a huge model more than an intentionally designed product.

                          1. 3

                            a partially trained DNN is kind of like a zip file

                            No. ZIP is a lossless, deterministic compression algorithm, in no way comparable to what the present DNN or its training algorithms do.

                            Ultimately, the degree of similarity and the degree of creativity of the snippet will be decisive. Unfortunately, however, the value of such snippets is greatly exaggerated in the present discussion. It is undisputed that in copyright law source code (unfortunately) automatically counts as a work, and this (even more unfortunately) also applies to parts of it. However, this is a perversion of the concept of a work. Because the probability that any snippet is present in any number of other source codes in a very similar way is close to 100%. Industry and open source developers are already suffering from the perverted use of patent law; now they are to be bothered also by the perverted use of copyright law. Judging whether or not a snippet meets the creativity requirements is usually arbitrary. Fortunately, problems with this kind of misuse of copyright can be circumvented with its own means relatively easily by simply rewriting the snippet.

                            1. 1

                              The point of the analogy was it containing multiple items and sharing information, read “mpeg file” if you’re hung up on the lossy vs lossless distinction

                              1. 1

                                if you’re hung up on the lossy vs lossless distinction

                                Thanks. I have a formal education in both information technology and law.

                      3. 2

                        copilot only violates copyright 1 percent (or slightly less) of the time - trust us

                        Microsoft and you.

                  2. 2

                    I don’t think anything that github did is likely to have required a license in the first place.

                    I don’t think GitHub has any right to other people’s work, unless granted by a license.

                    1. 8

                      They have legal ownership of the copy that is in their possession, given that they acquired it lawfully (which they did). The same way you own a book.

                      You can do lots of things to code without a license. Read it, lend it to your friends, sell it to the used code store, stick it on a shelf so your guests can admire how big a library you have, execute it, etc. They don’t have copyright, but they absolutely have normal property rights.

                      1. 1

                        I think ownership might be the thing that’s at least debatable here. I don’t think GitHub owns the code it hosts. Similar to a web hoster not owning all the photos and your ISP not owning everything it caches or goes through the network.

                        Or more IT comparison. If I am a code reviewer, some consultant or something and code is given to me to inspect it. That doesn’t mean I own it, simply because I legally have the data on my hard drive. If said code was some service and I’d just run it the actual owner would likely be very unhappy.

                        I agree this is about copyright and not license. The question is whether what they do is some kind of fair use or anything you are allowed to do under copyright law.

                        I’d argue it’s not, because it doesn’t create a benefit for society, like most fair use does for example.

                        If it turns out it is, what would happen to let’s say anything that re-compresses an image, maybe lossy, as part of a service. They (likely) do that it in this case even with the explicit authorization of the copyright owner. They run ti through some algorithm and get something new out of it that kind of reassembles the originally, but not rally and certainly not in terms of bytes. Does that make them the owners?

                        Or what if someone simply wrote some “AI” that let’s say mostly strips comments, reorganizes code, maybe even just works with some sort of AST. Would it make the output owned by whoever runs it?

                        Does that mean I mean one could make an “AI” that disassembles binaries, maybe makes some redundant changes and outputs new modified binaries? Would that work.

                        What if it was more involved and you actually train a NN, and just teach it the bytes of some software or even a movie. You have a prompt where you can enter “The bytes on C:\videos\plan-9.mp4 are video files of Plan 9 from Outer Space. Remember this!”. It does, but not just by copying, but by adding it into its (language) model. Then since its your language model you share it on the web. Someone else may download it and say “Hey there. I need the bytes for Plan 9 from Outer Space in C:\warez\plan9.mp4, please store them there for me”. Who holds the copyright on what the AI creates through its language model? It might even have learned to skip redundant license statements of software, strip FBI warning from videos and who knows what.

                        What if the AI does more? What if it even can “watch” and “learn” the movie, potentially scale it up to 4k monitors, output to any format, knows how to change it just enough so any AIs looking for copyright infringements can’t differentiate it anymore? What if it can lean to even change movies, just enough so that copyright lawyers consider it a new work of art.

                        Where do you draw the line? Where does what’s allowed under copyright law end?

                        I really don’t have the answer, but I think with copyright law a huge mess was created in first place, because laws work best when they are something you can agree on at large and they change or come down when a large amount of people change their opinion (homosexuality, slavery, women voting, witchcraft, etc.). I don’t think with copyright there ever was huge amounts of agreements, and if it was was applied to the letter and copyright holders would really sue everyone who crosses its lines the majority of people would have voted to abandon at least large parts of it.

                        Besides that the like between being inspired by, learning from something or even learning something (think reciting a poem) simply are some form of copying with some translation. There already are existing huge topics on fair news, see sampling, mixing, etc. and laws that nobody feels comfortable executing, like singing copyrighted songs on parties, or other private settings in some countries.

                        I’d say all of this is at least something that’s not so clear in law, so whichever route it takes I am sure there certainly is potential far-reaching effect whatever the conclusion will be.

                      2. 5

                        Their ToS do grant them some rights, but their Copilot actively violates their own ToS:

                        https://docs.github.com/en/site-policy/github-terms/github-terms-of-service#4-license-grant-to-us

                        1. 4

                          I don’t see at all how Copilot violates their own terms. Could you give an actual explanation, with detailed specific claims?

                        2. 2

                          It seems that in the above “right” is being used to mean “moral right”, while many here are using “right” to mean “legal right”. Confusing those two things might be the source of some of the misunderstanding that I’m seeing here.

                        3. 1

                          I don’t think anything that github did is likely to have required a license in the first place.

                          It depends on the particular licenses that are violated, I guess.

                          1. 9

                            The argument is something like:

                            1. Copyright law says that there’s a list of things you can’t do without getting permission from the copyright holder.
                            2. For open-source/Free Software, the copyright holder does grant permission to do some of those things, in some ways, via a license.
                            3. But if the thing you are doing is not one of the list of things that requires the copyright holder’s permission, then the license terms are irrelevant, because you are not dependent on the license for your permission to do those things.

                            There is also the other option that if you have access to a piece of software under multiple potential license grants, one of which is more permissive than the other, you can choose the more permissive one without having to observe the less permissive one. I’ve pointed out in past threads about Copilot that I would not be surprised at all if the license grant embedded in GitHub’s terms of service turns out to be more than sufficient to allow everything Copilot does, for example.

                            1. 7

                              No, if they don’t need a license I don’t think the particular licenses matter at all. Licenses grant permission to do things that were otherwise illegal under law with some conditions attached. If you didn’t do anything that was otherwise illegal, licenses don’t do anything.

                              If they did need a license, they’re obviously in trouble with pretty much any license, because they didn’t comply with pretty much any license (other than the CC0 and wtfpl style ones).

                              1. 7

                                Some licenses grant you permission to reproduce some/all of the work provided you meet the conditions, Microsoft did not meet the conditions in those cases, yet the reproduced the works anyways.

                                How is violating copyright law not illegal? Or are you insinuating that copyright law doesn’t cover source code?

                                How is this any different than taking text from books in the library and trying to pass it off as your own work, while not paying dues to the copyright holders (or whatever conditions they have, if any)?

                                1. 8

                                  How is violating copyright law not illegal? Or are you insinuating that copyright law doesn’t cover source code?

                                  I’m saying I don’t think they violated copyright law. I’m not saying it doesn’t cover source code, but that I don’t think it covers this kind of use of copyrighted material of any kind.

                                  How is this any different than taking text from books in the library and trying to pass it off as your own work, while not paying dues to the copyright holders (or whatever conditions they have, if any)?

                                  I don’t think the distinction between text from text books and code is relevant if that’s what you’re asking. If you trained the same kind of model on an equally large collection of equally diverse textbooks and released it as a book writing aid I think you would have exactly the same legal status. (Edit: I should say “and served requests to the model as a book writing aid”, releasing the model vs serving requests is potentially legally relevant, with the latter being slightly more likely to be problematic IMO).

                                  I don’t think it’s fair to describe what’s happening here as “taking text from books in the library and trying to pass it off as your own work” though. There are many more steps happening here, and they’re achieving a much broader goal than rote copying pieces of text. And sure, sometimes the process manages to create a copy of a small piece of someone else’s copyrighted work that has been copied many times already into other places, but that’s de minimis copying and not copyright infringement.

                                  1. 7

                                    It might be worth noting for instance, that Google won Google v. Oracle despite copying 11,500 lines of code. In part because that wasn’t a substantial portion of the work. I’d expect a similar analysis here.

                                    The samples that were duplicated that they use to justify the lawsuit are things like part of an exercise from a textbook, it’s not a substantial portion of the book.

                                    1. 0

                                      And I think they are violating copyright law, but I’m not a lawyer and you probably aren’t either. I hope this goes to trial so we get to hear some random judge’s opinion.

                                      1. 4

                                        Your whole argument rests on “they didn’t follow the license!”

                                        If their whole argument is “we didn’t need that license to do what we did”, then your argument is not really relevant. That’s what people are trying to get you to understand – the license terms may literally have no relevance whatsoever.

                                        1. 1

                                          What’s the point in any license if it’s not relevant here? My point is that argument, that the license is not relevant, is, uhh, not relevant.

                                          1. 6

                                            A license grants you rights to do things that would not otherwise be covered by copyright law. You do not require a license to do things that are covered by Fair Use / Fair Dealings (delete as appropriate for your jurisdiction) or by other explicit statute law. For example, in the USA you do not require an explicit license to record a cover of a song because compulsory licensing is enshrined in statute law and so you just have to pay a fixed amount for every copy that you sell.

                                            The argument (disclaimer: I work for MS but have no connection to this project) it that this kind of use is covered by explicit law. I don’t know precisely what the laws in question are, there are a few things on building machine learning systems and databases that may apply but it will be up to the court to decide.

                                            Whether they win or not, I think they’ve achieved their goal. We (MS) have spent a huge amount of time and money building a reputation as a good open source citizen. I’m able to hire great people on the basis of that and the expectation that they will be paid to contribute to the F/OSS ecosystem. Being on the other side of a lawsuit from the SFLC does a lot of damage to that reputation and, in some ways, winning a lawsuit against the SFLC would do more damage than losing.

                                            1. 2

                                              Still, I’m glad it goes to court and I even hope MS wins. Copyright has limits and that’s important.

                                              We’ve had this before. Publishers wanting to block used-book sales, for example.

                                            2. 4

                                              A license is just that: a license to do something with a copyrighted work. It can’t take away rights that were already granted by copyright law, such as fair use.

                                              1. 4

                                                Ordinary users of GitHub receive code under the license chosen by the person who posted the code.

                                                GitHub has a choice between receiving code under that license, or under the license granted in GitHub’s terms of service.

                                                GitHub can simply choose the more permissive of the two, in which case the more restrictive of the two is in fact irrelevant.

                                                Think of it like any other dual-licensing scheme. Suppose I write a piece of software, we’ll call it foolib. And I offer it under a choice of BSD or AGPL. If you choose to receive foolib from me under the BSD offer, you will be able to do things with foolib that the AGPL would not have allowed. And you will be able to do that because the AGPL is not the license under which you received foolib and so is not the license which governs your use of it. No amount of yelling “that’s an AGPL violation!” would be relevant there.

                                                Similarly, even if I only offer it under AGPL, you could still do certain things with it – such as fair use – without having to follow the AGPL’s terms. And again no amount of yelling “but that’s an AGPL violation!” would matter, because there are things copyright law still lets you do without needing to obtain or follow the terms of a license.

                                                The point being made here is simply that saying “But that’s a license violation!” over and over is not relevant, because the original argument is that GitHub either has access under an alternative, more permissive license, or is doing things that do not require a license in the first place. In the former case, the only license terms which matter are the more permissive ones; in the latter case, no license terms matter.

                                1. 14

                                  I have one big problem with this talk, and that’s… its central thesis actually: one thing the edit-compile-run cycle has going for it is that it’s principled. Namely, you’re starting from a known state (the source code) every time you compile. With live/image based programming where you fix everything on the go, the actual source code isn’t limited to the source text, but the state of the program itself! This has to be unmanageable at some point.

                                  I will also note that some aspects of live programming are available even to C/C++ programmers: just recompile the dll and reload it. This requires significant initial setup, but it allowed more than one game dev to change mechanics on the fly without restarting the whole game, without the help of a scripting language. The Yi editor written in Haskell demonstrated something similar. XMonad can also maintain its state across compilations (which happen whenever we change its configuration).

                                  1. 4

                                    Well said! I do love REPLs and think they are very valuable, I also prefer “principled” approaches (as you very well put it) and think there is a great deal of value there.

                                    Combining both seems like fertile ground.

                                    1. 1

                                      Exactly, it is about combining both. Principled use of excessive power.

                                    2. 3

                                      Yup. There have been some threads here about “image-based” development, which I experienced with Smalltalk-80. It’s kind of magical in some ways, but also leads to real problems because your program never starts from scratch.

                                      1. 2

                                        With live/image based programming where you fix everything on the go, the actual source code isn’t limited to the source text, but the state of the program itself! This has to be unmanageable at some point.

                                        I am a super huge fan of Common lisp and this is something I see every single newcomer struggle with. They usually don’t struggle with it for long though. But it’s definitely something that I would like to see discussed explicitly more often.

                                        On the bright side, Common lisp was designed with that (interactive development) in mind (mostly). What comes immediately to my mind, as an example of this, is the defparameter v.s. defvar expressions that only differ by what they do when they are re-evaluated. Making it easy to reload source code without losing the current state.

                                        Yet another thing: I’ve rarely seen anything that deals with “cleaning up the state”. For example, if you rename a function f1 to f2, the original function f1 still exists in the image unless you explicitly remove with (with the so well named function fmakunbound). The only time I’ve seen something that deals explicitly with renames is terraform’s refactor, but that’s hardly for “image based development”.

                                        One last thing that I won’t develop because this is already way longer than I wanted: the problem of updating the code within an image is the same problem as updating a database.

                                        1. 2

                                          the problem of updating the code within an image is the same problem as updating a database.

                                          I think you meant to analogise the database with the state, not the code…

                                          There’s a difference though: with a database, the data is concentrated in one place, and that data is mostly statically typed: you have a fairly rigid set of tables, each column has a definite type… and if it’s all normalised like it should be 99.9…% of the time the interface between it and the code will be relatively thin, and very stable.

                                          Now replace the database with the live state of a long lived program. A dynamically typed program at that. You need to make sure your program state is just as principled as the database’s, the interface between it and the code just as thin and stable… except here all you have is programmer discipline. There’s no clear fence, and depending how your program is accessing your data, good luck determining whether a state change will break your program or not. At least with a database you know that as long as you don’t remove columns or entire tables you’re mostly safe.

                                          So at a first glance, image based programming at scale (let’s say more than 10 devs working on the same code base & associated state) is looks utterly impossible. How does that work in practice?

                                          • How do you keep your program from irrecoverably mangling its state?
                                          • How do you deal with irrecoverably mangled state?
                                          • How do you version the code?
                                          • How do you version the state?
                                          • How do you share state among several collaborators?
                                          • How do you update the code of a production instance while keeping existing state?
                                          • How do you update the state of a production instance?
                                          • How do you update the state architecture of a production instance?

                                          And perhaps most importantly, do you even need all of the above? Why not?

                                          1. 3

                                            In CL: if the state becomes a mess, restart your image! No big deal, you loose 2 minutes. For many days, I can develop my program with the same running image, it’s fast, it’s interactive, I don’t loose state, I am happy. It happens indeed that I restart it. I also always run unit tests or build the project from the terminal from time to time, so from a blank test.

                                            We use source files with git. Maybe they didn’t have this in Smalltalk land decades ago, we do now.

                                            How do you share state among several collaborators?

                                            To share state among developers, we can build and share a custom image. It’s nearly like an executable, but it’s an image that offers a REPL. Now instead of running a new SBCL instance from our editor, we start sbcl --core my-image, and it starts instantly.

                                            How do you update the code of a production instance while keeping existing state?

                                            Easy: git pull, connect to the running image (you have a tmux or you use Swank, the server side of Slime, through an SSH tunnel) and you load your source files again (today, it would quickload). The image is not restarted, so you keep state.

                                            Also if you define variables with defvar, you don’t loose their state (versus defparameter).

                                            But, disclaimer: one can very well update production instances without worrying keeping any state: rsync a new binary, stop the service, start the new service. Usual deployment knowledge would apply. Just as with a Go application for instance.

                                            How do you update the state of a production instance?

                                            wow, you mean, from the state of a dev instance to the production one? Who would care to do that? I don’t see the point. I develop on my machine, I deploy new code to the production machine. I don’t have to mix states. Maybe for very specific type of applications? If I need to send data, I send the data files.

                                            1. 5

                                              In CL: if the state becomes a mess, restart your image! No big deal, you loose 2 minutes.

                                              Oh, I see. So the people raging about image based programming actually don’t really care about their image. And I guess then state is not versioned either. Except of course whatever is needed to establish an initial state from scratch. I also bet the need for several people to work on the same image ranges from limited to nil.

                                              What I thought was a completely alien realm seems actually pretty mundane all of a sudden. What I thought impossible probably is impossible… I just had the wrong idea. Kind of anticlimactic to be honest, but I think I have a better idea of what this really is now.

                                              Thanks.

                                              1. 2

                                                don’t really care about their image

                                                That’s a way of putting it :p For me, a young-ish lisper, image-based development is awesome for: development, and ease of live reload (if I want). Bonus, I use custom images from time to time (start a Lisp image with a ton of libraries pre-installed and with other tweaks). That’s it, and that’s a lot already.

                                                1. 1

                                                  Oh, I see. So the people raging about image based programming actually don’t really care about their image.

                                                  I think it’s different between CL and Smalltalk, maybe? In CL, your image is pretty important during a given development session, but not so much between developers, or between yourself on different days, I think. The main thing is that your changes are applied to a running image without a restart, and you can build up a running program iteratively like that. If you need to get back to a particular state after resetting your image (restarting your Lisp process), you’d set it up the same way you set up fixtures for a unit test in a stop-compile-run language.

                                                  1. 1

                                                    Well if some programmers do care about their image to the point they will want to version it in some way, or work on it collaboratively over a long period of time… how they do it remains a total mystery to me.

                                                    1. 1

                                                      I think some Smalltalk implementations have had some kind of tooling for versioning images, but I have no experience of them.

                                              2. 3

                                                the problem of updating the code within an image is the same problem as updating a database. I think you meant to analogise the database with the state, not the code…

                                                Yes and no, but I definitely wasn’t clear xD . I had in mind that updating functions (or methods) in an image is somewhat analog to updating views in a database. But anyway, you seem to understand the problem of updating the state of a running image well enough :)

                                                So at a first glance, image based programming at scale (let’s say more than 10 devs working on the same code base & associated state) is looks utterly impossible. How does that work in practice?

                                                As @vindarel said, people don’t do that, but!

                                                1. It’s a problem that I find interesting and think about from time to time. The solution, I think, would be something like “gating” the changes made to the running images. So, one dev could redefine a function (in a sandbox, which I have no idea how that would work exactly), if it passes the tests, then the change is applied to the “real” image. It would practically be a CI-within-a-process.

                                                2. People most definitely don’t care about their dev images, but some do care about the real, always running, production image. In which case they go to great length to make sure the code that is going to be evaluated during a release don’t mess up the state, don’t cause any (foreseeable) bugs and don’t pollute the image. I think this is where my analogy with database migrations came: applying successive version of the codebase to a running image is analog to running successive migrations to a database, which brings me to:

                                                3. Checks, in Common lisp, you can compile the code before actually evaluating it. SBCL (the most commonly (ha!) used implementation) will catch a lot of “type errors” (which it often reports as warnings, but you can count those warnings as errors). Also, nothing keeps you from running tests for anything you can think of.

                                                How do you update the state of a production instance?

                                                Imagine you have an ever running python or js repl, how would you go about updating a global variable that contains all the state (to keep it simple)?

                                                The first thing that comes to my mind is: only access the state via some specific functions (often called accessors). Update them to support the new and the old state “shape” (I think it’s what you meant by “architecture”) before updating the state.

                                                How do you keep your program from irrecoverably mangling its state?

                                                I covered it a bit, but you kinda just have to be mindful about the changes to the state; type-checking at compile time, assertion at run-time and automated tests all helps. Also keep in mind that you can (and are very often strongly encouraged) to use functional data structures, which helps a lot when you are wary of mangling the state. BUT, bugs can happen, same as database migrations, it can go wrong even with a lot of precaution.

                                                How do you deal with irrecoverably mangled state?

                                                As vindarel said, just restart the process :P

                                                How do you version the code?

                                                Same as with all “mainstream” programming languages. You use git.

                                                But, @gcupc is right about Smalltalk. Smalltalk’s source code was traditionally part of the image. Smalltalk was much more “image-based” that Common lisp on that aspect. I don’t know much more about it though… Except perhaps that I know that Pharo, a recent Smalltalk implementation, tries hard to make Smalltalk work with source files, so that people can use “traditional” command line tools on those files.

                                                How do you version the state?

                                                I never heard anybody doing that exactly. But the “state” would be just a file, so you could use git (or better, git-lfs) to version that file. But I don’t think that’s what you had in mind. (And sbcl is super picky about running an image from a different version, so you would be stuck with a specific version of the lisp implementation.)

                                                I think that what you had in mind is “I have a running process, I apply some code change to it, how does the code know what’s the state’s schema right now?”. I see 2 solutions: you can do it implicitly, by using predicates that checks the types at runtime, or you can keep a “global” variable that explicitly keep a version number.

                                                How do you share state among several collaborators?

                                                Depends what you mean. But in theory, if many collaborators have a running image that they don’t want to close for some reason, they can just git pull the most recent version of the codebase and apply that code to their running image. See point 3, about releasing updates without down-time.

                                                How do you update the code of a production instance while keeping existing state? / How do you update the state of a production instance? / How do you update the state architecture of a production instance?

                                                See point 3.

                                                And perhaps most importantly, do you even need all of the above?

                                                No, you don’t need it.

                                                Why not?

                                                Because it’s freaking hard xD, but I think you already guessed it.

                                                / disclaimer: this message is way too long, I’m not gonna proof-read it just now ;)

                                          1. 5

                                            The takeaway here seems very wrong. The solution to this problem is to use one of the many barcode formats with built-in error tolerance. Missing a line of pixels should not destroy your data, when you’re probably gluing that barcode somewhere that people can scratch it. This is a design problem, not an implementation problem.

                                            1. 8

                                              OP here; I agree that error detection/correction would be better.

                                              Some context:

                                              • these events happened 20 years ago
                                              • the (hard to sway) customer specified the barcode format
                                              1. 4

                                                ∀ problems ∃ a solution that’s both perfect and unavailable ;)

                                                Seriously, if you’re going to print, it makes sense to print in such a way as to minimise the impact of the hardware’s weaknesses on the output makes sense. It’s like woodworking, where skilled people look at the wood before deciding where to saw, and decide in order to avoid weak spots in the result.

                                                You could make something else instead of the thing you’re making. Make a stool instead of a chair, or print something else. That something else would also be best done if you use your tools in a way that suits the materials and requirements.

                                                1. 1

                                                  Aha, implicit project scope requirements defined by the customer. Sounds about right. ;-)

                                                2. 2

                                                  Why not both?

                                                  1. 1

                                                    Because if you use a format with error correction, you never have to worry about your printer.

                                                    1. 3

                                                      “Never”

                                                      1. 1

                                                        No more than anyone does. If your printer fails, you’re hosed, but that’s easy to check. You don’t need a comment in your code that says “// Rotate barcode 90 deg for printing, to avoid dead lines”

                                                1. 8

                                                  If you hate reading long Twitter threads as much as I do, esp. on mobile: https://threadreaderapp.com/thread/1566421109255315458.html

                                                  1. 3

                                                    Thanks!

                                                  1. 2

                                                    For Python tests based on unittest you can use a test runner that supports parallelization. Two that I know of that can do so are zope.testrunner, which is a very mature test runner (it comes from the Zope project but doesn’t actually have anything to do with Zope proper) and Green which is somewhat newer and “thinner” but is quite nice.

                                                    For aggregating across test runners, subunit is one I’ve used and liked. I’m not sure it’s still under active development.

                                                    1. 1

                                                      And if you’re using pytest, there’s pytest-xdist.

                                                      1. 1

                                                        OK but that doesn’t do anything about C++ tests right? Or you would have to wrap each of your C++ tests with a Python test that looks at the exit code?

                                                        I would want to run them in parallel with each other, not just Python tests in parallel, then C++ tests in parallel :)

                                                        1. 1

                                                          There are cross-language test aggregators that do various things in this space. They may or may not solve the parallelization part, but they assist in aggregating runs.

                                                          At a minimum you could do something like spawn 2 C++ test runs, each with half the tests and aggregate the results.

                                                          Wrapping C++ tests in Python might not be too bad if there aren’t any C++ test runners that help.

                                                      1. 4

                                                        I’m not sure I’d consider this “illegal”, but I also am not a lawyer.

                                                        Seems like licensing is just broken in the age of dozens of libraries statically compiled into a single binary. There seem to be no good solutions.

                                                        1. 18

                                                          I’m not sure I’d consider this “illegal”, but I also am not a lawyer.

                                                          Distributing a binary that incorporates object code derived from source code that isn’t licensed to permit that distribution is clearly illegal. It violates copyright law in the same way as hosting (for example) a pirated copy of Photoshop. It might also be a crime, depending on the jurisdiction and the facts of the situation.

                                                          I’m also not a lawyer, but none of this stuff is magic. If someone can read an API reference manual they can read a statute. Lawyers are mostly useful for when (1) you’re going to be doing something that might be illegal and you want to be very careful about what is or isn’t allowed, or (2) someone is accusing you of having done something illegal and you’d prefer that the judge/jury disagree with them.

                                                          1. 8

                                                            Sadly that bit about statutes is only partly true. Precedent plays an important role in common law, which in practice results in a very tangled web of implicit dependencies. As in: important considerations are not specified in the statue anywhere; you need to be familiar with that entire body of law and its history. :(

                                                            1. 5

                                                              That’s generally only if you want to get close to the edge of what the law allows, or get a definitive(-ish) answer to an ambiguous case.

                                                              Let’s say it’s currently 2015, and you’ve recorded a cover of a song whose author passed away in 1946. Is it legal for you to distribute that recording? The statute says it’s not currently legal, because that song won’t enter the public domain until January 1st 2017 (author’s year of death + 70 years).

                                                              But if I ask about the specific song “Happy Birthday” and the date is December 3rd 2015, then the answer is “yes it’s legal”, due to the ruling in Marya v. Warner/Chappell.

                                                              The law provides the defaults, and precedents provide special-cased overrides, but those special cases only end up being argued in court because they’re weird. You don’t need to study the case law and keep up-to-date with rulings to stay within the bounds of the statute, you only need that if you want to go beyond what the statute says, and still be OK.

                                                            2. 4

                                                              I’m also not a lawyer, but none of this stuff is magic.

                                                              Yes, which is why saying this is “illegal” is silly. No one will ever be convicted of accidentally mislicensing a free product. There’s no actual legal liability here. It’s just people getting mad about stuff that doesn’t matter.

                                                              1. 2

                                                                Thanks, this was the point I was getting at. Strong words are being used but this is far from settled law, and likely never will be.

                                                              2. 1

                                                                The author’s use of the term illegal seems to be solely reserved for the distribution of the unlicensed library. That code is surely not copyrighted as well. That doesn’t meet what I consider to be a clear violation of the law. This seems to be a grey area to me.

                                                                I agree we don’t need lawyers to read the statutes. Where lawyers help a lot is that they know more of the statutes and the established rulings around them. There is a lot of precedent in the decision of what is illegal, as another comment points out.

                                                                1. 8

                                                                  The author’s use of the term illegal seems to be solely reserved for the distribution of the unlicensed library. That code is surely not copyrighted as well.

                                                                  Creative works written by individuals are copyrighted by default, and distribution requires permission from the copyright holder. It’s a “default deny” model.

                                                                  1. 1

                                                                    I still think this is a grey area. If someone copies it, the author may have a hard time proving they are the original creator. If they even are.

                                                                    In the US, just having the copyright from creation doesn’t grant you the right to sue over it. It must be registered.

                                                                    https://www.copyright.gov/help/faq/faq-general.html

                                                                    You will have to register, however, if you wish to bring a lawsuit for infringement of a U.S. work.

                                                                    1. 6

                                                                      I think you’re interpreting “illegal” as “likely to be punished by the legal system”, which is not what it means.

                                                                      If I were to write a big post about some widget and post it on my blog for free, then someone re-hosting it on their own site would be committing copyright infringement. The same would be true even I posted it on Github instead of my blog. The act of distributing my copyrighted work without permission would be illegal, regardless of whether I was charging money for it, or had registered it with the US copyright office.

                                                                      Consider an author in Europe, who has just published their first book. If someone in the USA bought it, scanned it, and posted it online, that would be illegal even if the author had never stepped foot in the USA and had never even heard of copyright.gov. Again, the model is “default deny”. If someone doesn’t have permission (such as via an open-source license, or Creative Commons, etc) then distribution is illegal.

                                                                      1. 4

                                                                        No, I’m interpreting illegal exactly as you do. I’m also claiming that the author’s use of it in this case is a bit hyperbolic and over the top.

                                                                        Illegal doesn’t mean much if it’s not enforceable nor enforced. Worrying about something being illegal that can’t be enforced is not worth the effort in my opinion.

                                                                  2. 3

                                                                    That code is surely not copyrighted as well.

                                                                    What do you mean by that?

                                                                    1. 1

                                                                      In the US, if you have not explicitly registered your copyright with the Copyright Office then you have no standing to sue.

                                                                      1. 6

                                                                        That’s incorrect. All it does is protect against an innocent infringement defence (e.g. “I didn’t know it was copyrighted”). https://www.law.cornell.edu/uscode/text/17/401

                                                                        1. 2

                                                                          https://www.copyright.gov/help/faq/faq-general.html

                                                                          You will have to register, however, if you wish to bring a lawsuit for infringement of a U.S. work.

                                                                          1. 2

                                                                            The registration can occur after the infringement. (You will be restricted to actual damages if you do this, though.)

                                                                        2. 4

                                                                          While true as stated (https://www.bfvlaw.com/copyright-registration-required-to-sue-the-supreme-court-clarifies/), the registration need not be immediate or even pre-infringement.

                                                                          1. 1

                                                                            Yeah this doesn’t surprise me. It still requires the author to explicitly act.

                                                                            1. 2

                                                                              Yes, but this has nothing to do with whether the work is copyrighted or not. It only affects how the court case would turn out.

                                                                          2. 3

                                                                            that’s not the same as the code not having copyright

                                                                        3. 2

                                                                          My non-lawtalker understanding is that copyright violation is a civil issue not a criminal one, so while possibly actionable, I would hesitate to say infringement is illegal per se. (IANAL but I’ve read techdirt daily for a couple of decades now.)

                                                                          1. 9

                                                                            copyright violation is a civil issue not a criminal one, so while possibly actionable, I would hesitate to say infringement is illegal per se

                                                                            “Illegal” means it violates the law, civil or criminal. If something violates criminal law, it’s called a “crime”.

                                                                      2. 3

                                                                        If someone deliberately posts their code unlicensed to Github, I find it pretty hard to believe that any court in the US is going to enforce a copyright claim later because the normal assumption is that by posting it in public, you’re waiving your rights. It’s like leaving stuff on the curb in front of your house: yeah technically you should sign some document saying that you’re giving up possession, and there are cases of people ganking something from a curb that wasn’t intended to be ganked, but it’s hard to imagine making a successful lawsuit about an honest mistake.

                                                                        The bit about “static linking” is also very dumb. If someone changed Go to build as a series of DLLs instead of a single binary, suddenly the licensing situation would be totally different, lol. Who cares? Static vs dynamic linking is just a proxy for code-you-happen-to-use vs code-core-to-the-product, and in Go, it’s a poor proxy because everything is static.

                                                                        1. 12

                                                                          because the normal assumption is that by posting it in public, you’re waiving your rights

                                                                          That’s really not how copyright works. And lost legal cases of “The image was posted online, I thought I could use it” aplenty.

                                                                          suddenly the licensing situation would be totally different, lol.

                                                                          Not really? What would change is that the executable without the dependency would be differently licensed, but for the complete package nothing changes (and thus the first part doesn’t matter all that much, what good is an executable without a required dependency).

                                                                          1. 2

                                                                            And lost legal cases of “The image was posted online, I thought I could use it” aplenty.

                                                                            That’s totally different from code. If you post a picture, the assumption is that you retain your copyright because that’s how all of photography as a business works. People have to look at pictures for them to have value.

                                                                            If you post your code to a site that exists to share code and accidentally a license, then the good faith interpretation is that you meant it to be public domain but forgot or didn’t bother to explicitly license it. You have to put a photo online to use the photo. You only put code online so you can give it away. There’s no other reason to upload it publicly.

                                                                            It’s like someone driving away with an unlocked car on your curb. You can’t drive away with a car even if it’s unlocked because it’s a car, and this is even though it would be totally legitimate to haul off a TV or whatever.

                                                                            1. 11

                                                                              I believe that your interpretation is wrong. Unless you know what license the author intends, you have no way to use it safely. There are plenty of reasonable reasons to put a piece of code public without “giving it away” or “putting it in the public domain” (which isn’t actually possible in many places of the world). They may, for example, have in mind a “shared source” sort of thing where the source is available for study but not anything else.

                                                                              1. 1

                                                                                In this case, the go-diff author added an MIT license after being bugged about it. Because why we else would they have uploaded it in the first place?

                                                                                1. 6

                                                                                  Because why we else would they have uploaded it in the first place?

                                                                                  Maybe they’re too cheap to get private git hosting. Maybe they wanted to publish it for educational reasons. Maybe so they can link to it from their resume.

                                                                              2. 9

                                                                                You only put code online so you can give it away.

                                                                                You could argue the same that people only put photos on Flickr to give them away, that doesn’t make it any more true. If Github wanted to be a purely “give code away for others to use” site, they’d require you to license it.

                                                                                1. 1

                                                                                  It really depends on your usage. You could likely easily get away with that for some hobby project. But once you start a company that distributes software relying on other libraries and one of them is “i guess public, i hope”, the risk of getting it wrong just doesn’t make sense. Especially once you start earning serious amounts and may be sued for damages.

                                                                                  1. 1

                                                                                    Sure. This is why Google kicked all packages without licenses off of the Go online documentation viewer. But people are getting really carried away in this case. It’s obvious that the point of uploading go-diff to Github was to give it away. It turned out that when contacted, the author added an MIT license. Did we really all need to freak out about something that had essentially $0 in liability implications?

                                                                                    1. 3

                                                                                      Did we really all need to freak out about something that had essentially $0 in liability implications?

                                                                                      It’s not just about this particular package, but about the fact that so many distros “illegally” packaged it without thinking. This points to a systemic issue.

                                                                              3. 2

                                                                                I think the static linking comes down to the conflict with the policies of the distributions. They’re in a bind for sure, if they want to follow the letter of the law.

                                                                                In the Debian ecosystem the Go model also conflicts with their packaging policies themselves. Debian goes to great lengths creating packages for every individual Go library, before a binary can be built using Go. They effectively ignore and break the modules system in order to shoehorn Go applications into a system built for dynamically linked applications.

                                                                            1. 2

                                                                              I’m doing a little side project that involves linting Makefiles and one thing I could use is a Makefile parser. The linked project has a simple parser, but I’ve been unable to find anything more comprehensive.

                                                                              Suggestions welcome.

                                                                              1. 2

                                                                                Have you looked here?

                                                                                1. 2

                                                                                  That one is new to me; thanks!

                                                                                  I also found this small parser in JS: https://github.com/kba/makefile-parser

                                                                              1. 3
                                                                                • Seeing the meteor shower tonight
                                                                                • Continue playing RE4 on my PS2 on the old retro wood panel CRT in my now cleaned retro room (All hail FreeMcBoot)
                                                                                • Researching about more ways to create AST parsers (I’ve recently been using linting more aggressively to avoid code styling discussions and it’s been excellent… I want to employ this for anything.)
                                                                                1. 3

                                                                                  I’ve recently been using linting more aggressively to avoid code styling discussions and it’s been excellent… I want to employ this for anything.

                                                                                  Oh! I’ve just started working on what I’ve been thinking of as a “project linter”. There’s currently more lines of Make than Python in the project, so there’s not much to see, but the README might be worth reading: https://github.com/benji-york/fend

                                                                                  1. 2

                                                                                    Re. parsers: I don’t know if this is what you’re looking for, but Lark looks like a really nice parser framework.

                                                                                1. 2

                                                                                  Good stuff! I have come to many of the same conclusions and even many of the same implementations—especially Make as the primary developer interface to a project.

                                                                                  My variations include:

                                                                                  • avoidance of external virtual environment and dependency management (poetry, pipenv, etc.) because they do not yet provide enough value for their costs (given my particular value weightings)
                                                                                  • slightly more standard Make target names (e.g., “dist” instead of “build” (and I use “build” as “do whatever is needed to make this a working dev environment”); the ones in this video are very close and I don’t see a reason for them to change; for details see https://www.gnu.org/software/make/manual/html_node/Standard-Targets.html#Standard-Targets

                                                                                  Here’s a representative example of the style of Makefile I use: https://github.com/flowoilwell/otable/blob/master/Makefile

                                                                                  Here’s a slightly more complex example; I especially like how coverage is done in this project: https://github.com/benji-york/manuel/blob/master/Makefile

                                                                                  1. 3

                                                                                    Neat. Thank you for the compliment and for sharing. I really like the conditional definition of release in the second example. I’m going to meditate on that a bit and see if there’s a way I can safeguard some tasks similarly. Our releases are done on CI; no one can should push from local builds but I think I’ve got some other tasks worth guarding.

                                                                                    1. 2

                                                                                      I’m glad you saw something interesting there.

                                                                                      For a slightly wilder Make experiment, check out this alternative way of specifying PHONY targets: https://twitter.com/benji_york/status/1550138533494628357

                                                                                  1. 14

                                                                                    It’s fun to see how different languages handle the same patterns. My fav Clojure relies on plain maps as it’s core data structure and explicitly rejects paragraphs like this:

                                                                                    Not only do dicts allow you to change their data, but they also allow you to change the very structure of objects. You can add or delete fields or change their types at will. Resorting to this is the worst felony you can commit to your data.

                                                                                    1. 7

                                                                                      I haven’t coded that much Clojure (although I keep tabs on it), but as someone who jumps back and forth between Python and JavaScript I’ve noticed the same distinction.

                                                                                      I think the reason dicts are not used as much in Python as objects in JavaScript and maps in Clojure is that dicts feel like second-class citizens; first of all, dict access is awkward. Compare foo[“bar”] with eg foo.bar or (:bar foo)—both JS and Clojure have affordances for literal key access which are missing in Python.

                                                                                      Secondly, tooling support (autocompletion etc) has historically been very poor for dicts. This has sort of changed with the introduction of types dictionaries, but then you have to wrestle with the awkward dict syntax again.

                                                                                      Ultimately I think this often is to Python’s detriment—for example, translating Python classes to a wire protocol often involves quite a lot of boilerplate. Luckily there are libraries like Pydantic that provide good solutions to this problem, but it’s still not as seamless as eg serializing Clojure maps.

                                                                                      1. 6

                                                                                        It’s funny you mention JS because if I needed an arbitrary key-value map I would not just use literal keys, but rather a Map. Consider what happens if you try to use a key named ‘toString’, or ‘hasOwnProperty’.

                                                                                        I also don’t think you should just be shoving your in-memory representation over the wire unless it’s extremely simple, especially in situations where you might need to change it and your client and your server might get out of sync in terms of versions.

                                                                                        1. 4

                                                                                          for example, translating Python classes to a wire protocol often involves quite a lot of boilerplate. Luckily there are libraries like Pydantic that provide good solutions to this problem, but it’s still not as seamless as eg serializing Clojure maps.

                                                                                          I guess I don’t really get this one.

                                                                                          Pydantic is the hot new kid on the block, sure, but if you’re building networked services this stuff is table stakes and has been for years and years. If you use Django there’s DRF serializers. If you don’t use Django there’s Marshmallow. In both cases the tooling can auto-derive serialization and deserialization and at least basic type-related validation from whatever single class is the source of truth about your data’s shape, whether it’s an ORM (Django or SQLAlchemy) model, or a dataclass or whatever.

                                                                                          So I literally cannot remember the last time I had to write “quite a bit of boilerplate” for this. Maybe if I were one of the people I see occasionally who insist they’ll never ever use a third-party framework or library? But that seems like a problem with the “never use third-party”, not with the language or the ecosystem.

                                                                                        2. 5

                                                                                          At the same time, I understand Clojure has this inclination toward keys that aren’t just any old string, but are namespaced and meaningful. I wonder if Clojure programs at a certain complexity would still translate from wire format maps to domain model maps?

                                                                                          1. 3

                                                                                            At the same time, I understand Clojure has this inclination toward keys that aren’t just any old string, but are namespaced and meaningful.

                                                                                            And even with that, sometimes it can be very hard to get your bearings when you’re jumping in at a random piece of code. Figuring out what the map might look like that “should” go into a certain function can be very difficult.

                                                                                            I wonder if Clojure programs at a certain complexity would still translate from wire format maps to domain model maps?

                                                                                            I work on a complex codebase and we use a Malli derivative to keep the wire format stable and version the API. The internal Malli model is translated to JSON automatically through this spec, and it also ensures that incoming data is well-formed. It’s all rather messy and I’m not sure if I wouldn’t prefer manual code for this because Malli is quite heavy-handed and its metaprogramming facilities are hard to use and badly documented.

                                                                                          2. 5

                                                                                            The advice in the article is wrong for Python as well. Dicts are not opaque, it’s wrapping them in bespoke custom classes that makes data opaque. I should probably blog about it, because there’s much more I want to say than fits in a comment.

                                                                                            1. 10

                                                                                              Dicts are not opaque, it’s wrapping them in bespoke custom classes that makes data opaque.

                                                                                              Dicts aren’t opaque in the sense of encapsulation, but they’re opaque in the sense of making it harder on the developer trying to figure out what’s going on.

                                                                                              If I’m working with a statically-typed codebase (via mypy), I can search for all instances of a given type. I can also look for all accesses of a given field on that type. It’s not possible to usefully do this with a dict, since you’re using dicts everywhere. You also can’t say “field X has type Y, field Z has type Q” unless you use TypedDict and then at that point you don’t gain anything from not using a real class.

                                                                                              Similarly, I can look at the definition for the class and see its fields, methods, and docstrings. You can’t do that with a dict.

                                                                                              I’ve been working with a codebase at $WORK that used dicts everywhere and it was a huge pain in the ass. I’ve been converting them to dataclasses as I go and it’s a lot more convenient.

                                                                                              1. 1

                                                                                                You might be interested in TypedDict (also described in PEP-585) and the additions to TypedDict in PEP-655.

                                                                                                1. 1

                                                                                                  I’ve used TypedDict as a transitional measure while doing the dicts-to-dataclasses thing; it was definitely super helpful there.

                                                                                                  1. 1

                                                                                                    I’m not sure why TypedDict exists. You may as well opt for a dataclass or pydantic. Maybe it’s useful for typing **kwargs?

                                                                                                    1. 2

                                                                                                      The primary idea being that if a dictionary is useful in a given circumstance, then a dictionary with type assertions is often even more useful. The motivation section of the PEP expands on that a little.

                                                                                                      1. 1

                                                                                                        it exists to help add type checking support to code that needs to pass dicts around for whatever reason (e.g. interop with legacy libraries)

                                                                                                  2. 2

                                                                                                    Please post the article here if you do write it, cuz I don’t know much about Python best practices.

                                                                                                    1. 1

                                                                                                      I will. Although I don’t have any official claim at those practices being officially “best”. I only know they work best for me :-)

                                                                                                1. 3

                                                                                                  Caveat: I don’t have a lot of experience with C (or C++).

                                                                                                  I had the impression from reading various internet discussions that null-terminated strings were considered a mistake. After some searching, I found multiple impassioned defenses of them on Stack Overflow. This gives me more context and understanding for why null terminated strings were chosen for C, but doesn’t provide any reason why almost no languages since uses them.

                                                                                                  What is Zig’s rationale for using null terminated strings?

                                                                                                  1. 4

                                                                                                    Zig has support for both null and non-null terminated strings. The []const u8 type, which is the convention for strings is non-null terminated. The default type for a string literal is *const [N:0]u8. This can then coerce into a []const u8 which is a slice. Null terminated strings are useful for c interop, but slices are very useful also.

                                                                                                    1. 3

                                                                                                      As someone who only knows a little of Zig, my guess is that the decision is a consequence of Zig’s origin. Zig is meant to be a better C. C uses null-terminated strings and (nearly) every C library does. Therefore, supporting them in an essential way seems hard to get away from.

                                                                                                      1. 3

                                                                                                        EDIT: looks like g-w1 actually knows: Zig has both kinds of strings, and the null-terminated ones are for C interop.

                                                                                                      2. 2

                                                                                                        Relying on the null terminator causes problems because calculating lengths (and doing bounds checks on random access) are O(n). C used null terminators because space was very constrained. A length field the same size as a null byte (as Pascal used) limited strings to 256 characters, which caused a lot of problems. If you have a 32-bit or 64-bit size field, you’re typically not losing much (especially if you do short-string optimisation and reserve a bit to indicate whether the short string is embedded in the space used by the size and pointer).

                                                                                                        In contrast, having the null terminator can make C interop easier because you don’t need to copy strings to convert them to C strings. How much this matters depends a lot on your use case. Having the null terminator can cause a lot of problems if you have one inconsistently. For example:

                                                                                                        $ cat str.cc
                                                                                                        #include <string>
                                                                                                        #include <cstring>
                                                                                                        #include <iostream>
                                                                                                        
                                                                                                        int main()
                                                                                                        {
                                                                                                                std::string hello = "hello";
                                                                                                                auto hello_null = hello;
                                                                                                                hello_null += '\0';
                                                                                                                std::cout << hello << " == " << hello_null << " = " << (hello == hello_null) << std::endl;
                                                                                                                std::cout << "strlen(" << hello << ".c_str()) == " << strlen(hello.c_str()) << std::endl;
                                                                                                                std::cout << "strlen(" << hello_null << ".c_str()) == " << strlen(hello_null.c_str()) << std::endl;
                                                                                                        }
                                                                                                        $ c++ str.cc && ./a.out
                                                                                                        hello == hello = 0
                                                                                                        strlen(hello.c_str()) == 5
                                                                                                        strlen(hello.c_str()) == 5
                                                                                                        

                                                                                                        Converting a C++ standard string to a C string implicitly strips the null terminator (it’s there, you just can’t see it), which means that strlen(x.c_str()) and x.size() will be inconsistent.

                                                                                                        The biggest mistake that a string library can make is coupling the string interface to a string representation. A contiguous array of bytes containing a UTF-8 encoding is fine for a lot of uses of immutable strings, but what happens if you want to iterate over grapheme clusters (or even unicode code points)? If you do this multiple times for the same string then you can do it much more efficiently if you cache the boundaries with the string. For mutable strings, there are a lot more problems. Consider adding a character to the middle of a string with the contiguous-array representation. It’s a O(n) operation in the length of the string, because you have to reallocate and copy everything. With a model that over-allocates the buffer then it’s O(n) in the length of the tail of the string, with periodic O(n) copies when the buffer is exhausted (amortised to something better depending on the policy). With a twine-like representation, insertion can be cheap but indexing may be more expensive. The optimal string representation depends hugely on the set of operations that you want to perform. If your string operations aren’t abstracted over the representation then there’s pressure to use a non-optimal representation.

                                                                                                        Objective-C did this reasonably well. Strings implement a small set of primitive methods and can implement more efficient specialised versions. The UText interface in ICU is very similar to the Objective-C model, with one important performance improvement. When iterating over characters (actually, UTF-16 code units), implementations of UText have a choice of providing direct access to an internal buffer or to a temporary one. With a twine-like implementation, you can just update the pointer and length in the UText to point to the current segment, whereas with NSString you need to copy the characters to a caller-provided buffer.

                                                                                                      1. 1

                                                                                                        What is a layer?

                                                                                                        1. 3

                                                                                                          On many keyboards there is a mechanism such that some or all of the keys can change behavior based on which “layer” is active. The layer can be changed by holding down a key or pressing a key. In this way, “shift” is somewhat like a layer change key—it swaps from the “lower-case” layer to the “upper-case” layer.

                                                                                                          On very small keyboards, layers are required to generate numbers and F-keys because there aren’t enough physical buttons to represent them.

                                                                                                          1. 12

                                                                                                            Make is definitely my favourite of the usual unix tools, and one I’d recommend all beginners to learn about. I’m betting most people think it’s only a build system for C, and even when it’s used sometimes it’s only as a task runner. It works for any command that takes files as an input and produces a file as output, and you get incremental and (if you define the tasks right) parallelized builds for free! And if you don’t like make itself, there’s a ton of language-specific clones like Rake or Jake that support most of its features .I just wish it could deal better with tasks that produce multiple files, not even the clones usually support that.

                                                                                                            I’m using Jake on a project where I have to parse HTML and JS files and do some compilin’ and interpretin’; it’s important that I have very clear data provenance from the source files because it’s copyrighted material and I want to be able to distribute the project without including any of the incriminating data, so having a tool like make to keep track of all the inputs and outputs is very useful. JS also has some great libraries for parsing both HTML and JS, and it would be really awkward to wrap the functions that do these transformations as command line programs, and with Jake I can just call the functions in the tasks.

                                                                                                            1. 2

                                                                                                              As a fellow lover of Make, I’ve been getting some good use out of remake lately and thought you might like it if you hadn’t seen it.

                                                                                                              1. 2

                                                                                                                I just wish it could deal better with tasks that produce multiple files, not even the clones usually support that.

                                                                                                                In make 4.3:

                                                                                                                • New feature: Grouped explicit targets

                                                                                                                Pattern rules have always had the ability to generate multiple targets with a single invocation of the recipe. It’s now possible to declare that an explicit rule generates multiple targets with a single invocation. To use this, replace the “:” token with “&:” in the rule. To detect this feature search for ‘grouped-target’ in the .FEATURES special variable. Implementation contributed by Kaz Kylheku kaz@kylheku.com

                                                                                                                1. 1

                                                                                                                  I agree make is great! I use it to run my backup. (There are different portions of my hard disk that need to be backed up in a certain order.)

                                                                                                                1. 2

                                                                                                                  Very nice writeup, especially the trick with NoReturn!

                                                                                                                  I also find type narrowing extremely useful for error handling, you can use Union[Exception, T] as the result type, and ‘pattern match’ with isinstance, which works in runtime and checkable by mypy. In addition, covariant Union type lets use it with generators and consume without extra boilerplate. I write more about it here

                                                                                                                  1. 2

                                                                                                                    So go(lang) in Python? 😉

                                                                                                                    1. 4

                                                                                                                      go returns a Tuple[Exception, T]

                                                                                                                      1. 2

                                                                                                                        Yep. I’m describing the Go approach here

                                                                                                                        1. 1

                                                                                                                          Thanks for the article karlicoss. We were just discussing this pattern the other day and we ended up reaching similar conclusions. The only difference in our case was that we wanted to return T along with the exception. This makes the “pattern” of the return type a bit more messy. We need to keep exploring.

                                                                                                                    2. 1

                                                                                                                      You mean returning an Exception instead of raising it?

                                                                                                                      1. 1

                                                                                                                        ah yes, sorry! Too late to edit now :(

                                                                                                                    1. 21

                                                                                                                      My 2 cents. The other day, I wanted to test a new feature of Hugo that has not been released:

                                                                                                                      https://github.com/gohugoio/hugo/pull/6771

                                                                                                                      Hugo doesnt have a CI, so the only option was building it myself. This might not sound like a big deal, but I am on Windows. On Windows, you typically have problems building projects as the developers many times dont even test on Windows. This will lead to compile and dependency errors.

                                                                                                                      Hugo is a big project, the Zip file is 12 MB, so I was pretty sure I would have some trouble with this. But I didnt. I just followed the instruction:

                                                                                                                      git clone git://github.com/gohugoio/hugo
                                                                                                                      cd hugo
                                                                                                                      go install
                                                                                                                      

                                                                                                                      and it took a while, but not a single error. I have built many projects with C, C++, C#, D, Nim and others over the years, and this is the first time I have had this experience with this large of a project. The closest to this I think is FFmpeg, but even with that you needed to install dependencies else something would be missing from the output or simply fail. I had a similar experience with Rust, where I needed a new build:

                                                                                                                      https://github.com/getzola/zola/issues/893

                                                                                                                      except Rust just failed spectacularly, because it seems one of the dependency crates uses C on the backend, and has poor or no support for Windows:

                                                                                                                      https://github.com/compass-rs/sass-rs/issues/63

                                                                                                                      Go lets me get work done. I can focus on the code, where with other languages, I find myself getting distracted with the tooling or build process.

                                                                                                                      1. 15

                                                                                                                        Every time I write code in a language other than Go, I’m startled by how weak the tooling is. After writing all Go for about two years at a job, I switched to a place that has almost exclusively Python projects. To say that I miss go fmt is the king of all understatements. Also, after Go, I don’t see why every programming language doesn’t just come up with a native way of running tests a là go test. It’s weird to have a series of codebases where the command to execute their test suites is only the same because it’s placed behind a standardized make target.

                                                                                                                        1. 16

                                                                                                                          Maybe it’s just the ecosystems I play in, but it appears that language tooling is converging on this pattern though.

                                                                                                                          Rust:

                                                                                                                          • cargo fmt
                                                                                                                          • cargo test

                                                                                                                          DotNet:

                                                                                                                          • dotnet format
                                                                                                                          • dotnet test

                                                                                                                          Elixir:

                                                                                                                          • mix format
                                                                                                                          • mix test
                                                                                                                          1. 11

                                                                                                                            Zig, too:

                                                                                                                            • zig fmt
                                                                                                                            • zig [build] test
                                                                                                                            1. 2

                                                                                                                              That’s awesome, thank you for your response! I know about prettier for Javascript (and to their credit, it really makes absolutely no sense to have a formalized code formatter for a language with no official interpreter). I also knew about cargo (slipped my mind), but am firmly outside the .NET/Elixir ecosystems, so glad to see that happening.

                                                                                                                              1. 7

                                                                                                                                There’s also pyfmt for python, and pytest. I think pytest came before go, though.

                                                                                                                            2. 5

                                                                                                                              Black (https://github.com/psf/black) is the Python equivalent of go fmt.

                                                                                                                          1. 18

                                                                                                                            It’s nice to see someone else abusing Python for the sake of fun. I’ve blogged in the past about many hacks, including: let, attempting to make call/cc, worlds, pattern matching with with, and dispatching with with. Basically, yes, yes, yes, more of this kind of stuff! It makes languages really fun.

                                                                                                                            1. 6

                                                                                                                              I’ll join this party :) I figured out how to make Rust-like macros in Python by stuffing things in type-annotations, which you can read here.

                                                                                                                              1. 1

                                                                                                                                I don’t write much Python these days, but as a Schemer interested in macros, I used to try out all sorts of stuff. There was one really good attempt: MetaPython which hasn’t had a release since 2009. I think that was the one I felt worked best, so if you’re still interested, you might play with it.

                                                                                                                                Also, this is awesome! And, I don’t know much rust, but I did not realize that what I would call “pragmas” are powered by macros (in hindsight this makes total sense!), making them accessible for all sorts of hackery and wizardry. Thanks for sharing!

                                                                                                                                1. 2

                                                                                                                                  … but as a Schemer interested in macros

                                                                                                                                  Do you know about Hy?

                                                                                                                                  1. 1

                                                                                                                                    I do! A long time ago I had a similar project called Ruse, which aimed to be a compliant Scheme on top of Python 2, which fizzled before I prepped it for release. I’m happy that someone else, independently, thought the idea of a Lisp targetting Python was good. :)

                                                                                                                              2. 3

                                                                                                                                It sounds like you’d enjoy my (now quite old) blog post about abusing encodings in Python: http://benjiyork.com/blog/2008/02/programmable-python-syntax-via-source.html

                                                                                                                                1. 3

                                                                                                                                  Thanks, that is amazing.

                                                                                                                                  Very relatedly, I blogged about sourefiles using built in rot13 with a starting comment #encoding: rot13 that then have all Sourcecode encoded: https://frederik-braun.com/rot13-encoding-in-python.html

                                                                                                                                2. 1

                                                                                                                                  I think you might enjoy the complete works of Oleg Kiselyov, full of mind-bending trickery in Scheme, ML and others. I sometimes wish I could just set aside a year and thoroughly study and understand what Oleg is publishing.

                                                                                                                                  1. 2

                                                                                                                                    I sometimes wish I could just set aside a year and thoroughly study and understand what Oleg is publishing.

                                                                                                                                    I sometimes ask the question, legitimately, “What would Oleg do?” – Yes, aware. But similar to you, completely understudied due to time.

                                                                                                                                1. 2

                                                                                                                                  Can someone confirm that except Exception: is bad practice for the use cases given in the article?

                                                                                                                                  1. 2

                                                                                                                                    It’s bad if you really need what that construct does.

                                                                                                                                    On the other hand, if you’re just being lazy and you could find out the exact exception(s) that you want to handle, you should do that instead.

                                                                                                                                  1. 1

                                                                                                                                    s/invreased/increased/

                                                                                                                                    1. 2

                                                                                                                                      Thank you!