Moved my stuff to Codeberg, was really painless with Giteas migration tools. Gotta say I enjoy the noise free experience. Github was turning into yet another social media IMO.
Also have my own Gitea instance running for other stuff.
Nothing wrong with that. I’m just one of those that get really easily addicted to social media. I found myself refreshing github frontpage many times a day to see if anything interesting has popped up… That’s not healthy at all, which is another reason why I moved to Codeberg. It has similar features but they dont jump on your eyes battling for your attention like Github does.
It’s a plug, but this week I made a filter list (for uBlock Origin and the like) to hide some of the overtly social features, specifically from the feeds. As most of my employers and so many projects are on GitHub, I don’t feel like I have a choice, but this list helps keep some of the distraction and attempts to increase engagement at bay.
Cannot second this hard enough. If you went to go look for github alternatives and ended up unimpressed with the buggy, clunky, slow gitlab UI, please give codeberg/gitea a look; they are dramatically higher quality. I feel like every time I go to use gitlab I find a new bug, and I have yet to see a single bug in codeberg.
I suspect this is because Gitlab is trying to do a LOT - just like Github (git web UI, bug tracking, wiki, discussions, locks, socks, lingerie) whereas Gitea does one thing and does it well.
The team behind Gitea also put a ton of effort into things like documentation and ease of installation which matter a lot more than many people give them credit for.
GitHub is Facebook for programmers. Your posts are your repos, commits, issues & comments, pull requests, discussion posts, wiki pages. Many of these posts can be “liked” using stars and upvotes. There are several kinds of “feeds” where you can see a stream of other people’s posts.
Although I use GitHub for pragmatic reasons, I’m not comfortable with how Facebooky it is. I just found out about Codeberg from this article, and TBH it looks good to me. They don’t have their own CI server yet (in planning since about 2020), so I’d have to think about how I want to do CI, if I switch.
More like Facebook + LinkedIn, considering that potential
employers and tech recruiters treat it as sort of a CV. Odd coincidence
that LinkedIn is another MS appendage.
Centralization also greases the wheels of surveillance.
Yes, you can do that. I get my feed via email. I see new issues and PRs on my own repos, and I also see comments from issues in other people’s repos that I have commented on. I used to monitor new issues from some repos I don’t own but was active in, but I don’t do that right now.
The stuff in my feed is only for participating in projects, but the facebookiness goes well beyond that. You can follow people and see their activity, you get “achievement badges” whether you want them or not, you can put a ton of information in your personal profile, etc. It’s not as creepy as Linked In or Facebook yet. Nobody has ever hassled me to follow them or star their project. But it’s owned by Microsoft, and their track record suggests that things will get creepier.
Update: yes you can remove those “achievement badges” from your public profile. Just did that.
To be honest I check the feed at least once a day to see what the people I followed liked. If you follow the right people you can find the right repos.
A few months ago I saw a new lobste.rs post about a GitHub repo I’d just discovered a few hours before. I thought “odd coincidence!”, then I read the first comment (by the person who posted it) saying they’d seen the repo in my GitHub feed when I starred it.
I’m fine with the social aspects of GitHub. It’s useful when you work in a team, or to know how much support there is for fixing a particular issue, or whatever.
They suggest to make your README callout your contempt with the GitHub status quo—and mine already has for a long time now. Things this list misses:
cgit and GNU Savannah as a Web GUI and Git forge alternatives
explaining that the education system should stop making students sign up for and use proprietary services like this unless it’s the last option. This has long been an issue of institutions giving training for Adobe products, or Microsoft Office, but a GitHub account is parent-owned by Microsoft and still a closed-source, proprietary platform and future generations would be better off learning the fundamentals in a generic or FOSS option.
not having your programming community’s package management or identity tied to exclusively GitHub. For instance, in Elm you cannot upload a package to any other platform and you also can’t download packages from anywhere else.
making sure GitHub doesn’t get special treatment for plugins as a default. For instance npm’s package.json, while it has short codes for GitLab and BitBucket, but if you just pass in user/repo it assumes GitHub—contrast to Nix + nixpkgs that only allows a short code like sourcehut:~user/repo for several common forges or the full URL (with no specialer, shorter treatment for GitHub, just github:user/repo). Similarly basically all Vim plugin managers have short options for GitHub only or a full URL which is an implicit endorsement towards GitHub. As a result you get downstream tools like https://vimcolorschemes.com/ that get built with only GitHub in mind because that’s where the plugins are. It seems subtle/convenient, but it matters in shaping ecosystems.
Heck, maybe it’s time to look outside of git too as if you wanted to try an alternative VCS, GitHub won’t host it–it’s in the name.
While not technically Git, I agree that for most people and companies out there, it would likely be the better solution. Especially for people who barely understand what the difference between GitHub and Git is or what merge, pull-request, cherry-pick, etc. means and are scared of pulling because they’ll spend the rest of the day on Stakoverflow, because someone else commited code to a project.
This is an interesting issue, and I think there’s way more to think about than what’s mentioned in the article.
If we’re purely talking about Copilot, I’m not really a fan, but that’s a whole separate other issue. It’s simply hard to abandon GitHub because viable alternatives are few and far between and every single one comes with its own set of trade-offs.
GitHub has a critical mass of open source projects. Being able to use a single account to contribute to a large majority of open source projects is a nice thing. Encouraging people to contribute is a good thing. If I want to contribute to something hosted on someone’s custom Gitea instance (as an example), I’ll think very hard if it’s worth it before making an account.
Tooling. So much tooling is only really possible because they primarily focus on a single platform (GitHub).
Features. Issues, pull requests, project management, releases, etc. Some of these are done by other platforms as well, but few support the wide variety of features that GitHub does.
I’m not a fan of the alternatives.
Gitea is my general recommendation for self-hosting - it’s fairly minimal, has great community support, and has enough integrations for most people/companies. Unfortunately, it requires a new account for every developer who wants to contribute to a project on that specific gitea instance.
Gitlab is probably the closest in terms of features, but it’s a very large piece of software requiring a comparatively large machine and some very specific setup (at least if you self host).
Bitbucket is decent, but very Atlassian-focused and definitely more interested in the business market than open source.
SourceHut is a solid piece of software, but it uses a workflow many developers are not familiar with and I’m uncomfortable using it because of how vitriolic the owner has been in the past (both on his blog and when I’ve tried to contribute to his software).
Gitea is actively working on ActivityPub federation which should make it much easier for people to contribute to random projects they see without having to specifically make an account on a new server.
And email works just fine for decentralization on projects not looking for other social features like stars—though the workflow is alien to those who have only experienced the merge request flow.
Any command line mail client can be used to pipe those to your difftool of choice; building it as a specific feature doesn’t really make sense because command line tools expect you to use pipes to compose features.
Probably not directly in the email client? But as an Emacs user, I feel like it would be pretty trivial to go from mu4e or Gnus to ediff-merge-buffers. If you’re using some graphical diff tool, I can’t imagine that it would be hard to get a nice 3-way diff using a patch file.
Sourcehut is working on being one. I bet someone (me, eventually, at least) will make a bridge to the AP stuff too so that the ones that speak AP but not email can be used by email
Going from “you can do this decentralized” to “well, to get the features you really want there, you’ll need this centralized service’s implementation” in the space of like two comments is impressive.
The fact that people can self-host sourcehut does not make real-world typical use of sourcehut be decentralized. If — and this is a big if — sourcehut were to take off in popularity, for example, it would not be sourcehut the self-host-able piece of software taking off, it would be a particular instance of sourcehut taking off, and thus becoming the centralized bottleneck all over again.
And it will gain maybe a few extra users as a result, but never achieve critical mass the way centralized services have, because there are no technological solutions to the social forces that drive centralization.
While it would be great to make dev work easier, it unfortunately doesn’t solve search/discovery. Unfortunately gh is still a great place to search for open source projects. I don’t know if AP can address that part, but I got someone will tackle this next.
Honestly, I don’t really have many problems with GitHub. It works decently, and if it goes to hell, I can just push somewhere else and deal with the fallout later. Actually finding projects/code is useful with code search (ignoring ML sludge), and I really don’t see how people can get addicted to the whole stars thing. Besides, if it’s public, something like Copilot will snarf it anyways.
I was a long-time holdout from GitHub. I pushed every project I was contributing to and every company that I worked for to avoid it because I don’t like centralised systems that put control in a single location. I eventually gave up for two reasons:
It’s fairly easy to migrate from GitHub if you ever actually want to. Git is intrinsically decentralised. GitHub Pages and even GitHub wikis are stored in git and so can just be cloned and take elsewhere (if you’re sensible, you’ll have a cron job to do this to another machine for contingency planning). Even GitHub Issues are exposed via an API in machine-readable format, so you can take all of this away as well. I’d love to see folks that are concerned about GitHub provide tooling that lets me keep a backup of everything associated with GitHub in a format that’s easy to import into other systems. A lot of my concerns about GitHub are hypothetical: in general, centralised power structures and systems with strong network effects end up being abused. Making it easy to move mitigates a lot of this, without requiring you to actually move.
The projects I put on GitHub got a lot more contributions than the ones hosted elsewhere. These ranged from useless bug reports, through engaged bug reports with useful test cases, up to folks actively contributing significant new features. I think the Free Software movement often shoots itself in the foot by refusing to compromise. If your goal is to increase the amount of Free Software in the world, then the highest impact way of doing that is to make it easy for anyone to contribute to Free Software. In the short term, that may mean meeting them where they are, on proprietary operating systems or other platforms. The FSF used to understand this: the entire GNU project began providing a userland that ran on proprietary kernels and gradually replaced everything. No one wants to throw everything away and move to an unfinished Free Software platform, but if you can gradually increase the proportion of Free Software that they use then there becomes a point where it’s easy for them to discard the last few proprietary bits. If you insist on ideological purity then they just give up and stay in a mostly or fully proprietary ecosystem.
Even if it’s possible, even easy, to copy your content from Github when they cross some threshold you’re no longer ok with, there will be very little to copy to unless we somehow sustain development of alternatives during the time it takes to reach that threshold.
IMHO it would be better if the default was at least ”one of the three most popular” rather than ”Github, because that’s what everyone uses”.
If you use their issue tracker, pull requests and so on, that will be voided too. That isn’t easily pushable to another git host. Such things can tell a lot about a project and the process of it getting there, so it would be sad if that was lost.
Agreed on everything but Copilot. The freedom to study how the software works is a fundamental attribute of free software. Learning is not covered by the GPL’s requirements. Copilot sometimes copypastes code (honestly - who doesn’t) but broadly it learns. This is entirely in keeping with open source.
If we’re gonna set a standard that you can’t use things you learnt in software under an open-source license when writing commercial software, we might as well shutter either the entire software industry or the entire open-source movement, because literally everybody does that. It’s how brains work!
And of course, it’s not like being off Github is gonna prevent MS from feeding your project into Copilot 2.
Like all of these neural network “AIs”, it’s just a pattern recognition system that launders the work of many humans into a new form, which the corporation can profit from but the humans cannot. It’s piracy for entities rich enough to train and operate such an AI, and unethical enough to acquire the training data, but you or I would still be punished for pirating from the corporation. Whether or not it is legal is irrelevant to me (I’m in favor of abolishing copyright), but we must recognize the increasing power imbalance between individuals and corporations such “AI” represents.
Copilot understands nothing of what it writes. It learns nothing and knows nothing. It is not sentient or alive, no matter how tempting it is to anthropomorphize it.
I think “pattern recognition system that launders the work of many humans into a new form” is just a rude way to phrase “learning.”
Define “understands.” Define “knows.” I think transformers derive tiered abstract patterns from input that they can generalize and apply to new situations. That’s what learning is to me.
The standard philosophical definition of knowledge is a justified true belief. Copilot and other AIs make the belief part problematic, so bracket that. But they don’t justify things well at all. Justification is a social process of showing why something is true. The AIs sound like total bullshit artists when asked to justify anything. I don’t think Copilot “knows” things anymore than a dictionary does yet.
Putting aside Gettier cases, that’s not what I understand “justified” to mean. You just need to have a reason for holding the knowledge. With AI, reinforcement learning is the justification.
The point of “justified belief” is just that it’s not knowledge if you just guess that it’s raining outside, even if it is in fact raining.
The definition that @carlmjohnson is quoting is Plato’s and ever since Plato put it forth, knowledge theorists have been bickering about what “justified” means. The history of ideas after the age of Boethius or so isn’t quite my strong point so I’ll leave that part to someone else but FWIW most classical definitions of justification either don’t readily apply to reinforced learning, or if they do, it fails them quite badly.
That being said, if you want to go forth with that definition, it’s very hard to frame a statistical model’s output as belief in the first place, whether justified or not. Even for the simplest kinds of statistical models (classification problems with binary output – yes/no) it’s not at all clear to formulate what belief the model possesses. For example, it’s trivial to train a model to recognize if a given text is an Ancient Greek play or not. But when you feed it a piece of text, the question that the model is “pondering” isn’t “Is this an Ancient Greek play”, but “Should I say yes?”, just like any other classification model. If subjected to the right laws and statements, a model that predicts whether a statement would cause someone to be held in contempt of the court might also end up telling you if a given text is an Ancient Greek play with reasonable accuracy, too. “Is this an Ancient Greek play?” and “Is this statement in contempt of the court?” are not equivalent statements, but the model will happily appear to make both with considerable accuracy.
The model is making an inference about the content (“This content is of the kind I say yes to/the kind I say no to”), but because the two kinds cannot be associated to a distinct piece of information about the subject being fed to the model, I don’t think it can be said to constitute a belief. It’s not a statement that something is the case because it’s not clear what it asserts to be the case or not: there are infinitely many different classification problems that a model might turn out to solve satisfactorily.
In Greek, “justified” was some variation on “logos”: an account. Obviously everyone and their Buridan’s ass has a pet theory of justification, but I think it’s fair to interpret Plato’s mooted definition (it’s rejected in the dialogue IIRC!) as being “the ability to give an account of why the belief is true”. This is the ability which Socrates finds that everyone lacks, and why he says he knows that he knows nothing.
Ugh, it’s really tricky. This comes up in two dialogs: Theaetetus, where knowledge gets defined as “true judgement with an account” (which IIRC is the logos part) and it’s plainly rejected in the end. The other one is Meno, where it’s discussed in the terms of the difference between true belief and knowledge, but the matter is not definitively resolved.
I was definitely wrong to say it was Plato’s – I think I edited my comment which initially said “is effectively Plato’s” because I thought it was too wordy but I was 100% wrong to do it, as Plato doesn’t actually use this formulation anywhere (although his position, or rather a position that can be inferred from the dialogues, is frequently summarized in these terms). (Edit: FWIW this is a super frequent problem with modern people talking about ancient sources and one of the ways you can probably tell I’m an amateur :P)
I think it’s fair to interpret Plato’s mooted definition (it’s rejected in the dialogue IIRC!) as being “the ability to give an account of why the belief is true”.
You may know of this already but just in case your familiarity with modern philosophy is as spotty as mine, only it’s got holes in different places, and if you’re super curious patient, you’re going to find Gettier’s “Is Justified True Belief Knowledge?” truly fascinating. It’s a landmark paper that formalizes a whole lot of objections to this, some of them formulated as early as the 15th century or so.
The counter-examples Gettier comes up with are better from a formal standpoint but Russel famously formulated one that’s really straightforward.
Suppose I’m looking at a clock which shows it’s two o’clock, so I believe it’s two o’clock. It really is two o’clock – it appears that I possess a belief that is both justified (I just looked at the clock!) and true (it really is two o’clock). I can make a bunch of deductions that are going to be true, to: for example, if I were to think that thirty minutes from now it’s going to be half past two, I’d be right. But – thought I haven’t realized it – that clock has in fact stopped working since yesterday at two. (Bear with me, we’re talking about clocks from Russell’s age). My belief is justified, and it’s true, but only by accident: what I have is not knowledge, but sheer luck – I could’ve looked at the clock as half past two and held the same justified belief, but it would’ve been false, suggesting that an external factor may also be involved in whether a belief is true or not, justified or not, and, thus, knowledge or not, besides the inherent truth and justification of a statement.
The counter-examples Gettier comes up with are better from a formal standpoint but Russel famously formulated one that’s really straightforward.
I love collecting “realistic” Gettier problems:
You’re asked a question and presented with a multiple choice answer. You can rule out 3 of the answers by metagaming (one is two orders of magnitude different from the others, etc)
I give you a 100 reasons why I believe X. You examine the first 30 of them and they’re all completely nonsensical. In fact, every argument but #41 is garbage. Argument #41 is irrefutable.
I believe “Poets commit suicide more often than the general population”, because several places say they commit suicide at 30x the rate. This claim turns out to be bunk, and a later investigation finds it’s more like 1.1x.
I encounter a bug and know, from all my past experience dealing with it, that it’s probably reason X. I have not actually looked at the code, or even know what language it’s programmed in, and it’s one notable for not having X-type bugs. The developers were doing something extremely weird that subverted that guarantee, though, and it is in fact X.
I find an empirical study convincingly showing X. The data turns out to have been completely faked. This is discovered by an unrelated team of experts who then publish an empirical study convincingly showing X’, which is an even stronger claim than X.
My favourite ones come from debugging, that’s actually what got me into this in the first place (along with my Microwave Systems prof stubbornly insisting that you should know these things, even if engineers frown upon it, but that’s a whole other story):
While debugging an Ethernet adapter’s driver, I am pinging another machine and watching the RX packet count of an interface go up, so I believe packets are being received on that interface, and the number of packets received on my machine matches the number of packets that the other machine is sending to it. Packets are indeed being received on the interface. I made a stupid copy-paste error in the code: I’m reading from the TX count register and reporting that as the RX count. It only shows the correct value because sending a ping packet generates a single response packet, so the two counts happen to match.
An RTOS’ task overflows its stack (this was proprietary, it’s complicated) and bumps into another task’s stack, corrupting it. I infer the system crashes because of the stack corruption. Indeed, I can see task A bumping into task B’s stack, then A yields to B, and B eventually jumps at whatever garbage is on the stack, thus indeed crashing the system. There’s actually a bug in the process manager which causes the task table to become corrupted: A does overflow its task, but B’s stack is not located where A is overflowing. When A yields to B, the context is incorrectly restored, and B looks for its stack someplace else than it actually is, loading the stack pointer with an incorrect value. It just so happens that, because B is usually started before A, the bug is usually triggered by B yielding to A, but A just sits in a loop and toggles a couple of flags, so it’s never doing anything with the stack and never crashes, even though its stack does eventually get corrupted, too.
I got a few other ones but it’s really late here and I’m not sure I’m quite coherent by now :-D.
I’m familiar with Gettier cases. I never dove very deep into the literature. It always struck me that a justification is not just a verbal formulation but needs some causal connection to the fact of the matter: a working clock causes my reasoning to be correct but a stopped clock has no causal power etc. I’m sure someone has already worked out something like this and brought out the objections etc etc but it seems like a prima facie fix to me.
Yes, IMO the belief is “how may this text continue?” However, efficiently answering this question requires implicit background knowledge. In a similar sense, our brains may be said to only have information about “what perpetuates our existence” or “what makes us feel good.” At most we can be said to have knowledge of the electric potentials applied to our nerves, as Plato also made hay of. However, as with language models, a model of the unseen world arises as a side effect of the compression of sensory data.
Actually, language models are fascinating to me because they’re a second-order learner. Their model is entirely based on hearsay; GPT-3 is a pure gossip. My hope for the singularity is that language models will be feasible to make safe because they’ll unavoidably pick up the human ontology by imitation.
Yes, IMO the belief is “how may this text continue?”
That’s a question, not a belief – I assume you meant “This text may continue ”. This has the same problem: that’s a belief that you are projecting onto the model, not necessarily one that the model formulates. Reasoning by analogy is an attractive shortcut but it’s an uneasy one – we got gravity wrong because of it for almost two thousand years. Lots of things “may be said” about our brains, but not all of them are true, and not all of them apply to language models.
Sure, but by that metric everything that anyone has ever said is a belief that person is projecting. I think that language models match the pattern of having a belief, as I understand it.
a belief that you are projecting onto the model, not necessarily one that the model formulates
You’re mixing up meta-levels here: I believe that the model believes things. I’m not saying that we should believe that the model believes things because the model believes that; rather, (from my perspective) we should believe it because it’s true.
In other words, if I model the learning process of a language model, the model in my head of the process fits the categories of “belief” and “learning”.
I think that language models match the pattern of having a belief, as I understand it.
Observing that a model follows the pattern of a behaviour is not the same as observing that behaviour though. For example, Jupiter’s motion matches the pattern of orbiting a fixed Earth on an epicycle, but both are in fact orbiting the Sun.
FWIW, this is an even weaker assumption than I am making above – it’s not that no statements are made and that we only observe something akin to statements being made. I’m specifically arguing that the statements that the model appears to make (whether “it” makes them or not) are not particular enough to discriminate any information that the model holds about the world outside of itself and, thus, do not qualify as beliefs.
If the world had a different state, the model would have different beliefs - because the dataset would contain different content.
Also, Jupiter is in fact orbiting a fixed Earth on an epicycle. There is nothing that inherently makes that view less true than the orbiting-the-sun view. But I don’t see how that relates at all.
The problem is that reinforcement learning pushes the model toward reproducing the data distribution it was trained on. It’s completely orthogonal to truth about reality, in exactly the same way as guessing the state of the weather without evidence.
The data for language models in general is sampled from strings collected from websites, which includes true statements but also fiction, conspiracy theories, poetry, and just language in general. “Do you really think people would get on the Internet and tell lies” is one of the oldest jokes around for a reason.
You can ask GPT-3 what the weather is outside, and it’ll give you an answer that is structured like a real answer would be, but has no relation to the actual weather outside your location or whatever data centers collectively host the darned thing. It _looks_like a valid answer, but there’s no reason to believe it is one, and it’s dangerous to infer that anything like training on photo/weather pairs is happening when nobody built that into the actual model at hand.
Copilot in particular is no better - it’s more focused on code specifically, but the fact that someone wrote code does not mean that code is a correct or good solution. All Copilot can say is that it’s structured in a way that resembles other structures it’s seen before. That’s not knowledge of the underlying semantics. It’s useful and it’s an impressive technical achievement - but it’s not knowledge. Any knowledge involved is something the reader brings to the table, not the machine.
Oh I’ll readily agree that Copilot probably doesn’t generate “correct code” rather than “typical code.” Though if it’s like GPT-3, you might be able to prompt it to write correct code. That might be another interesting avenue for study.
“However, this code has a bug! If you look at line”…
I’ve experimented with this a bit and found it quite pronounced - if you feed copilot code written in an awkward style (comments like “set x to 1”, badly named variables) you will get code that reflects that style.
IMHO it’s perilous and not quite fair to think what a machine should be allowed to do and not to do by semantic convention. “Machine learning” was one uninspired grant writer away from going down into history as, say, “statistically-driven autonomous process inference and replication”, and we likely wouldn’t have had this discussion because anything that replicates code is radioactive for legal teams.
Copilot is basically Uber for copy-pasting from Stack Overflow. It’s in a legally gray area because the legal status of deriving works via statistical models is unclear, not because Microsoft managed to finally settle the question of what constitutes learning after all. And it’s probably on the more favourable side of gray shades because it’s a hot tech topic so it generates a lot of lobbying money for companies that can afford lawyers who can make sure it stays legally defensible until the next hot tech topic comes up.
Also, frankly, I think the question of whether what Copilot does constitutes learning or not is largely irrelevant, and that the question of whether Copilot-ing one’s code should be allowed is primarily rooted in entitlement. Github is Microsoft’s platform so, yes, obviously, they’re going to do whatever they can get away with on it, including things that may turn out to be illegal, or things that are illegal but will be deemed legal by a corrupt judge, or whatever. If someone wants $evil_megacorp to not do things with your code, why on Earth was their code anywhere near $evil_megacorp’s machines in the first place?
This cannot be a surprise to anyone who’s been in this field for more than a couple of years. Until a court rules otherwise, “fair” is whatever the people running a proprietary platform decide is fair. If anyone actually thought Github was about building a community and helping people do great things together or whatever their mission statement is these days, you guys, I have a bridge in Manhattan, I’m selling it super cheap, the view is amazing, it’s just what you need to take your mind off this Copilot kerfuffle, drop me a line if you wanna buy it.
(Much later edit: I know Microsoft is a hot topic in FOSS circles so just to be clear, lemme just say that I use Github and have zero problem with Copilot introducing the bugs that I wrote in other people’s programs :-D).
If machine learning was called “data replication”, it would be misnamed. And if it was called “pattern inference”, it would just be a synonym for learning… I wouldn’t care about Codex if I thought it was just a copypaste engine. I don’t think it is, though. Does it occasionally copypaste? Sure, but sometimes it doesn’t, and those are the interesting cases for me.
I don’t think this at all comes down to Github being Microsoft’s platform so much as Github being the biggest repo in one place.
I’m not at all defending Microsoft for the sake of Microsoft here, mind. I hate Microsoft and hope they die. I just think this attack does not hold water.
If machine learning was called “data replication”, it would be misnamed.
I beg to differ! Machine learning is a misnomer for statistically-driven autonomous process inference and replication, not the other way ’round!
I’m obviously kidding but what I want to illustrate is that you shouldn’t apply classical meaning to an extrapolated term. A firewall is neither a wall nor is it made of fire, and fire protection norms doesn’t apply to it. Similarly, just because it’s called machine learning, doesn’t mean you should treat it as human learning and apply the same norms.
I realize that. I want to underline that, while machine learning may be superficially analogous to human learning, just like a firewall is superficially analogous to a wall made of fire, it does not mean that it should be treated the same as human learning in all regards.
I don’t think it should be treated the same as human learning in all regards either. I think it’s similar to human learning in some ways and dissimilar in others, and the similarities are enough to call it “learning”.
Do you think Microsoft would be okay with someone training an AI on the leaked Windows source code and using it to develop an operating system or a Windows emulator?
Oh by no means will I argue that Microsoft are not hypocritical. I think it’s morally valid though, and whether Microsoft reciprocates shouldn’t enter into it.
Bit of a niggle, but it depends on the jurisdiction, really. Believe it or not, there exist jurisdictions where the Berne Convention is not recognized and as such it is perfectly legal to read it.
Exactly this. I think anthropomorphising abstract math executed in silicon is a trap for our emotional and ethical “senses”. We cannot fall for it. Machines and algorithms aren’t humans, aren’t even alive in any sense of the word, and this must inform our attitudes.
Machines aren’t humans. That’s fine, but irrelevant.
Machines aren’t alive. Correct, but irrelevant.
If the rule doesn’t continue to make sense when we finally have general AI or meet sapient aliens, it’s not a good rule.
That said, we certainly don’t have any human-equivalent or gorilla-equivalent machine intelligences now. We only have fuzzy ideas about how meat brains think, and we only have fuzzy ideas about how transformers match input to output, but there’s no particular reason to consider them equivalent. Maybe in 5 or 10 or 50 years.
If the rule doesn’t continue to make sense when we finally have general AI or meet sapient aliens, it’s not a good rule.
Won’t happen. If it does happen, we all die very soon afterwards.
I think the rule is good. We could come up with a different rule: oxygen in the atmosphere is a good thing. If we reach general AI or meet sapient aliens, they might disagree. Does that mean the rule was bad all along? I feel similar about anthropomorphising machines. It’s not in our ecological interest to do so.
I’m on board; however, I would, at least personally, make an exception if the machine-learned tool and it’s data/neural net were free, libre, and open source too. Of course the derivative work also needs to not violate the licenses too.
Learning is not covered by the GPL’s requirements.
For most intents and purposes, licences legally cover it as “creation of derived works”, otherwise why would “clean room design” ever exist. Just take a peek at the decompiled sources, you’re only learning after all.
I think this depends on the level of abstraction. There’s a difference in abstraction between learning and copying - otherwise, clean room design would itself be a derivative work.
I don’t understand what you mean. Clean-room implementation requires not having looked at the source of the thing you’re re-implementing. If you read the source code of a piece of software to learn, then come up with an independent implementation yourself, you haven’t done a clean-room implementation.
Cleanroom requires having read a documentation of the thing you are reimplementing. So some part of the sequence read -> document -> reimplement has to break the chain of derivation. At any rate, my contention is that training a neural network to learn a concept is not fundamentally different from getting a human to document a leaked source code. You’re going from literal code to abstract knowledge back down to literal code.
Would it really change your mind if OpenAI trained a second AI on the first AI in-between?
At any rate, my contention is that training a neural network to learn a concept is not fundamentally different from getting a human to document a leaked source code.
I think it’s quite different in the sense that someone reading the code’s purpose may come up with an entirely different algorithm to do the same thing. This AI won’t be capable of that - it is only capable of producing derivations. Sure, it may mix and match from different sources, but that’s not exactly the same as coming up with a novel approach. For example, unless there’s something like it in the source you feed it, I doubt the “AI” would be able to come up with Carmack’s fast inverse square root.
In copyright law, we have usually distinguished between an interface and an implementation. The difference there is always gonna be fuzzy, because law usually is. But with an AI approaches, there’s no step which distinguishes the interface and the implementation.
One problem here is the same sort of thing that came up in the Oracle/Google case — what do you do with things that have one “obvious” way to do them? If I’m the first person to write an implementation of one of those “obvious” functions in a given language, does my choice of license on that code then restrict everyone else who ever writes in that language?
And a lot (though of course not all) of the verbatim-copying examples that people have pointed out from Copilot have been on standard/obvious/boilerplate type code. It’s not clear to me that licensing ought to be able to restrict those sorts of things, though the law is murky enough that I could see well-paid attorneys arguing it either way.
I recently deleted all my GitHub repositories - I’d mostly migrated to GitLab, but the Copilot release really spurred me to complete the migration.
On my radar is a subsequent migration from GitLab to sourcehut. Not because I have any particular issue with GitLab - all of my interactions with them have been positive, and I like their product. But I like sourcehut more; their approach is more in line with my personal preferences.
Most complaints towards GitLab are around the open core model and how the VC funding could make a bad influence on the platform. So far it seems they are doing okay financially.
I’m only on github to make PRs. A important step for me was to consciously avoid the dopamine triggers by removing all my stars and follows, and making my profile page as boring and inconsequential as possible. I find it’s easier to ignore the social scoring by committing to not reciprocate.
A important step for me was to consciously avoid the dopamine triggers by removing all my stars and follows
I’m genuinely curious, what’s your reason for doing that? To me, those things are the most direct indicators possible that people give a shit about what you’re doing and about you personally. That’s kind of what it’s all about for me; having people use and care about things I create is my primary motivation for programming (outside of work, which I do primarily for money).
To me, those things are the most direct indicators possible that people give a shit about what you’re doing and about you personally.
Clicking a button doesn’t exactly signal “giving a shit” to me… it requires no effort. What signals (to me) that people give a shit is seeing patches come in, bug reports, random folks on IRC or whatever saying “hi” and thanking me, seeing distros picking up my stuff, and so on. Giving fake internet points doesn’t necessarily mean that anyone gives a shit, at best they’re probably bored or mildly interested enough to click a button and move on to the next shiny thing in their feed.
Exactly - a “star” can simply mean “hey this looks cool”. I’m sure a majority of people who star a project never even tried to use the project. It’s just ego inflation. More important is that people actually use your stuff, in places where it matters. If some project is technically cool but unusable, it could still acquire many many stars.
I’m genuinely curious, what’s your reason for doing that? To me, those things are the most direct indicators possible that people give a shit about what you’re doing and about you personally
My bar for that rests at the point that someone gives me their personal feedback on my work in a way that lets me know they have actually read, studied, or used it. That is giving a shit. Competing with the whole world and collecting a few imaginary stars, stickers, or points does not say anything about your work, unless you happen to be a marketeer.
I’d love for open source development to be decentralized and I host a semi-private Gitea instance of my server (most of the projects are public but registrations are disabled and only I have an account), but to be completely honest:
I don’t want to make an account on every single Gitea/GitLab instance out there whenever I want to contribute some code or open an issue on a project
I find GitHub’s code search extremely useful when I’m searching for examples on how to use a certain API or library
The accounts problem could be solved by ActivityPub federation, which as Hail_Spacecake has mentioned in another comment Gitea is working on, or maybe by just teaching people to use email and having a nice email/git UI as SourceHut is doing, but code search is a deal breaker for me. Is any of these projects working towards some sort of federated code search or is there some other service that indexes code from forges other than GitHub?
Created an uitlity to export github issues to mbox (https://github.com/abbbi/issue2mbox). Some people used it to import their projects issues from github to sourcehut..
On the topic of GitHub co-pilot, I can generally recommend this article by Felix Reda (who was very involved in the last copyright reform in the EU, as a member of parliament for the pirate party).
The argument seems to be that the size of things that are reproduced are too small to be covered by copyright. In general, copyright has a threshold for originality that must be met. To take an extreme example, a lot of folks use i as an induction variable in loops. I suspect that the first code to do this is still in copyright (copyright extensions mean that basically anything since the invention of the stored program computer are still in copyright, unless someone forgot to file back when that was a requirement). I think it’s pretty clear that a line such as for (int i=0 ; i<size; i++) { or even for (auto &i : collection) { would not be subject to copyright. The second one is more interesting because C++11 range-based for loop syntax was introduced after the rules around copyright were changed so that the act of creation, rather than an explicit filing with the copyright office, conferred copyright. It’s impossible to prove whether someone independently created that line or if they copied it from another project. That level of copying is fine.
I think it’s likely to require a court case to determine whether this is actually the case for Copilot. For example, one of the things that I saw it able to reproduce from one of the iD game engines was of a complexity that would meet the requirements for patenting. This generally requires a higher bar than copyright. If it’s creating a complete algorithm then that may well meet the bar for originality.
Patents are also interesting in other ways. Some open-source projects come with separate patent licenses that grant you the rights to a patent for derived works of the code in the project. This puts you in a deeply uncomfortable position if Copilot injects the code into your project: if you argue that your code is a derived work, you must comply with the copyright license. If you argue that your code is not a derived work, you do not have a patent license.
I found the earlier arguments somewhat disingenuous. Yes, data mining is explicitly permitted, but a degenerate example of a data mining system would be one that indexes entire source files and reproduces them in their entirety based on a keyword. If that were seen as not propagating copyright then any indexing system with learned indexes provides a bulletproof end run around copyright. It would take only a small tweak to BitTorrent’s distributed hash table implementation to provide a mechanism for sharing copyrighted works that is completely legal. I doubt that this was the intent of the law and I doubt that any court would rule in that direction.
Some years ago, I proposed that BitTorrent could be defended in court because Fair Use / Fair Dealings permitted quoting extracts up to some defined limits. If each seed shared only a section of the original within that section, then they’d all be defendable under quoting. If you collect a set of quotes and assemble a complete work, that’s fine. I discussed this with a lawyer and the counter argument was basically: this is why we have judges. Judges (current US Supreme Court Justices aside) are not idiots. They can spot when someone is trying to do an end-run around the law and they are deeply unsympathetic. Judges exist, in part, because statutes are not 100% unambiguous and it’s important to employ judgement.
Thanks for the link. A (somewhat) reasonable voice in all the unfounded mass hysteria. It remains to be mentioned, however, that Reda is not a lawyer but a politician, and did not study law. Even if Reda has participated in the European copyright reform, the statement is just an opinion. There are comments by IP lawyers available, e.g. here: https://fossa.com/blog/analyzing-legal-implications-github-copilot/ or https://www.youtube.com/watch?v=7HWIxLKrZ_w
While Felix is not a lawyer, he was a lawmaker of those laws, and create the mechanisms at hand - which often needs an understanding as thorough as that of a lawyer. Particularly, it’s a good introductions into what the mechanisms at play there are.
Still, with any legal debate, they live through dissenting opinions, so huge thanks for adding yours :).
In 2020, the community discovered that GitHub has a for-profit software services contract with the USA Immigration and Customs Enforcement (ICE). Activists, including some GitHub employees, have been calling on GitHub for two years to cancel that contract. GitHub’s primary reply has been that their parent company, Microsoft, has sold Microsoft Word for years to ICE without any public complaints. They claim that this somehow justifies even more business with an agency whose policies are problematic. Regardless of your views on ICE and its behavior, GitHub’s ongoing dismissive and disingenuous responses to the activists who raised this important issue show that GitHub puts its profits above concerns from the community.
The fact that there is a reasonable chance that GitHub will eventually bow to pressure from internal and external activists to cease doing business with a client they dislike on partisan American political grounds (even if GitHub hasn’t yet done this) is also a good reason to avoid relying on their services. If activists, including people who work at GitHub, can apply political pressure to deplatform ICE, they can decide that you or your company is as problematic as ICE and apply political pressure against you. GitHub is in large part a centralized social media company, and if you become dependent on any centralized social media network, you become vulnerable to pressure from any activists who have the ear of the controller of that network to cut you off from that network.
Fortunately, contributing code via git is one of the forms of online social interaction least vulnerable to centralization pressure, and using a GitHub alternative or hosting your own is very easy to do and very close to as good of an experience as using GitHub - as long as you make sure to not rely on any strictly-non-git infrastructure GitHub provides. If you run your own wiki and bug tracker and CI system, you’re not vulnerable if GitHub suddenly decides to stop offering you those services.
Github have consistently refused to bow to activist pressure - but what if they did, and then activists targeted you, and github decided to bow to that pressure too!
This has got to be the strangest take I’ve read today - inventing a conjunction of multiple events, none of which are particularly likely, as a justification for self hosting (which hardly lacks for good good arguments)
What’s the evidence that they won’t? GitHub’s existing leadership could have a genuine change of political persuasion, they might get replaced with people who are more friendly to the activist demands, perhaps something in the broader American political landscape will change that changes the calculus GitHub’s leadership is currently making. The important point is that GitHub could cut any of their users off at any time, and if any of their users are politically unpopular among the sorts of people who work at or influence GitHub, they’ll have no recourse. It would be better to not depend on GitHub to begin with, and reduce dependency on them if it exists.
It won’t be just American political pressure either. The Chinese have ways of bending many US concerns to their will, including Microsoft. Try designing software referencing Falun Gong, Tienanmen, or Xinjiang and see what happens.
It’s so weird that people develop close-source software on github, despite being fully aware that MS owns it and can see everything they do in their “private” repos.
Or that FOSS projects develop on it despite knowing the platform is closed source and requires its users to have accounts and interact on a closed platform–and then tangentially only supporting community communications on Discord.
To me that’s more rational, because it’s easy to move to a different platform if github gets in the way of your OSS development. It’s a lot harder to take your secret developments back if MS snoops on them.
It’s not just about the code. You’re missing all communications and freedom/privacy aspects for your users. I read a quote that stuck with me.
Choosing proprietary tools and services for your free software project ultimately sends a message to downstream developers and users of your project that freedom of all users—developers included—is not a priority.
— Matt Lee
back when I hosted projects on github I would always accept patches sent to me directly, but unfortunately this accommodating attitude is rare among github users.
A separate model trained on all the bugs in my repositories which can spit out a thousand new repositories (or one LARGE repository) with similar bugs running through it? If it is being automatically ingested, might as well automatically create it…
I have a different approach, I use GitHub as this is a social network.
But as every system, after a while GitHub will probably not be a good platform for me anymore.
So I also self-host on Gitea (which I greatly recommend in favor of Gitlab for personal use).
And I when I push, I push on both github and gitea.
I don’t entirely trust microsoft as a steward of FOSS projects. Gitlab was large enough that self hosting it was a bit of a process. It is inconvenient having to create a new account for every project that you participate in.
Git itself is wonderful, and flexible, but the builtin hosting tools are slimmer than people have come to expect.
It would be nice if there was a decentralized option. I’m in love with go’s packaging, how you can use http meta tags to transparently point to src from various providers (self hosted, github, gitlab) without any extra work by the package consumer. I’d love to have distributed code social network built overtop of something like that.
It’s stupid, and I feel like as a software developer I was supposed to like it, but I absolutely hate that github ditched support for password. I don’t create new repo’s often and I curse github everytime I have to. The error message is terrible (it asks you for username/password and then tells you to use token based authentication; then why ask username/password in the first place? Typically you would use SSH in the first place, and you just have to know this is a possibility).
I use a password manager and feel like password-based authentication is secure for me. I feel like github just made it so much harder for newbies to enter programming, and caused a big pain for existing users.
As much of an advocate of patch mails I am, for most people that just moves the problem - now the question becomes where do you store all those patch emails and make them easily searchable? Also, most mail archives don’t group long-running threads together in a way that spans multiple months or years. There’s also not an easy way to figure out which “issues” are still open, which are closed and who’s working on what, nor is there an easy way for people to subscribe only to updates to the issues they’re interested in.
Sourcehut has fantastic mailing lists with patch archival/search as a central feature. It integrates into the issue tracker, you can close tickets with special email headers.
I agree with this effort, pay for sourcehut, mirror some repos to codeberg, and run an internal gitea instance.
That said I have a hard time replacing the discovery of projects that GitHub provides. Many times when looking for a solution I can search GitHub and find an existing project. Search engines aren’t specialized enough for this type of searching, and most of them are so filled with SEO junk to be considered useless.
Also do people really consider stars to be some sort of social thing? I star repos because it’s an easy way to bookmark projects I may be interested in later. Browser bookmarks don’t give me the same specialized filtering. I also consider stars useful when comparing similar projects, but only as one of multiple data points. I’ve never considered them to be some sort of status metric.
I’m really looking forward to federated forges, but I’m worried finding projects will be harder. I hope discovery is part of the federation implementation.
This is just another organization that fears for its influence and therefore wants to distinguish itself with lurid political statements. Copyright law, with all its international conventions and local variations, is far too complex for any developer (or even non-specialist lawyer) to really know. The SFC is exploiting the ignorance of developers here for its political purposes. If there was actually any legal truth to their claims, they would be taking legal action, not just spreading bad vibes.
To quote Linus: “I personally think this arguing for lawyering has become a nasty festering disease, and the SFC and Bradley Kuhn has been the Typhoid Mary spreading the disease.”
I’ll admit there was a time when I rather discounted thoughts like this as pointless noise, because Github offered centralization and discoverability benefits which outweighed the risk in my mind.
I still think it does but I can’t deny that Github Copilot seriously alters the variables of the equation.
I wish some of the more decentralized solutions were further along. I worry about open source being lost forever when Chloe the chain smoking twinkie eating hacker finally has eir coronary and stops paying the hosting bills.
Moved my stuff to Codeberg, was really painless with Giteas migration tools. Gotta say I enjoy the noise free experience. Github was turning into yet another social media IMO.
Also have my own Gitea instance running for other stuff.
Weird to say, but having social interactions on GitHub it’s what attracted me to use it in the first place.
Nothing wrong with that. I’m just one of those that get really easily addicted to social media. I found myself refreshing github frontpage many times a day to see if anything interesting has popped up… That’s not healthy at all, which is another reason why I moved to Codeberg. It has similar features but they dont jump on your eyes battling for your attention like Github does.
It’s a plug, but this week I made a filter list (for uBlock Origin and the like) to hide some of the overtly social features, specifically from the feeds. As most of my employers and so many projects are on GitHub, I don’t feel like I have a choice, but this list helps keep some of the distraction and attempts to increase engagement at bay.
Thanks for this, it helps.
Wow, I wasn’t actually expecting anyone to try it. It means a lot to hear that!
Gitea really is a self hosting wonder. Huge fan, and I wish it got as much press as Gitlab does.
Cannot second this hard enough. If you went to go look for github alternatives and ended up unimpressed with the buggy, clunky, slow gitlab UI, please give codeberg/gitea a look; they are dramatically higher quality. I feel like every time I go to use gitlab I find a new bug, and I have yet to see a single bug in codeberg.
I suspect this is because Gitlab is trying to do a LOT - just like Github (git web UI, bug tracking, wiki, discussions, locks, socks, lingerie) whereas Gitea does one thing and does it well.
The team behind Gitea also put a ton of effort into things like documentation and ease of installation which matter a lot more than many people give them credit for.
…how is GitHub anything like social media? You can’t make posts or anything…
GitHub is Facebook for programmers. Your posts are your repos, commits, issues & comments, pull requests, discussion posts, wiki pages. Many of these posts can be “liked” using stars and upvotes. There are several kinds of “feeds” where you can see a stream of other people’s posts.
Although I use GitHub for pragmatic reasons, I’m not comfortable with how Facebooky it is. I just found out about Codeberg from this article, and TBH it looks good to me. They don’t have their own CI server yet (in planning since about 2020), so I’d have to think about how I want to do CI, if I switch.
More like Facebook + LinkedIn, considering that potential employers and tech recruiters treat it as sort of a CV. Odd coincidence that LinkedIn is another MS appendage.
Centralization also greases the wheels of surveillance.
So people are refreshing their feed to see if anyone they know has… Opened any issues today?
If I have a PR open, I do check Github for notifications regularly.
doesn’t it email you?
Yes…
Yes, you can do that. I get my feed via email. I see new issues and PRs on my own repos, and I also see comments from issues in other people’s repos that I have commented on. I used to monitor new issues from some repos I don’t own but was active in, but I don’t do that right now.
The stuff in my feed is only for participating in projects, but the facebookiness goes well beyond that. You can follow people and see their activity, you get “achievement badges” whether you want them or not, you can put a ton of information in your personal profile, etc. It’s not as creepy as Linked In or Facebook yet. Nobody has ever hassled me to follow them or star their project. But it’s owned by Microsoft, and their track record suggests that things will get creepier.
Update: yes you can remove those “achievement badges” from your public profile. Just did that.
github.com##.Layout-sidebar a[href$="tab=achievements"]:upward(div)
is the uBlock rule to hide everyone else’s achievements tooTo be honest I check the feed at least once a day to see what the people I followed liked. If you follow the right people you can find the right repos.
What I really want to know is if checking @crazyloglad’s commits once in a while counts as stalking. Asking for a friend.
A few months ago I saw a new lobste.rs post about a GitHub repo I’d just discovered a few hours before. I thought “odd coincidence!”, then I read the first comment (by the person who posted it) saying they’d seen the repo in my GitHub feed when I starred it.
I’m fine with the social aspects of GitHub. It’s useful when you work in a team, or to know how much support there is for fixing a particular issue, or whatever.
They suggest to make your README callout your contempt with the GitHub status quo—and mine already has for a long time now. Things this list misses:
package.json
, while it has short codes for GitLab and BitBucket, but if you just pass inuser/repo
it assumes GitHub—contrast to Nix +nixpkgs
that only allows a short code likesourcehut:~user/repo
for several common forges or the full URL (with no specialer, shorter treatment for GitHub, justgithub:user/repo
). Similarly basically all Vim plugin managers have short options for GitHub only or a full URL which is an implicit endorsement towards GitHub. As a result you get downstream tools like https://vimcolorschemes.com/ that get built with only GitHub in mind because that’s where the plugins are. It seems subtle/convenient, but it matters in shaping ecosystems.Heck, maybe it’s time to look outside of
git
too as if you wanted to try an alternative VCS, GitHub won’t host it–it’s in the name.This has been a worry of mine for a while so I hardcode all of my vim-plug plugin links. Makes it easier to go to the page for a given plugin too.
There is github in a single executable: fossil.
While not technically Git, I agree that for most people and companies out there, it would likely be the better solution. Especially for people who barely understand what the difference between GitHub and Git is or what merge, pull-request, cherry-pick, etc. means and are scared of pulling because they’ll spend the rest of the day on Stakoverflow, because someone else commited code to a project.
This is an interesting issue, and I think there’s way more to think about than what’s mentioned in the article.
If we’re purely talking about Copilot, I’m not really a fan, but that’s a whole separate other issue. It’s simply hard to abandon GitHub because viable alternatives are few and far between and every single one comes with its own set of trade-offs.
Gitea is actively working on ActivityPub federation which should make it much easier for people to contribute to random projects they see without having to specifically make an account on a new server.
And email works just fine for decentralization on projects not looking for other social features like stars—though the workflow is alien to those who have only experienced the merge request flow.
Do you get nice 3-way diffs when you review PRs via email?
Email is a protocol. Your client can produce 3 way diffs if it likes, same as it could if the patch came any other way.
Fair enough. Are there any email clients that do this?
The git book has a section on their email commands
git git-format-patch
generates a patch suitable for sending in plaintext email.Any command line mail client can be used to pipe those to your difftool of choice; building it as a specific feature doesn’t really make sense because command line tools expect you to use pipes to compose features.
Probably not directly in the email client? But as an Emacs user, I feel like it would be pretty trivial to go from mu4e or Gnus to ediff-merge-buffers. If you’re using some graphical diff tool, I can’t imagine that it would be hard to get a nice 3-way diff using a patch file.
Sourcehut is working on being one. I bet someone (me, eventually, at least) will make a bridge to the AP stuff too so that the ones that speak AP but not email can be used by email
Going from “you can do this decentralized” to “well, to get the features you really want there, you’ll need this centralized service’s implementation” in the space of like two comments is impressive.
Neither of the things I mentioned (sourcehut, a self hosted free software, and AP, a decentralized social network protocol) are centralized services
The fact that people can self-host sourcehut does not make real-world typical use of sourcehut be decentralized. If — and this is a big if — sourcehut were to take off in popularity, for example, it would not be sourcehut the self-host-able piece of software taking off, it would be a particular instance of sourcehut taking off, and thus becoming the centralized bottleneck all over again.
You don’t really need/want federation for that, just SSO.
And it will gain maybe a few extra users as a result, but never achieve critical mass the way centralized services have, because there are no technological solutions to the social forces that drive centralization.
This is extremely appealing, I had no idea.
While it would be great to make dev work easier, it unfortunately doesn’t solve search/discovery. Unfortunately gh is still a great place to search for open source projects. I don’t know if AP can address that part, but I got someone will tackle this next.
But not without flaws. I dislike that I can’t search for an exact string
Pretty sure Google indexes GitHub and also all other places, so searches there work even better IME :)
Well that’s the point. The call is to give up some of those things, a sacrifice in protest of GitHub’s abusive behaviours.
Note I said ‘some of’ because honestly, GitHub’s PR interface and some of its other features are…less than great.
This really kills gitea, for now. The federation will revive it ♥
Honestly, I don’t really have many problems with GitHub. It works decently, and if it goes to hell, I can just push somewhere else and deal with the fallout later. Actually finding projects/code is useful with code search (ignoring ML sludge), and I really don’t see how people can get addicted to the whole stars thing. Besides, if it’s public, something like Copilot will snarf it anyways.
I was a long-time holdout from GitHub. I pushed every project I was contributing to and every company that I worked for to avoid it because I don’t like centralised systems that put control in a single location. I eventually gave up for two reasons:
It’s fairly easy to migrate from GitHub if you ever actually want to. Git is intrinsically decentralised. GitHub Pages and even GitHub wikis are stored in git and so can just be cloned and take elsewhere (if you’re sensible, you’ll have a cron job to do this to another machine for contingency planning). Even GitHub Issues are exposed via an API in machine-readable format, so you can take all of this away as well. I’d love to see folks that are concerned about GitHub provide tooling that lets me keep a backup of everything associated with GitHub in a format that’s easy to import into other systems. A lot of my concerns about GitHub are hypothetical: in general, centralised power structures and systems with strong network effects end up being abused. Making it easy to move mitigates a lot of this, without requiring you to actually move.
The projects I put on GitHub got a lot more contributions than the ones hosted elsewhere. These ranged from useless bug reports, through engaged bug reports with useful test cases, up to folks actively contributing significant new features. I think the Free Software movement often shoots itself in the foot by refusing to compromise. If your goal is to increase the amount of Free Software in the world, then the highest impact way of doing that is to make it easy for anyone to contribute to Free Software. In the short term, that may mean meeting them where they are, on proprietary operating systems or other platforms. The FSF used to understand this: the entire GNU project began providing a userland that ran on proprietary kernels and gradually replaced everything. No one wants to throw everything away and move to an unfinished Free Software platform, but if you can gradually increase the proportion of Free Software that they use then there becomes a point where it’s easy for them to discard the last few proprietary bits. If you insist on ideological purity then they just give up and stay in a mostly or fully proprietary ecosystem.
Even if it’s possible, even easy, to copy your content from Github when they cross some threshold you’re no longer ok with, there will be very little to copy to unless we somehow sustain development of alternatives during the time it takes to reach that threshold.
IMHO it would be better if the default was at least ”one of the three most popular” rather than ”Github, because that’s what everyone uses”.
If you use their issue tracker, pull requests and so on, that will be voided too. That isn’t easily pushable to another git host. Such things can tell a lot about a project and the process of it getting there, so it would be sad if that was lost.
Agreed on everything but Copilot. The freedom to study how the software works is a fundamental attribute of free software. Learning is not covered by the GPL’s requirements. Copilot sometimes copypastes code (honestly - who doesn’t) but broadly it learns. This is entirely in keeping with open source.
If we’re gonna set a standard that you can’t use things you learnt in software under an open-source license when writing commercial software, we might as well shutter either the entire software industry or the entire open-source movement, because literally everybody does that. It’s how brains work!
And of course, it’s not like being off Github is gonna prevent MS from feeding your project into Copilot 2.
Copilot does not learn.
Like all of these neural network “AIs”, it’s just a pattern recognition system that launders the work of many humans into a new form, which the corporation can profit from but the humans cannot. It’s piracy for entities rich enough to train and operate such an AI, and unethical enough to acquire the training data, but you or I would still be punished for pirating from the corporation. Whether or not it is legal is irrelevant to me (I’m in favor of abolishing copyright), but we must recognize the increasing power imbalance between individuals and corporations such “AI” represents.
Copilot understands nothing of what it writes. It learns nothing and knows nothing. It is not sentient or alive, no matter how tempting it is to anthropomorphize it.
I think “pattern recognition system that launders the work of many humans into a new form” is just a rude way to phrase “learning.”
Define “understands.” Define “knows.” I think transformers derive tiered abstract patterns from input that they can generalize and apply to new situations. That’s what learning is to me.
The standard philosophical definition of knowledge is a justified true belief. Copilot and other AIs make the belief part problematic, so bracket that. But they don’t justify things well at all. Justification is a social process of showing why something is true. The AIs sound like total bullshit artists when asked to justify anything. I don’t think Copilot “knows” things anymore than a dictionary does yet.
Putting aside Gettier cases, that’s not what I understand “justified” to mean. You just need to have a reason for holding the knowledge. With AI, reinforcement learning is the justification.
The point of “justified belief” is just that it’s not knowledge if you just guess that it’s raining outside, even if it is in fact raining.
The definition that @carlmjohnson is quoting is Plato’s and ever since Plato put it forth, knowledge theorists have been bickering about what “justified” means. The history of ideas after the age of Boethius or so isn’t quite my strong point so I’ll leave that part to someone else but FWIW most classical definitions of justification either don’t readily apply to reinforced learning, or if they do, it fails them quite badly.
That being said, if you want to go forth with that definition, it’s very hard to frame a statistical model’s output as belief in the first place, whether justified or not. Even for the simplest kinds of statistical models (classification problems with binary output – yes/no) it’s not at all clear to formulate what belief the model possesses. For example, it’s trivial to train a model to recognize if a given text is an Ancient Greek play or not. But when you feed it a piece of text, the question that the model is “pondering” isn’t “Is this an Ancient Greek play”, but “Should I say yes?”, just like any other classification model. If subjected to the right laws and statements, a model that predicts whether a statement would cause someone to be held in contempt of the court might also end up telling you if a given text is an Ancient Greek play with reasonable accuracy, too. “Is this an Ancient Greek play?” and “Is this statement in contempt of the court?” are not equivalent statements, but the model will happily appear to make both with considerable accuracy.
The model is making an inference about the content (“This content is of the kind I say yes to/the kind I say no to”), but because the two kinds cannot be associated to a distinct piece of information about the subject being fed to the model, I don’t think it can be said to constitute a belief. It’s not a statement that something is the case because it’s not clear what it asserts to be the case or not: there are infinitely many different classification problems that a model might turn out to solve satisfactorily.
In Greek, “justified” was some variation on “logos”: an account. Obviously everyone and their Buridan’s ass has a pet theory of justification, but I think it’s fair to interpret Plato’s mooted definition (it’s rejected in the dialogue IIRC!) as being “the ability to give an account of why the belief is true”. This is the ability which Socrates finds that everyone lacks, and why he says he knows that he knows nothing.
Ugh, it’s really tricky. This comes up in two dialogs: Theaetetus, where knowledge gets defined as “true judgement with an account” (which IIRC is the logos part) and it’s plainly rejected in the end. The other one is Meno, where it’s discussed in the terms of the difference between true belief and knowledge, but the matter is not definitively resolved.
I was definitely wrong to say it was Plato’s – I think I edited my comment which initially said “is effectively Plato’s” because I thought it was too wordy but I was 100% wrong to do it, as Plato doesn’t actually use this formulation anywhere (although his position, or rather a position that can be inferred from the dialogues, is frequently summarized in these terms). (Edit: FWIW this is a super frequent problem with modern people talking about ancient sources and one of the ways you can probably tell I’m an amateur :P)
You may know of this already but just in case your familiarity with modern philosophy is as spotty as mine, only it’s got holes in different places, and if you’re super curious patient, you’re going to find Gettier’s “Is Justified True Belief Knowledge?” truly fascinating. It’s a landmark paper that formalizes a whole lot of objections to this, some of them formulated as early as the 15th century or so.
The counter-examples Gettier comes up with are better from a formal standpoint but Russel famously formulated one that’s really straightforward.
Suppose I’m looking at a clock which shows it’s two o’clock, so I believe it’s two o’clock. It really is two o’clock – it appears that I possess a belief that is both justified (I just looked at the clock!) and true (it really is two o’clock). I can make a bunch of deductions that are going to be true, to: for example, if I were to think that thirty minutes from now it’s going to be half past two, I’d be right. But – thought I haven’t realized it – that clock has in fact stopped working since yesterday at two. (Bear with me, we’re talking about clocks from Russell’s age). My belief is justified, and it’s true, but only by accident: what I have is not knowledge, but sheer luck – I could’ve looked at the clock as half past two and held the same justified belief, but it would’ve been false, suggesting that an external factor may also be involved in whether a belief is true or not, justified or not, and, thus, knowledge or not, besides the inherent truth and justification of a statement.
I love collecting “realistic” Gettier problems:
My favourite ones come from debugging, that’s actually what got me into this in the first place (along with my Microwave Systems prof stubbornly insisting that you should know these things, even if engineers frown upon it, but that’s a whole other story):
I got a few other ones but it’s really late here and I’m not sure I’m quite coherent by now :-D.
I’m familiar with Gettier cases. I never dove very deep into the literature. It always struck me that a justification is not just a verbal formulation but needs some causal connection to the fact of the matter: a working clock causes my reasoning to be correct but a stopped clock has no causal power etc. I’m sure someone has already worked out something like this and brought out the objections etc etc but it seems like a prima facie fix to me.
Yes, IMO the belief is “how may this text continue?” However, efficiently answering this question requires implicit background knowledge. In a similar sense, our brains may be said to only have information about “what perpetuates our existence” or “what makes us feel good.” At most we can be said to have knowledge of the electric potentials applied to our nerves, as Plato also made hay of. However, as with language models, a model of the unseen world arises as a side effect of the compression of sensory data.
Actually, language models are fascinating to me because they’re a second-order learner. Their model is entirely based on hearsay; GPT-3 is a pure gossip. My hope for the singularity is that language models will be feasible to make safe because they’ll unavoidably pick up the human ontology by imitation.
That’s a question, not a belief – I assume you meant “This text may continue ”. This has the same problem: that’s a belief that you are projecting onto the model, not necessarily one that the model formulates. Reasoning by analogy is an attractive shortcut but it’s an uneasy one – we got gravity wrong because of it for almost two thousand years. Lots of things “may be said” about our brains, but not all of them are true, and not all of them apply to language models.
Sure, but by that metric everything that anyone has ever said is a belief that person is projecting. I think that language models match the pattern of having a belief, as I understand it.
You’re mixing up meta-levels here: I believe that the model believes things. I’m not saying that we should believe that the model believes things because the model believes that; rather, (from my perspective) we should believe it because it’s true.
In other words, if I model the learning process of a language model, the model in my head of the process fits the categories of “belief” and “learning”.
Observing that a model follows the pattern of a behaviour is not the same as observing that behaviour though. For example, Jupiter’s motion matches the pattern of orbiting a fixed Earth on an epicycle, but both are in fact orbiting the Sun.
FWIW, this is an even weaker assumption than I am making above – it’s not that no statements are made and that we only observe something akin to statements being made. I’m specifically arguing that the statements that the model appears to make (whether “it” makes them or not) are not particular enough to discriminate any information that the model holds about the world outside of itself and, thus, do not qualify as beliefs.
If the world had a different state, the model would have different beliefs - because the dataset would contain different content.
Also, Jupiter is in fact orbiting a fixed Earth on an epicycle. There is nothing that inherently makes that view less true than the orbiting-the-sun view. But I don’t see how that relates at all.
The problem is that reinforcement learning pushes the model toward reproducing the data distribution it was trained on. It’s completely orthogonal to truth about reality, in exactly the same way as guessing the state of the weather without evidence.
The data is sampled from reality… I’m not sure what you think evidence is, that training data does not satisfy.
It’s exactly the same as guessing the weather from a photo of the outside, after having been trained on photo/weather pairs.
The data for language models in general is sampled from strings collected from websites, which includes true statements but also fiction, conspiracy theories, poetry, and just language in general. “Do you really think people would get on the Internet and tell lies” is one of the oldest jokes around for a reason.
You can ask GPT-3 what the weather is outside, and it’ll give you an answer that is structured like a real answer would be, but has no relation to the actual weather outside your location or whatever data centers collectively host the darned thing. It _looks_like a valid answer, but there’s no reason to believe it is one, and it’s dangerous to infer that anything like training on photo/weather pairs is happening when nobody built that into the actual model at hand.
Copilot in particular is no better - it’s more focused on code specifically, but the fact that someone wrote code does not mean that code is a correct or good solution. All Copilot can say is that it’s structured in a way that resembles other structures it’s seen before. That’s not knowledge of the underlying semantics. It’s useful and it’s an impressive technical achievement - but it’s not knowledge. Any knowledge involved is something the reader brings to the table, not the machine.
Oh I’ll readily agree that Copilot probably doesn’t generate “correct code” rather than “typical code.” Though if it’s like GPT-3, you might be able to prompt it to write correct code. That might be another interesting avenue for study.
“However, this code has a bug! If you look at line”…
I’ve experimented with this a bit and found it quite pronounced - if you feed copilot code written in an awkward style (comments like “set x to 1”, badly named variables) you will get code that reflects that style.
IMHO it’s perilous and not quite fair to think what a machine should be allowed to do and not to do by semantic convention. “Machine learning” was one uninspired grant writer away from going down into history as, say, “statistically-driven autonomous process inference and replication”, and we likely wouldn’t have had this discussion because anything that replicates code is radioactive for legal teams.
Copilot is basically Uber for copy-pasting from Stack Overflow. It’s in a legally gray area because the legal status of deriving works via statistical models is unclear, not because Microsoft managed to finally settle the question of what constitutes learning after all. And it’s probably on the more favourable side of gray shades because it’s a hot tech topic so it generates a lot of lobbying money for companies that can afford lawyers who can make sure it stays legally defensible until the next hot tech topic comes up.
Also, frankly, I think the question of whether what Copilot does constitutes learning or not is largely irrelevant, and that the question of whether Copilot-ing one’s code should be allowed is primarily rooted in entitlement. Github is Microsoft’s platform so, yes, obviously, they’re going to do whatever they can get away with on it, including things that may turn out to be illegal, or things that are illegal but will be deemed legal by a corrupt judge, or whatever. If someone wants $evil_megacorp to not do things with your code, why on Earth was their code anywhere near $evil_megacorp’s machines in the first place?
This cannot be a surprise to anyone who’s been in this field for more than a couple of years. Until a court rules otherwise, “fair” is whatever the people running a proprietary platform decide is fair. If anyone actually thought Github was about building a community and helping people do great things together or whatever their mission statement is these days, you guys, I have a bridge in Manhattan, I’m selling it super cheap, the view is amazing, it’s just what you need to take your mind off this Copilot kerfuffle, drop me a line if you wanna buy it.
(Much later edit: I know Microsoft is a hot topic in FOSS circles so just to be clear, lemme just say that I use Github and have zero problem with Copilot introducing the bugs that I wrote in other people’s programs :-D).
If machine learning was called “data replication”, it would be misnamed. And if it was called “pattern inference”, it would just be a synonym for learning… I wouldn’t care about Codex if I thought it was just a copypaste engine. I don’t think it is, though. Does it occasionally copypaste? Sure, but sometimes it doesn’t, and those are the interesting cases for me.
I don’t think this at all comes down to Github being Microsoft’s platform so much as Github being the biggest repo in one place.
I’m not at all defending Microsoft for the sake of Microsoft here, mind. I hate Microsoft and hope they die. I just think this attack does not hold water.
I beg to differ! Machine learning is a misnomer for statistically-driven autonomous process inference and replication, not the other way ’round!
I’m obviously kidding but what I want to illustrate is that you shouldn’t apply classical meaning to an extrapolated term. A firewall is neither a wall nor is it made of fire, and fire protection norms doesn’t apply to it. Similarly, just because it’s called machine learning, doesn’t mean you should treat it as human learning and apply the same norms.
I don’t think machine learning learns because it’s called machine learning, I think it learns because pattern extraction is what I think learning is.
I realize that. I want to underline that, while machine learning may be superficially analogous to human learning, just like a firewall is superficially analogous to a wall made of fire, it does not mean that it should be treated the same as human learning in all regards.
I don’t think it should be treated the same as human learning in all regards either. I think it’s similar to human learning in some ways and dissimilar in others, and the similarities are enough to call it “learning”.
Do you think Microsoft would be okay with someone training an AI on the leaked Windows source code and using it to develop an operating system or a Windows emulator?
You don’t even have the right to read that. That said, I think it should be legal.
I’m not asking whether it should be legal, but whether Microsoft would be happy about it. If not, it’s hypocritical of them to make Copilot.
Oh by no means will I argue that Microsoft are not hypocritical. I think it’s morally valid though, and whether Microsoft reciprocates shouldn’t enter into it.
Bit of a niggle, but it depends on the jurisdiction, really. Believe it or not, there exist jurisdictions where the Berne Convention is not recognized and as such it is perfectly legal to read it.
I’d personally relicense all my code to a license that specifically prohibits it from being used as input for a machine-learning system.
This is specifically regarding text and images, but the principle applies.
https://gerikson.com/m/2022/06/index.html#2022-06-25_saturday_01
“It would violate Freedom Zero!” I don’t care. Machines aren’t humans.
Exactly this. I think anthropomorphising abstract math executed in silicon is a trap for our emotional and ethical “senses”. We cannot fall for it. Machines and algorithms aren’t humans, aren’t even alive in any sense of the word, and this must inform our attitudes.
Machines aren’t humans. That’s fine, but irrelevant.
Machines aren’t alive. Correct, but irrelevant.
If the rule doesn’t continue to make sense when we finally have general AI or meet sapient aliens, it’s not a good rule.
That said, we certainly don’t have any human-equivalent or gorilla-equivalent machine intelligences now. We only have fuzzy ideas about how meat brains think, and we only have fuzzy ideas about how transformers match input to output, but there’s no particular reason to consider them equivalent. Maybe in 5 or 10 or 50 years.
Won’t happen. If it does happen, we all die very soon afterwards.
I think the rule is good. We could come up with a different rule: oxygen in the atmosphere is a good thing. If we reach general AI or meet sapient aliens, they might disagree. Does that mean the rule was bad all along? I feel similar about anthropomorphising machines. It’s not in our ecological interest to do so.
Source distribution is like the only thing that’s not covered by Freedom Zero so you’re good there 🤷🏻♀️
Arguably the GPL and the AGPL implicitly prohibits feeding it to copilot.
(I personally don’t mind my stuff being used in copilot so don’t shoot the messenger on that.
(I don’t mind opposition to copilot either, it sucks. Just, uh, don’t tag me.))
Do we have a lawyer’s take here, because I’d be very interested.
It’s the position of the Software Freedom Conservancy according to their web page. 🤷🏻♀️ It hasn’t been tried in court.
I’m on board; however, I would, at least personally, make an exception if the machine-learned tool and it’s data/neural net were free, libre, and open source too. Of course the derivative work also needs to not violate the licenses too.
For most intents and purposes, licences legally cover it as “creation of derived works”, otherwise why would “clean room design” ever exist. Just take a peek at the decompiled sources, you’re only learning after all.
I think this depends on the level of abstraction. There’s a difference in abstraction between learning and copying - otherwise, clean room design would itself be a derivative work.
I don’t understand what you mean. Clean-room implementation requires not having looked at the source of the thing you’re re-implementing. If you read the source code of a piece of software to learn, then come up with an independent implementation yourself, you haven’t done a clean-room implementation.
Cleanroom requires having read a documentation of the thing you are reimplementing. So some part of the sequence read -> document -> reimplement has to break the chain of derivation. At any rate, my contention is that training a neural network to learn a concept is not fundamentally different from getting a human to document a leaked source code. You’re going from literal code to abstract knowledge back down to literal code.
Would it really change your mind if OpenAI trained a second AI on the first AI in-between?
I think it’s quite different in the sense that someone reading the code’s purpose may come up with an entirely different algorithm to do the same thing. This AI won’t be capable of that - it is only capable of producing derivations. Sure, it may mix and match from different sources, but that’s not exactly the same as coming up with a novel approach. For example, unless there’s something like it in the source you feed it, I doubt the “AI” would be able to come up with Carmack’s fast inverse square root.
You can in theory get Codex to generate a comment from code, and then code from the comment. So this sort of process is entirely possible with it.
It might be an interesting study to see how often it picks the same algorithm given the same comment.
In copyright law, we have usually distinguished between an interface and an implementation. The difference there is always gonna be fuzzy, because law usually is. But with an AI approaches, there’s no step which distinguishes the interface and the implementation.
One problem here is the same sort of thing that came up in the Oracle/Google case — what do you do with things that have one “obvious” way to do them? If I’m the first person to write an implementation of one of those “obvious” functions in a given language, does my choice of license on that code then restrict everyone else who ever writes in that language?
And a lot (though of course not all) of the verbatim-copying examples that people have pointed out from Copilot have been on standard/obvious/boilerplate type code. It’s not clear to me that licensing ought to be able to restrict those sorts of things, though the law is murky enough that I could see well-paid attorneys arguing it either way.
The genie is out of the bottle with copilot-like software. If Github don’t do it, someone else will.
+1 to more Git forge diversity, however. I use Sourcehut.
These last few days I’ve seen announcements from Amazon and Salesforce regarding “ML powered code generation”. It’s definitely not exclusive to GH.
I recently deleted all my GitHub repositories - I’d mostly migrated to GitLab, but the Copilot release really spurred me to complete the migration.
On my radar is a subsequent migration from GitLab to sourcehut. Not because I have any particular issue with GitLab - all of my interactions with them have been positive, and I like their product. But I like sourcehut more; their approach is more in line with my personal preferences.
Most complaints towards GitLab are around the open core model and how the VC funding could make a bad influence on the platform. So far it seems they are doing okay financially.
I’m only on github to make PRs. A important step for me was to consciously avoid the dopamine triggers by removing all my stars and follows, and making my profile page as boring and inconsequential as possible. I find it’s easier to ignore the social scoring by committing to not reciprocate.
I’m genuinely curious, what’s your reason for doing that? To me, those things are the most direct indicators possible that people give a shit about what you’re doing and about you personally. That’s kind of what it’s all about for me; having people use and care about things I create is my primary motivation for programming (outside of work, which I do primarily for money).
Clicking a button doesn’t exactly signal “giving a shit” to me… it requires no effort. What signals (to me) that people give a shit is seeing patches come in, bug reports, random folks on IRC or whatever saying “hi” and thanking me, seeing distros picking up my stuff, and so on. Giving fake internet points doesn’t necessarily mean that anyone gives a shit, at best they’re probably bored or mildly interested enough to click a button and move on to the next shiny thing in their feed.
Every pull request or email patch I’ve received is a thousand times more meaningful than any star or follow on Github. Those are just pointless.
Originally stars were for bookmarking, but it’s degenerated into meaningless 👍🏻/+1 noise.
Exactly - a “star” can simply mean “hey this looks cool”. I’m sure a majority of people who star a project never even tried to use the project. It’s just ego inflation. More important is that people actually use your stuff, in places where it matters. If some project is technically cool but unusable, it could still acquire many many stars.
I usually star projects so I can find them later.
My bar for that rests at the point that someone gives me their personal feedback on my work in a way that lets me know they have actually read, studied, or used it. That is giving a shit. Competing with the whole world and collecting a few imaginary stars, stickers, or points does not say anything about your work, unless you happen to be a marketeer.
I set mine to private; everyone can. Eventually all my code will be self-hosted and accessible via an RSS feed and https.
I’d love for open source development to be decentralized and I host a semi-private Gitea instance of my server (most of the projects are public but registrations are disabled and only I have an account), but to be completely honest:
The accounts problem could be solved by ActivityPub federation, which as Hail_Spacecake has mentioned in another comment Gitea is working on, or maybe by just teaching people to use email and having a nice email/git UI as SourceHut is doing, but code search is a deal breaker for me. Is any of these projects working towards some sort of federated code search or is there some other service that indexes code from forges other than GitHub?
Created an uitlity to export github issues to mbox (https://github.com/abbbi/issue2mbox). Some people used it to import their projects issues from github to sourcehut..
On the topic of GitHub co-pilot, I can generally recommend this article by Felix Reda (who was very involved in the last copyright reform in the EU, as a member of parliament for the pirate party).
https://felixreda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/
So all code written using Copilot is automatically public domain? I don’t think Copilot users will be happy about that…
The argument seems to be that the size of things that are reproduced are too small to be covered by copyright. In general, copyright has a threshold for originality that must be met. To take an extreme example, a lot of folks use
i
as an induction variable in loops. I suspect that the first code to do this is still in copyright (copyright extensions mean that basically anything since the invention of the stored program computer are still in copyright, unless someone forgot to file back when that was a requirement). I think it’s pretty clear that a line such asfor (int i=0 ; i<size; i++) {
or evenfor (auto &i : collection) {
would not be subject to copyright. The second one is more interesting because C++11 range-based for loop syntax was introduced after the rules around copyright were changed so that the act of creation, rather than an explicit filing with the copyright office, conferred copyright. It’s impossible to prove whether someone independently created that line or if they copied it from another project. That level of copying is fine.I think it’s likely to require a court case to determine whether this is actually the case for Copilot. For example, one of the things that I saw it able to reproduce from one of the iD game engines was of a complexity that would meet the requirements for patenting. This generally requires a higher bar than copyright. If it’s creating a complete algorithm then that may well meet the bar for originality.
Patents are also interesting in other ways. Some open-source projects come with separate patent licenses that grant you the rights to a patent for derived works of the code in the project. This puts you in a deeply uncomfortable position if Copilot injects the code into your project: if you argue that your code is a derived work, you must comply with the copyright license. If you argue that your code is not a derived work, you do not have a patent license.
I found the earlier arguments somewhat disingenuous. Yes, data mining is explicitly permitted, but a degenerate example of a data mining system would be one that indexes entire source files and reproduces them in their entirety based on a keyword. If that were seen as not propagating copyright then any indexing system with learned indexes provides a bulletproof end run around copyright. It would take only a small tweak to BitTorrent’s distributed hash table implementation to provide a mechanism for sharing copyrighted works that is completely legal. I doubt that this was the intent of the law and I doubt that any court would rule in that direction.
Some years ago, I proposed that BitTorrent could be defended in court because Fair Use / Fair Dealings permitted quoting extracts up to some defined limits. If each seed shared only a section of the original within that section, then they’d all be defendable under quoting. If you collect a set of quotes and assemble a complete work, that’s fine. I discussed this with a lawyer and the counter argument was basically: this is why we have judges. Judges (current US Supreme Court Justices aside) are not idiots. They can spot when someone is trying to do an end-run around the law and they are deeply unsympathetic. Judges exist, in part, because statutes are not 100% unambiguous and it’s important to employ judgement.
or does it mean that copilot can “learn” (i.e. copypasta) my GPL code into someone else’s project, then it’s somehow magically in the public domain?
Thanks for the link. A (somewhat) reasonable voice in all the unfounded mass hysteria. It remains to be mentioned, however, that Reda is not a lawyer but a politician, and did not study law. Even if Reda has participated in the European copyright reform, the statement is just an opinion. There are comments by IP lawyers available, e.g. here: https://fossa.com/blog/analyzing-legal-implications-github-copilot/ or https://www.youtube.com/watch?v=7HWIxLKrZ_w
While Felix is not a lawyer, he was a lawmaker of those laws, and create the mechanisms at hand - which often needs an understanding as thorough as that of a lawyer. Particularly, it’s a good introductions into what the mechanisms at play there are.
Still, with any legal debate, they live through dissenting opinions, so huge thanks for adding yours :).
Always like to help.
The fact that there is a reasonable chance that GitHub will eventually bow to pressure from internal and external activists to cease doing business with a client they dislike on partisan American political grounds (even if GitHub hasn’t yet done this) is also a good reason to avoid relying on their services. If activists, including people who work at GitHub, can apply political pressure to deplatform ICE, they can decide that you or your company is as problematic as ICE and apply political pressure against you. GitHub is in large part a centralized social media company, and if you become dependent on any centralized social media network, you become vulnerable to pressure from any activists who have the ear of the controller of that network to cut you off from that network.
Fortunately, contributing code via git is one of the forms of online social interaction least vulnerable to centralization pressure, and using a GitHub alternative or hosting your own is very easy to do and very close to as good of an experience as using GitHub - as long as you make sure to not rely on any strictly-non-git infrastructure GitHub provides. If you run your own wiki and bug tracker and CI system, you’re not vulnerable if GitHub suddenly decides to stop offering you those services.
This has got to be the strangest take I’ve read today - inventing a conjunction of multiple events, none of which are particularly likely, as a justification for self hosting (which hardly lacks for good good arguments)
You say this, but they have not and never have, so where is the evidence they will? This all comes off as FUD.
What’s the evidence that they won’t? GitHub’s existing leadership could have a genuine change of political persuasion, they might get replaced with people who are more friendly to the activist demands, perhaps something in the broader American political landscape will change that changes the calculus GitHub’s leadership is currently making. The important point is that GitHub could cut any of their users off at any time, and if any of their users are politically unpopular among the sorts of people who work at or influence GitHub, they’ll have no recourse. It would be better to not depend on GitHub to begin with, and reduce dependency on them if it exists.
It won’t be just American political pressure either. The Chinese have ways of bending many US concerns to their will, including Microsoft. Try designing software referencing Falun Gong, Tienanmen, or Xinjiang and see what happens.
It’s so weird that people develop close-source software on github, despite being fully aware that MS owns it and can see everything they do in their “private” repos.
Or that FOSS projects develop on it despite knowing the platform is closed source and requires its users to have accounts and interact on a closed platform–and then tangentially only supporting community communications on Discord.
To me that’s more rational, because it’s easy to move to a different platform if github gets in the way of your OSS development. It’s a lot harder to take your secret developments back if MS snoops on them.
It’s not just about the code. You’re missing all communications and freedom/privacy aspects for your users. I read a quote that stuck with me.
I bet this “post” will not trigger same big waves as first github/microsoft merger.
GitHub is really great. I love using it. But it’s good other people use other software. More competition.
Unfortunately, it’s difficult to compete with GitHub because of the “you have to be on GitHub to contribute to GitHub projects”-effect.
back when I hosted projects on github I would always accept patches sent to me directly, but unfortunately this accommodating attitude is rare among github users.
I imagine one way to fight Copilot is to throw up a lot of wrong code on github. But who has time for that?
A separate model trained on all the bugs in my repositories which can spit out a thousand new repositories (or one LARGE repository) with similar bugs running through it? If it is being automatically ingested, might as well automatically create it…
I have a different approach, I use GitHub as this is a social network. But as every system, after a while GitHub will probably not be a good platform for me anymore. So I also self-host on Gitea (which I greatly recommend in favor of Gitlab for personal use).
And I when I push, I push on both github and gitea.
I have mixed feelings about this.
I don’t entirely trust microsoft as a steward of FOSS projects. Gitlab was large enough that self hosting it was a bit of a process. It is inconvenient having to create a new account for every project that you participate in.
Git itself is wonderful, and flexible, but the builtin hosting tools are slimmer than people have come to expect.
It would be nice if there was a decentralized option. I’m in love with go’s packaging, how you can use http meta tags to transparently point to src from various providers (self hosted, github, gitlab) without any extra work by the package consumer. I’d love to have distributed code social network built overtop of something like that.
I think that ship sailed a decade ago but as the kids are saying nowadays, they’re right and they should say it.
It’s stupid, and I feel like as a software developer I was supposed to like it, but I absolutely hate that github ditched support for password. I don’t create new repo’s often and I curse github everytime I have to. The error message is terrible (it asks you for username/password and then tells you to use token based authentication; then why ask username/password in the first place? Typically you would use SSH in the first place, and you just have to know this is a possibility).
I use a password manager and feel like password-based authentication is secure for me. I feel like github just made it so much harder for newbies to enter programming, and caused a big pain for existing users.
I use it and quite frankly, not of these are issues for me.
The central part I use is the git itself - and I can move if I wanted to.
It would be pretty cool if Git had a standard way to store and share pull requests (and issues?) though.
As much of an advocate of patch mails I am, for most people that just moves the problem - now the question becomes where do you store all those patch emails and make them easily searchable? Also, most mail archives don’t group long-running threads together in a way that spans multiple months or years. There’s also not an easy way to figure out which “issues” are still open, which are closed and who’s working on what, nor is there an easy way for people to subscribe only to updates to the issues they’re interested in.
Sourcehut has fantastic mailing lists with patch archival/search as a central feature. It integrates into the issue tracker, you can close tickets with special email headers.
It’s nice, but it leaves out all the interesting parts. (Interchange format, storage, replies…)
I want something that is to request-pull as LFS is to HTTP.
That’s what mailing lists are for.
I agree with this effort, pay for sourcehut, mirror some repos to codeberg, and run an internal gitea instance.
That said I have a hard time replacing the discovery of projects that GitHub provides. Many times when looking for a solution I can search GitHub and find an existing project. Search engines aren’t specialized enough for this type of searching, and most of them are so filled with SEO junk to be considered useless.
Also do people really consider stars to be some sort of social thing? I star repos because it’s an easy way to bookmark projects I may be interested in later. Browser bookmarks don’t give me the same specialized filtering. I also consider stars useful when comparing similar projects, but only as one of multiple data points. I’ve never considered them to be some sort of status metric.
I’m really looking forward to federated forges, but I’m worried finding projects will be harder. I hope discovery is part of the federation implementation.
+1.
I’m going to start querying across multiple sites, as a poor workaround:
“rust (site:github.com | site:gitlab.com | site:st.rt | site:codeberg.org | site:gitlab.gnome.org)”
Is there a better alternative?
We could dump a scrape of all those services to typesense.
Or better yet, they could have a standard search endpoint and other parties could fan the queries out and aggregate results.
Start a new forge? Register it for inclusion in the lobsters search engine.
This is just another organization that fears for its influence and therefore wants to distinguish itself with lurid political statements. Copyright law, with all its international conventions and local variations, is far too complex for any developer (or even non-specialist lawyer) to really know. The SFC is exploiting the ignorance of developers here for its political purposes. If there was actually any legal truth to their claims, they would be taking legal action, not just spreading bad vibes.
To quote Linus: “I personally think this arguing for lawyering has become a nasty festering disease, and the SFC and Bradley Kuhn has been the Typhoid Mary spreading the disease.”
I’ll admit there was a time when I rather discounted thoughts like this as pointless noise, because Github offered centralization and discoverability benefits which outweighed the risk in my mind.
I still think it does but I can’t deny that Github Copilot seriously alters the variables of the equation.
I wish some of the more decentralized solutions were further along. I worry about open source being lost forever when Chloe the chain smoking twinkie eating hacker finally has eir coronary and stops paying the hosting bills.