But the point seems valid to me — what good is the freedom to modify the software on your computer if doing so is brain-fryingly difficult?
The underlying problem is that developing for a very, very deep and complex hierarchy of components is hard. It usually requires big teams with well-organized development and testing processes. (I’ve worked on such a team.) When an individual opens up some component in the middle, like a UI framework, and starts making changes, it’s pretty likely stuff will break, or at least the resulting build & integration work will be challenging.
I still recognize the value of such freedom, but I think many descriptions are being disingenuous in how they describe it as being the same as tinkering with a bike or a guitar, which are approachable to nearly anyone.
The legal freedom is the required element, but it’s not sufficient alone. Whether any modification is easy is somewhat orthogonal to the software license (I bet Microsoft couldn’t easily modify keybindings either across eleven different bug-compatible UI toolkits and the Office Multiverse).
I get that difficulty of actually modifying software is a problem, but Free Software is a specific term, so reframing it to encompass usability of patching libraries across a whole desktop app ecosystem is stretching it beyond its meaning.
BTW: this situation is an analog of the “Non-programmers can’t modify Free Software” problem. FSF has addressed this, that even if you don’t have the skills, you still have the rights, so you can get help from the community or hire someone to make modifications for you.
Yeah, I see it as less “Free software that you can’t customize is not truly Free Software” and more “Free software that you can’t customize is not good software”.
Whether the software is free or not only matters if it’s worth using in the first place.
If what you’re looking for is good or user-friendly or user-customizable software, I’m afraid you’re just going to have to exercise one of the freedoms of open source and fork it. I too want good software, but the idea of Free Software and “truly” (whatever that means) free software has nothing to do with the software being good.
You’re absolutely right. More often than not, that “complex hierarchy” is quite deliberate and serves a business interest rather than user’s interest. Very relevant piece on this topic by the Rails inventor: They’re rebuilding the Death Star of complexity
I think that would be a different article. This one seems more to be concerned with using “actually-existing free software” than with formal philosophical concerns.
every compiled language uses the same set of concepts when it is compiled…
I don’t believe this is true of Go, which has its own ABI and tool chain that derive from Plan9. Its calling conventions and use of the stack are nonstandard. (This is why Go / C interop is awkward and expensive.)
That said, if you use cgo in a Go package, the C functions/structs it creates do behave normally and can be used with other languages.
No one in the comments yet has mentioted UnifiedPush which allows users the option to self-host their Push Notifications if applications support it (hint: your service should consider adding support).
I use my existing, self-hosted XMPP setup, Prosody+Cheogram (server+client (tho reconsidering ejabberd)), for Android for legacy chat apps Molly (Signal) & Element (Matrix) & Goguma (IRC) while bugging my employer to add the UnifiedPush plugin to its self-hosted Mattermost.
Sounds like that relies on a background process that keeps an open socket to a notification server. Which is a recipe for major battery drain. There’s a very good reason iOS doesn’t support that.
How do you think Google’s Firebase works? It exists since developers would all be trying to maintain their own sockets—or worse, polling—but instead providing a Google-ran service that would unify the notifications into a single socket. What UnifiedPush does is tell a server where to send notifications to instead of assuming Google & then on the client side it let that socket your phone is always connected to be to a server you own/trust. This matters since the article is stating that Google has access to & can read all of those notifications as the assumed only game in town—with Apple doing the exact same shenanigans while not allowing you to point to another server if you desire.
I mentioned already having an XMPP application on my phone for chat, Cheogram which is a fork of Conversations, which already has that push-based socket connection to my server anyhow is a perfect candidate for a UnifiedPush-compliant server & it was just a configuration flag to enable. If we look at Conversation’s battery/data drain we can see it is the best of any chat application you could be running so sending notifications over it isn’t bad & wouldn’t require running an additional server or app I didn’t already have running.
I doubt Firebase notifications work by leaving a TCP socket open. As I said in another comment here, that’s bad for power consumption since it keeps the radio fully powered. If it works like Apple’s push notifications, it uses GSM signaling instead, the same channel used for incoming phone calls and SMS, which demands much less power.
If it works like Apple’s push notifications, it uses GSM signaling instead,
Citation very much needed? As far as I know, Apple push notifications also are just a long-running TLS connection back to Apple servers. Keeping a connection logically open also does not require to keep “the radio fully powered”, mobile networks have various optimizations around that.
The power saving of the centralized push notification services is a) requiring only a single connection and thus minimizing traffic and b) ensuring that it’s implemented properly on both sides and an app or server does not cause unneeded traffic (by accident or intentionally), which might go deeper than the usual networking APIs allow.
Citation is unfortunately out of date — it’s what iOS engineers told me while I was still at Apple in 2007, when I complained that the Messages app only supported SMS and not AOL Instant Messenger (the most popular messaging service, and the one the desktop iChat app used.) And later on I remember hearing from someone at Apple that APNS used the same signaling channel as SMS, but I don’t remember details.
I looked around but I haven’t been able to find a low level description of the APNS architecture. There are statements that APNS uses XMPP, which would make sense, but that doesn’t necessarily mean the device always has an open XMPP socket — it could wait for a cell signaling message to alert it.
I’m sure cell protocols and the way IP is routed over than have changed a lot since 2007. But I really doubt cell modems have become less power-hungry, since they’re now handling much more complex standards at much higher bandwidth. It would still make sense to use a lighter-weight lower-power mode when possible.
But it’s possible that this is nowadays baked into IP-over-cell — the radio might go low-power even while any sockets are open, and get a ping from the cell tower when a socket has new incoming data.
Ah I think see what was being asked now about the protocols. Does an open TCP socket automatically mean poor battery? I haven’t noticed any difference in battery as I thought these applications were designed for the efficiency requirements of mobile devices.
I couldn’t find any info about GSM for FCM either.
IIRC, probably dance (and most treatments of division by prime hashing) neglects to point out that they are trying to disambiguate short linear sequences of keys { i.e. realize h(a + bk for k in 1,2,3..) != h(a) } which is to say they are trying to exploit a common non-randomness in input key sets - I.e. to have fewer collisions than a random assignment would suggest. Knuth covered this 50 years ago, though. { EDIT: Note this strategy / idea pre-dates concerns about adversarial key sets. This actually dates back to an era when associative tables were called “symbol table algorithms” and key sets were commonly program identifiers like a1, a2, a3 and people would even do things that today seem weird/at least uncommon like limit IDs to <=6 chars and pack 6, 6-bit letters into one 36-bit word and then hash the whole word as a number. }
A small caveat: fibonacci hashing is great for post-conditioning a possibly-crappy hash function. But if you are using a collision-resistant hash (keyed siphash is a common one) then there’s no need for the extra fibonacci multiplication.
No problemo. The Knuth presentation is really the classic. He covers progressions in various character slots not just the last and other subtopics. I don’t know if Addison-Wesley would even let Knuth himself set up a linkable website to TAOCP-Volume3-Section6.4, though. Maybe. Maybe they already have and I just missed it. :-) The hash function development is the first half or so of that section (which was the same section number in my original first editions, IIRC, and I suspect cited by so many academic papers it is unlikely to ever change even if Knuth keeps fiddling with TAOCP until he’s 110).
These days, most will tell you that you should target “hard” (for some value of that) to distinguish from pseudo-random hash outputs for any untrusted key sets (which may be “most” key sets), not try to be closer to perfect hashing with data-semi-awareness. If you’re on the more modern, random regimen, as you said neither modulo prime nor multiply-shift is needed – rather your power-of-two sizes with mask/shifting is best. I think this aspect (matters most if trying to do better than random, not if you have a good integer hash) is lost in most multiply-shift vs. divide-by-prime reduction-to-tableSize discussions. Really, of course, what is most optimal all depends how much you know about your data, as with so very many things.
If you’re on the more modern, random regimen, as you said neither modulo prime nor multiply-shift is needed – rather your power-of-two sizes with mask/shifting is best
NPOT tables can have allocated size much closer to their occupancy, reducing memory/cache pressure.
By “is best” I only meant “is the best recommendation” and that obviously depends upon whom one is recommending to! Few need to look up / explain how to mask..but, there is no absolute answer. I agree it’d have been better for me to say “is a fine recommendation” where audience variability was more implied than “is best”. Sorry.
That said, your comment to me oversells NPOT. VM page rounding can add space / TLB pressure (of the order of per-table page-size rounding error), but over-allocated-yet-never-accessed memory should not add CPU L[123] data cache pressure. Page rounding mostly matters for “many tables less than a few VM pages” which I view as a special, not common case. { You do only say “can” & “reducing” and I doubt any of this is news to you, and I also like Dan Lemire’s work. }
over-allocated-yet-never-accessed memory should not add CPU L[123] data cache pressure
But occupied hash buckets tend to be fairly uniformly distributed, so you should indeed get a lot of sparsely occupied cache lines if occupancy is poor.
FWIW, in interpreting my comments, I had taken your “allocated size” to mean “as distinguished from addressed size” in the sense of “request 32768 bytes and allocator actually uses some 65536 byte area”.
Now that you mention density which makes your meaning more clear, well, there is no perfect density for all workloads - e.g. what is best to scan over elements is not best for unsuccessful lookup queries (at least with open addressing). But sure, NPOT gives you more flexibility in the (also special) case of knowing what load factor you want to target just before when you also know many queries are about to happen against a somewhat temporally stable occupancy. That can be nice, but in terms of importance, many people I know skip even pre-sizing POT tables based on AOT cardinality estimates, just growing tables up from tiny or skip rehashing after many deletes to preserve density. Speaking of, a related special case is if growing by 2X/4X is too fast for a “right sized” table (“right” for whatever operation mix, if even known) you could grow by 1.5X or something with NPOT.
Anyway, NPOT can help in a few special cases and to be clear I’m not advocating against it, but I still think it’s probably quite a bit less important than much else (e.g. being able to efficiently save/load tables/run off of live memory files) which seem neglected in prog.lang. runtime libs.
In other words, Apple and Google confirm that they enable government spying through their voluntary excessive data sharing. Framing it that it’s all government’s doing is dishonest. On another note how is it possible that the American government is always the one that’s mentioned in news like this? It seems like it’s one of the most oppressive governments on the planet.
On another note how is it possible that the American government is always the one that’s mentioned in news like this?
Off the top of my head, two obvious explanations are (1) other governments do similar things but aren’t reported on, making this the preferable option, or (2) you’re not reading news sources that report on other governments because you’re on the English Internet where most news is about America.
I’m pretty sure, if not here then on an oranger website, there’s been recent news about the UK and EU trying to backdoor your chats, and going further back you might find news about the more extreme control measures taken by the CCP or Middle Eastern countries.
The difference is that in EU they try to obtain the legal power to do so before any eavesdropping is organized, and any such proposed law is subject to a lot of scrutiny, and doesn’t pass. In USA, you hear about these incidents after the privacy violation has already happened, and the agencies that do it ignore the law without consequences.
On another note how is it possible that the American government is always the one that’s mentioned in news like this?
This is a bit like that famous study in WW2 on where to put heavier armor on planes. The study was made on the planes that made it back, not the ones that were shot down and lost.
The reason there’s a lot of noise about surveillance in the US is that it’s an almost uniquely open society, with robust protections for journalists and whistleblowers. Also there’s a strong cultural strand of individualism that is leary of arguments about the good of society to infringe perceived rights.
You can be assured that there are plenty of countries that have far more intrusive surveillance regimes. The problem there is, if you try to tell the world about them, you get thrown in jail.
What evidence is there that this is the case, and not simply that this one government is more likely to engage in abuses? Seems like Occam’s razor favors the second explanation. This could be a very long discussion but most of the world does not consider USA’s government to be particularly open, transparent, or to feature protections for whistleblowers or high quality journalism. Example: https://brilliantmaps.com/threat-to-peace/
But the Pentagon and the Justice Department hold a tremendous amount of power. So I’m not trying to suggest this isn’t a complex picture. But there are some tremendous tools and oversight that US citizens enjoy.
Apple and Google confirm that they enable government spying through their voluntary excessive data sharing
I don’t think it’s entirely accurate to describe this as voluntary. It seems to be the case that they were required to disclose this information to the government by law, and required not to disclose their participation in the scheme.
how is it possible that the American government is always the one that’s mentioned in news like this?
The US is unusual in a number of ways that explain this:
the US is very rich and powerful, so it is more likely to attract scrutiny
more English-language content on the internet is US-centric than centered around any other country, so there’s more reporting on US activities
the US is seen as a leader in global governance, and so malfeasance is more surprising/newsworthy
the US has a large population, and is the largest predominantly-English-speaking country by population, so more human stuff happens in the US than many other countries
US governmental institutions are subject to significant oversight requiring public disclosure of information (including FOIA, GAO audits, and congressional oversight)
the US has perhaps the world’s largest national security apparatus, making them most likely to engage in novel surveillance activities
most of the largest technology companies in the world are based in the US, so the companies, many of their employees (including technical staff and leadership), and their servers are subject to US jurisdiction more often than other jurisdictions
maybe for some of the above reasons, you’re personally more interested in the US than other countries
I don’t think it’s entirely accurate to describe this as voluntary
I agree, but only because of the word ‘entirely’. If you build a central funnel that information goes through, then you should expect that someone will use it to spy on your users. The US government is one of the more benign entities that may end up doing this, since they’re at least mostly bound by the rule of law. If the government can compel you to disclose data, organised crime or foreign intelligence agencies can almost certainly exfiltrate it as well.
By “voluntary excessive data sharing” you mean push notifications? That’s a feature mobile apps need, not something “voluntary”. Having apps keep long-lived sockets open, accept incoming connections, or poll a server are all non-starters for mobile devices.
It’s the US government in a lot of these announcements because there is some degree of transparency allowed. When repressive governments to this, and they do, it’s harder to confirm what’s going on and you just hear stuff about “state-sponsored actors”.
How is it possible that mobile apps have existed for years without them, and desktop programs still don’t use them? Don’t you think that most of these could be encrypted end to end, made robust against eavesdropping, or be made completely local?
Weren’t you around in 2014 when it took a dedicated whistleblower whose life is ruined to this day to share just a fraction of these domestic spying programs with the population? Doesn’t look like a lot of transparency is involved to me. The US government is one of those repressive governments in every meaning of the word.
Dude, push notifications have been around on iOS about as long as 3rd party apps have been supported, I.e. around 2008. I’m sure it’s about the same on Android. They are absolutely necessary for any app that needs “server push”, including email apps, social apps, etc.
Desktop apps don’t need this because they have very different power requirements and networking — it’s fine for them to leave sockets open and run in the background.
I’m more than willing to explain the details, if you can indicate you’re not just interested in uninformed flaming…?
It would be feasible to E2E encrypt push notifications, I’m pretty sure. It probably just hasn’t been a priority, I would guess, because it wasn’t well-known they’d be abused this way. “Completely local” makes no sense in this context; the whole point is for a server to notify its client app.
The US government is one of those repressive governments in every meaning of the word.
Spying ≠ repression. Everyone spies. The question is how much the spying results in repressive activities. The US and EU are a lot better in this regard than genuinely repressive countries like Russia, China, Saudi Arabia, Iran, etc.
How is it possible that mobile apps have existed for years without them, and desktop programs still don’t use them?
Desktops are a different domain. They are plugged into the mains. They care about power at a granularity of around 5W. Each app keeping a TCP/IP connection open (or polling periodically) is fine.
Even laptops are different because they’re in one of two states. Either they’re on, in which case the screen is using a lot more power than an extra TCP/IP connection, or they’re in standby and not expected to receive notifications.
Phones (and, to a lesser degree, tablets) spend a lot of their life in a warm standby state, where they’re expected to be able to wake within a second or two in response to an event in the network so that the user can shift their attention to the device an do something. If every app needed a separate process waiting for events, that would flatten the battery a lot. A lot of early Linux phones suffered from this. Instead, a platform service establishes one connection to a remote server and the application core goes to sleep. The baseband processor handles any keep-alive traffic necessary for this and wakes the application cores only when a notification is received. This reduces the power consumption by about an order of magnitude in standby.
Also, phones want to turn off their radios when possible to save power. An app with an open socket forces the radio to remain on. Cellphones have always been able to receive notifications while their cell radios are not at full power; that’s how they get notified of incoming calls and texts. Push notifications are delivered by the same mechanism.
On another note how is it possible that the American government is always the one that’s mentioned in news like this? It seems like it’s one of the most oppressive governments on the planet.
Any realistic, honest answer to this question would be bombed with “off-topic” flags on this website.
It’s partly because the US govt is just really big in absolute terms (~330M people and the world’s biggest GDP), and partly because it’s massively influential overseas due to its geopolitical strategies. We don’t hear about India’s govt much, because their main geopolitical concerns are their land borders with Pakistan and China.
It is not simply because the US is just big, nor any of the other nonsense answers given above such that they may attract the minimum amount of controversy. I’ve already taken a handful of random flags above as it stands so if you want to talk to me about it any more you should choose some place else to do it!
As an American, I like to keep tabs on my government. We do our best to hire honest hard working employees, but the power seems to go to their head and we have to fire them periodically. Sadly most of them, unlike regular employees, hire these big PR and law firms and desperately try to keep their old jobs. I wonder if we should take away some perks.
You may think that all this negative reporting makes us look weak and flabby on the world stage. I bet Putin and Xi think so. But here’s what’s happening: somewhere in basements in Russia and China, people like you and me are thinking “Now why can’t we dump on our government like that.” Then they call their friend Charlie next door. And that is how democracy is spread.
You can read up on the Five Eyes treaty if it makes you feel better? I seem to recall something about a Dutch or German equivalent to it as well, but can’t find references.
In the actually oppressive countries, this happens all the time but doesn’t get reported on in the news. America is just the largest English-speaking news market.
Doesn’t that sound like every conspiracy theory, that absence of evidence is just further evidence for the theory? Besides, there’s still tons of classified programs in US whose details emerge almost every week, involving torture, domestic spying, illegal eavesdropping, and still more.
In other words, Apple and Google confirm that they enable government spying through their voluntary excessive data sharing.
How do you get this? Apple and Google got given “lawful” orders that required that they comply, and were prohibited from telling anyone that they were subject to these orders. What part of that is voluntary?
Similarly where do you get “excessive data sharing” - how do you think push notifications can possibly work without going through a service that can be subject to government orders?
No company, not even FB, has directly defied an order. What they do is come up with exciting and novel interpretation of general laws, and if it turns out their imagination was wrong they are subject to the penalties in the law (fines, etc). But all you can do in response to a direct order from law enforcement is appeal to a court - which is all secret if the order demands it - and if your appeal fails you functionally have no choice but to comply.
Here you have at least Google and Apple being forced to comply and were required to be quiet. I get that people think Apple and Google want to cooperate with this stuff, but we definitionally do not know who else is subject to the same orders. It sounds like Google and Apple could only talk about this because of a senator asking them, so could Signal for example report they’re subject to this order if a senator hasn’t asked them? The implication is that they couldn’t.
Some notes here: with conservative stack scanning, we can hope for better mutator throughput and fewer bugs. The throughput comes from not having to put all live pointers in memory; the compiler can keep them in registers, and avoid managing the HandleScope.
This doesn’t make sense to me. I thought one of the requirements of conservative stack scanning was that you had to dump all the registers during marking. Otherwise you could miss a pointer that is not on the stack but only in a register. However this statement makes it sound like it is the other way around: with conservative stack scanning you don’t have to dump the registers.
They are using a variant of terminology I’m familiar with but historically conservative GC has meant “we don’t have precise knowledge of what data is a pointer so if we see 32/64bits that could be a pointer to a live object, we won’t gc that object. This can cause leaks because your 32bit “pointer” might just be an int32 who’s value collides with a relevant memory address. This is much less of an issue in 64bit. This is in contrast with “precise” GC which knows what sequences of bytes are pointers and what sequences aren’t.
I assume I’m missing something because I don’t think v8 has moved to a conservative GC but maybe they adopted that strategy just for the stack.
Sounds like the article agrees:
conservative stack scanning means that because you can’t be sure that a word on the stack refers to an object and isn’t just a spicy integer, you can’t move objects that might be the target of a conservative root.
I’m not sure exactly how registers come into play. It sounds like in this case conservative stack scanning means “look for pointerish things on the stack without bothering to check if there’s a register actually pointing to them”?
Well, when the GC is called, all callers’ registers have either been saved to the stack, or they’re callee-preserved registers. Either way the GC can find them.
I’m less sure about other threads. An inactive thread’s registers are all saved to a buffer, but I’m not sure how or if you can find the registers of another active thread (plus, they change constantly…j
This is a clever tradeoff between speed/compactness and human readability.
It would be significantly faster to parse than JSON because the leading byte-counts remove the need to scan strings and numbers byte-by-byte, as well as the need to unquote strings. Plus, if you are scanning it to extract a specific nested value, via a JSONPointer or something, the lengths make it super fast to skip ahead to that point.
(It does complicate encoding though, because the length comes first but when writing an array/object you don’t know its value until you’ve written the following data. I guess I would first try to guess the number of digits, skip that many bytes, write the value, then memmove the value up or down if the guess was wrong before filling in the length.)
The only obvious change I’d have made is to encode the byte-counts in hex, which would shave off a few bytes and speed up parsing a little.
Somewhat confusingly, it would seem that conservative stack scanning has garnered the acronym “CSS” inside V8. What does CSS have to do with GC?, I ask. I know the answer but my brain keeps asking the question.
What a stupid acronym, inside a browser component of all things! They should have called it Heuristic Tracing of Memory Links, or Heuristic Tracing of Transient Pointers.
Reminds me of how compilers often talk about a “scheduler” that has nothing to do with the runtime notion of scheduling threads/tasks/etc. Pretty sure I’ve run into other examples of GC terminology overloading acronyms or terms as well, but I can’t recall specifics off hand.
Yes, you’re right. Brain not working first thing in the morning. There is some interesting work unifying instruction scheduling, register allocation, and instruction selection, but it’s not computationally feasible yet.
https://unison-code.github.io/ is the register allocation and instruction scheduling. Instruction selection I’m not sure is feasible beyond single-digit instruction sequences because I think equality saturation is a more promising overall approach to instruction selection, and extraction is already hard.
I wouldn’t be surprised if instruction scheduling would predate the other usage. But I think this is not the same as the CSS example. It’s a common word and both usages are correct in their respective niche. CSS as an abbreviation is somewhat arbitrary, they could have chosen slightly different words to avoid it, but I guess it was deliberate.
Property 1 is tautologically useful. Property 2 is useful for things that want amortised growth costs. Property 3 looks really interesting. I’ve not heard this before. I can’t think of a reason why it would be true in the general case, so I presume it’s some property of common hash functions. Anyone know?
I always make hash table sizes powers of two, to avoid expensive modulo operations. I know that’s problematic if the hash function produces less randomness in its lower bits, and I’ve run into that, but I just switched to a better hash.
I know I’m not the only one … it seems like there are two schools of thought on this.
For npot tables, use lemire’s fast alternative to the modulo reduction, which is slightly faster than mod-with-multiplication and much faster than runtime division. It also requires no additional state. The latency is 3 cycles on high-power hardware, compared with 1 for masking (although you then have to either cache the mask or else add another cycle to your critical path to subtract 1 from the table length).
I wondered about what that tradeoff looks like now. 32-bit integer division (and modulo) is very cheap on modern hardware (typically <10 cycle latency, at least two in parallel). It’s more expensive than a bitfield extract (single cycle). It’s more painful on embedded systems (on our Ibex it’s 15 cycles and stalls the whole pipeline until it’s done). Fewer hash collisions will probably offset the costs quite quickly, but I’d want to see benchmarks to decide either way.
The compiler will do that, but it’s not that helpful for this use case. The modulo is the size, which will be one of the primes in the list, not not a fixed one.
You’d think you want the opposite of Property 3 for performance reasons?
E.g. using a (generalized) Mersenne prime would make modulo operations inexpensive.
Another one that seems to totally fail to acknowledge the lack of Memory Safe tooling adapted to the industry.
Yes Rust and its ecosystem solved part of this problem, and we should all be thankful for the work of everyone in there.
But maybe instead of spouting nonsense, the US government (and others) should start looking at why the tooling is non adapted and what they could do to help fix it.
But nah. Better yell at the industry to rewrite everything. That is definitely why they have chosen unsafe tools. Because we did not yell at them enough. I mean this strategy worked so well for the past 30 years. Better do more of it.
Pages 4-8 talk about mitigations - training, coverage, secure coding guidelines, fuzzing, static and dynamic analysis, safer C++ subsets, non-executable memory, control flow integrity, ASLR, … CHERI, and more
Then page 9 starts talking about Memory safe languages. They don’t mention Rust specifically, as far as I see. They mention that many memory safe languages have garbage collectors, which can have an adverse impact on performance.
It seems pretty reasonable to me
I think you may have read something that that’s not there. Maybe you have been reading too much from Rust advocates :) I don’t see any signs that this is “RIIR” advocacy
Appendix of memory safe languages: C#, Go, Java, Python, Rust, Swift
My career has largely been in Python, with Ruby and EMCAScript as minor highlights. The tooling exists.
The underlying issue is machismo. How do we interrupt machismo? This is a question for psychiatrists, I suppose. Yelling doesn’t work; embarrassing them with security vulnerabilities doesn’t work; providing easy paths toward memory-safety doesn’t work. The inherent difficulty of correctly writing code in a memory-unsafe environment attracts people who enjoy the delusion of being more meritorious than their peers, and perhaps we need to break their worldviews entirely.
I mean… Rust worked. Maybe we should look at why it did when other failed? Personal hypothesis that have some support: they are the only one that actually cared about UX
And all these memory safe languages you mentioned are already considered industry standard. These are not the targets of this kind of reports.
Another one that seems to totally fail to acknowledge the lack of Memory Safe tooling adapted to the industry.
Is the industry incapable of creating such tooling? And can you clarify what “the industry” is, because a lot of people use memory safe languages today already, in many different contexts.
I mean. In theory we are not incapable. We finally got Rust.
But it took us… 30years, a lot of luck and a lot of people’s sanity. There is probably a better way possible.
Our funding models for the work necessary to bring research results into industry adoptable tools (aja engineering them) is mostly… “Some people hobby time”. That is… Slightly problematic.
Which tools do you think are lacking? Static analyzers, address sanitizers and UB sanitizers are widely available and have been a huge help in improving the safety of C/C++.
C++ of course has safe constructs like unique_ptr and shared_ptr, and there’s ongoing work in Clang to create a “safe mode” where pointer arithmetic is illegal and subscripting operators of vector/array/span are bounds-checked. (Sorry I don’t have references; I just ran across this the other day and didn’t bookmark it.)
They are a huge help, but the data shows that this help is not enough to massively move the needle. I am not saying they are useless. They are really useful.
But when we get this kind of impact as in Android, which was a codebase already shipping and using these tools, there is a real discussion we should have on how we, as a whole industry, only got there when a browser company self selected through internal politics to do it, and then fired everyone once it reached industry adoptable level. https://security.googleblog.com/2022/12/memory-safe-languages-in-android-13.html
I know you specifically aren’t interested in it per the README… But I would be interested in a JS implementation of this, preferably vai wasm. A python implementation for pandas too. JSON generation is very slow from pandas. At this point I’m probably going to move to parquet-wasm. but a drop in replacement for JSON would be very attractive.
Hi! Main author of Fleece here. ( I think I may have posted it to lobste.rs long ago when I first joined.)
BTW, Cap’n Proto serialization and FlexBuffers both have similarities to Fleece. I didn’t know about either of them when I first designed Fleece in 2015. And now of course there’s the newly-announced JSON-b in SQLite.
I think it’s fairly interesting. It looks simpler (or just better specified?) than the wire format of capnproto, flexbuffers, etc. which have a strong vibe of “just use the C++ code and trust us”. In contrast this seems like it’s realistic to reimplement it. Is there any sort of community of users around it?
Mostly just Couchbase Lite and developers using its C API (the other language bindings don’t expose Fleece directly.)
There are a few FFI bindings that I wrote as part of bindings to Couchbase Lite — Python, Rust and Nim. IIRC the repos they’re in are in the GitHub couchbaselabs domain.
Since you put so much effort into Foundation compat, it would be nice to see a comparison against Apple’s binary plist format. Looks like it should be faster, especially for use cases that don’t consume everything in the input immediately.
I’d expect the results to be similar, though slightly better, than the Foundation JSON parsing results. Most of the overhead tends to be in allocating the object tree, not in parsing the input.
Can you give an example what the append feature is good for? I understand the concept and how it’s applied for example in SQLite’s WAL but I’m interested in the details of your use case here.
It enables a transparent form of delta compression:
Read data into memory and get pointer to root value.
Make changes using the mutable API
Write the changes by appending to the original data
Save the appended data
The appended data is effectively a delta. You can save it without having to rewrite the original data. You preserve the previous version, making it easy to revert to. (This is similar to what a VCS does.) You can also send the appended data to someone who already has the previous version, saving bandwidth.
It’s not as efficient as dedicated binary-delta formats like xdelta/zdelta, but it’s simple and has effectively zero decoding time.
This is very much like the way append-only data stores like CouchDB (and LMDB, kind of) work; only they work at the level of pages instead of objects.
I often do something like this by structuring a PR as multiple commits that each do one thing. Then reviewers can look at each commit separately in GitHub if they want. This often takes some tricky interactive rebasing to set up, but I think it’s worth it.
I dunno about doing this as multiple tiny PRs. Presumably each intermediate “stacked” PR still has to pass CI tests. That could add a lot of extra overhead,
GitHub’S UI hates you. Review comments that people put on individual commits don’t show up in the conversation tab and go away if you edit one of those commits and rebase.
There’s value in running ci on every commit though. You get an automatic signal on where in your changes things went wrong. “Oh, I broke this test when I had this 10-line change to the button text” vs “my 900-line PR broke 4 tests for some reason”
Obviously if you can’t afford to run ci in parallel for commits or if your ci tests takes 9 hours then this could be a big problem for you. But I don’t think that’s most people.
There’s definitely value in that at least for bisecting but you don’t need 9-hour CI runs for per-commit tests to become a problem. Just one hour of build & tests, not unheard of in the C++ world at least, and 100 developers spamming PRs makes it already quite expensive.
I often do something like this by structuring a PR as multiple commits that each do one thing. Then reviewers can look at each commit separately in GitHub if they want.
This has multiple issues though, notably:
github’s support for commit reviews is woeful it’s hard to comment on a PR’s commit, comments get lost even more easily
you can’t easily merge the head commits of such a PR if they’re accepted but the tail is debated, you need to fork it off to a separate PR, integrate that, then rebase the old PR dropping the now redundant commits
The draw of stacked PRs, from what I understand, is that it integrates all that as first-class concepts, if you have a stack of PRs (~commits) you can comment and review on each piece individually, and merge them separately (as long as you respect the dependency chain obviously).
My work project, Couchbase Lite, has been doing something like this since 2018. I came up with a binary JSON-equivalent data format called Fleece that doesn’t require parsing, and all the JSON data is stored in SQLite as Fleece blobs. (This is invisible to the API, just an internal representation.)
By the way, JSONB says it’s from postgres, and from a quick check the earliest version that has it available is 9.4, which was released in December 18, 2014. Looks like they never bothered to promote or standardise it much?
Automatic ref-counting is a type of garbage collection, as also implemented in eg Python, no?
(I agree there’s a bit more to the picture in some of your examples - copy-on-write (or mutable value semantics) interacting with the refcount automatically, static refcount inference or ellision, etc. Some of these approaches are really interesting!)
Technically, yes. But it’s so much simpler than other GC algorithms that it’s often considered a separate thing. You can implement it in a page of code without requiring a special heap, it can be applied to one type of object without affecting others, etc.
Yeah, “technically,” memory management has two top-level taxons: automatic and manual. Garbage Collection is technically a synonym for automatic memory management. In that sense, Rust is garbage collected.
However, the industry parlance is to use “garbage collected” to mean tracing garbage collection, which is what we think of when we talk about Java, JS, Go, and the related concepts of mark-and-sweep, stop the world, and generational gc.
If you go back to the origin of C, stack variables were called “auto” variables because they were automatically laid out and destroyed with the stack frames, as opposed to the static variables that the programmer was responsible for managing through out the program lifetime.
Automatic ref counting is considered one of the simplest garbage collection mechanisms and is used at least as a component of many more complicated implementations. I fail to see how that wouldn’t fall into the first bullet there.
If someone calls something “garbage collection”, I’ll assume it collects cycles but has at least some stop-the-world style runtime overhead; if they call it “reference counting”, it probably has the opposite trade-off. I’d give someone a very funny look if they told me Rust has a garbage collector, and I don’t think I’m alone in that. If a garbage-collected language uses reference counting as part of its GC implementation, that’s an implementation detail that (generally) could be replaced without changing the semantics.
That may not be strictly correct under academic definitions of the term “garbage collection”, but that’s the way I almost always see it used in practice.
(It would be possible for a garbage-collected language to guarantee reference-counting-style deterministic collection in the absence of cycles. I think Python might do this? But that’s neither here nor there - the point is they’re distinct enough features that it’s not necessarily correct to assume that one term encompasses the other.)
Admittedly this is an old paper, but it was highly influential on me. It was the first time I understood object capability security: it’s just lexical scope and argument passing! Powerful stuff.
I was a bit surprised that the article doesn’t highlight the term “capability”, since what it describes is, to me, a capability system. The author does say:
The kernel is similar to those of the capability-based operating systems of the 1970’s [15, 34]. The main differences are that this kernel is simpler, more abstract, and more clearly connected with programming language concepts than are classical capability systems.
I’m guessing that the definition of a capability system has broadened since this was published, maybe in response to it?
Somewhat, yes. It’d be more accurate to describe the current capability-theory ecosystem as pluralist, compatibilist, or disconnected; there are multiple definitions which come from different foundations, and no single unifying picture has yet emerged.
For example, I might say that a capability is an element of a Cartesian closed category. While daunting, this really just means that a capability is a value in a programming environment which behaves like a simply-typed lambda calculus; it can be forgotten or dropped, duplicated or copied, and passed to subroutines like any other parameter-filling argument. From this perspective, the surprising part of capability theory is that actors have Cartesian-closed/lambda-calculus-like behavior and certain references to actors can be used as values; category theory doesn’t come pre-equipped with a notion of actors.
For example, I might say that a capability is an element of a Cartesian closed category.
Doesn’t that abstract away all of the interesting properties that make something a capability? Which is more to do with the environment than the values themselves, i.e. you aren’t given ambient authority.
Only if your programming environment isn’t also built out of similar categorical constructions. If your programming environment is simply-typed lambda calculus, then it’s not possible to violate the boundary of a capability. Really, the urge of programmers to add new impurities is the problem.
For the systems that I or @dustyweb have built, we are fortunate that stretching out lambda-applications in time and space does not break this abstraction. As a result, we can have message-passing semantics, and we also can have coarse-grained actors (individual machines) communicating across a network. That’s all quite free. What’s not free is the environment running on an individual machine and providing fine-grained actors (private pages of memory); this only occurs when programmers uphold the purity of lambda calculus.
See also previously, on Lobsters, although the discussion did not really go in this direction.
If your programming environment is simply-typed lambda calculus, then it’s not possible to violate the boundary of a capability. Really, the urge of programmers to add new impurities is the problem.
OK, but purity isn’t enough. It doesn’t capture what makes something a capability rather than just a closure.
For instance, where did the capability come from? If your system has some ambient authority that allows any component to get another equivalent capability, then you don’t have capability security. This isn’t a property of the capability itself.
And capabilities have to be able to cause some kind of side-effect, otherwise there isn’t any security boundary to protect. (Well, apart from cryptographic secrets, but they aren’t much use in the absence of IO.)
So you need to add a lot more to your mathematical abstractions to get something that has the flavour of a capability, in particular you need something that describes how side effects work, and something explaining how ambient authority is prevented.
Purity prevents ambient authority. This is one of the main ideas behind the E lineage. From Mark Miller’s thesis, it is well-established that a lambda calculus only has three ways to introduce a capability, and E (and its derivatives) all use purity to forbid other introductions.
There are several ways to conceive of side effects. We don’t really need them, though; it’s enough to have coarse-grained actors evolving in time as they pass messages to each other over a network. I’m going to handwave this by appeal to folklore, as it’s Turing-complete whether a network of actors will halt, and so there isn’t any additional computational power that we could physically add; side effects don’t do anything extra.
That isn’t true, though. Ambient authority is a property of the APIs that are provided to an object / actor / capability when it is created. For example, trad POSIX filesystem APIs give you ambient authority, but if you restrict it to the *at() family and remove .. parent links, then it can be capability-secure.
Even with just the lambda calculus, you need to restrict the free variables in a closure so that it does not capture more capabilities than it should. Yes, you can do so by inspection, but you can make inspection trivial if you extend the lambda calculus with a form of functional abstraction that syntactically prevents free variables (i.e. a combinator). That’s the kind of nuance you lose by saying a capability is just a closure.
coarse-grained actors evolving in time as they pass messages to each other over a network
These are side effects! Actors are stateful, they cannot be pure.
And bah, whenever I see an argument that relies on reducing a system to Turing-equivalence, I know it isn’t saying anything interesting. What makes capabilities worthwhile is how they restrict what a program can do, so saying it can do anything is not helpful.
Main effects aren’t side effects. A monad for interpreting E expands to the traditional reader+writer+state monad along with a mailbox for outgoing messages; these are the intended effects of E expressions. Each actor maintains a pure view of computation, and the only way to see it as effectful is to stand next to the network and observe multiple computers working together.
The pattern of restricting free variables is called “attenuation of capabilities”, and it can be performed automatically by compilers. For example, my Monte compiler automatically computes the closures for objects and only closes over scoped names when they are actually used by the object. I did explore implementing this with combinators, but it was not efficient or ergonomic. (Not to say that Monte is ergonomic, of course!)
Ok you are starting to talk about the stuff that is actually relevant to capabilities.
I don’t get how you can have a stable reference to a stateful actor and maintain purity; ie, an actor can’t be simply a closure. When I say “side effect” I mean something that changes state without changing references to that state, eg sending a message to an actor, or launching the missiles. It might be an intended effect, but when talking about pure computations the result — the value computed — is the point, and anything else is a side effect.
For monadic effects, the reason monads work is that they thread state through the computation in a linear (or affine) manner, so that the implementation can update in place instead of creating an altered copy. But still, there isn’t a stable reference to (say) the IO monad’s state of the world: each IO call consumes the old state and produces a replacement (or not, if the missiles were really destructive). So monads do not model capabilities.
When I started my career I wanted to help the most people. Instead of going to grad school and tinkering on something a few hundred people might use, I went to Apple. (I could have gone to Microsoft, but I have standards.)
Focusing on rough-edged homemade tools for other hackers, and calling that better than the stuff the other 8 billion people can use, seems a bit myopic IMO.
Claiming 8 billion people can use any product is a stretch, especially ones that cost thousands of dollars and need a credit card, and don’t work without internet access and iTunes, that delete accessibility tools every other update, and that give root on your device to governments if they ask nastily enough.
I love my Apple products! Thank you for helping make them.
I didn’t mean to belittle you or your employer. I don’t think wigwams are “better” than battleships. I just think wigwams are fun and I want people to work together
I think you misunderstand how language works. It isn’t invented in labs by committees; someone makes a word and other people start using it. Sometimes they’re not from the same culture. Most languages are full of loan-words. There is almost no language on Earth English hasn’t “appropriated” from, including a number of colonizer languages like Norse and French (and Germanic if you go far back enough.) Linguistic diversity and creativity is awesome.
If you want people to stop using indigenous North American words, you’re going to have to rename thousands of cities and rivers, mountains, etc. in the northeastern US, BTW.
someone makes a word and other people start using it. Sometimes they’re not from the same culture. Most languages are full of loan-words.
The context is important. English speakers (among others) have spent centuries brutalizing Indigenous people and suppressing their cultures. For an English speaker to help themselves to an Indigenous word is interpreted by many as piling injustice on top of injustice; adding insult to injury. English borrowing a word from e.g. French carries a completely different valence. The situations are not the same.
(I don’t know what culture the OP’s author is part of. I’m trying to explain why this kind of word borrowing might be questionable in general; I’m not necessarily asserting that it is in this case.)
For an English speaker to help themselves to an Indigenous word is interpreted by many as piling injustice on top of injustice
I understand. And I think they’re wrong. Using another language’s word doesn’t take anything away, it only adds to that language’s reach. The real danger languages face is that of dying out, not of spreading.
using a french word, or, naming a place after a word, is somewhat different to a use that almost completely erases/overrides the meaning of a word, of a set of cultures that are already having issues propagating knowledge of their languages.
The word wigwam has been in common usage in American English since the 1600s. There are many, many, many other English words that come from Native languages) more, a good number of which have diverged from their origins.
I agree that the usage of “wigwam” here in contrast with “battleship” is weird and possibly a misunderstanding of the word, but I don’t see how it’s offensive, or how it “erases” anything.
“Erases” in the context of their usage. None of their descriptions or usage relate to what a wigwam is, the prior meaning of wigwam is practically unused.
To dramatize, he has taken the meaning, gone over it with correction fluid, written in a new meaning, and through this project, is trying to have others do the same.
“What is a wigwam” “Its a kind of native american hut” a meaning that has potential to lead to more discussion and learning and awareness of a set of cultures.
“What is a wigwam” “a wigwam is a reasonably-established project that provides alternatives to bloat/cruft.” 🥹❓, its so devoid of its original meaning that someone exposed to only this form might think its an original word.
take (something) for one’s own use, typically without the owner’s permission.
For cultural appropriation specifically:
the unacknowledged or inappropriate adoption of the customs, practices, ideas, etc. of one people or society by members of another and typically more dominant people or society.
Yes, and how exactly does using a word in a simile take it away from the owner?
Appropriation would e.g. be proposing to, from now on, use the word permanently to denote something different from what it originally was. To cause a change such that the original owner cpuld no longer effectively use the word with its original meaning.
This looks a lot like a reinvention of PouchDB, using SQLite in a Rust wrapper compiled to WASM instead of the browser’s IndexDB. (I’m not sure where its SQLite persists its data to?)
Well… there are a few problems with adding BLAKE3, or SHAKE, or SHA-3, or any other hash algorithm, regardless of its merits:
Adding it without removing BLAKE2 would just leave me with a redundant hash, and I try my best to have one primitive to rule them all.
Adding it and removing BLAKE2 would forgo the ability to implement Argon2 — I’d end up with a variant instead.
Now even though it’s in the optional folder, I do have SHA-512 as well. I even went as far as implementing HMAC-SHA-512. But I have reasons:
SHA-512 is necessary to implement Ed25519, which I can use to compare Monocypher with (i) other libraries, and (ii) specially crafted test vectors (Whycheproof). This is a very important contributor to Monocypher’s correctness, so I kind of need it.
SHA-512 is necessary to implement Ed25519, which is the EdDSA variant everyone uses. I’m under no illusion than more Monocypher users use Ed25519 than they use EdDSA-BLAKE2b. Because even though it wasn’t my primary goal, compatibility is still valuable for many people, and failing to provide that compatibility must come with a good reason (such as “AES is redundant with ChaCha20”).
If users use SHA-512 and would like to avoid using several hashes (to reduce code size or attack surface), it makes sense that they’d avoid BLAKE2b as well. But BLAKE2b comes with a very useful keyed mode, and the canonical SHA-2 version of keyed mode is HMAC.
If I do add implement another hash, it will likely be in a separate project. Even BLAKE2s, despite it being a better fit than BLAKE2b on embedded targets (where Monocypher is most popular).
This is just so fundamentally confused about what rights and obligations come with the freedoms, and why they’re important.
But the point seems valid to me — what good is the freedom to modify the software on your computer if doing so is brain-fryingly difficult?
The underlying problem is that developing for a very, very deep and complex hierarchy of components is hard. It usually requires big teams with well-organized development and testing processes. (I’ve worked on such a team.) When an individual opens up some component in the middle, like a UI framework, and starts making changes, it’s pretty likely stuff will break, or at least the resulting build & integration work will be challenging.
I still recognize the value of such freedom, but I think many descriptions are being disingenuous in how they describe it as being the same as tinkering with a bike or a guitar, which are approachable to nearly anyone.
The legal freedom is the required element, but it’s not sufficient alone. Whether any modification is easy is somewhat orthogonal to the software license (I bet Microsoft couldn’t easily modify keybindings either across eleven different bug-compatible UI toolkits and the Office Multiverse).
I get that difficulty of actually modifying software is a problem, but Free Software is a specific term, so reframing it to encompass usability of patching libraries across a whole desktop app ecosystem is stretching it beyond its meaning.
BTW: this situation is an analog of the “Non-programmers can’t modify Free Software” problem. FSF has addressed this, that even if you don’t have the skills, you still have the rights, so you can get help from the community or hire someone to make modifications for you.
Yeah, I see it as less “Free software that you can’t customize is not truly Free Software” and more “Free software that you can’t customize is not good software”.
Whether the software is free or not only matters if it’s worth using in the first place.
I think autohotkey can do it to some extent. Example - https://github.com/rcmdnk/vim_ahk.
But this is not built in to Windows.
I agree with the rest of the comment.
If what you’re looking for is good or user-friendly or user-customizable software, I’m afraid you’re just going to have to exercise one of the freedoms of open source and fork it. I too want good software, but the idea of Free Software and “truly” (whatever that means) free software has nothing to do with the software being good.
You’re absolutely right. More often than not, that “complex hierarchy” is quite deliberate and serves a business interest rather than user’s interest. Very relevant piece on this topic by the Rails inventor: They’re rebuilding the Death Star of complexity
I think that would be a different article. This one seems more to be concerned with using “actually-existing free software” than with formal philosophical concerns.
I don’t believe this is true of Go, which has its own ABI and tool chain that derive from Plan9. Its calling conventions and use of the stack are nonstandard. (This is why Go / C interop is awkward and expensive.)
That said, if you use cgo in a Go package, the C functions/structs it creates do behave normally and can be used with other languages.
No one in the comments yet has mentioted UnifiedPush which allows users the option to self-host their Push Notifications if applications support it (hint: your service should consider adding support).
I use my existing, self-hosted XMPP setup, Prosody+Cheogram (server+client (tho reconsidering ejabberd)), for Android for legacy chat apps Molly (Signal) & Element (Matrix) & Goguma (IRC) while bugging my employer to add the UnifiedPush plugin to its self-hosted Mattermost.
Sounds like that relies on a background process that keeps an open socket to a notification server. Which is a recipe for major battery drain. There’s a very good reason iOS doesn’t support that.
How do you think Google’s Firebase works? It exists since developers would all be trying to maintain their own sockets—or worse, polling—but instead providing a Google-ran service that would unify the notifications into a single socket. What UnifiedPush does is tell a server where to send notifications to instead of assuming Google & then on the client side it let that socket your phone is always connected to be to a server you own/trust. This matters since the article is stating that Google has access to & can read all of those notifications as the assumed only game in town—with Apple doing the exact same shenanigans while not allowing you to point to another server if you desire.
I mentioned already having an XMPP application on my phone for chat, Cheogram which is a fork of Conversations, which already has that push-based socket connection to my server anyhow is a perfect candidate for a UnifiedPush-compliant server & it was just a configuration flag to enable. If we look at Conversation’s battery/data drain we can see it is the best of any chat application you could be running so sending notifications over it isn’t bad & wouldn’t require running an additional server or app I didn’t already have running.
I doubt Firebase notifications work by leaving a TCP socket open. As I said in another comment here, that’s bad for power consumption since it keeps the radio fully powered. If it works like Apple’s push notifications, it uses GSM signaling instead, the same channel used for incoming phone calls and SMS, which demands much less power.
Citation very much needed? As far as I know, Apple push notifications also are just a long-running TLS connection back to Apple servers. Keeping a connection logically open also does not require to keep “the radio fully powered”, mobile networks have various optimizations around that.
The power saving of the centralized push notification services is a) requiring only a single connection and thus minimizing traffic and b) ensuring that it’s implemented properly on both sides and an app or server does not cause unneeded traffic (by accident or intentionally), which might go deeper than the usual networking APIs allow.
Citation is unfortunately out of date — it’s what iOS engineers told me while I was still at Apple in 2007, when I complained that the Messages app only supported SMS and not AOL Instant Messenger (the most popular messaging service, and the one the desktop iChat app used.) And later on I remember hearing from someone at Apple that APNS used the same signaling channel as SMS, but I don’t remember details.
I looked around but I haven’t been able to find a low level description of the APNS architecture. There are statements that APNS uses XMPP, which would make sense, but that doesn’t necessarily mean the device always has an open XMPP socket — it could wait for a cell signaling message to alert it.
I’m sure cell protocols and the way IP is routed over than have changed a lot since 2007. But I really doubt cell modems have become less power-hungry, since they’re now handling much more complex standards at much higher bandwidth. It would still make sense to use a lighter-weight lower-power mode when possible.
But it’s possible that this is nowadays baked into IP-over-cell — the radio might go low-power even while any sockets are open, and get a ping from the cell tower when a socket has new incoming data.
Ah I think see what was being asked now about the protocols. Does an open TCP socket automatically mean poor battery? I haven’t noticed any difference in battery as I thought these applications were designed for the efficiency requirements of mobile devices.
I couldn’t find any info about GSM for FCM either.
While @snej is 90% of the way there already in a comment on just using masking, Knuth v3 has always had a very nice treatment on this since the 1970s.
https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/ has a more web-friendly analysis.
IIRC, probably dance (and most treatments of division by prime hashing) neglects to point out that they are trying to disambiguate short linear sequences of keys { i.e. realize h(a + bk for k in 1,2,3..) != h(a) } which is to say they are trying to exploit a common non-randomness in input key sets - I.e. to have fewer collisions than a random assignment would suggest. Knuth covered this 50 years ago, though. { EDIT: Note this strategy / idea pre-dates concerns about adversarial key sets. This actually dates back to an era when associative tables were called “symbol table algorithms” and key sets were commonly program identifiers like
a1, a2, a3
and people would even do things that today seem weird/at least uncommon like limit IDs to <=6 chars and pack 6, 6-bit letters into one 36-bit word and then hash the whole word as a number. }I remember the Fibonacci hashing article now, but somehow I’d forgotten about it. Thanks for bringing it to mind.
A small caveat: fibonacci hashing is great for post-conditioning a possibly-crappy hash function. But if you are using a collision-resistant hash (keyed siphash is a common one) then there’s no need for the extra fibonacci multiplication.
No problemo. The Knuth presentation is really the classic. He covers progressions in various character slots not just the last and other subtopics. I don’t know if Addison-Wesley would even let Knuth himself set up a linkable website to TAOCP-Volume3-Section6.4, though. Maybe. Maybe they already have and I just missed it. :-) The hash function development is the first half or so of that section (which was the same section number in my original first editions, IIRC, and I suspect cited by so many academic papers it is unlikely to ever change even if Knuth keeps fiddling with TAOCP until he’s 110).
These days, most will tell you that you should target “hard” (for some value of that) to distinguish from pseudo-random hash outputs for any untrusted key sets (which may be “most” key sets), not try to be closer to perfect hashing with data-semi-awareness. If you’re on the more modern, random regimen, as you said neither modulo prime nor multiply-shift is needed – rather your power-of-two sizes with mask/shifting is best. I think this aspect (matters most if trying to do better than random, not if you have a good integer hash) is lost in most multiply-shift vs. divide-by-prime reduction-to-tableSize discussions. Really, of course, what is most optimal all depends how much you know about your data, as with so very many things.
NPOT tables can have allocated size much closer to their occupancy, reducing memory/cache pressure.
By “is best” I only meant “is the best recommendation” and that obviously depends upon whom one is recommending to! Few need to look up / explain how to mask..but, there is no absolute answer. I agree it’d have been better for me to say “is a fine recommendation” where audience variability was more implied than “is best”. Sorry.
That said, your comment to me oversells NPOT. VM page rounding can add space / TLB pressure (of the order of per-table page-size rounding error), but over-allocated-yet-never-accessed memory should not add CPU L[123] data cache pressure. Page rounding mostly matters for “many tables less than a few VM pages” which I view as a special, not common case. { You do only say “can” & “reducing” and I doubt any of this is news to you, and I also like Dan Lemire’s work. }
But occupied hash buckets tend to be fairly uniformly distributed, so you should indeed get a lot of sparsely occupied cache lines if occupancy is poor.
FWIW, in interpreting my comments, I had taken your “allocated size” to mean “as distinguished from addressed size” in the sense of “request 32768 bytes and allocator actually uses some 65536 byte area”.
Now that you mention density which makes your meaning more clear, well, there is no perfect density for all workloads - e.g. what is best to scan over elements is not best for unsuccessful lookup queries (at least with open addressing). But sure, NPOT gives you more flexibility in the (also special) case of knowing what load factor you want to target just before when you also know many queries are about to happen against a somewhat temporally stable occupancy. That can be nice, but in terms of importance, many people I know skip even pre-sizing POT tables based on AOT cardinality estimates, just growing tables up from tiny or skip rehashing after many deletes to preserve density. Speaking of, a related special case is if growing by 2X/4X is too fast for a “right sized” table (“right” for whatever operation mix, if even known) you could grow by 1.5X or something with NPOT.
Anyway, NPOT can help in a few special cases and to be clear I’m not advocating against it, but I still think it’s probably quite a bit less important than much else (e.g. being able to efficiently save/load tables/run off of live memory files) which seem neglected in prog.lang. runtime libs.
In other words, Apple and Google confirm that they enable government spying through their voluntary excessive data sharing. Framing it that it’s all government’s doing is dishonest. On another note how is it possible that the American government is always the one that’s mentioned in news like this? It seems like it’s one of the most oppressive governments on the planet.
Off the top of my head, two obvious explanations are (1) other governments do similar things but aren’t reported on, making this the preferable option, or (2) you’re not reading news sources that report on other governments because you’re on the English Internet where most news is about America.
I’m pretty sure, if not here then on an oranger website, there’s been recent news about the UK and EU trying to backdoor your chats, and going further back you might find news about the more extreme control measures taken by the CCP or Middle Eastern countries.
The difference is that in EU they try to obtain the legal power to do so before any eavesdropping is organized, and any such proposed law is subject to a lot of scrutiny, and doesn’t pass. In USA, you hear about these incidents after the privacy violation has already happened, and the agencies that do it ignore the law without consequences.
This is a bit like that famous study in WW2 on where to put heavier armor on planes. The study was made on the planes that made it back, not the ones that were shot down and lost.
The reason there’s a lot of noise about surveillance in the US is that it’s an almost uniquely open society, with robust protections for journalists and whistleblowers. Also there’s a strong cultural strand of individualism that is leary of arguments about the good of society to infringe perceived rights.
You can be assured that there are plenty of countries that have far more intrusive surveillance regimes. The problem there is, if you try to tell the world about them, you get thrown in jail.
What evidence is there that this is the case, and not simply that this one government is more likely to engage in abuses? Seems like Occam’s razor favors the second explanation. This could be a very long discussion but most of the world does not consider USA’s government to be particularly open, transparent, or to feature protections for whistleblowers or high quality journalism. Example: https://brilliantmaps.com/threat-to-peace/
Here’s one glaring bit of evidence: https://www.foia.gov/
It’s older and more battle-tested than most similar acts in other countries.
There a lots of individual examples such as CIA crackdowns https://en.wikipedia.org/wiki/Church_Committee and NSA crackdowns https://en.wikipedia.org/wiki/Thomas_A._Drake.
But the Pentagon and the Justice Department hold a tremendous amount of power. So I’m not trying to suggest this isn’t a complex picture. But there are some tremendous tools and oversight that US citizens enjoy.
I don’t think it’s entirely accurate to describe this as voluntary. It seems to be the case that they were required to disclose this information to the government by law, and required not to disclose their participation in the scheme.
The US is unusual in a number of ways that explain this:
For example, Australia’s e-safety commissioner wants to eliminate end-to-end encryption altogether: https://www.esafety.gov.au/newsroom/media-releases/esafety-welcomes-feedback-on-draft-industry-standards-to-tackle-online-child-sexual-abuse-and-pro-terror-material
I agree, but only because of the word ‘entirely’. If you build a central funnel that information goes through, then you should expect that someone will use it to spy on your users. The US government is one of the more benign entities that may end up doing this, since they’re at least mostly bound by the rule of law. If the government can compel you to disclose data, organised crime or foreign intelligence agencies can almost certainly exfiltrate it as well.
Because that’s where the tech companies are based and where their execs live, so the government the most able to exert pressure.
By “voluntary excessive data sharing” you mean push notifications? That’s a feature mobile apps need, not something “voluntary”. Having apps keep long-lived sockets open, accept incoming connections, or poll a server are all non-starters for mobile devices.
It’s the US government in a lot of these announcements because there is some degree of transparency allowed. When repressive governments to this, and they do, it’s harder to confirm what’s going on and you just hear stuff about “state-sponsored actors”.
How is it possible that mobile apps have existed for years without them, and desktop programs still don’t use them? Don’t you think that most of these could be encrypted end to end, made robust against eavesdropping, or be made completely local?
Weren’t you around in 2014 when it took a dedicated whistleblower whose life is ruined to this day to share just a fraction of these domestic spying programs with the population? Doesn’t look like a lot of transparency is involved to me. The US government is one of those repressive governments in every meaning of the word.
Dude, push notifications have been around on iOS about as long as 3rd party apps have been supported, I.e. around 2008. I’m sure it’s about the same on Android. They are absolutely necessary for any app that needs “server push”, including email apps, social apps, etc.
Desktop apps don’t need this because they have very different power requirements and networking — it’s fine for them to leave sockets open and run in the background.
I’m more than willing to explain the details, if you can indicate you’re not just interested in uninformed flaming…?
It would be feasible to E2E encrypt push notifications, I’m pretty sure. It probably just hasn’t been a priority, I would guess, because it wasn’t well-known they’d be abused this way. “Completely local” makes no sense in this context; the whole point is for a server to notify its client app.
Spying ≠ repression. Everyone spies. The question is how much the spying results in repressive activities. The US and EU are a lot better in this regard than genuinely repressive countries like Russia, China, Saudi Arabia, Iran, etc.
Desktops are a different domain. They are plugged into the mains. They care about power at a granularity of around 5W. Each app keeping a TCP/IP connection open (or polling periodically) is fine.
Even laptops are different because they’re in one of two states. Either they’re on, in which case the screen is using a lot more power than an extra TCP/IP connection, or they’re in standby and not expected to receive notifications.
Phones (and, to a lesser degree, tablets) spend a lot of their life in a warm standby state, where they’re expected to be able to wake within a second or two in response to an event in the network so that the user can shift their attention to the device an do something. If every app needed a separate process waiting for events, that would flatten the battery a lot. A lot of early Linux phones suffered from this. Instead, a platform service establishes one connection to a remote server and the application core goes to sleep. The baseband processor handles any keep-alive traffic necessary for this and wakes the application cores only when a notification is received. This reduces the power consumption by about an order of magnitude in standby.
Also, phones want to turn off their radios when possible to save power. An app with an open socket forces the radio to remain on. Cellphones have always been able to receive notifications while their cell radios are not at full power; that’s how they get notified of incoming calls and texts. Push notifications are delivered by the same mechanism.
Any realistic, honest answer to this question would be bombed with “off-topic” flags on this website.
It’s partly because the US govt is just really big in absolute terms (~330M people and the world’s biggest GDP), and partly because it’s massively influential overseas due to its geopolitical strategies. We don’t hear about India’s govt much, because their main geopolitical concerns are their land borders with Pakistan and China.
It is not simply because the US is just big, nor any of the other nonsense answers given above such that they may attract the minimum amount of controversy. I’ve already taken a handful of random flags above as it stands so if you want to talk to me about it any more you should choose some place else to do it!
As an American, I like to keep tabs on my government. We do our best to hire honest hard working employees, but the power seems to go to their head and we have to fire them periodically. Sadly most of them, unlike regular employees, hire these big PR and law firms and desperately try to keep their old jobs. I wonder if we should take away some perks.
You may think that all this negative reporting makes us look weak and flabby on the world stage. I bet Putin and Xi think so. But here’s what’s happening: somewhere in basements in Russia and China, people like you and me are thinking “Now why can’t we dump on our government like that.” Then they call their friend Charlie next door. And that is how democracy is spread.
You can read up on the Five Eyes treaty if it makes you feel better? I seem to recall something about a Dutch or German equivalent to it as well, but can’t find references.
In the actually oppressive countries, this happens all the time but doesn’t get reported on in the news. America is just the largest English-speaking news market.
Doesn’t that sound like every conspiracy theory, that absence of evidence is just further evidence for the theory? Besides, there’s still tons of classified programs in US whose details emerge almost every week, involving torture, domestic spying, illegal eavesdropping, and still more.
If your bar is hard evidence, I’ll see you in twenty five years, when it exists.
How do you get this? Apple and Google got given “lawful” orders that required that they comply, and were prohibited from telling anyone that they were subject to these orders. What part of that is voluntary?
Similarly where do you get “excessive data sharing” - how do you think push notifications can possibly work without going through a service that can be subject to government orders?
They could have refused and chalk up any fines under costs, like they do in EU.
No company, not even FB, has directly defied an order. What they do is come up with exciting and novel interpretation of general laws, and if it turns out their imagination was wrong they are subject to the penalties in the law (fines, etc). But all you can do in response to a direct order from law enforcement is appeal to a court - which is all secret if the order demands it - and if your appeal fails you functionally have no choice but to comply.
Here you have at least Google and Apple being forced to comply and were required to be quiet. I get that people think Apple and Google want to cooperate with this stuff, but we definitionally do not know who else is subject to the same orders. It sounds like Google and Apple could only talk about this because of a senator asking them, so could Signal for example report they’re subject to this order if a senator hasn’t asked them? The implication is that they couldn’t.
This doesn’t make sense to me. I thought one of the requirements of conservative stack scanning was that you had to dump all the registers during marking. Otherwise you could miss a pointer that is not on the stack but only in a register. However this statement makes it sound like it is the other way around: with conservative stack scanning you don’t have to dump the registers.
They are using a variant of terminology I’m familiar with but historically conservative GC has meant “we don’t have precise knowledge of what data is a pointer so if we see 32/64bits that could be a pointer to a live object, we won’t gc that object. This can cause leaks because your 32bit “pointer” might just be an int32 who’s value collides with a relevant memory address. This is much less of an issue in 64bit. This is in contrast with “precise” GC which knows what sequences of bytes are pointers and what sequences aren’t.
I assume I’m missing something because I don’t think v8 has moved to a conservative GC but maybe they adopted that strategy just for the stack.
Sounds like the article agrees:
I’m not sure exactly how registers come into play. It sounds like in this case conservative stack scanning means “look for pointerish things on the stack without bothering to check if there’s a register actually pointing to them”?
As I understand it, the entire v8 gc isn’t conservative, only the bit that scans the C++ heap.
Well, when the GC is called, all callers’ registers have either been saved to the stack, or they’re callee-preserved registers. Either way the GC can find them.
I’m less sure about other threads. An inactive thread’s registers are all saved to a buffer, but I’m not sure how or if you can find the registers of another active thread (plus, they change constantly…j
I can’t speak to V8 but I assume they do similar to JSC which is abusing setjmp to drop the full register set into memory.
This is a clever tradeoff between speed/compactness and human readability.
It would be significantly faster to parse than JSON because the leading byte-counts remove the need to scan strings and numbers byte-by-byte, as well as the need to unquote strings. Plus, if you are scanning it to extract a specific nested value, via a JSONPointer or something, the lengths make it super fast to skip ahead to that point.
(It does complicate encoding though, because the length comes first but when writing an array/object you don’t know its value until you’ve written the following data. I guess I would first try to guess the number of digits, skip that many bytes, write the value, then memmove the value up or down if the guess was wrong before filling in the length.)
The only obvious change I’d have made is to encode the byte-counts in hex, which would shave off a few bytes and speed up parsing a little.
If you did padding zeros for the lengths, you could just always reserve 20 bytes or whatever. If it were hex, you could just say it’s always 16 bytes.
[head explodes at such waste of space]
You could make it smol, and say lich messages are only 4GB max, then encode that as hex, 8 bytes, same as using raw int 64s.
The git protocol is one common example of padded hex length prefixes.
Only tiny messages would be readable though, because there’s no whitespace
You could try something clever with newlines as a closing delimiter, though people generally don’t like that for textual formats
What a stupid acronym, inside a browser component of all things! They should have called it Heuristic Tracing of Memory Links, or Heuristic Tracing of Transient Pointers.
Reminds me of how compilers often talk about a “scheduler” that has nothing to do with the runtime notion of scheduling threads/tasks/etc. Pretty sure I’ve run into other examples of GC terminology overloading acronyms or terms as well, but I can’t recall specifics off hand.
I’ve usually heard that qualified as “instruction scheduler”…
…unless you’re referring to some entirely different scheduler, of course.
No, it’s the instruction scheduler, but I’ve not often heard it qualified as such, especially not in the code base for the Go compiler IIRC.
In LLVM, it’s usually referred to as ISel.
Instruction selection and instruction scheduling are generally taken to be different tasks.
Yes, you’re right. Brain not working first thing in the morning. There is some interesting work unifying instruction scheduling, register allocation, and instruction selection, but it’s not computationally feasible yet.
https://unison-code.github.io/ is the register allocation and instruction scheduling. Instruction selection I’m not sure is feasible beyond single-digit instruction sequences because I think equality saturation is a more promising overall approach to instruction selection, and extraction is already hard.
I wouldn’t be surprised if instruction scheduling would predate the other usage. But I think this is not the same as the CSS example. It’s a common word and both usages are correct in their respective niche. CSS as an abbreviation is somewhat arbitrary, they could have chosen slightly different words to avoid it, but I guess it was deliberate.
Naming is hard but only if you’re bad at it.
Property 1 is tautologically useful. Property 2 is useful for things that want amortised growth costs. Property 3 looks really interesting. I’ve not heard this before. I can’t think of a reason why it would be true in the general case, so I presume it’s some property of common hash functions. Anyone know?
I always make hash table sizes powers of two, to avoid expensive modulo operations. I know that’s problematic if the hash function produces less randomness in its lower bits, and I’ve run into that, but I just switched to a better hash.
I know I’m not the only one … it seems like there are two schools of thought on this.
cc @david_chisnall @fanf
For npot tables, use lemire’s fast alternative to the modulo reduction, which is slightly faster than mod-with-multiplication and much faster than runtime division. It also requires no additional state. The latency is 3 cycles on high-power hardware, compared with 1 for masking (although you then have to either cache the mask or else add another cycle to your critical path to subtract 1 from the table length).
It’s also order-preserving, which turns out to be important for ordered hash tables like (one flavor of) Robin Hood and bidirectional linear probing.
You can construct an order over integers which is preserved under mod.
By comparing remainder bits first? Sure, but not very convenient.
It’s not much less convenient than the division you have to do anyway when you probe.
For the pot case, you can just rotate.
I wondered about what that tradeoff looks like now. 32-bit integer division (and modulo) is very cheap on modern hardware (typically <10 cycle latency, at least two in parallel). It’s more expensive than a bitfield extract (single cycle). It’s more painful on embedded systems (on our Ibex it’s 15 cycles and stalls the whole pipeline until it’s done). Fewer hash collisions will probably offset the costs quite quickly, but I’d want to see benchmarks to decide either way.
If you are taking the modulus wrt a fixed value, multiply by the inverse.
The compiler will do that, but it’s not that helpful for this use case. The modulo is the size, which will be one of the primes in the list, not not a fixed one.
Yeah, so you can precompute the inverse either when creating the list of primes, or when the table is resized.
Or even simpler, multiply (32 x 32 -> 64) the hash by the table size and use the top half of the result.
Please, 64 x 64 -> 128; we’re not savages.
You’d think you want the opposite of Property 3 for performance reasons? E.g. using a (generalized) Mersenne prime would make modulo operations inexpensive.
Another one that seems to totally fail to acknowledge the lack of Memory Safe tooling adapted to the industry.
Yes Rust and its ecosystem solved part of this problem, and we should all be thankful for the work of everyone in there.
But maybe instead of spouting nonsense, the US government (and others) should start looking at why the tooling is non adapted and what they could do to help fix it.
But nah. Better yell at the industry to rewrite everything. That is definitely why they have chosen unsafe tools. Because we did not yell at them enough. I mean this strategy worked so well for the past 30 years. Better do more of it.
:sighs:
What are you referring to?
Pages 4-8 talk about mitigations - training, coverage, secure coding guidelines, fuzzing, static and dynamic analysis, safer C++ subsets, non-executable memory, control flow integrity, ASLR, … CHERI, and more
Then page 9 starts talking about Memory safe languages. They don’t mention Rust specifically, as far as I see. They mention that many memory safe languages have garbage collectors, which can have an adverse impact on performance.
It seems pretty reasonable to me
I think you may have read something that that’s not there. Maybe you have been reading too much from Rust advocates :) I don’t see any signs that this is “RIIR” advocacy
Appendix of memory safe languages: C#, Go, Java, Python, Rust, Swift
My career has largely been in Python, with Ruby and EMCAScript as minor highlights. The tooling exists.
The underlying issue is machismo. How do we interrupt machismo? This is a question for psychiatrists, I suppose. Yelling doesn’t work; embarrassing them with security vulnerabilities doesn’t work; providing easy paths toward memory-safety doesn’t work. The inherent difficulty of correctly writing code in a memory-unsafe environment attracts people who enjoy the delusion of being more meritorious than their peers, and perhaps we need to break their worldviews entirely.
I mean… Rust worked. Maybe we should look at why it did when other failed? Personal hypothesis that have some support: they are the only one that actually cared about UX
And all these memory safe languages you mentioned are already considered industry standard. These are not the targets of this kind of reports.
Python is literally in the report. Switching to a “high level” language is definitely a valid way of making your stuff safer.
Is the industry incapable of creating such tooling? And can you clarify what “the industry” is, because a lot of people use memory safe languages today already, in many different contexts.
I mean. In theory we are not incapable. We finally got Rust.
But it took us… 30years, a lot of luck and a lot of people’s sanity. There is probably a better way possible.
Our funding models for the work necessary to bring research results into industry adoptable tools (aja engineering them) is mostly… “Some people hobby time”. That is… Slightly problematic.
Which tools do you think are lacking? Static analyzers, address sanitizers and UB sanitizers are widely available and have been a huge help in improving the safety of C/C++.
C++ of course has safe constructs like unique_ptr and shared_ptr, and there’s ongoing work in Clang to create a “safe mode” where pointer arithmetic is illegal and subscripting operators of vector/array/span are bounds-checked. (Sorry I don’t have references; I just ran across this the other day and didn’t bookmark it.)
They are a huge help, but the data shows that this help is not enough to massively move the needle. I am not saying they are useless. They are really useful.
But when we get this kind of impact as in Android, which was a codebase already shipping and using these tools, there is a real discussion we should have on how we, as a whole industry, only got there when a browser company self selected through internal politics to do it, and then fired everyone once it reached industry adoptable level. https://security.googleblog.com/2022/12/memory-safe-languages-in-android-13.html
I know you specifically aren’t interested in it per the README… But I would be interested in a JS implementation of this, preferably vai wasm. A python implementation for pandas too. JSON generation is very slow from pandas. At this point I’m probably going to move to parquet-wasm. but a drop in replacement for JSON would be very attractive.
Should be feasible. In lieu of inner pointers, the references would be objects containing the byte array and an offset.
Hi! Main author of Fleece here. ( I think I may have posted it to lobste.rs long ago when I first joined.)
BTW, Cap’n Proto serialization and FlexBuffers both have similarities to Fleece. I didn’t know about either of them when I first designed Fleece in 2015. And now of course there’s the newly-announced JSON-b in SQLite.
I think it’s fairly interesting. It looks simpler (or just better specified?) than the wire format of capnproto, flexbuffers, etc. which have a strong vibe of “just use the C++ code and trust us”. In contrast this seems like it’s realistic to reimplement it. Is there any sort of community of users around it?
Mostly just Couchbase Lite and developers using its C API (the other language bindings don’t expose Fleece directly.)
There are a few FFI bindings that I wrote as part of bindings to Couchbase Lite — Python, Rust and Nim. IIRC the repos they’re in are in the GitHub couchbaselabs domain.
Since you put so much effort into Foundation compat, it would be nice to see a comparison against Apple’s binary plist format. Looks like it should be faster, especially for use cases that don’t consume everything in the input immediately.
I’d expect the results to be similar, though slightly better, than the Foundation JSON parsing results. Most of the overhead tends to be in allocating the object tree, not in parsing the input.
Can you give an example what the append feature is good for? I understand the concept and how it’s applied for example in SQLite’s WAL but I’m interested in the details of your use case here.
It enables a transparent form of delta compression:
The appended data is effectively a delta. You can save it without having to rewrite the original data. You preserve the previous version, making it easy to revert to. (This is similar to what a VCS does.) You can also send the appended data to someone who already has the previous version, saving bandwidth.
It’s not as efficient as dedicated binary-delta formats like xdelta/zdelta, but it’s simple and has effectively zero decoding time.
This is very much like the way append-only data stores like CouchDB (and LMDB, kind of) work; only they work at the level of pages instead of objects.
I often do something like this by structuring a PR as multiple commits that each do one thing. Then reviewers can look at each commit separately in GitHub if they want. This often takes some tricky interactive rebasing to set up, but I think it’s worth it.
I dunno about doing this as multiple tiny PRs. Presumably each intermediate “stacked” PR still has to pass CI tests. That could add a lot of extra overhead,
GitHub’S UI hates you. Review comments that people put on individual commits don’t show up in the conversation tab and go away if you edit one of those commits and rebase.
There’s value in running ci on every commit though. You get an automatic signal on where in your changes things went wrong. “Oh, I broke this test when I had this 10-line change to the button text” vs “my 900-line PR broke 4 tests for some reason”
Obviously if you can’t afford to run ci in parallel for commits or if your ci tests takes 9 hours then this could be a big problem for you. But I don’t think that’s most people.
There’s definitely value in that at least for bisecting but you don’t need 9-hour CI runs for per-commit tests to become a problem. Just one hour of build & tests, not unheard of in the C++ world at least, and 100 developers spamming PRs makes it already quite expensive.
This has multiple issues though, notably:
The draw of stacked PRs, from what I understand, is that it integrates all that as first-class concepts, if you have a stack of PRs (~commits) you can comment and review on each piece individually, and merge them separately (as long as you respect the dependency chain obviously).
My work project, Couchbase Lite, has been doing something like this since 2018. I came up with a binary JSON-equivalent data format called Fleece that doesn’t require parsing, and all the JSON data is stored in SQLite as Fleece blobs. (This is invisible to the API, just an internal representation.)
That’s pretty cool work! :)
By the way, JSONB says it’s from postgres, and from a quick check the earliest version that has it available is 9.4, which was released in December 18, 2014. Looks like they never bothered to promote or standardise it much?
For future reference, someone else submitted this recently: https://lobste.rs/s/iq0uwa/fleece_super_fast_compact_json
Another new language that drops the ball on memory management. Sigh.
I guess the Onyx people haven’t looked at Objective-C, Swift, Nim and even C++, all of which use the third option, automatic ref-counting.
Automatic ref-counting is a type of garbage collection, as also implemented in eg Python, no?
(I agree there’s a bit more to the picture in some of your examples - copy-on-write (or mutable value semantics) interacting with the refcount automatically, static refcount inference or ellision, etc. Some of these approaches are really interesting!)
Technically, yes. But it’s so much simpler than other GC algorithms that it’s often considered a separate thing. You can implement it in a page of code without requiring a special heap, it can be applied to one type of object without affecting others, etc.
Yeah, “technically,” memory management has two top-level taxons: automatic and manual. Garbage Collection is technically a synonym for automatic memory management. In that sense, Rust is garbage collected.
However, the industry parlance is to use “garbage collected” to mean tracing garbage collection, which is what we think of when we talk about Java, JS, Go, and the related concepts of mark-and-sweep, stop the world, and generational gc.
If you go back to the origin of C, stack variables were called “auto” variables because they were automatically laid out and destroyed with the stack frames, as opposed to the static variables that the programmer was responsible for managing through out the program lifetime.
Automatic ref counting is considered one of the simplest garbage collection mechanisms and is used at least as a component of many more complicated implementations. I fail to see how that wouldn’t fall into the first bullet there.
If someone calls something “garbage collection”, I’ll assume it collects cycles but has at least some stop-the-world style runtime overhead; if they call it “reference counting”, it probably has the opposite trade-off. I’d give someone a very funny look if they told me Rust has a garbage collector, and I don’t think I’m alone in that. If a garbage-collected language uses reference counting as part of its GC implementation, that’s an implementation detail that (generally) could be replaced without changing the semantics.
That may not be strictly correct under academic definitions of the term “garbage collection”, but that’s the way I almost always see it used in practice.
(It would be possible for a garbage-collected language to guarantee reference-counting-style deterministic collection in the absence of cycles. I think Python might do this? But that’s neither here nor there - the point is they’re distinct enough features that it’s not necessarily correct to assume that one term encompasses the other.)
Fair enough
Admittedly this is an old paper, but it was highly influential on me. It was the first time I understood object capability security: it’s just lexical scope and argument passing! Powerful stuff.
I was a bit surprised that the article doesn’t highlight the term “capability”, since what it describes is, to me, a capability system. The author does say:
I’m guessing that the definition of a capability system has broadened since this was published, maybe in response to it?
Somewhat, yes. It’d be more accurate to describe the current capability-theory ecosystem as pluralist, compatibilist, or disconnected; there are multiple definitions which come from different foundations, and no single unifying picture has yet emerged.
For example, I might say that a capability is an element of a Cartesian closed category. While daunting, this really just means that a capability is a value in a programming environment which behaves like a simply-typed lambda calculus; it can be forgotten or dropped, duplicated or copied, and passed to subroutines like any other parameter-filling argument. From this perspective, the surprising part of capability theory is that actors have Cartesian-closed/lambda-calculus-like behavior and certain references to actors can be used as values; category theory doesn’t come pre-equipped with a notion of actors.
Doesn’t that abstract away all of the interesting properties that make something a capability? Which is more to do with the environment than the values themselves, i.e. you aren’t given ambient authority.
Only if your programming environment isn’t also built out of similar categorical constructions. If your programming environment is simply-typed lambda calculus, then it’s not possible to violate the boundary of a capability. Really, the urge of programmers to add new impurities is the problem.
For the systems that I or @dustyweb have built, we are fortunate that stretching out lambda-applications in time and space does not break this abstraction. As a result, we can have message-passing semantics, and we also can have coarse-grained actors (individual machines) communicating across a network. That’s all quite free. What’s not free is the environment running on an individual machine and providing fine-grained actors (private pages of memory); this only occurs when programmers uphold the purity of lambda calculus.
See also previously, on Lobsters, although the discussion did not really go in this direction.
OK, but purity isn’t enough. It doesn’t capture what makes something a capability rather than just a closure.
For instance, where did the capability come from? If your system has some ambient authority that allows any component to get another equivalent capability, then you don’t have capability security. This isn’t a property of the capability itself.
And capabilities have to be able to cause some kind of side-effect, otherwise there isn’t any security boundary to protect. (Well, apart from cryptographic secrets, but they aren’t much use in the absence of IO.)
So you need to add a lot more to your mathematical abstractions to get something that has the flavour of a capability, in particular you need something that describes how side effects work, and something explaining how ambient authority is prevented.
Purity prevents ambient authority. This is one of the main ideas behind the E lineage. From Mark Miller’s thesis, it is well-established that a lambda calculus only has three ways to introduce a capability, and E (and its derivatives) all use purity to forbid other introductions.
There are several ways to conceive of side effects. We don’t really need them, though; it’s enough to have coarse-grained actors evolving in time as they pass messages to each other over a network. I’m going to handwave this by appeal to folklore, as it’s Turing-complete whether a network of actors will halt, and so there isn’t any additional computational power that we could physically add; side effects don’t do anything extra.
That isn’t true, though. Ambient authority is a property of the APIs that are provided to an object / actor / capability when it is created. For example, trad POSIX filesystem APIs give you ambient authority, but if you restrict it to the *at() family and remove .. parent links, then it can be capability-secure.
Even with just the lambda calculus, you need to restrict the free variables in a closure so that it does not capture more capabilities than it should. Yes, you can do so by inspection, but you can make inspection trivial if you extend the lambda calculus with a form of functional abstraction that syntactically prevents free variables (i.e. a combinator). That’s the kind of nuance you lose by saying a capability is just a closure.
These are side effects! Actors are stateful, they cannot be pure.
And bah, whenever I see an argument that relies on reducing a system to Turing-equivalence, I know it isn’t saying anything interesting. What makes capabilities worthwhile is how they restrict what a program can do, so saying it can do anything is not helpful.
Main effects aren’t side effects. A monad for interpreting E expands to the traditional reader+writer+state monad along with a mailbox for outgoing messages; these are the intended effects of E expressions. Each actor maintains a pure view of computation, and the only way to see it as effectful is to stand next to the network and observe multiple computers working together.
The pattern of restricting free variables is called “attenuation of capabilities”, and it can be performed automatically by compilers. For example, my Monte compiler automatically computes the closures for objects and only closes over scoped names when they are actually used by the object. I did explore implementing this with combinators, but it was not efficient or ergonomic. (Not to say that Monte is ergonomic, of course!)
Ok you are starting to talk about the stuff that is actually relevant to capabilities.
I don’t get how you can have a stable reference to a stateful actor and maintain purity; ie, an actor can’t be simply a closure. When I say “side effect” I mean something that changes state without changing references to that state, eg sending a message to an actor, or launching the missiles. It might be an intended effect, but when talking about pure computations the result — the value computed — is the point, and anything else is a side effect.
For monadic effects, the reason monads work is that they thread state through the computation in a linear (or affine) manner, so that the implementation can update in place instead of creating an altered copy. But still, there isn’t a stable reference to (say) the IO monad’s state of the world: each IO call consumes the old state and produces a replacement (or not, if the missiles were really destructive). So monads do not model capabilities.
When I started my career I wanted to help the most people. Instead of going to grad school and tinkering on something a few hundred people might use, I went to Apple. (I could have gone to Microsoft, but I have standards.)
Focusing on rough-edged homemade tools for other hackers, and calling that better than the stuff the other 8 billion people can use, seems a bit myopic IMO.
Claiming 8 billion people can use any product is a stretch, especially ones that cost thousands of dollars and need a credit card, and don’t work without internet access and iTunes, that delete accessibility tools every other update, and that give root on your device to governments if they ask nastily enough.
I love my Apple products! Thank you for helping make them.
I didn’t mean to belittle you or your employer. I don’t think wigwams are “better” than battleships. I just think wigwams are fun and I want people to work together
Hey, I apologize for the tone of that comment. It was late and I was kind of grumpy. Sorry!
step one: learn what a wigwam is (hint, not a boat)
step two: stop appropriating Native words
I think you misunderstand how language works. It isn’t invented in labs by committees; someone makes a word and other people start using it. Sometimes they’re not from the same culture. Most languages are full of loan-words. There is almost no language on Earth English hasn’t “appropriated” from, including a number of colonizer languages like Norse and French (and Germanic if you go far back enough.) Linguistic diversity and creativity is awesome.
If you want people to stop using indigenous North American words, you’re going to have to rename thousands of cities and rivers, mountains, etc. in the northeastern US, BTW.
The context is important. English speakers (among others) have spent centuries brutalizing Indigenous people and suppressing their cultures. For an English speaker to help themselves to an Indigenous word is interpreted by many as piling injustice on top of injustice; adding insult to injury. English borrowing a word from e.g. French carries a completely different valence. The situations are not the same.
(I don’t know what culture the OP’s author is part of. I’m trying to explain why this kind of word borrowing might be questionable in general; I’m not necessarily asserting that it is in this case.)
I understand. And I think they’re wrong. Using another language’s word doesn’t take anything away, it only adds to that language’s reach. The real danger languages face is that of dying out, not of spreading.
using a french word, or, naming a place after a word, is somewhat different to a use that almost completely erases/overrides the meaning of a word, of a set of cultures that are already having issues propagating knowledge of their languages.
The word wigwam has been in common usage in American English since the 1600s. There are many, many, many other English words that come from Native languages) more, a good number of which have diverged from their origins.
I agree that the usage of “wigwam” here in contrast with “battleship” is weird and possibly a misunderstanding of the word, but I don’t see how it’s offensive, or how it “erases” anything.
“Erases” in the context of their usage. None of their descriptions or usage relate to what a wigwam is, the prior meaning of wigwam is practically unused.
To dramatize, he has taken the meaning, gone over it with correction fluid, written in a new meaning, and through this project, is trying to have others do the same.
“What is a wigwam” “Its a kind of native american hut” a meaning that has potential to lead to more discussion and learning and awareness of a set of cultures.
“What is a wigwam” “a wigwam is a reasonably-established project that provides alternatives to bloat/cruft.” 🥹❓, its so devoid of its original meaning that someone exposed to only this form might think its an original word.
Step zero: learn what ‘appropriation’ means (hint: it doesn’t mean merely using a foreign word)
For cultural appropriation specifically:
Yes, and how exactly does using a word in a simile take it away from the owner?
Appropriation would e.g. be proposing to, from now on, use the word permanently to denote something different from what it originally was. To cause a change such that the original owner cpuld no longer effectively use the word with its original meaning.
This is a long way from that.
This looks a lot like a reinvention of PouchDB, using SQLite in a Rust wrapper compiled to WASM instead of the browser’s IndexDB. (I’m not sure where its SQLite persists its data to?)
Thanks again for Monocypher! The only other algorithm I’m wanting is Blake3, but I’m sure adding it would overflow your self-imposed line limit.
Well… there are a few problems with adding BLAKE3, or SHAKE, or SHA-3, or any other hash algorithm, regardless of its merits:
Now even though it’s in the optional folder, I do have SHA-512 as well. I even went as far as implementing HMAC-SHA-512. But I have reasons:
SHA-512 is necessary to implement Ed25519, which I can use to compare Monocypher with (i) other libraries, and (ii) specially crafted test vectors (Whycheproof). This is a very important contributor to Monocypher’s correctness, so I kind of need it.
SHA-512 is necessary to implement Ed25519, which is the EdDSA variant everyone uses. I’m under no illusion than more Monocypher users use Ed25519 than they use EdDSA-BLAKE2b. Because even though it wasn’t my primary goal, compatibility is still valuable for many people, and failing to provide that compatibility must come with a good reason (such as “AES is redundant with ChaCha20”).
If users use SHA-512 and would like to avoid using several hashes (to reduce code size or attack surface), it makes sense that they’d avoid BLAKE2b as well. But BLAKE2b comes with a very useful keyed mode, and the canonical SHA-2 version of keyed mode is HMAC.
If I do add implement another hash, it will likely be in a separate project. Even BLAKE2s, despite it being a better fit than BLAKE2b on embedded targets (where Monocypher is most popular).