I think that the points around making things serializable is interesting and valid. But I think that it’s important to acknowledge that the rust type system does in fact involve trade-offs. Self referential structures aren’t just “object soup”. They are a natural and fundamental way of modeling information and when a language makes it awkward to express those relationships it has given up something. Rust gave it up in return for memory safety without GC which is a valid tradeoff.
They are a natural and fundamental way of modeling information and when a language makes it awkward to express those relationships it has given up something.
One could say the same thing about goto. Just s/information/code/.
The progress in programming comes from increasing global composability by eliminating local expressiveness that interferes with it.
In my experience forcing developers to express information relationships in a bigger framework of tree of ownership leads to substantially better software.
I would argue that the comparison to goto is not completely apt — the two tradeoffs offer very different benefits. In the goto case you barely lose anything (almost every usage can be converted to structured control-flow), while your reasoning abilities are greatly increased. It is no accident Dijkstra’s paper became so widespread.
Not having a GC is a valid tradeoff in some (niche) use cases, as mentioned, but it is very far from a ubiquitous one. There is nothing fundamentally bad about self-referential data, in fact I think the vast majority of existing software would be vastly more complicated without them.
almost every usage can be converted to structured control-flow
Almost every use of self-referencing can be expressed as indexes, with increased safety, ability to reason about and composability.
Not having a GC is a valid tradeoff in some (niche) use cases, as mentioned, but it is very far from a ubiquitous one.
Not having a GC makes the code have no runtime, which increases it’s composability by making it potentially usable in everything. No one will use Java code from Python, or JS, etc. while almost every language can use Rust because it doesn’t have a runtime.
It also allows practical aliasing model which enables efficient memory safety and concurrency.
Small concession, big wins.
There is nothing fundamentally bad about self-referential data
“There’s nothing fundamentally bad about goto.” Quite a few elegant goto patterns in C, that I don’t miss, but used to employ.
Dropping it too was a relatively minor concession for ability to reason and compose larger software.
Almost every use of self-referencing can be expressed as indexes, with increased safety
Moving references out of the type system does not increase safety. If you have a pointer to an object of type T, and the type system knows that it is a T*, then static and dynamic checkers (and run-time enforcement techniques) can ensure that your T* does not outlive the underlying T, or that it is explicitly invalidated if it does. In contrast, if you refer to objects via an index then you are hiding this relationship from the type system. You are doing manual memory management again. It’s not quite as bad as C, because a dangling index will point to a bounds-checked object and so will either be detectably invalid ot point to some T, but there is nothing that guarantees that it points to the T that you think it does.
If you delete an object from a graph, you now have a free slot in your index table. Do you reuse that? Now you have a mechanism for introducing use-after-free bugs. A dangling index to that slot will refer to the wrong object. Alternatively, do you avoid reusing the slot? Now you have a memory leak and have increasingly sparse data structures that become progressively more inefficient to traverse.
This kind of model works fairly well for things like the CFG in the Rust compiler, because those are short-lived structures and so it’s fine to simply burn slots for deallocated basic blocks and free the whole thing at the end (though it’s still more effort than it would be to use explicit traced references or weak references in a reference-counted environment). They work very poorly for data structures that change shape over time.
Not having a GC makes the code have no runtime, which increases it’s composability by making it potentially usable in everything. No one will use Java code from Python, or JS, etc. while almost every language can use Rust because it doesn’t have a runtime.
People happily used Objective-C from Ruby, Python, and Java (and even Haskell and Ocaml) via mechanically generated FFI, and from C/C++ via hand-written FFI. Having a runtime didn’t prevent this. Having reference counting in the runtime made it easy to build bridges that automatically surfaced Objective-C objects in these other languages. Objective-C’s perception as an Apple language was the largest barrier to this.
It also allows practical aliasing model which enables efficient memory safety and concurrency.
Except that now you’re advocating for design patterns that throw away temporal safety. Type-pooled allocators in C/C++ have the same level of temporal safety as your proposal. They’re much better than nothing (which is why Apple now uses them in the XNU kernel) but they’re a long way away from ideal.
“There’s nothing fundamentally bad about goto.” Quite a few elegant goto patterns in C, that I don’t miss, but used to employ.
C’s goto is absolutely nothing like the goto that existed prior to structured programming. It has well-defined behaviour with respect to lexically scoped objects (you can’t branch over a local variable’s declaration, for example) and it is not permitted to branch to labels in a different function.
C’s goto is absolutely nothing like the goto that existed prior to structured programming. It has well-defined behaviour with respect to lexically scoped objects (you can’t branch over a local variable’s declaration, for example) and it is not permitted to branch to labels in a different function.
You can branch over a local variable’s declaration. The following is legal:
goto L1; // legal
int x = 42;
L1:
goto L2; // legal
{
double y = 3.1419;
L2:
}
The only restriction that the C standard places on goto’s are that (1) the jump must be within the same function, and that (2) you cannot jump into a scope containing a variable-length array. (C23 6.8.6.1 “The goto statement”, draft N3096), so the following would be invalid:
s/increased/decreased. Even naive C have better tools to actually debug “dangling pointer” problems — valgrind will cry about it, but there is nothing to do if you store an index, but the associated value was replaced in the meanwhile. Let alone a managed memory language.
And come on, increased ability to reason about? What is more clear, that my Node has a parent: Node, or if it has a parent: u32? In a managed language the former has only a single invalid value (or if the language is null-aware, none), the latter has potentially 2^32, let alone the case of dangling pointers.
Not having a GC makes the code have no runtime, which increases it’s composability
Yes, which is a niche use case, which might be worth it depending on context, say, writing a specialized library. Most people want to write applications, though, where this concern is negligible, and can be circumvented by IPC.
It also allows practical aliasing model which enables efficient memory safety and concurrency
So you think a checked array access will have better optimizations by the compiler than pointers themselves? Especially that one of rust’s strong points is that it can actually “use the restrict keyword” properly, thanks to the borrow checker? This is an unfounded statement on your part.
There is definitely value in restricting the “degree of freedom” of code, see types, but as most things, it is always a tradeoff. We don’t generally use regular programs, even though they are much easier to reason about than Turing-complete ones.
And come on, increased ability to reason about? What is more clear, that my Node has a parent: Node, or if it has a parent: u32?
I think you misunderstood this. The idea is to not to just replace the pointer with an index in the object, but to implement another vector of data in the main struct:
parents: Vec<(PersonId, PersonId)>
This works if your data is immutable, so you can sort it and use ranges and binary search. If you mutate your data during runtime, using a map works better:
parents: HashMap<PersonId, PersonId>
Although if the data is kept in memory for a long time, and mutates during its lifetime, I’d consider using indices like this a bit before jumping into writing the code.
Again, you have all your data in flat structures and you do not have any nesting of data anywhere. It makes a lot of sense e.g. when modeling databases to an ORM, or when writing a compiler.
Again, you have all your data in flat structures and you do not have any nesting of data anywhere. It makes a lot of sense e.g. when modeling databases to an ORM, or when writing a compiler.
That’s an important point. If one is working with graph data, chances are high that it will need to be persisted anyway, in which case there’s barely any technology that will allow persisting the “object soup” natively (they were attempt of “object oriented databases”, but obviously it’s a terrible idea so it didn’t work out).
So now the “ease” of “object soup” has an additional difficulty of dealing with the conversion back and forth to the form I’d advocate for in the first place as clearly better.
tools to actually debug “dangling pointer” problems
It’s not a pointer. It can’t be accidentally deferenced forgetting to check if it’s still valid. Sure the logical coherence of the larger data-structure needs to be taken care of, but it’s not unlike GCed languages. In a managed language if you point a child at a new parent, but forget to update the old parent, are you going to get great results? If anything, all the “invalid values” (often including ‘generations’ in a slotmap) can actually crash the code in an orderly way and help spot the issue, instead of letting it “mostly work”.
or if it has a parent: u32
It’s parent: NodeId.
Yes, which is a niche use case, which might be worth it depending on context, say, writing a specialized library.
That’s no a niche use-case. The ubiquity of C and C++ despite their huge risks, is a result of being able to reuse them.
So you think a checked array access will have better optimizations by the compiler than pointers themselves?
Who cares. Rust does have references, including shared ones. You only really need to use indexes when you’re already working with a graph, and children need to point at parents, etc. and the whole thing is non-trivial already, and the performance overhead of “non-native-references” is akin to the cost of a JOIN in a database .
Normal relationships that fit a DAG (“tree of onwership”) are all one needs 99% of the time anyway.
Worry about all the memory barriers, extra memory usage and cache invalidation caused by a GCed-runtime that is slowing down the normal GCed code at all times, to make the GC work, not a 1% case in Rust, that is often fast enough.
There is nothing fundamentally bad about self-referential data, in fact I think the vast majority of existing software would be vastly more complicated without them.
It wasn’t obvious to me, but the reasons for Rust not supporting self-referential structs can be said to be historical, not strictly technical (see 1 and 2).
I find that restriction quite limiting in the kind of APIs I’m able to design. I often want to encapsulate two object in a single struct, where one fields references the other (e.g. rusqlite::Connection and rusqlite::Transaction). You can’t do that in current Rust because:
Fields of structs are dropped in declaration order.
Transferring ownership (i.e. moving) of a struct needs to be equivalent to memcpy().
In [2], withoutboats lays out a proposal for a Move trait that would invert the status quo: structs would be immovable by default and movable as an opt-in. Immovable types could be naturally self-referential and even structs that implement Move could manually instruct the compiler how to copy themselves, thereby solving (2). As for (1), it’s a simple convention that would need to be relaxed.
The biggest reason in my understanding why this is not happening is backward compatibility.
You’re falling into the trap of believing that there is a single path of progress towards some kind of ideal, which every language is moving towards or away from. Ideas in languages evolve the same way organisms do: In response to localised pressures, towards whatever gives them an advantage in that domain.
So to your point: goto has indeed mostly “evolved” out of most programmers workflow and languages. But it is still alive and well in certain contexts, for example kernel development.
Totally agreed, there are major tradeoffs here. One that’s gotten a lot of discussion recently: The designers of async/await had to go through an enormous amount of trouble to fit futures into the language, since they’re fundamentally self-referential types. They settled on the Pin API, which is a leading contender for the trickiest module in the whole standard library.
Another interesting one: The small-string-optimization layout in GCC’s std::string is self-referential. Rust’s String doesn’t do SSO in general, but if it did, it would have to do something like what Clang does rather than what GCC does.
I don’t think SSO is typically self-referential in the sense that is not allowed in rust. Typically SSO looks like (but more optimized to use every available byte and not hardcoded to 64 but platforms).
There is no explicit reference to the inline buffer (as that would waste a word of SSO space). Rust can handle this perfectly fine as a method like as_u8 can borrow a reference to the buffer.
Oops, I skipped over what’s probably the most important implementation detail: Clang’s string::data()has a branch to handle the small string case, but GCC’s string::data()does not.
Well said. I’m as big a member of the Rust Evangelism Strike Force as anyone, but it’s very frustrating to see so many people try to spin things that are clearly flaws/shortcomings/whatever as somehow actually good. It’s not just Rust fans, of course–I see it in all of the various programming language forums/subreddits/etc. It’s okay to admit that something sucks about the thing you like!
Using arenas in Rust is sometimes the best approach because of performance reasons, but frequently arenas are just awkward, bug prone (must keep your indexes in sync with arena updates), workarounds for Rust’s borrow rules.
In the past seven years I’ve been working with Rust, using the approach with indices described in the article has been an absolutely awesome API two times in quite large projects. It’s a very nice abstraction when you need to parse a file of code once and then refer it later. Eventually you’ll have one allocated string sitting in the bottom, tokenized in the next layer and then building a database of typed token IDs either in vectors or maps on top of the AST. The public API is built on top of the flat vectors of IDs which can then take a reference directly from the string which represents the code.
It is very efficient and extremely nice API to use. It has been causing almost zero crashes and has accelerated all development in the corresponding code bases. It’s not that I’m trying to evangelize Rust, but this way of modeling the data works very nicely for a few hard problems I’ve been needing to solve in my career.
Sure. And I guess I shouldn’t have specified “because of performance reasons”. Rather, my point was that sometimes the arena approach is actually the best approach to the problem, but sometimes it’s used because it’s the only option.
Phrased a different way: if you were writing your code in a GC’d ref-happy language and you’d still choose the arena approach, great. But if you wouldn’t choose the arena approach in the hypothetical language, then it seems hard to argue that it’s better than “object soup” or whatever. It’s like the infamous tree/graph data structures conundrum. Using trees and graphs to model data relationships is very often exactly what we want. The fact that I see so many (frankly, newbie) Rust fans try to tell people that they really don’t want/need trees is crazy.
Adding to what others have already said here, I like how a data-oriented module with flat vectors and references between them tends to read: one big struct that contains vectors of everything relevant. Everything fitting densely in one or a few screenfulls. If you know all your representations are flat, and only referencing others through indexes into that big struct, you know you have all the data there. It’s not hidden in a vector inside a field nested deep down in the hierarchy.
Rust has raw pointers. You just have to write unsafe to dereference them. And unfortunately raw pointer field projection is currently painful, although improvements are on the way.
But a graph of raw pointers is rarely the best approach. And when it is, you can typically create a safe interface around the raw pointers derefs that enforces whatever invariants you’re using to ensure sound memory management.
‘Raw pointers’ in rust are not strong pointers—they can be forged. And the rules for when it is legal to do so are quite literally not specified (see e.g. tratt).
a graph of raw pointers is rarely the best approach
Why?
you can typically create a safe interface around the raw pointers derefs that enforces whatever invariants you’re using to ensure sound memory management
You advocate greenspinning. Green-spun subinterfaces tend to be difficult to use (since they are by definition and as a consequence of essential limitations incomplete), slow, and not modularly composable. Better that the language should grow to include such obviously and near-universally useful functionality.
From experience: With an adjacency list it is easier to manipulate the entire graph as a single structure. You have a single object that contains all the nodes in the graph, even if they’re disconnected or there are multiple “root” nodes. Adding edges is simpler because you don’t have to modify a particular node, and adding pairs of directed edges is simpler because you don’t have to modify two different nodes. You can easily print out and introspect the graph, since each node and edge automatically has a unique index number that doesn’t need to be tracked anywhere. It’s easy to iterate over all nodes and edges in arbitrary order, and also easy to find all the edges leading into a node. Removing nodes is far less error-prone because it’s so much easier to find all the edges that point to a particular node. It’s easy to modify to add fast lookups, by sorting your lists and doing binary search or by adding extra structures similar to database indices. I’m not going to even consider differences in performance, just algorithmic convenience… but the list goes on.
It’s counter-intuitive, but a mesh of separate objects referring to each other with pointers is kinda the worst representation for a graph.
True. It’s more of a spectrum than one might first expect though. The extreme case would be a game ECS system, but even in the Garnet compiler I occasionally look at the objects-with-pointers AST and IR trees and say “you know, this would have been easier if it was an array of nodes and indices”. It really seems like extra complexity but it works really well in practice.
I’m confused. By strong pointers, do you mean smart pointers? If so, Rust has Box, Rc, rc::Weak, etc.
You advocate greenspinning.
Do I? If that’s what you want to call it, okay.
Green-spun subinterfaces tend to be difficult to use (since they are by definition and as a consequence of essential limitations incomplete), slow, and not modularly composable.
Really? So Rust std lib collections are slow and not modularly composable? Compared to what?
Better that the language should grow to include such obviously and near-universally useful functionality.
How would you propose a language to grow to include functionality without someone having to implement it in the first place?
I think there are two options for what you could mean by strong pointers here, but correct me if I’m missing something:
There’s strong pointers where cycles don’t get collected automatically, like std::shared_ptr in C++ or ARC in Swift. I agree that these are simple in that it’s easy to define their semantics, but I’m not sure I’d agree that using them ubiquitously is simple. I haven’t written any Swift myself, but it seems easy to lose track of who’s keeping what alive, especially when there are lots of lambdas capturing things.
Or there’s strong pointers where cycles do get collected, like in Python (or maybe any other GC’d language, if we want to call those strong pointers). This is genuinely simple when all you have to deal with is freeing memory and there’s no FFI. But when other resources get involved or native code gets involved, and you have to start relying on finalizers and pinning, it gets extremely hairy. E.g. https://github.com/python/cpython/issues/60198.
One point of simplicity that Rust and C++ share (and I know how it sounds to praise C++ for its simplicity), is that it’s really easy to understand what destructors do. It’s a function that runs when things go out of scope, the end. (Caveats for move semantics in Rust and return-value-optimization in C++, but these aren’t footguns.) If you manage to touch an object after its destructor has already run, you did something illegal with pointers, don’t do whatever that was. There’s not much more to it than that. There’s no “object resurrection”, no special tricks to deal with teardown, no need to think about whether some object in some edge case might already have been finalized.
What do you mean by “strong pointers”? To me that means “the opposite of weak pointers”, and weak pointers are ones that don’t imply ownership in a system with GC or refcounting, which is not particularly simple… so I’m wondering what I’m missing.
Partly also responding to your other comment. And cc @oconnor663
A pointer is something that denotes an object. A strong pointer denotes an object directly; a weak pointer denotes an object indirectly. For eample, in java, if C is the name of a class or interface, then a variable of type C is a strong pointer. A memory address might be a strong pointer to a byte in memory, but it is not generally a strong pointer to an object conceived as a higher-level language-level abstraction. (In fact, a memory address is not even a strong pointer to a byte in any architecture with virtual memory.)
An array index, or a string to be looked up in a hash table—these are weak pointers. Weak pointers enable some forms of expressiveness and abstraction (for example, the index might not be into a single array, but into many of them; this allows to modularly and contextually add attributes to an object, like the transpose of stealth mixins), but are harder to reason about and, because they are essentially based on puns, amenable to misuse; hence, the most expressive languages, like lisp and self, encourage the use of strong pointers, in order to maintain local reasoning and to allow the encapsulated abstraction of representation (that object could still be represented using a single index and a collection of context arrays, apropos your performance considerations, but that detail will be hidden from the user of the object; similarly you could have, for example, an object encapsulating an entire graph, if you want one—but, if you have access to its nodes, you surely should be prevented from confusing the nodes of one graph with the nodes of another, should you not?). (See also, for example, explicit foreign key relationships in sql, though I am given to understand that these are not generally checked, unfortunately.)
There is one language which excels at the manipulation of indices: apl. I say this having written the better part of a j compiler in j (j is a dialect of apl), representing the program structure using an edge-list. I feel pretty good about that compiler and its ir. But other languages are not so oriented. They are oriented towards individual objects, encapsulation, and abstraction—scale and explicit interfaces. They lack facilities for expressively and concisely manipulating large arrays (no, the flavour-of-the-month pound of streamprocessing/numpy/whatnot being peddled does not count—rather than subordinate detail, it bows down before it, and the only thing it suggests is verbosity). I think the former represent a interesting ethos about what programming represents. I think it would do most people who program good to write a program in apl (not learn to use it as a calculator—that is rather different, and does not properly impart the values). But it is important to understand that this ethos is divorced from and in direct opposition to the values which underly the design of practically every other program and programming language today. There is a reason that, when apl programmers of the past few decades write c, they often write it like this; if you do not write your code like that, but still advocate the free use of weak pointers, you are committing a half measure and must think about why.
Weak pointers still make a great deal of sense, sometimes. But generally speaking they should not be the default.
The real world is built on weak pointers. The universe can’t tell one calcium atom from another. We go to a great deal of effort to create the illusion that computers underlied by strong pointers exist. Only to throw all that away and revert to weak pointers, stacked on top of the strong ones; why?
I like this distinction and I think it’s a useful concept to think about, but I found using the terms ‘strong pointer’ and ‘weak pointer’ to mean something that is totally unrelated to ‘strong reference’ and ‘weak reference’. I’d encourage you to think of a different name for this concept.
I’ve never seen it used in that way, but that doesn’t mean it isn’t. I’ve never seen the concept this cleanly articulated either, though it’s come up in a load of discussions that I’ve been involved in.
I think this concept is better described by objects having identity (strong pointer) vs not.
Let me illustrate this with Java, which as of yet only has identity-semantics, but value classes are in the plans. The old Date class is mutable with an identity. You can have an instance of it, and even if it equals another instance, they are not replaceable with each other — one might be locked by one thread, or another thread might have a reference and could decide to mutate it.
In contrast, the newer LocalDate class is actually designed with identity-less value classes in mind, e.g. they are immutable. There are still ways a program can semantically differentiate between two instances of a LocalDate class (the aforementioned synchronization, also ==), but it is possible to migrate it to an identity-less class, and then any two instance of the same value can be replaced. (This allows better optimizations, e.g. a list can store them in-place, instead of storing pointers to them).
Let me quote Guy Steele here, for a to me, eye-opening statement: the concept of side-effect is inseparable from the notion of identity. The only way one can observationally determine that a side effect has occurred is when the same object behaves in two different ways at different times.
Object identity is orthogonal to the concept of strong references. In particular, there are languages of total functions which don’t include pointers; every reference in those languages is strong, but object identity is trivial and comparisons are always done by value.
That said, you are right from a certain perspective: languages with mutability must necessarily mix their identity/comparison semantics with their reference semantics, because a mutable reference can change its identity as execution progresses. This is what Steele is getting at. Without mutability, though, the “same object” will never change behavior “at different times.”
Indeed, and to take this a step further, the use of weak pointers in a total language enables a form of mutability via state-passing style, whereas for data pointed to strongly, there is no danger of their changing.
I’m not really a fan of this style, since compared to GC languages, it’s a compromise on both:
Type safety – a T* is type safe, but an index into an array of T is not. In this example you only have Person, but if you had Person and Job that would be more clear.
Memory safety – indexes can dangle; they can refer to invalid items. The unsafety can be logical as well as physical
BTW I think the genius of C is that it’s 2 languages in one – it has value semantics (tight-packing, zero alloc), AND reference semantics (arbitrary graphs of objects – “object soup” is a bit pejorative).
It seems like Rust is too heavily skewed towards the former. If it wants to “replace” C, it seems like it should be stronger at the latter. I like the examples in this post, since they show that.
I always point to these articles on representing DFAs/NFAs with graphs of pointers to tiny structs as extremely elegant C code:
Type safety – a T* is type safe, but an index into an array of T is not. In this example you only have Person, but if you had Person and Job that would be more clear.
Memory safety – indexes can dangle; they can refer to invalid items. The unsafety can be logical as well as physical
In Rust, one prominent library and pattern for addressing the problem of dangling indexes (“ABA problem”) is generational-arena.
To mitigate the risk of confusing Person indices with Job indices, one could store them and pass them as separate struct PersonId(u64) and struct JobId(u64) types.
SQL foreign keys (at least if they’re integers) are a similar use of indices in place of pointers, and SQL has (potentially depending on implementation) different ways of addressing these problems, or at least the “memory safety” problem.
I think Rust’s story here has several sides, and different people emphasize different parts: Safe Rust tightly restricts the kinds of pointer graphs you can build. Unsafe Rust lets you go nuts and do whatever you want with raw pointers. (It’s even more permissive than C. There’s no “strict aliasing” rule.) But safe Rust also exerts a lot of gravity on unsafe Rust. In part that’s because everyone agrees culturally that you’re “not supposed to do that” when you have a choice, and the ecosystem is reluctant to accept unsafe APIs that could’ve been safe. In another part, it’s because the rules you have to follow at the boundary between safe and unsafe code are more complicated than the normal rules for pointers in C. (All shared and mutable references in safe code are “restrict” pointers in C terms, and it’s easy to violate that requirement when you’re doing unsafe casts.)
I’d love to know the answer to your regex question, but the best I can do is point to some sources.
If there’s a holy gospel of regular expressions in Rust, it’s Andrew Gallant’s 2016 release announcement / article / nearly-a-book for ripgrep: https://blog.burntsushi.net/ripgrep. In there he links to a different entry in the Russ Cox series that you mentioned. (Ctrl-f for “Russ Cox”.)
The dfa and hybrid modules use unsafe to explicitly elide bounds checks in the core search loops. This makes the codegen tighter and typically leads to consistent 5-10% performance improvements on some workloads.
OK yeah I actually agree – I was actually thinking last night that this wasn’t the best example, since all the indices could be into the same data structure.
It makes the syntax uglier to use indices, but now I do think there’s a very straightforward translation of that code. (although I would still be intersted in seeing it! I wish I had time to do that experiment)
A better example would have more heterogeneous data types and lifetimes. But C doesn’t handle the lifetime issue either …
(Though it seems like getting the generational indices errors at runtime isn’t too different than getting ASAN errors at runtime, but I haven’t tried the former. Also, in both cases you’re kinda using an “extra” tool on top of the language.)
A big difference between ASan and these other approaches is that (last I heard, as far as I know) you’re not supposed to run ASan in production. It has a large performance cost, but maybe more importantly it opens up new attack surfaces.
This is one of the most brilliant ways of structuring trees I learned in the past few years. It serializes to JSON as-is, even if having cycles in the structure, and it is very easy to iterate using ranges, and easy to index using any sort of a map structure.
I usually make a type out of different IDs, here we could have a FriendId(pub usize), and a special walker structure:
As another step, I wonder if some of the concerns raised here about safety w/r/t indexes vs pointers would be addressed by a “flyweight” API that encapsulates & hides the underlying arrays. I’m out of my depth with Rust, but maybe something like
I think this might be missing another ergonomic(-ish) approach, which is to make a Person type which is wrapping some Rc-d data (PersonData?), can be copied, etc, doing the “right thing”, with some helper methods or Deref implementations or whatever.
You can stick to indexes as well (and just have a ‘static Person arena), but it feels a bit silly to throw around numbers when you’re one wrapper class away from having “normal” ergonomics IMO.
Ultimately if you have an ECS system though, with objects popping into and out of existence, you will always need some sort of liveness check on objects. There’s no way around it! Things are given references to people, and then use those references later on. If your people can fall out of existence, you need liveness checks. See every “game development in Rust” talk out there.
The Rc is very problematic when you need cycles between entities. It leaks memory if the other side is not Weak, and the ergonomics of a Weak require an extra unwrap before the data can be used. Also referencing this data is suddenly very hard and there is no real ownership for the data anymore. And if you need a Send boundary and threads, all your Rcs need to be Arcs, which might have serious implications to the performance of the software: an atomic can be poison to the caches. Finding these performance regressions is very hard from a big application.
The ids work very nicely if you wrap them to a structure which I explained in another comment:
What is really awesome with this approach is how you can extend the setup to different use-cases. You need to look for a city where a person lives in? What about creating a HashMap<PersonId, CityId> to the schema and off you go. This is not Rust-only solution, it works very well in any statically typed language.
I think it’s hard to go too much into this in a context-free way, since the use case is what matters, but with PersonIds you still have the issue of “is this ID still referencing a ‘live’ entity” (so you will need to unwrap just like with Weak). And if you are saying “no, actually everything is always live and we don’t GC” then memory leaks are a non-isuse (or you stick everything in a ’static and be done with it)
Making Ids explicit makes the Rust borrow checker happy, but I really think it doesn’t resolve the fundamental issues in entity systems, which is items no longer existing despite references to them still being somewhere else in the system. And you can handle that fine, if you have some unwrapping step. And if stuff can never be removed, then just “leak” memory (after all, if you stuff is immortal it’s not really a leak is it?). And Rc might not be the right solution for that, but some arena works fine.
I’m saying this, but I do think it’s interesting to explore stuff like your Walker structure anyways.
Keep in mind that all the problems we solved with this structure were immutable after creation. It’s a basis for an ORM or a GraphQL server where you want to load a schema file or database definition into memory, with the need of fast serialization and deserialization.
If there would be a need for deletion of items and mutation during runtime, I’d probably go with slotmap instead.
The next step after thinking of these as indexes is to think of them as identifiers of entities, and that our data structures are holding aspects we assign to identifiers. Which is what relational databases and entity component systems do. It has the nice property that you can take various aspects a la carte instead of trying to have some kind of object that represents the entity.
I downloaded the same PDF a couple weeks ago, and I’ve been meaning to read through it :) Someone who knows this stuff should really write an article like “here are some programs you can write in Rust but not Hylo and vice versa”.
The literal rule for what you can’t express in Hylo is: any Rust function that, if you made its lifetimes explicit, has at most one lifetime parameter. If that lifetime parameter is used in the return type, you need a “subscript”, otherwise a regular function will do. Oh, and references can only appear at the top level, that is very important. Vice-versa is simpler: Rust is strictly more expressive.
But that doesn’t quite answer “what programs you can write” b.c. there’s the question of writing “the same thing” a different way.
A Rust function without references can obviously be written in Hylo. If it has references only on the inputs, that can be expressed in Hylo:
fn foo(arg1: &T, arg2: &mut T) -> T // works in Hylo
Every Rust function with references only in its inputs can be written with a single lifetime parameter. Put differently, lifetime parameters are only ever needed when there’s an output reference. Thus the above function can be written with only one lifetime parameter:
fn foo<'a>(arg1: &'a T, arg2: &'a mut T) -> T
If a Rust function has references on both its inputs and output, then it’s expressible in Hylo only if there’s at most one lifetime parameter when you write out the lifetimes. So:
fn foo<'a>(&'a mut self, &'a T) -> &'a T // works in Hylo, needs to be implemented with a "subscript"
fn foo<'a, 'b>(&'a mut self, &'b T) -> &'a T // does not work in Hylo
The other, bigger, restriction is that references have to be top-level. You can’t nest them. So no Option<&str>, or HashMap<i32, &User>. And no iterators over references! No such thing as split_words() -> impl Iterator<Item = &str>!
Instead of a function returning an Option<&str>, the function would take a callback and invoke it only if the option is populated. And iterators need to be internal iterators: to iterate over a collection, you pass a closure to the collection and the collection invokes that closure once per element.
Please ping me
I’ve added a file to my blog post ideas folder, and it says to ping you, so I’ll definitely ping you if I write it. Though no promises, as that ideas folder grows more than it shrinks…
This reminds me discussions years ago, whether Java or C++ (or Python) is „better“… People argued, that it is better to have such options (like pointers or duck-typing) than not have them and that it is up to the programmer to use these options wisely. They also said, that „limited“ languages like Java are bad, because they make some advanced things impossible or difficult/annoying (it is not as smooth as pure Java, if you have to use JNI/JNA to be native like C/C++ or if you have to use Reflection API, to be dynamic like Python). I argued that such „limited“ set of features that Java offered was better (for business applications) and that it was well-balanced between „powerful“ and „safe“.
A: I miss this feature.
B: You do not need it!
A: But I really need it!
B: So, here you are...
A: But it is too uncomfortable/cumbersome/difficult to use.
B: We made it this ugly to discourage you from using it. :-)
We mean well. We do not want you to shoot yourself in the foot.
It is still a question, what does „well-balanced“ mean and how powerful, foolproof or complex the language should be.
I feel like all this debate over the goodness or badness of arbitrary pointers in imperative languages, from the original post to the >35 comments in the last ~15 hours, could be improved with sufficient appreciation of SQL or other relational languages, of the potential benefits of, more than merely taking on some restrictions in data structure design, taking on such restrictions in return for access to an additional paradigm of programming. The entity–component–system design seems like a step towards this. All that said, though, I don’t feel sufficiently learned in relational programming myself to say clearly what one would gain.
I think that the points around making things serializable is interesting and valid. But I think that it’s important to acknowledge that the rust type system does in fact involve trade-offs. Self referential structures aren’t just “object soup”. They are a natural and fundamental way of modeling information and when a language makes it awkward to express those relationships it has given up something. Rust gave it up in return for memory safety without GC which is a valid tradeoff.
One could say the same thing about
goto
. Justs/information/code/
.The progress in programming comes from increasing global composability by eliminating local expressiveness that interferes with it.
In my experience forcing developers to express information relationships in a bigger framework of tree of ownership leads to substantially better software.
I would argue that the comparison to goto is not completely apt — the two tradeoffs offer very different benefits. In the goto case you barely lose anything (almost every usage can be converted to structured control-flow), while your reasoning abilities are greatly increased. It is no accident Dijkstra’s paper became so widespread.
Not having a GC is a valid tradeoff in some (niche) use cases, as mentioned, but it is very far from a ubiquitous one. There is nothing fundamentally bad about self-referential data, in fact I think the vast majority of existing software would be vastly more complicated without them.
Almost every use of self-referencing can be expressed as indexes, with increased safety, ability to reason about and composability.
Not having a GC makes the code have no runtime, which increases it’s composability by making it potentially usable in everything. No one will use Java code from Python, or JS, etc. while almost every language can use Rust because it doesn’t have a runtime.
It also allows practical aliasing model which enables efficient memory safety and concurrency.
Small concession, big wins.
“There’s nothing fundamentally bad about
goto
.” Quite a few elegantgoto
patterns in C, that I don’t miss, but used to employ.Dropping it too was a relatively minor concession for ability to reason and compose larger software.
Moving references out of the type system does not increase safety. If you have a pointer to an object of type T, and the type system knows that it is a
T*
, then static and dynamic checkers (and run-time enforcement techniques) can ensure that yourT*
does not outlive the underlying T, or that it is explicitly invalidated if it does. In contrast, if you refer to objects via an index then you are hiding this relationship from the type system. You are doing manual memory management again. It’s not quite as bad as C, because a dangling index will point to a bounds-checked object and so will either be detectably invalid ot point to some T, but there is nothing that guarantees that it points to the T that you think it does.If you delete an object from a graph, you now have a free slot in your index table. Do you reuse that? Now you have a mechanism for introducing use-after-free bugs. A dangling index to that slot will refer to the wrong object. Alternatively, do you avoid reusing the slot? Now you have a memory leak and have increasingly sparse data structures that become progressively more inefficient to traverse.
This kind of model works fairly well for things like the CFG in the Rust compiler, because those are short-lived structures and so it’s fine to simply burn slots for deallocated basic blocks and free the whole thing at the end (though it’s still more effort than it would be to use explicit traced references or weak references in a reference-counted environment). They work very poorly for data structures that change shape over time.
People happily used Objective-C from Ruby, Python, and Java (and even Haskell and Ocaml) via mechanically generated FFI, and from C/C++ via hand-written FFI. Having a runtime didn’t prevent this. Having reference counting in the runtime made it easy to build bridges that automatically surfaced Objective-C objects in these other languages. Objective-C’s perception as an Apple language was the largest barrier to this.
Except that now you’re advocating for design patterns that throw away temporal safety. Type-pooled allocators in C/C++ have the same level of temporal safety as your proposal. They’re much better than nothing (which is why Apple now uses them in the XNU kernel) but they’re a long way away from ideal.
C’s goto is absolutely nothing like the goto that existed prior to structured programming. It has well-defined behaviour with respect to lexically scoped objects (you can’t branch over a local variable’s declaration, for example) and it is not permitted to branch to labels in a different function.
You can branch over a local variable’s declaration. The following is legal:
The only restriction that the C standard places on goto’s are that (1) the jump must be within the same function, and that (2) you cannot jump into a scope containing a variable-length array. (C23 6.8.6.1 “The goto statement”, draft N3096), so the following would be invalid:
EDIT: fixed formatting.
s/increased/decreased. Even naive C have better tools to actually debug “dangling pointer” problems — valgrind will cry about it, but there is nothing to do if you store an index, but the associated value was replaced in the meanwhile. Let alone a managed memory language.
And come on, increased ability to reason about? What is more clear, that my Node has a
parent: Node
, or if it has aparent: u32
? In a managed language the former has only a single invalid value (or if the language is null-aware, none), the latter has potentially 2^32, let alone the case of dangling pointers.Yes, which is a niche use case, which might be worth it depending on context, say, writing a specialized library. Most people want to write applications, though, where this concern is negligible, and can be circumvented by IPC.
So you think a checked array access will have better optimizations by the compiler than pointers themselves? Especially that one of rust’s strong points is that it can actually “use the restrict keyword” properly, thanks to the borrow checker? This is an unfounded statement on your part.
There is definitely value in restricting the “degree of freedom” of code, see types, but as most things, it is always a tradeoff. We don’t generally use regular programs, even though they are much easier to reason about than Turing-complete ones.
I think you misunderstood this. The idea is to not to just replace the pointer with an index in the object, but to implement another vector of data in the main struct:
This works if your data is immutable, so you can sort it and use ranges and binary search. If you mutate your data during runtime, using a map works better:
Although if the data is kept in memory for a long time, and mutates during its lifetime, I’d consider using indices like this a bit before jumping into writing the code.
Again, you have all your data in flat structures and you do not have any nesting of data anywhere. It makes a lot of sense e.g. when modeling databases to an ORM, or when writing a compiler.
That’s an important point. If one is working with graph data, chances are high that it will need to be persisted anyway, in which case there’s barely any technology that will allow persisting the “object soup” natively (they were attempt of “object oriented databases”, but obviously it’s a terrible idea so it didn’t work out).
So now the “ease” of “object soup” has an additional difficulty of dealing with the conversion back and forth to the form I’d advocate for in the first place as clearly better.
It’s not a pointer. It can’t be accidentally deferenced forgetting to check if it’s still valid. Sure the logical coherence of the larger data-structure needs to be taken care of, but it’s not unlike GCed languages. In a managed language if you point a child at a new parent, but forget to update the old parent, are you going to get great results? If anything, all the “invalid values” (often including ‘generations’ in a slotmap) can actually crash the code in an orderly way and help spot the issue, instead of letting it “mostly work”.
It’s
parent: NodeId
.That’s no a niche use-case. The ubiquity of C and C++ despite their huge risks, is a result of being able to reuse them.
Who cares. Rust does have references, including shared ones. You only really need to use indexes when you’re already working with a graph, and children need to point at parents, etc. and the whole thing is non-trivial already, and the performance overhead of “non-native-references” is akin to the cost of a
JOIN
in a database .Normal relationships that fit a DAG (“tree of onwership”) are all one needs 99% of the time anyway.
Worry about all the memory barriers, extra memory usage and cache invalidation caused by a GCed-runtime that is slowing down the normal GCed code at all times, to make the GC work, not a 1% case in Rust, that is often fast enough.
It wasn’t obvious to me, but the reasons for Rust not supporting self-referential structs can be said to be historical, not strictly technical (see 1 and 2).
I find that restriction quite limiting in the kind of APIs I’m able to design. I often want to encapsulate two object in a single struct, where one fields references the other (e.g.
rusqlite::Connection
andrusqlite::Transaction
). You can’t do that in current Rust because:memcpy()
.In [2], withoutboats lays out a proposal for a
Move
trait that would invert the status quo: structs would be immovable by default and movable as an opt-in. Immovable types could be naturally self-referential and even structs that implementMove
could manually instruct the compiler how to copy themselves, thereby solving (2). As for (1), it’s a simple convention that would need to be relaxed.The biggest reason in my understanding why this is not happening is backward compatibility.
For reference, these were discussed here as https://lobste.rs/s/6fjkeh/why_async_rust and https://lobste.rs/s/tfsr57/changing_rules_rust.
In fact, every program with goto can be converted into one with only loops and conditionals https://en.wikipedia.org/wiki/Structured_program_theorem
Yes, though arguably there are a few usage patterns where
goto
or its tamer little brother,break
is much cleaner.You’re falling into the trap of believing that there is a single path of progress towards some kind of ideal, which every language is moving towards or away from. Ideas in languages evolve the same way organisms do: In response to localised pressures, towards whatever gives them an advantage in that domain.
So to your point: goto has indeed mostly “evolved” out of most programmers workflow and languages. But it is still alive and well in certain contexts, for example kernel development.
Totally agreed, there are major tradeoffs here. One that’s gotten a lot of discussion recently: The designers of async/await had to go through an enormous amount of trouble to fit futures into the language, since they’re fundamentally self-referential types. They settled on the
Pin
API, which is a leading contender for the trickiest module in the whole standard library.Another interesting one: The small-string-optimization layout in GCC’s
std::string
is self-referential. Rust’sString
doesn’t do SSO in general, but if it did, it would have to do something like what Clang does rather than what GCC does.I don’t think SSO is typically self-referential in the sense that is not allowed in rust. Typically SSO looks like (but more optimized to use every available byte and not hardcoded to 64 but platforms).
There is no explicit reference to the inline buffer (as that would waste a word of SSO space). Rust can handle this perfectly fine as a method like
as_u8
can borrow a reference to the buffer.You’re right about what’s typical, and for example that’s what Clang’s implementation does, but GCC’s
std::string
is different. Reading C++ stdlib headers is the worst, but the most direct evidence of this is that Clang’s data pointer is only present in the long representation but GCC’s data pointer is always present. Also, Clang has a fast path for bitwise copying short strings, but GCC does not, because it can’t copy the self-referential data pointer. Since this steals 8 bytes of capacity from the small representation, GCC padsstd::string
to 32 bytes.Oops, I skipped over what’s probably the most important implementation detail: Clang’s
string::data()
has a branch to handle the small string case, but GCC’sstring::data()
does not.Well said. I’m as big a member of the Rust Evangelism Strike Force as anyone, but it’s very frustrating to see so many people try to spin things that are clearly flaws/shortcomings/whatever as somehow actually good. It’s not just Rust fans, of course–I see it in all of the various programming language forums/subreddits/etc. It’s okay to admit that something sucks about the thing you like!
Using arenas in Rust is sometimes the best approach because of performance reasons, but frequently arenas are just awkward, bug prone (must keep your indexes in sync with arena updates), workarounds for Rust’s borrow rules.
In the past seven years I’ve been working with Rust, using the approach with indices described in the article has been an absolutely awesome API two times in quite large projects. It’s a very nice abstraction when you need to parse a file of code once and then refer it later. Eventually you’ll have one allocated string sitting in the bottom, tokenized in the next layer and then building a database of typed token IDs either in vectors or maps on top of the AST. The public API is built on top of the flat vectors of IDs which can then take a reference directly from the string which represents the code.
It is very efficient and extremely nice API to use. It has been causing almost zero crashes and has accelerated all development in the corresponding code bases. It’s not that I’m trying to evangelize Rust, but this way of modeling the data works very nicely for a few hard problems I’ve been needing to solve in my career.
Sure. And I guess I shouldn’t have specified “because of performance reasons”. Rather, my point was that sometimes the arena approach is actually the best approach to the problem, but sometimes it’s used because it’s the only option.
Phrased a different way: if you were writing your code in a GC’d ref-happy language and you’d still choose the arena approach, great. But if you wouldn’t choose the arena approach in the hypothetical language, then it seems hard to argue that it’s better than “object soup” or whatever. It’s like the infamous tree/graph data structures conundrum. Using trees and graphs to model data relationships is very often exactly what we want. The fact that I see so many (frankly, newbie) Rust fans try to tell people that they really don’t want/need trees is crazy.
The simplest tool is a strong pointer in a language which hasn’t been hamstrung.
Adding to what others have already said here, I like how a data-oriented module with flat vectors and references between them tends to read: one big struct that contains vectors of everything relevant. Everything fitting densely in one or a few screenfulls. If you know all your representations are flat, and only referencing others through indexes into that big struct, you know you have all the data there. It’s not hidden in a vector inside a field nested deep down in the hierarchy.
Rust has raw pointers. You just have to write
unsafe
to dereference them. And unfortunately raw pointer field projection is currently painful, although improvements are on the way.But a graph of raw pointers is rarely the best approach. And when it is, you can typically create a safe interface around the raw pointers derefs that enforces whatever invariants you’re using to ensure sound memory management.
‘Raw pointers’ in rust are not strong pointers—they can be forged. And the rules for when it is legal to do so are quite literally not specified (see e.g. tratt).
Why?
You advocate greenspinning. Green-spun subinterfaces tend to be difficult to use (since they are by definition and as a consequence of essential limitations incomplete), slow, and not modularly composable. Better that the language should grow to include such obviously and near-universally useful functionality.
From experience: With an adjacency list it is easier to manipulate the entire graph as a single structure. You have a single object that contains all the nodes in the graph, even if they’re disconnected or there are multiple “root” nodes. Adding edges is simpler because you don’t have to modify a particular node, and adding pairs of directed edges is simpler because you don’t have to modify two different nodes. You can easily print out and introspect the graph, since each node and edge automatically has a unique index number that doesn’t need to be tracked anywhere. It’s easy to iterate over all nodes and edges in arbitrary order, and also easy to find all the edges leading into a node. Removing nodes is far less error-prone because it’s so much easier to find all the edges that point to a particular node. It’s easy to modify to add fast lookups, by sorting your lists and doing binary search or by adding extra structures similar to database indices. I’m not going to even consider differences in performance, just algorithmic convenience… but the list goes on.
It’s counter-intuitive, but a mesh of separate objects referring to each other with pointers is kinda the worst representation for a graph.
That’s true for mathematical graphs.
Less true when you have object graphs with many different fields connecting vastly different kinds of objects in different ways.
True. It’s more of a spectrum than one might first expect though. The extreme case would be a game ECS system, but even in the Garnet compiler I occasionally look at the objects-with-pointers AST and IR trees and say “you know, this would have been easier if it was an array of nodes and indices”. It really seems like extra complexity but it works really well in practice.
I’m confused. By strong pointers, do you mean smart pointers? If so, Rust has Box, Rc, rc::Weak, etc.
Do I? If that’s what you want to call it, okay.
Really? So Rust std lib collections are slow and not modularly composable? Compared to what?
How would you propose a language to grow to include functionality without someone having to implement it in the first place?
I think there are two options for what you could mean by strong pointers here, but correct me if I’m missing something:
There’s strong pointers where cycles don’t get collected automatically, like
std::shared_ptr
in C++ or ARC in Swift. I agree that these are simple in that it’s easy to define their semantics, but I’m not sure I’d agree that using them ubiquitously is simple. I haven’t written any Swift myself, but it seems easy to lose track of who’s keeping what alive, especially when there are lots of lambdas capturing things.Or there’s strong pointers where cycles do get collected, like in Python (or maybe any other GC’d language, if we want to call those strong pointers). This is genuinely simple when all you have to deal with is freeing memory and there’s no FFI. But when other resources get involved or native code gets involved, and you have to start relying on finalizers and pinning, it gets extremely hairy. E.g. https://github.com/python/cpython/issues/60198.
One point of simplicity that Rust and C++ share (and I know how it sounds to praise C++ for its simplicity), is that it’s really easy to understand what destructors do. It’s a function that runs when things go out of scope, the end. (Caveats for move semantics in Rust and return-value-optimization in C++, but these aren’t footguns.) If you manage to touch an object after its destructor has already run, you did something illegal with pointers, don’t do whatever that was. There’s not much more to it than that. There’s no “object resurrection”, no special tricks to deal with teardown, no need to think about whether some object in some edge case might already have been finalized.
What do you mean by “strong pointers”? To me that means “the opposite of weak pointers”, and weak pointers are ones that don’t imply ownership in a system with GC or refcounting, which is not particularly simple… so I’m wondering what I’m missing.
Partly also responding to your other comment. And cc @oconnor663
A pointer is something that denotes an object. A strong pointer denotes an object directly; a weak pointer denotes an object indirectly. For eample, in java, if C is the name of a class or interface, then a variable of type C is a strong pointer. A memory address might be a strong pointer to a byte in memory, but it is not generally a strong pointer to an object conceived as a higher-level language-level abstraction. (In fact, a memory address is not even a strong pointer to a byte in any architecture with virtual memory.)
An array index, or a string to be looked up in a hash table—these are weak pointers. Weak pointers enable some forms of expressiveness and abstraction (for example, the index might not be into a single array, but into many of them; this allows to modularly and contextually add attributes to an object, like the transpose of stealth mixins), but are harder to reason about and, because they are essentially based on puns, amenable to misuse; hence, the most expressive languages, like lisp and self, encourage the use of strong pointers, in order to maintain local reasoning and to allow the encapsulated abstraction of representation (that object could still be represented using a single index and a collection of context arrays, apropos your performance considerations, but that detail will be hidden from the user of the object; similarly you could have, for example, an object encapsulating an entire graph, if you want one—but, if you have access to its nodes, you surely should be prevented from confusing the nodes of one graph with the nodes of another, should you not?). (See also, for example, explicit foreign key relationships in sql, though I am given to understand that these are not generally checked, unfortunately.)
There is one language which excels at the manipulation of indices: apl. I say this having written the better part of a j compiler in j (j is a dialect of apl), representing the program structure using an edge-list. I feel pretty good about that compiler and its ir. But other languages are not so oriented. They are oriented towards individual objects, encapsulation, and abstraction—scale and explicit interfaces. They lack facilities for expressively and concisely manipulating large arrays (no, the flavour-of-the-month pound of streamprocessing/numpy/whatnot being peddled does not count—rather than subordinate detail, it bows down before it, and the only thing it suggests is verbosity). I think the former represent a interesting ethos about what programming represents. I think it would do most people who program good to write a program in apl (not learn to use it as a calculator—that is rather different, and does not properly impart the values). But it is important to understand that this ethos is divorced from and in direct opposition to the values which underly the design of practically every other program and programming language today. There is a reason that, when apl programmers of the past few decades write c, they often write it like this; if you do not write your code like that, but still advocate the free use of weak pointers, you are committing a half measure and must think about why.
Weak pointers still make a great deal of sense, sometimes. But generally speaking they should not be the default.
The real world is built on weak pointers. The universe can’t tell one calcium atom from another. We go to a great deal of effort to create the illusion that computers underlied by strong pointers exist. Only to throw all that away and revert to weak pointers, stacked on top of the strong ones; why?
I like this distinction and I think it’s a useful concept to think about, but I found using the terms ‘strong pointer’ and ‘weak pointer’ to mean something that is totally unrelated to ‘strong reference’ and ‘weak reference’. I’d encourage you to think of a different name for this concept.
I thought it was a standard usage; apparently not. I was a bit surprised to see the confusion expressed in this thread.
I’ve never seen it used in that way, but that doesn’t mean it isn’t. I’ve never seen the concept this cleanly articulated either, though it’s come up in a load of discussions that I’ve been involved in.
I think this concept is better described by objects having identity (strong pointer) vs not.
Let me illustrate this with Java, which as of yet only has identity-semantics, but value classes are in the plans. The old
Date
class is mutable with an identity. You can have an instance of it, and even if it equals another instance, they are not replaceable with each other — one might be locked by one thread, or another thread might have a reference and could decide to mutate it.In contrast, the newer LocalDate class is actually designed with identity-less value classes in mind, e.g. they are immutable. There are still ways a program can semantically differentiate between two instances of a LocalDate class (the aforementioned synchronization, also ==), but it is possible to migrate it to an identity-less class, and then any two instance of the same value can be replaced. (This allows better optimizations, e.g. a list can store them in-place, instead of storing pointers to them).
Let me quote Guy Steele here, for a to me, eye-opening statement: the concept of side-effect is inseparable from the notion of identity. The only way one can observationally determine that a side effect has occurred is when the same object behaves in two different ways at different times.
Object identity is orthogonal to the concept of strong references. In particular, there are languages of total functions which don’t include pointers; every reference in those languages is strong, but object identity is trivial and comparisons are always done by value.
That said, you are right from a certain perspective: languages with mutability must necessarily mix their identity/comparison semantics with their reference semantics, because a mutable reference can change its identity as execution progresses. This is what Steele is getting at. Without mutability, though, the “same object” will never change behavior “at different times.”
Indeed, and to take this a step further, the use of weak pointers in a total language enables a form of mutability via state-passing style, whereas for data pointed to strongly, there is no danger of their changing.
I’m not really a fan of this style, since compared to GC languages, it’s a compromise on both:
T*
is type safe, but an index into an array of T is not. In this example you only havePerson
, but if you hadPerson
andJob
that would be more clear.BTW I think the genius of C is that it’s 2 languages in one – it has value semantics (tight-packing, zero alloc), AND reference semantics (arbitrary graphs of objects – “object soup” is a bit pejorative).
It seems like Rust is too heavily skewed towards the former. If it wants to “replace” C, it seems like it should be stronger at the latter. I like the examples in this post, since they show that.
I always point to these articles on representing DFAs/NFAs with graphs of pointers to tiny structs as extremely elegant C code:
https://swtch.com/~rsc/regexp/regexp1.html
https://swtch.com/~rsc/regexp/nfa.c.txt
It took me quite awhile to understand how to use pointers well: if you’re scratching your head at the code, it’s worth studying!
Honest question: can you even write this code with Rust’s borrow checker, or do you have to use indices? It’s just 400 lines of C.
I would very be interested in an exploration of that, and what the code looks like
In Rust, one prominent library and pattern for addressing the problem of dangling indexes (“ABA problem”) is
generational-arena
.To mitigate the risk of confusing Person indices with Job indices, one could store them and pass them as separate
struct PersonId(u64)
andstruct JobId(u64)
types.SQL foreign keys (at least if they’re integers) are a similar use of indices in place of pointers, and SQL has (potentially depending on implementation) different ways of addressing these problems, or at least the “memory safety” problem.
Adding generation numbers to indices reminds me of “Handles are the Better Pointers” which is roughly the same idea in C.
Previous discussion: https://lobste.rs/s/4xcla1/handles_are_better_pointers
I think Rust’s story here has several sides, and different people emphasize different parts: Safe Rust tightly restricts the kinds of pointer graphs you can build. Unsafe Rust lets you go nuts and do whatever you want with raw pointers. (It’s even more permissive than C. There’s no “strict aliasing” rule.) But safe Rust also exerts a lot of gravity on unsafe Rust. In part that’s because everyone agrees culturally that you’re “not supposed to do that” when you have a choice, and the ecosystem is reluctant to accept unsafe APIs that could’ve been safe. In another part, it’s because the rules you have to follow at the boundary between safe and unsafe code are more complicated than the normal rules for pointers in C. (All shared and mutable references in safe code are “restrict” pointers in C terms, and it’s easy to violate that requirement when you’re doing unsafe casts.)
I’d love to know the answer to your regex question, but the best I can do is point to some sources.
If there’s a holy gospel of regular expressions in Rust, it’s Andrew Gallant’s 2016 release announcement / article / nearly-a-book for
ripgrep
: https://blog.burntsushi.net/ripgrep. In there he links to a different entry in the Russ Cox series that you mentioned. (Ctrl-f for “Russ Cox”.)There’s also some documentation about how unsafe code is used in the internals of the
regex
crate that Andrew maintains: https://github.com/rust-lang/regex/tree/master/regex-automata#safety. This part seems relevant:Oh nice, Andrew commented about this on r/rust.
OK yeah I actually agree – I was actually thinking last night that this wasn’t the best example, since all the indices could be into the same data structure.
It makes the syntax uglier to use indices, but now I do think there’s a very straightforward translation of that code. (although I would still be intersted in seeing it! I wish I had time to do that experiment)
A better example would have more heterogeneous data types and lifetimes. But C doesn’t handle the lifetime issue either …
(Though it seems like getting the generational indices errors at runtime isn’t too different than getting ASAN errors at runtime, but I haven’t tried the former. Also, in both cases you’re kinda using an “extra” tool on top of the language.)
A big difference between ASan and these other approaches is that (last I heard, as far as I know) you’re not supposed to run ASan in production. It has a large performance cost, but maybe more importantly it opens up new attack surfaces.
Not per se asan, but google has been running sanitisers on android in production for a while now - https://android-developers.googleblog.com/2018/06/compiler-based-security-mitigations-in.html https://security.googleblog.com/2022/12/memory-safe-languages-in-android-13.html
This is one of the most brilliant ways of structuring trees I learned in the past few years. It serializes to JSON as-is, even if having cycles in the structure, and it is very easy to iterate using ranges, and easy to index using any sort of a map structure.
I usually make a type out of different IDs, here we could have a
FriendId(pub usize)
, and a special walker structure:Which can be
Copy
. Superb way for e.g. modeling a database schema in the memory.slotmap
has a fancy macro for defining new key types too: https://docs.rs/slotmap/latest/slotmap/macro.new_key_type.htmlUnrelated to the content of the article itself, but I love the way the footnotes are presented and the use of them at all on the website
https://edwardtufte.github.io/tufte-css/
Your last design is just a step away from an “Entity Component System” architecture:
That’s the subject of the final paragraph :)
As another step, I wonder if some of the concerns raised here about safety w/r/t indexes vs pointers would be addressed by a “flyweight” API that encapsulates & hides the underlying arrays. I’m out of my depth with Rust, but maybe something like
The simplest tool is a struct of arrays regardless of the implementation language ;)
I think this might be missing another ergonomic(-ish) approach, which is to make a
Person
type which is wrapping someRc
-d data (PersonData
?), can be copied, etc, doing the “right thing”, with some helper methods orDeref
implementations or whatever.You can stick to indexes as well (and just have a ‘static Person arena), but it feels a bit silly to throw around numbers when you’re one wrapper class away from having “normal” ergonomics IMO.
Ultimately if you have an ECS system though, with objects popping into and out of existence, you will always need some sort of liveness check on objects. There’s no way around it! Things are given references to people, and then use those references later on. If your people can fall out of existence, you need liveness checks. See every “game development in Rust” talk out there.
The
Rc
is very problematic when you need cycles between entities. It leaks memory if the other side is notWeak
, and the ergonomics of aWeak
require an extraunwrap
before the data can be used. Also referencing this data is suddenly very hard and there is no real ownership for the data anymore. And if you need aSend
boundary and threads, all yourRc
s need to beArc
s, which might have serious implications to the performance of the software: an atomic can be poison to the caches. Finding these performance regressions is very hard from a big application.The ids work very nicely if you wrap them to a structure which I explained in another comment:
Now you can specialize:
And you can iterate from the flat vectors:
What is really awesome with this approach is how you can extend the setup to different use-cases. You need to look for a city where a person lives in? What about creating a
HashMap<PersonId, CityId>
to the schema and off you go. This is not Rust-only solution, it works very well in any statically typed language.I think it’s hard to go too much into this in a context-free way, since the use case is what matters, but with
PersonId
s you still have the issue of “is this ID still referencing a ‘live’ entity” (so you will need to unwrap just like withWeak
). And if you are saying “no, actually everything is always live and we don’t GC” then memory leaks are a non-isuse (or you stick everything in a ’static and be done with it)Making Ids explicit makes the Rust borrow checker happy, but I really think it doesn’t resolve the fundamental issues in entity systems, which is items no longer existing despite references to them still being somewhere else in the system. And you can handle that fine, if you have some unwrapping step. And if stuff can never be removed, then just “leak” memory (after all, if you stuff is immortal it’s not really a leak is it?). And
Rc
might not be the right solution for that, but some arena works fine.I’m saying this, but I do think it’s interesting to explore stuff like your
Walker
structure anyways.Keep in mind that all the problems we solved with this structure were immutable after creation. It’s a basis for an ORM or a GraphQL server where you want to load a schema file or database definition into memory, with the need of fast serialization and deserialization.
If there would be a need for deletion of items and mutation during runtime, I’d probably go with slotmap instead.
The next step after thinking of these as indexes is to think of them as identifiers of entities, and that our data structures are holding aspects we assign to identifiers. Which is what relational databases and entity component systems do. It has the nice property that you can take various aspects a la carte instead of trying to have some kind of object that represents the entity.
any thoughts on mutable value semantics?
https://www.jot.fm/issues/issue_2022_02/article2.pdf
https://www.youtube.com/watch?v=QthAU-t3PQ4
I downloaded the same PDF a couple weeks ago, and I’ve been meaning to read through it :) Someone who knows this stuff should really write an article like “here are some programs you can write in Rust but not Hylo and vice versa”.
The literal rule for what you can’t express in Hylo is: any Rust function that, if you made its lifetimes explicit, has at most one lifetime parameter. If that lifetime parameter is used in the return type, you need a “subscript”, otherwise a regular function will do. Oh, and references can only appear at the top level, that is very important. Vice-versa is simpler: Rust is strictly more expressive.
But that doesn’t quite answer “what programs you can write” b.c. there’s the question of writing “the same thing” a different way.
I’ll consider writing this blog post.
I want to make sure I understand the literal rule: Is “at most” a typo?
Yes I’d love to read that post. Please ping me on Twitter/GMail/whatever (same handle everywhere) if you write it!
Gah, sorry, yes, I wrote that backwards.
A Rust function without references can obviously be written in Hylo. If it has references only on the inputs, that can be expressed in Hylo:
Every Rust function with references only in its inputs can be written with a single lifetime parameter. Put differently, lifetime parameters are only ever needed when there’s an output reference. Thus the above function can be written with only one lifetime parameter:
If a Rust function has references on both its inputs and output, then it’s expressible in Hylo only if there’s at most one lifetime parameter when you write out the lifetimes. So:
The other, bigger, restriction is that references have to be top-level. You can’t nest them. So no
Option<&str>
, orHashMap<i32, &User>
. And no iterators over references! No such thing assplit_words() -> impl Iterator<Item = &str>
!Instead of a function returning an
Option<&str>
, the function would take a callback and invoke it only if the option is populated. And iterators need to be internal iterators: to iterate over a collection, you pass a closure to the collection and the collection invokes that closure once per element.I’ve added a file to my blog post ideas folder, and it says to ping you, so I’ll definitely ping you if I write it. Though no promises, as that ideas folder grows more than it shrinks…
This reminds me discussions years ago, whether Java or C++ (or Python) is „better“… People argued, that it is better to have such options (like pointers or duck-typing) than not have them and that it is up to the programmer to use these options wisely. They also said, that „limited“ languages like Java are bad, because they make some advanced things impossible or difficult/annoying (it is not as smooth as pure Java, if you have to use JNI/JNA to be native like C/C++ or if you have to use Reflection API, to be dynamic like Python). I argued that such „limited“ set of features that Java offered was better (for business applications) and that it was well-balanced between „powerful“ and „safe“.
It is still a question, what does „well-balanced“ mean and how powerful, foolproof or complex the language should be.
I feel like all this debate over the goodness or badness of arbitrary pointers in imperative languages, from the original post to the >35 comments in the last ~15 hours, could be improved with sufficient appreciation of SQL or other relational languages, of the potential benefits of, more than merely taking on some restrictions in data structure design, taking on such restrictions in return for access to an additional paradigm of programming. The entity–component–system design seems like a step towards this. All that said, though, I don’t feel sufficiently learned in relational programming myself to say clearly what one would gain.
Here’s an old (by Rust standards) RustConf talk that has a lot of these ideas too: https://m.youtube.com/watch?v=aKLntZcp27M
That talk is linked and credited in the final paragraph! :)