Except that scrapscript is a scripting language for a decentralized social network, and has authentication and a payment system. (See link in my other post.)
When I read some of the Unison docs, I got the strong impression that names don’t matter, entities (types, values) are identified solely by their hash, and their name is not part of their identity.
In scrapscript, names do matter, it doesn’t have these Unison naming semantics. A scrap can have a name that is consistent across multiple versions of the scrap, which allows you to upgrade code to the latest version. I don’t know how this works in Unison.
The social network, authentication, and payment systems are the “killer apps” that I want to build with the language, but aren’t part of the core architecture :)
I’m not super familiar with unison either, but your description of scrapscript was spot on :) thanks!
Why have variables? It seems like you’ve given yourself an alpha-equivalence problem.
In a previous implementation, I canonicalized variable names before hashing. Turns out it wasn’t a very useful feature. So now, for scraps to have the same hash, they have to be exactly the same
In practice, I think community preprocessors (linters, formatters, etc) will push many programs to fall into the same buckets.
Edit: Another question: Are scrapyards monoidal? Is it possible to add two scrapyards together?
If you read this comment, this is something I’m still working on.
The naming of scraps is unique to a scrapyard, but the repository of scraps themselves are interchangeable if you don’t consider collisions.
Oh, I mean, why have any variables at all? In my Cammy experimental language, there are no variables or lambda-binders; instead, functions are composed directly from smaller functions. This lets me skip over alpha- and eta-equivalence entirely! Unlike in Unison, Dark, and apparently Scrapscript; Cammy code does not allow naming types or creating custom encapsulation.
Additionally, Cammy’s hives are monoidal. They’re JSON, and there is an algorithm (not yet implemented) for merging two hives into one larger hive. The upshot is that namespaces are always local to each hive; there is no global namespace. (I totally understand your approach of making your namespace into the global namespace; I did something similar with Tahoe-LAFS once upon a time.)
Please keep variables, they’re very useful for many programmers. Not everyone wants to write in a concatenative or pointfree style all the time!
Unrelatedly, do you expect scrapscript to be defined by a spec and possibly have several implementations (including some that are to be embedded into other languages), or to be defined by your canonical implementation?
I am not sure yet, but I think there will be a canonical implementation hosted in the scrapyard itself. But in general, I think the ecosystem will look more like JSON, with many non-canonical implementations in many languages.
Do you have any concrete documentation? The pitch sounds great but there’s only so much I can glean from trying to read between the lines of cryptic little snippets of a language I don’t know.
Concrete documentation is my next goal :) I wanted to see if people were interested in the idea before committing more time to it, and it looks like I’m getting way more support than I could’ve ever imagined!
The little $0 on the end is an extra ID for the scrapyard to uniquely identify scraps with colliding hashes, if needed.
That means these things aren’t hashes, they are IDs that are interpreted relative to a particular scrapyard. So your system is not content addressible and doesn’t use IPFS. The blog post gives a different impression.
$0/$1/$2/… seems like a simple way to solve collision problems. If you have other suggestions, I’d love to hear them.
I do hope to decentralize scrapyards at some point, but distributed systems are too hard to tackle from the start. One strategy is to take the biggest scrapyard and decentralize it by using IPFS or a custom backend. Another strategy would be to federate all scrapyards together by creating an exchange protocol.
I may be misunderstanding, but: if you’re going to use digests as proxies for scripts, then you’re committed to the belief that there is a 1::1 mapping from digest ⟶ script. This is of course a false belief, but with a good digest function it’s true enough for the real world, and lots of software believes it.
If you accept the possibility of hash collisions, then you can’t use digests to represent scripts anymore. You can tack on integers to make the digests locally unique, but how do you make that globally unique without a central registry? And with a registry, you don’t need the digests anymore.
Thanks! It sounds like you fully understand the problem :)
We’re early enough in scrapscript to change some of these things, so your thoughts are helpful.
Here’s the tradeoffs I see for having collision integers:
Trusting the digest: By trusting the digest, the only sensible way to handle collisions would be to randomly sample all the matching scraps until one matches by type. This is honestly not a bad option, because it makes a lot of the system simpler and makes everything more distributable.
Expecting collisions: Explicit is often better than implicit. By specifying a version, it prevents a certain class of (very very unlikely) attacks from ever occurring. This is probably the worse choice, but it’s the more reversible choice. It’s easy to ignore the $0 in existing identifiers, but it’s difficult to add them.
I very much expect to adopt the $0/$1/… syntax in the beginning, and then drop it when we have more confidence in the system. I’m open to ideas though, and will be fielding more feedback from the community
In general, every content-addressible store I’ve ever heard of assumes collisions don’t happen, as do blockchains, and git. The keyspace is too massive (assuming the hash function isn’t broken, and if somebody figured out how to make sha1 collusion on demand, a lot of systems will suddenly have a very big problem) SHA1 has a uniformly distributed 160 bit digest. Calculate the collision probability and multiply it by the number of scraps you expect to exist (feel free to make absurd assumptions like “there will be 3 scraps for every atom in the observable universe”) figure out what the real probability of a collision is (edit: I think this is wrong, it’s not a simple multiplication because of the birthday paradox. I don’t remember the right formula, but the number is still very small), and if you’re still worried, (i.e. it’s above 5%, or 1%, or 0.00001%, whatever), switch to SHA256, if that’s not satisfactory, there’s SHA512, if that’s not satisfactory, make sure you haven’t dropped a few hundred zeros somewhere in your calculations 😉 (and use a sponge function, which can make the key arbitrarily long)
(Actually, probably switch to sha256 anyway, it’s thought to be more secure, and it’s I think what IPFS and Docker Registry use, which are both currently extant, large, content stores)
Explicit may be better than implicit*, but sticking a $0 at the end of everything and building contingencies for hash collisions is just extra work and cognitive load, and kind of undermines the concept of a content-addressable store.
**IMO in many cases they’re dual, and regardless, tacit is better than both. Often times I hear that rule repeated, it feels like there’s some conspiracy of keyboard manufacturers who profit when developers need to press more buttons to say the same thing.
Could you add a content length field to the ID to prevent (many) collisions? My understanding is that most known collision attacks require adding arbitrary data. If you have the constraint that both the hash and the content length must be equal, I think deliberate collision attacks become a lot harder. But you should check with a cryptographer.
The first example defines a type person. What happens when you distribute this code across the internet, and the type person conflicts with another type with the same name in the code that dereferences the hash? Is there a module system to manage namespaces and prevent name conflicts?
In an expression like spaceq/is-planet, it looks like spaceq is a module and is-planet is a member of this module. Is spaceq a globally unique module name? If someone, somewhere else on the internet, makes a module called spaceq, is there now a name conflict that will confuse any code that tries to use spaceq? How do you manage namespaces on an internet scale, guarantee uniqueness and prevent conflicts?
The first example defines a type person. What happens when you distribute this code across the internet, and the type person conflicts with another type with the same name in the code that dereferences the hash? Is there a module system to manage namespaces and prevent name conflicts?
The scrapyard is the namespace. I suspect that everybody will want to use the same scrapyard, but it’s not required.
I’m not 100% sure, but I think we have to replace all the names with their hashes before sending them over the wire. We don’t want cases where they arrive at the destination and use an updated version, or something like that.
I’m also not 100% sure about this, but I think we need to put the expected scrapyard in the message header, so that the recipient knows where to find all the referenced scraps.
In an expression like spaceq/is-planet, it looks like spaceq is a module and is-planet is a member of this module. Is spaceq a globally unique module name? If someone, somewhere else on the internet, makes a module called spaceq, is there now a name conflict that will confuse any code that tries to use spaceq? How do you manage namespaces on an internet scale, guarantee uniqueness and prevent conflicts?
In that example, spaceq is actually the user and is-planet is their function within the scrapyard.
It’s just a simple key-value store, so users can use arbitrary paths after their username. e.g. spaceq/planets/revised/is-planet haha
Whoever manages the scrapyard can manage their namespace however they want. The scrapyard I’ll be managing will probably be the “worldwide” one, and I’m still deciding how to fairly distribute user accounts. I want to avoid domain squatting and other legacy problems of the existing internet
The public scrapyard will essentially be two key-value stores: (1) name -> list (hash,timestamp), (2) hash -> scrap. An authentication layer will ensure that only you can edit your keyspace.
Have you ever looked at SDSI/SPKI? It was a system for certificates, identification, authentication and authorization based on S-expressions. Your language seems more than powerful enough to do something similar.
I urge you not to design for a centralized system. It makes some stuff easier, but in the long run it creates all sorts of problems, esp if it has to handle user accounts, and esp^2 if it becomes a social network. Just look at what’s happening to other centralized social networks.
I’ll definitely look into that! Sounds right down my alley. I’ve been having trouble with finding an elegant way to do the auth side of things.
There are different levels of centralization. Facebook, email, and bittorrent offer different tradeoffs.
I think I’m going to start with an email-like model where anybody can host their own interoperable server. But I suspect that most people will pick a popular default, much like GitHub or Gmail. If power gets too consolidated, I’ll figure out how to decentralize for robustness.
In case it matters, I just want to reiterate that scrapscript is not a money-making endeavor for me. I’d like to build profitable side-businesses on top of scrapscript, but the core language and ecosystem must remain open and equitable :) the centralization question is more of a matter of technical tradeoffs right now
Tbh this sounds a lot like unison
Isomorf was another similar thing.
Except that scrapscript is a scripting language for a decentralized social network, and has authentication and a payment system. (See link in my other post.)
When I read some of the Unison docs, I got the strong impression that names don’t matter, entities (types, values) are identified solely by their hash, and their name is not part of their identity.
In scrapscript, names do matter, it doesn’t have these Unison naming semantics. A scrap can have a name that is consistent across multiple versions of the scrap, which allows you to upgrade code to the latest version. I don’t know how this works in Unison.
The social network, authentication, and payment systems are the “killer apps” that I want to build with the language, but aren’t part of the core architecture :)
I’m not super familiar with unison either, but your description of scrapscript was spot on :) thanks!
IIUC, in Unison, names “matter” but are aliases to those content hashes.
Unless you were talking about structural / unique types. For structural types, the names don’t really matter, but they do for unique types.
Author here! Let me know if you have any questions :)
Why have variables? It seems like you’ve given yourself an alpha-equivalence problem.
Edit: Another question: Are scrapyards monoidal? Is it possible to add two scrapyards together?
In a previous implementation, I canonicalized variable names before hashing. Turns out it wasn’t a very useful feature. So now, for scraps to have the same hash, they have to be exactly the same
In practice, I think community preprocessors (linters, formatters, etc) will push many programs to fall into the same buckets.
If you read this comment, this is something I’m still working on.
The naming of scraps is unique to a scrapyard, but the repository of scraps themselves are interchangeable if you don’t consider collisions.
Oh, I mean, why have any variables at all? In my Cammy experimental language, there are no variables or lambda-binders; instead, functions are composed directly from smaller functions. This lets me skip over alpha- and eta-equivalence entirely! Unlike in Unison, Dark, and apparently Scrapscript; Cammy code does not allow naming types or creating custom encapsulation.
Additionally, Cammy’s hives are monoidal. They’re JSON, and there is an algorithm (not yet implemented) for merging two hives into one larger hive. The upshot is that namespaces are always local to each hive; there is no global namespace. (I totally understand your approach of making your namespace into the global namespace; I did something similar with Tahoe-LAFS once upon a time.)
Super interesting idea! It would solve a certain set of problems, but I think it would make the language overall less accessible.
I could definitely see a Cammy-like embedded DSL being a fun project :)
Btw I think you’d really enjoy Smullyan’s To Mock A Mocking Bird, if you haven’t read it already. It’s an entertaining puzzle approach to combinators.
Please keep variables, they’re very useful for many programmers. Not everyone wants to write in a concatenative or pointfree style all the time!
Unrelatedly, do you expect scrapscript to be defined by a spec and possibly have several implementations (including some that are to be embedded into other languages), or to be defined by your canonical implementation?
Great great question.
I’m hoping to publish the full spec this month.
I am not sure yet, but I think there will be a canonical implementation hosted in the scrapyard itself. But in general, I think the ecosystem will look more like JSON, with many non-canonical implementations in many languages.
Is scrapscript statically typed? If so, what type system are you using?
Yes, statically typed. I’m pretty sure it’s basic Hindley-Milner, but I’m not an expert.
I think the type system is 1:1 with Elm right now, which is to say pretty simple.
Do you have any concrete documentation? The pitch sounds great but there’s only so much I can glean from trying to read between the lines of cryptic little snippets of a language I don’t know.
Concrete documentation is my next goal :) I wanted to see if people were interested in the idea before committing more time to it, and it looks like I’m getting way more support than I could’ve ever imagined!
I second that, the pitch worked pretty well, now you have curious people who want to know more :)
Yes, it seems many of us would love to see gory details! At least just an explanation of the syntax.
If two values have the same hash, are they equal? Can you compare functions for equality? Is 1 == 1.0? How many equality operators do you have?
The little
$0
on the end is an extra ID for the scrapyard to uniquely identify scraps with colliding hashes, if needed.So yes, two hashes always means equal scraps within the same scrapyard.
Yes, but variable names matter. They might also have to have the same arity, but I haven’t explored it that deeply.
This feature may have to be removed if it results in unpredictable or inconsistent behaviors.
No. Integers and floats are separate types.
I might also include complex numbers in the final spec, but I think I’m going to get community feedback first.
Just one equality operator, which compares any two values of the same type.
That means these things aren’t hashes, they are IDs that are interpreted relative to a particular scrapyard. So your system is not content addressible and doesn’t use IPFS. The blog post gives a different impression.
Sorry if I misrepresented myself.
$0/$1/$2/… seems like a simple way to solve collision problems. If you have other suggestions, I’d love to hear them.
I do hope to decentralize scrapyards at some point, but distributed systems are too hard to tackle from the start. One strategy is to take the biggest scrapyard and decentralize it by using IPFS or a custom backend. Another strategy would be to federate all scrapyards together by creating an exchange protocol.
I may be misunderstanding, but: if you’re going to use digests as proxies for scripts, then you’re committed to the belief that there is a 1::1 mapping from digest ⟶ script. This is of course a false belief, but with a good digest function it’s true enough for the real world, and lots of software believes it.
If you accept the possibility of hash collisions, then you can’t use digests to represent scripts anymore. You can tack on integers to make the digests locally unique, but how do you make that globally unique without a central registry? And with a registry, you don’t need the digests anymore.
Thanks! It sounds like you fully understand the problem :)
We’re early enough in scrapscript to change some of these things, so your thoughts are helpful.
Here’s the tradeoffs I see for having collision integers:
I very much expect to adopt the $0/$1/… syntax in the beginning, and then drop it when we have more confidence in the system. I’m open to ideas though, and will be fielding more feedback from the community
In general, every content-addressible store I’ve ever heard of assumes collisions don’t happen, as do blockchains, and git. The keyspace is too massive (assuming the hash function isn’t broken, and if somebody figured out how to make sha1 collusion on demand, a lot of systems will suddenly have a very big problem) SHA1 has a uniformly distributed 160 bit digest. Calculate the collision probability and multiply it by the number of scraps you expect to exist (feel free to make absurd assumptions like “there will be 3 scraps for every atom in the observable universe”) figure out what the real probability of a collision is (edit: I think this is wrong, it’s not a simple multiplication because of the birthday paradox. I don’t remember the right formula, but the number is still very small), and if you’re still worried, (i.e. it’s above 5%, or 1%, or 0.00001%, whatever), switch to SHA256, if that’s not satisfactory, there’s SHA512, if that’s not satisfactory, make sure you haven’t dropped a few hundred zeros somewhere in your calculations 😉 (and use a sponge function, which can make the key arbitrarily long)
(Actually, probably switch to sha256 anyway, it’s thought to be more secure, and it’s I think what IPFS and Docker Registry use, which are both currently extant, large, content stores)
Explicit may be better than implicit*, but sticking a $0 at the end of everything and building contingencies for hash collisions is just extra work and cognitive load, and kind of undermines the concept of a content-addressable store.
**IMO in many cases they’re dual, and regardless, tacit is better than both. Often times I hear that rule repeated, it feels like there’s some conspiracy of keyboard manufacturers who profit when developers need to press more buttons to say the same thing.
Wow, I think you pretty solidly convinced me.
Thank you so much for the help!
No problem! It’s not every day someone walks by with a question I’ve spent time researching in the last week :)
This project looks really cool and I’m looking forward to trying it out, godspeed.
Could you add a content length field to the ID to prevent (many) collisions? My understanding is that most known collision attacks require adding arbitrary data. If you have the constraint that both the hash and the content length must be equal, I think deliberate collision attacks become a lot harder. But you should check with a cryptographer.
The first example defines a type
person
. What happens when you distribute this code across the internet, and the typeperson
conflicts with another type with the same name in the code that dereferences the hash? Is there a module system to manage namespaces and prevent name conflicts?In an expression like
spaceq/is-planet
, it looks likespaceq
is a module andis-planet
is a member of this module. Isspaceq
a globally unique module name? If someone, somewhere else on the internet, makes a module calledspaceq
, is there now a name conflict that will confuse any code that tries to usespaceq
? How do you manage namespaces on an internet scale, guarantee uniqueness and prevent conflicts?Great questions!
The scrapyard is the namespace. I suspect that everybody will want to use the same scrapyard, but it’s not required.
I’m not 100% sure, but I think we have to replace all the names with their hashes before sending them over the wire. We don’t want cases where they arrive at the destination and use an updated version, or something like that.
I’m also not 100% sure about this, but I think we need to put the expected scrapyard in the message header, so that the recipient knows where to find all the referenced scraps.
In that example, spaceq is actually the user and is-planet is their function within the scrapyard.
It’s just a simple key-value store, so users can use arbitrary paths after their username. e.g. spaceq/planets/revised/is-planet haha
Whoever manages the scrapyard can manage their namespace however they want. The scrapyard I’ll be managing will probably be the “worldwide” one, and I’m still deciding how to fairly distribute user accounts. I want to avoid domain squatting and other legacy problems of the existing internet
The public scrapyard will essentially be two key-value stores: (1) name -> list (hash,timestamp), (2) hash -> scrap. An authentication layer will ensure that only you can edit your keyspace.
Have you ever looked at SDSI/SPKI? It was a system for certificates, identification, authentication and authorization based on S-expressions. Your language seems more than powerful enough to do something similar.
I urge you not to design for a centralized system. It makes some stuff easier, but in the long run it creates all sorts of problems, esp if it has to handle user accounts, and esp^2 if it becomes a social network. Just look at what’s happening to other centralized social networks.
I’ll definitely look into that! Sounds right down my alley. I’ve been having trouble with finding an elegant way to do the auth side of things.
There are different levels of centralization. Facebook, email, and bittorrent offer different tradeoffs.
I think I’m going to start with an email-like model where anybody can host their own interoperable server. But I suspect that most people will pick a popular default, much like GitHub or Gmail. If power gets too consolidated, I’ll figure out how to decentralize for robustness.
In case it matters, I just want to reiterate that scrapscript is not a money-making endeavor for me. I’d like to build profitable side-businesses on top of scrapscript, but the core language and ecosystem must remain open and equitable :) the centralization question is more of a matter of technical tradeoffs right now
The original author posted this on the orange site. Replies from the author give more context to this post: https://news.ycombinator.com/item?id=35712163