In the ActivityPub community I encountered a dev that treats URLs as opaque strings as it pertains to internal things to his application.
I can’t fathom trying to decouple the logic that paths, query params and everything else allows in favour of something so bland as a string identifier.
That’s not far off the right thing to do. A URL is three parts:
The schema
://
The stuff whose interpretation depends on the schema
Often, the best thing to do is treat the whole thing as an opaque token and pass it onto a different layer that understand the schema and can do the right thing.
Huge “it depends”: data:, jar:, blob:, and about: also do not require a double slash. But chrome: and resource: mostly do. I guess it’s fair to say that everything after the colon is specific to the URL scheme.
If anything, I wonder why the double slash was introduced at all by some schemes. Even in file: URLs it seems like a single slash should suffice and in http: it doesn’t add any value at all AFAICT. Does anyone happen to know the rationale behind it?
“The formats and protocols were designed to look as much like the existing ones as possible,” he explains, saying that HTTP was designed to look like NNTP, or Network News Transfer Protocol, which was used for Internet newsgroups. “The aim was for people who worked with the protocols to look at them and say: ‘Oh, yeah, I see what’s going on here.”
…the double slash at the front each web address came from a file system for a computer workstation called the Apollo/Domain. “The double slashes were there because, on some computer systems, that was already used to mean: ‘We’re going outside the computer now.’ The single slash was for the local file system. The double slash was for the outside.”
I also remember reading a quote that Tim just liked the look of it, but I can’t find any source for that, so maybe it’s a false memory or I heard it in person at a lecture he gave.
Yes, in Domain, local paths were Unix style /like/this, remote systems paths were like //host/usr/bin/whatever, and you could specify the protocol like protocol://whateverhost/file/path.
If HTTP(S) URLs didn’t have a double slash, and were of the form a http:server/path or https:server/path, then a colon would also suffice for that; i.e. :server/path could be treated as protocol relative.
Double slash means it’s a “hierarchical URI” which implies certain things about the structure (host, path) vs “non hierarchical” which is just schemeSpecificPart
That’s even more counterexamples, thanks! I’d previously used :// to try to spot things that are probably URLs in text, and few of those are likely to appear in human-authored text, but it’s quite depressing that there are so many exceptions to a fairly simple rule.
Actually, the // is part of the “stuff” as well. See mailto schema URIs
Now, if we use the concept of URLs as being web addresses, all the stuff has a known interpretation in the http schema, just different naming as highlighted (but also maybe different interpretation in an application, like using fragment to define an anchor or a path in an SPA website)
Ugh, that makes them even more horrible than I thought. I’d never quite thought of mailto as a URL schema but it does seem to think it’s one. And it’s decades too late to fix it.
mailto isn’t the only one that does this, it’s common in URI schemes! a few off the top of my head: magnet:, turn:, javascript:, view-source: so it’s not really something to “fix” but just to learn i think
Strictly speaking the // denotes the start of what the URL specification calls an “authority” (i.e., a domain name).
If the thing that comes after the : in a specific URI scheme isn’t a domain name, // probably shouldn’t be there.
There are some badly designed URI schemes whose designers did not appreciate this subtlety and erroneously require a // followed by something that isn’t an authority/domain name.
While the // may seem unnecessary today it actually does relate to one feature of relative URLs, namely protocol-relative addressing (e.g. <script src="//example.com/foo.js"/>). Though I suppose you could still have that syntax work even without having a // in absolute URLs.
There’s a little bit more to the authority part than just a domain name. The full syntax is,
user:pass@host:port
The login part was never used much and is mostly deprecated now, for reasons that should be clear(text). The host part can be a network address or hostname; the network address does not have to be an IP address (it can be like [weirdnet;…]) and the hostname does not have to be a domain name (it might come from hosts.txt or netbios, etc.).
To expand on the above for the benefit of others: after the use of colons in IPv6 addresses led to the need to wrap IPv6 addresses in brackets for use in URLs, a syntax was designed for other network addresses “just in case”. IPv7 does not seem like a very likely possibility, but if it were to happen, the syntax would be https://[v7.FOO]//.
I wonder if non-integer ports are also feasible? I could also imagine some implementations supporting /etc/services service name lookup in the port field. Then there’s UNIX domain sockets, something I’ve idly contemplated extending URL schemes to support. Maybe https://[UNIX./var/lib/foo]/.
As regards the user:pass@ thing, it is interesting to note that URI schemes divide into having two completely different uses for this kind of syntax. In HTTP it is for specifying the credential used to access a resource. But in other schemes like mailto: or xmpp: it is for specifying a user identity being named as a resource.
Technically these are different syntaxes of course as the mailto: and xmpp: schemes do not use the // authority syntax. But it is a distinction worth contemplating if we’re going to think deeply about URI design.
You can still have arguments as path elements or query parameters too, just using template URLs like in for example OpenSearch. This way you don’t end up locked into a URL routing scheme that’s now been hardcoded into your API documentation and every client.
Where scheme needs to follow certain rules, and schemepart needs to follow other rules governed by scheme, and the details of those rules (and their combinatorial requirements) are non-trivial! If you don’t need to interrogate a URL for details, treating it as an opaque string is a galaxy-brain maneuver, IMO.
Yeah, it’s a total mess. In CHICKEN, for this reason we split the eggs for parsing URLs into two: one “generic syntax” egg which simply parses the BNF from RFC 3986 (which is the successor to the one you linked) and one “common syntax” egg which deals with the specifics of HTTP and “suchlike” schemes. The “common syntax” egg also parses out the query string into key/value pairs, further percent-decodes path components after splitting them apart (because the insanity of the URI syntax means that for some schemes, %2F is identical to /, but for others you first need to split the path components and then decode %2F into slashes “inside” each path component), etc.
The solution isn’t very intuitively appealing, but it seems to work well enough.
In the ActivityPub community I encountered a dev that treats URLs as opaque strings as it pertains to internal things to his application.
I can’t fathom trying to decouple the logic that paths, query params and everything else allows in favour of something so bland as a string identifier.
That’s not far off the right thing to do. A URL is three parts:
Often, the best thing to do is treat the whole thing as an opaque token and pass it onto a different layer that understand the schema and can do the right thing.
Huge “it depends”:
data:
,jar:
,blob:
, andabout:
also do not require a double slash. Butchrome:
andresource:
mostly do. I guess it’s fair to say that everything after the colon is specific to the URL scheme.Another common one is
tel:
If anything, I wonder why the double slash was introduced at all by some schemes. Even in
file:
URLs it seems like a single slash should suffice and inhttp:
it doesn’t add any value at all AFAICT. Does anyone happen to know the rationale behind it?Tim Berners Lee said “It seemed like a good idea at the time”.
Less pithily:
https://www.internethalloffame.org/2012/06/06/berners-lee-world-finally-realizes-web-belongs-no-one/
I also remember reading a quote that Tim just liked the look of it, but I can’t find any source for that, so maybe it’s a false memory or I heard it in person at a lecture he gave.
Yes, in Domain, local paths were Unix style
/like/this
, remote systems paths were like//host/usr/bin/whatever
, and you could specify the protocol likeprotocol://whateverhost/file/path
.Windows UNC paths also came from Domain.
Hah, thanks, now it all makes sense!
I don’t know the original rationale, but the double slash makes possible protocol-relative URLs.
I’m sure theres a better link to describe them, but heres what I found with a moment’s googling: https://www.paulirish.com/2010/the-protocol-relative-url/
If HTTP(S) URLs didn’t have a double slash, and were of the form a
http:server/path
orhttps:server/path
, then a colon would also suffice for that; i.e.:server/path
could be treated as protocol relative.Double slash means it’s a “hierarchical URI” which implies certain things about the structure (host, path) vs “non hierarchical” which is just schemeSpecificPart
That’s even more counterexamples, thanks! I’d previously used :// to try to spot things that are probably URLs in text, and few of those are likely to appear in human-authored text, but it’s quite depressing that there are so many exceptions to a fairly simple rule.
Actually, the // is part of the “stuff” as well. See mailto schema URIs
Now, if we use the concept of URLs as being web addresses, all the stuff has a known interpretation in the http schema, just different naming as highlighted (but also maybe different interpretation in an application, like using fragment to define an anchor or a path in an SPA website)
Ugh, that makes them even more horrible than I thought. I’d never quite thought of
mailto
as a URL schema but it does seem to think it’s one. And it’s decades too late to fix it.mailto
isn’t the only one that does this, it’s common in URI schemes! a few off the top of my head:magnet:
,turn:
,javascript:
,view-source:
so it’s not really something to “fix” but just to learn i thinkStrictly speaking the
//
denotes the start of what the URL specification calls an “authority” (i.e., a domain name).If the thing that comes after the
:
in a specific URI scheme isn’t a domain name,//
probably shouldn’t be there.There are some badly designed URI schemes whose designers did not appreciate this subtlety and erroneously require a
//
followed by something that isn’t an authority/domain name.While the
//
may seem unnecessary today it actually does relate to one feature of relative URLs, namely protocol-relative addressing (e.g.<script src="//example.com/foo.js"/>
). Though I suppose you could still have that syntax work even without having a//
in absolute URLs.There’s a little bit more to the authority part than just a domain name. The full syntax is,
The login part was never used much and is mostly deprecated now, for reasons that should be clear(text). The host part can be a network address or hostname; the network address does not have to be an IP address (it can be like
[weirdnet;…]
) and the hostname does not have to be a domain name (it might come from hosts.txt or netbios, etc.).Indeed, my mistake.
To expand on the above for the benefit of others: after the use of colons in IPv6 addresses led to the need to wrap IPv6 addresses in brackets for use in URLs, a syntax was designed for other network addresses “just in case”. IPv7 does not seem like a very likely possibility, but if it were to happen, the syntax would be
https://[v7.FOO]//
.I wonder if non-integer ports are also feasible? I could also imagine some implementations supporting /etc/services service name lookup in the port field. Then there’s UNIX domain sockets, something I’ve idly contemplated extending URL schemes to support. Maybe
https://[UNIX./var/lib/foo]/
.As regards the
user:pass@
thing, it is interesting to note that URI schemes divide into having two completely different uses for this kind of syntax. In HTTP it is for specifying the credential used to access a resource. But in other schemes likemailto:
orxmpp:
it is for specifying a user identity being named as a resource.Technically these are different syntaxes of course as the
mailto:
andxmpp:
schemes do not use the//
authority syntax. But it is a distinction worth contemplating if we’re going to think deeply about URI design.Noooo!!!! /covers ears
You can still have arguments as path elements or query parameters too, just using template URLs like in for example OpenSearch. This way you don’t end up locked into a URL routing scheme that’s now been hardcoded into your API documentation and every client.
This is basically the right thing to do!
In general a URL is defined as nothing more than
Where scheme needs to follow certain rules, and schemepart needs to follow other rules governed by scheme, and the details of those rules (and their combinatorial requirements) are non-trivial! If you don’t need to interrogate a URL for details, treating it as an opaque string is a galaxy-brain maneuver, IMO.
Yeah, it’s a total mess. In CHICKEN, for this reason we split the eggs for parsing URLs into two: one “generic syntax” egg which simply parses the BNF from RFC 3986 (which is the successor to the one you linked) and one “common syntax” egg which deals with the specifics of HTTP and “suchlike” schemes. The “common syntax” egg also parses out the query string into key/value pairs, further percent-decodes path components after splitting them apart (because the insanity of the URI syntax means that for some schemes,
%2F
is identical to/
, but for others you first need to split the path components and then decode%2F
into slashes “inside” each path component), etc.The solution isn’t very intuitively appealing, but it seems to work well enough.