81% seems high to me, so wondered what qualified as an identifier in this context:
The “Identifier” category comprises process IDs, user IDs, column or row IDs, etc. that usually must be unique
And it was 81% of strings they identified as candidates for TSLs, only 11% of all strings in their corpus.
11% of identifiers as strings is still substantial. I definitely agree identifiers should have their own type that guarantees validity and use in correct contexts.
Basically when you validate, you’re throwing away the information that the input is a valid “thing”. With parsing, you keep that information using the type system.
For the SQL example, you could parse a String to get SQLQuery type, and your queryDB() functions should take SQLQuery instead of String. I think this is usually implemented in the way described in the article for historically reasons, related with the fact that OOP doesn’t (easily) allow the definition of type alias and so on.
Very interesting article, though!
This is an over simplification of a a whole set of details that vary case by case. JSON supports only maps, lists and basic types, all of which trivially mapped into built in data structures of virtually any high level general purpose programming language. That’s fine and straightforward. But in the case of SQL, I’ll argue that your program doesn’t, and practicality won’t effectively, have awareness of an SQL ‘thing’ other than that string being a string. Perhaps with typed placeholders but that is all.
There are attempts to parse the syntax but they are never 100% mirroring the parser of the RDBMS. And even if they would, you would still have the semantics left, and even if you theoreticaly would reach semantic validation pairity with your RDBMS, you would still need to validate the queries against your schema. At this point you have built an RDBMs minus the storage engine and a couple of other piecees, and should ask yourself why you are using an RDBMS.
The post is well intended and has good theoretical advice but lacks engineering insight. Creating types for everything and nothing and expecting to be able to have a bullet proof validation for things like SQL is a fast way to build an horrible over complicated API no one likes to use. Even things like URLs and CSV files are decievingly complex with tons of corner cases and pitfalls.
You are better off using them as strings and write your own validation on the stricter side for your specific case.
Let’s take SQL for example. Every API in every programming language that I’ve seen considers SQL statement a string.
LINQ to Entities is an example of a SQL interface that doesn’t require treating SQL as a string.
LINQ (Language Integrated Query) is the .NET platform’s answer to queries. You can query data structures, SQL databases, CSV, XML, JSON, and other data sources with the same SQL-like syntax. Queries can be either evaluated directly as method calls or converted to an expression tree for further analysis and optimization.
Here is an example C# snippet that uses the LINQ to Entities ORM.
String color = "Red";
using (AdventureWorksEntities context = new AdventureWorksEntities())
{
var query =
from product in context.Products
where product.Color == color
select new
{
Name = product.Name,
ProductNumber = product.ProductNumber,
ListPrice = product.ListPrice
};
foreach (var product in query)
{
Console.WriteLine("Name: {0}", product.Name);
Console.WriteLine("Product number: {0}", product.ProductNumber);
Console.WriteLine("List price: ${0}", product.ListPrice);
Console.WriteLine("");
}
}
Another very important thing that is not a string: code.
Code is not text. Text editors are not code editors.
We’re decades behind in terms of tooling because some of the pioneers decided to represent code as text in the wrong places, and everyone then followed their lead instead of the people who were doing it right and representing code as the structure that it is.
Word has very good format backwards compatibility. 97 is trivial because that version of the format was used until 2007, but I bet 95 wouldn’t be hard either.
Just because Unix makes everything other than plain text excruciating doesn’t make it a bad idea.
You’re confusing quality of idea with quality of implementation.
The mess of textual semi-formats that we have now is far worse than any Microsoft Word format, which is a particularly perverse case that is not representative of binary formats in general. See: image files, dynamic libraries, and things like the Blender file format, which do their jobs very well and rarely have issues. Moreover, I bet that if you open a 1995 Microsoft Word file, it’ll usually work fine…in Microsoft Word.
The Lisp (and maybe Smalltalk?) people were onto something by making their languages homoiconic and representing code as simple primitives (with lists, symbols, numeric and string literals being able to represent almost all code), but they didn’t go all the way and represent syntax nodes with objects, and write editors that directly operated on those nodes.
I might very well be confusing quality of idea with quality of implementation, but I find it odd that in the 65 years of programming languages, no one has come up with a representation of code that isn’t text based. Or perhaps more accurately, one that has taken off like image files.
I would also note that there isn’t an image file format, but several competing ones. There isn’t a dynamic library format, but again, there are several competing ones. But code? It’s been text for decades. Why is that? Why didn’t the Lisp and Smalltalk communities represent syntax nodes with objects? I can see it being difficult to do in the late 60/early 70s when such work was initially being one, but since then? I personally think it’s harder than it looks.
Or perhaps more accurately, one that has taken off like image files
That’s because of a number of reasons: bandwagoning around text, perceived simplicity of text (deceptive, because text’s complexity scales poorly - text makes simple things easy but complex things hard), momentum of tools, education, and mindshare, chicken-and-egg problem after textual tools became dominant, text being more “flexible” (e.g. you can draw ASCII diagrams in it - which is pitiful compared to being able to embed animations in structural code, but people don’t think about that), etc.
Early on in computing, the pioneers faced a choice (kind of) to evolve programming down a Lisp-like path or a C/Unix path. The C/Unix path won, everything became text-based, and then people became so attached to their tools and code that they didn’t consider that there was a better way.
Here I was going to say that, even if more complex than they look, structural code editors are easy, except that just a few hours ago I discovered a new problem (that threw a wrench into my formerly-small-and-elegant structural editor design) that I hadn’t thought of before: redefinition of code forms. At the level of the editor I’m designing, function redefinitions are easy, because the editor treats all function calls are the same. However, macros and special forms have special logic for navigation, editing, and display of their contents, so if you redefine one of them, you have to update existing code somehow. So, perhaps it isn’t as easy as I thought… (although, I’ve been looking through the internet for reasons why structure editors haven’t taken off, and I haven’t found any showstoppers yet)
There are some quite compelling disadvantages to a non-text based format.
I’ll use Bintran, an imaginary language with a non-text based format, to illustrate these issues.
Here are the unrealistic assumptions I am making:
Bintran was created in the 1970s and has somehow survived to the present day.
The developers didn’t make any of the design compromises that are a necessary part of shipping a system.
They never made technical choices with a short-term advantage but long term costs.
All of their choices worked out in a way that never occurs in reality.
The userbase is easy to support and has a cultlike loyalty to the product.
Issues:
OS Changes - OSes come and go and most OSes have disappeared. Many applications don’t survive the death of the platform they were created on.
Bintran was originally written in assembly language (MACRO-11) for RSX-11, a popular minicomputer of the time. Bintran was rewritten in VAX Fortran when VMS was introduced. Bintran was subsequently rewritten in C and ported to Unix and Windows.
Encoding issues - Encodings wax and wane in popularity; UTF-16 and EBCDIC were both popular when they were introduced.
Bintran was written on an ASCII system and sold only in the US. The customer base didn’t require any sort of i18n support or extended ASCII until after UTF-8 was established. The developers predicted that UCS-2 would have to be extended into a variable length encoding and chose UTF-8 instead.
Format Changes - Binary formats make Interoperability much harder.
The Bintran format was designed to last forever. It made pervasive use of type-length-value and the length field was a variable-length integer. Most changes were backward compatible and the few that weren’t didn’t cause any significant issues. The customer base recognized the merits of a binary format and always upgraded all Bintran installs within a short period of time. The usual compatibility issues found in most products with v3, v4, v7, and v9 all interoperating were never a problem.
Editors - Most programmers have a favorite text editor and want to use it with your language.
Bintran included a full screen editor as soon as it was technically possible. The in-house editor was regularly updated with features so users were always willing to learn the Bintran editor instead of using VI, Brief, Emacs, Visual Slick Edit, Jed, Epsilon, etc.
Tooling - People want to write their own tools for a new language.
Bintran included an API for third-party tooling after the port to VMS. The API was remarkably well-designed and required few breaking changes. All of the third party tools were updated soon after a new version was released; Hyrum’s Law was never an issue.
Version Control - Most companies have a preferred version control system. Products that don’t work with it will not be deployed.
Bintran included a market-leading version control system for no additional fee. Most customers that used another VCS were also willing to deploy the Bintran VCS. The product also supported an ASCII encoding of the binary format for the companies that weren’t willing to deploy the Bintran VCS.
The Lisp (and maybe Smalltalk?) people were onto something by making their languages homoiconic and representing code as simple primitives (with lists, symbols, numeric and string literals being able to represent almost all code), but they didn’t go all the way and represent syntax nodes with objects, and write editors that directly operated on those nodes.
They did! Interlisp-D on Xerox Lisp machines had SEdit, which operated on the s-expression cells in memory, not as text.
I should have clarified - “syntax nodes” here means representing something like a LET form as a single object (in the OO sense) and not as a conglomeration of smaller ones (lists and symbols). I tried to make a structure editor that operated on source on the s-expression level, and it had many of the same problems as text.
They weren’t the only ones either.
Interactive Programming Environments, a book published in 1984, describes a number of such systems.
Most structured editors either required substantial resources, weren’t portable to newer platforms, or both.
I don’t quite agree they’re not strings. They are strings at least from user perspective. However, they would benefit from a type that isn’t a generic string:
hashing uses bytes, and bytes depend on encoding, so you should be consistent with that (e.g. always hash NFC-normalized UTF-8 bytes)
you don’t want passwords get printed in logs or data dumps. A non-printable container could help with that.
for extra level of paranoia you may want to zero out memory when the password object is freed.
in other words, it’s simply a different type with an identical representation, String.
Why does that matter? In my opinion, you should treat passwords as mostly opaque identifiers. One possible design thought experiment is “Should Password support length operations?”
Pro: yes, it should, because we must validate the length of passwords
Con: no, it shouldn’t, because we should validate the length of a string representation of the password prior to legally converting it into a password which is now opaque
Both feel reasonable, slightly different styles. There are other possible paths here too “No, Password should only support entropy evaluations”. But in any case, we can discuss how String and Password differ.
Yeah I was slightly confused by this one too. My best guess is that passwords are subject to restrictions (length, requiring non alphanumeric characters, etc.) that a plain string isn’t.
You can save yourself a lot of headache if you stop treating everything that can be represented as a string, as a string.
yeah no. This is the way I coded for many, many years, and while it’s right, it’s also misleading. I’d state it more like things are only things to the degree you use them to solve problems
Using an url to do web navigation. Use that url object. Same for the other examples. If you’re using them to do the types of things you need from that type, use that type, of course. Everything is not a string.
But if you’re not using them to do those types of things, don’t worry about it. In fact, it’s none of your business. If I’m writing code where somebody is passing me a list of emails and I’m sorting them (a trivial example, admittedly), then they’re just strings. Who cares if they validate as email addresses or not? Don’t go borrowing trouble where you don’t have to. You start validating every piece of data against some standard you really have no use for, you’ll end up with a bunch of pain and suffering where you didn’t need any.
Types exist for a reason. If the reason matches what you’re doing, use the type. If not, don’t. (You also need to be careful that the reason the type was created exactly matches your reason, but that’s a rant for another day)
I liked the article. You can do too little typing and too much typing. The trick is use just as much as you need and no more or less.
As long as you are converting strings to another typed representation, you could do the same for all input coming from outside the system. If you put outside data into a DirtyString class/type and only use regular strings (or whatever typed representations you create) for processing/serialization then you have to convert the DirtyString to a normal string to use it. All of your validations against outside data can be in the conversion function instead of being spread out to every place you take in data. Plus, it’s a type safe way to segment internal and external data.
Addendum to SQL: SQLalchemy is a fantastic SQL querybuilder in Python. I wouldn’t recommend the ORM (because I don’t like ORMs) but SQLAlchemy Core is great to use.
”…we have a collective delusion that anything that can be represented as a string, is a string…”
This just isn’t true. Or am I delusional? That is not my impression of developers at all.
Edit: maybe it’s me who is delusional? I’m from a OOP background, and only ever work in strongly typed languages, I guess people I am around is from that same background.
The (vast) majority of strings are identifiers of some sort, at least according to the Wyvern folks: Safely Composable Type-Specific Languages (pdf).
They looked at strings in Java constructors:
plug: that’s why I created Polymorphic Identifiers.
81% seems high to me, so wondered what qualified as an identifier in this context:
And it was 81% of strings they identified as candidates for TSLs, only 11% of all strings in their corpus.
11% of identifiers as strings is still substantial. I definitely agree identifiers should have their own type that guarantees validity and use in correct contexts.
A related and interesting article on this topic: Parse, don’t validate.
Basically when you validate, you’re throwing away the information that the input is a valid “thing”. With parsing, you keep that information using the type system.
For the SQL example, you could parse a String to get SQLQuery type, and your queryDB() functions should take SQLQuery instead of String. I think this is usually implemented in the way described in the article for historically reasons, related with the fact that OOP doesn’t (easily) allow the definition of type alias and so on. Very interesting article, though!
This is an over simplification of a a whole set of details that vary case by case. JSON supports only maps, lists and basic types, all of which trivially mapped into built in data structures of virtually any high level general purpose programming language. That’s fine and straightforward. But in the case of SQL, I’ll argue that your program doesn’t, and practicality won’t effectively, have awareness of an SQL ‘thing’ other than that string being a string. Perhaps with typed placeholders but that is all. There are attempts to parse the syntax but they are never 100% mirroring the parser of the RDBMS. And even if they would, you would still have the semantics left, and even if you theoreticaly would reach semantic validation pairity with your RDBMS, you would still need to validate the queries against your schema. At this point you have built an RDBMs minus the storage engine and a couple of other piecees, and should ask yourself why you are using an RDBMS.
The post is well intended and has good theoretical advice but lacks engineering insight. Creating types for everything and nothing and expecting to be able to have a bullet proof validation for things like SQL is a fast way to build an horrible over complicated API no one likes to use. Even things like URLs and CSV files are decievingly complex with tons of corner cases and pitfalls. You are better off using them as strings and write your own validation on the stricter side for your specific case.
LINQ to Entities is an example of a SQL interface that doesn’t require treating SQL as a string.
LINQ (Language Integrated Query) is the .NET platform’s answer to queries. You can query data structures, SQL databases, CSV, XML, JSON, and other data sources with the same SQL-like syntax. Queries can be either evaluated directly as method calls or converted to an expression tree for further analysis and optimization.
Here is an example C# snippet that uses the LINQ to Entities ORM.
Edited to add the first line.
Another very important thing that is not a string: code.
Code is not text. Text editors are not code editors.
We’re decades behind in terms of tooling because some of the pioneers decided to represent code as text in the wrong places, and everyone then followed their lead instead of the people who were doing it right and representing code as the structure that it is.
Text may be a horrible format, but the rest are worst. Just try loading a Microsoft Word document from 1995. I’d be willing to bet is not that simple.
Also, who were doing it right, in your opinion?
Word has very good format backwards compatibility. 97 is trivial because that version of the format was used until 2007, but I bet 95 wouldn’t be hard either.
Just because Unix makes everything other than plain text excruciating doesn’t make it a bad idea.
You’re confusing quality of idea with quality of implementation.
The mess of textual semi-formats that we have now is far worse than any Microsoft Word format, which is a particularly perverse case that is not representative of binary formats in general. See: image files, dynamic libraries, and things like the Blender file format, which do their jobs very well and rarely have issues. Moreover, I bet that if you open a 1995 Microsoft Word file, it’ll usually work fine…in Microsoft Word.
The Lisp (and maybe Smalltalk?) people were onto something by making their languages homoiconic and representing code as simple primitives (with lists, symbols, numeric and string literals being able to represent almost all code), but they didn’t go all the way and represent syntax nodes with objects, and write editors that directly operated on those nodes.
I might very well be confusing quality of idea with quality of implementation, but I find it odd that in the 65 years of programming languages, no one has come up with a representation of code that isn’t text based. Or perhaps more accurately, one that has taken off like image files.
I would also note that there isn’t an image file format, but several competing ones. There isn’t a dynamic library format, but again, there are several competing ones. But code? It’s been text for decades. Why is that? Why didn’t the Lisp and Smalltalk communities represent syntax nodes with objects? I can see it being difficult to do in the late 60/early 70s when such work was initially being one, but since then? I personally think it’s harder than it looks.
That’s because of a number of reasons: bandwagoning around text, perceived simplicity of text (deceptive, because text’s complexity scales poorly - text makes simple things easy but complex things hard), momentum of tools, education, and mindshare, chicken-and-egg problem after textual tools became dominant, text being more “flexible” (e.g. you can draw ASCII diagrams in it - which is pitiful compared to being able to embed animations in structural code, but people don’t think about that), etc.
Early on in computing, the pioneers faced a choice (kind of) to evolve programming down a Lisp-like path or a C/Unix path. The C/Unix path won, everything became text-based, and then people became so attached to their tools and code that they didn’t consider that there was a better way.
Here I was going to say that, even if more complex than they look, structural code editors are easy, except that just a few hours ago I discovered a new problem (that threw a wrench into my formerly-small-and-elegant structural editor design) that I hadn’t thought of before: redefinition of code forms. At the level of the editor I’m designing, function redefinitions are easy, because the editor treats all function calls are the same. However, macros and special forms have special logic for navigation, editing, and display of their contents, so if you redefine one of them, you have to update existing code somehow. So, perhaps it isn’t as easy as I thought… (although, I’ve been looking through the internet for reasons why structure editors haven’t taken off, and I haven’t found any showstoppers yet)
It’s a lot harder than it looks.
There are some quite compelling disadvantages to a non-text based format.
I’ll use Bintran, an imaginary language with a non-text based format, to illustrate these issues.
Here are the unrealistic assumptions I am making:
They never made technical choices with a short-term advantage but long term costs.
Issues:
Bintran was originally written in assembly language (MACRO-11) for RSX-11, a popular minicomputer of the time. Bintran was rewritten in VAX Fortran when VMS was introduced. Bintran was subsequently rewritten in C and ported to Unix and Windows.
Bintran was written on an ASCII system and sold only in the US. The customer base didn’t require any sort of i18n support or extended ASCII until after UTF-8 was established. The developers predicted that UCS-2 would have to be extended into a variable length encoding and chose UTF-8 instead.
The Bintran format was designed to last forever. It made pervasive use of type-length-value and the length field was a variable-length integer. Most changes were backward compatible and the few that weren’t didn’t cause any significant issues. The customer base recognized the merits of a binary format and always upgraded all Bintran installs within a short period of time. The usual compatibility issues found in most products with v3, v4, v7, and v9 all interoperating were never a problem.
Bintran included a full screen editor as soon as it was technically possible. The in-house editor was regularly updated with features so users were always willing to learn the Bintran editor instead of using VI, Brief, Emacs, Visual Slick Edit, Jed, Epsilon, etc.
Bintran included an API for third-party tooling after the port to VMS. The API was remarkably well-designed and required few breaking changes. All of the third party tools were updated soon after a new version was released; Hyrum’s Law was never an issue.
Bintran included a market-leading version control system for no additional fee. Most customers that used another VCS were also willing to deploy the Bintran VCS. The product also supported an ASCII encoding of the binary format for the companies that weren’t willing to deploy the Bintran VCS.
They did! Interlisp-D on Xerox Lisp machines had SEdit, which operated on the s-expression cells in memory, not as text.
I should have clarified - “syntax nodes” here means representing something like a LET form as a single object (in the OO sense) and not as a conglomeration of smaller ones (lists and symbols). I tried to make a structure editor that operated on source on the s-expression level, and it had many of the same problems as text.
They weren’t the only ones either.
Interactive Programming Environments, a book published in 1984, describes a number of such systems.
Most structured editors either required substantial resources, weren’t portable to newer platforms, or both.
This is the least obvious one to me, and I notice it’s the only one for which you didn’t give examples of typed representations. Do you know of any?
I don’t quite agree they’re not strings. They are strings at least from user perspective. However, they would benefit from a type that isn’t a generic string:
In Haskell
in other words, it’s simply a different type with an identical representation, String.
Why does that matter? In my opinion, you should treat passwords as mostly opaque identifiers. One possible design thought experiment is “Should Password support length operations?”
Both feel reasonable, slightly different styles. There are other possible paths here too “No, Password should only support entropy evaluations”. But in any case, we can discuss how String and Password differ.
[Comment removed by author]
Yeah I was slightly confused by this one too. My best guess is that passwords are subject to restrictions (length, requiring non alphanumeric characters, etc.) that a plain string isn’t.
Passwords cannot be safely compared for equality using string functions; you can run into timing attacks if you do.
Not that you should ever have to compare the plaintext of a password..
a classic blog post on the topic
yeah no. This is the way I coded for many, many years, and while it’s right, it’s also misleading. I’d state it more like things are only things to the degree you use them to solve problems
Using an url to do web navigation. Use that url object. Same for the other examples. If you’re using them to do the types of things you need from that type, use that type, of course. Everything is not a string.
But if you’re not using them to do those types of things, don’t worry about it. In fact, it’s none of your business. If I’m writing code where somebody is passing me a list of emails and I’m sorting them (a trivial example, admittedly), then they’re just strings. Who cares if they validate as email addresses or not? Don’t go borrowing trouble where you don’t have to. You start validating every piece of data against some standard you really have no use for, you’ll end up with a bunch of pain and suffering where you didn’t need any.
Types exist for a reason. If the reason matches what you’re doing, use the type. If not, don’t. (You also need to be careful that the reason the type was created exactly matches your reason, but that’s a rant for another day)
I liked the article. You can do too little typing and too much typing. The trick is use just as much as you need and no more or less.
If all you’re doing is sorting something, then you truly only have a string. Or maybe even merely an ordered object of some kind.
Ask for as little as you can afford to.
Somewhat related:
As long as you are converting strings to another typed representation, you could do the same for all input coming from outside the system. If you put outside data into a DirtyString class/type and only use regular strings (or whatever typed representations you create) for processing/serialization then you have to convert the DirtyString to a normal string to use it. All of your validations against outside data can be in the conversion function instead of being spread out to every place you take in data. Plus, it’s a type safe way to segment internal and external data.
Addendum to SQL: SQLalchemy is a fantastic SQL querybuilder in Python. I wouldn’t recommend the ORM (because I don’t like ORMs) but SQLAlchemy Core is great to use.
”…we have a collective delusion that anything that can be represented as a string, is a string…”
This just isn’t true. Or am I delusional? That is not my impression of developers at all.
Edit: maybe it’s me who is delusional? I’m from a OOP background, and only ever work in strongly typed languages, I guess people I am around is from that same background.