Oh no. Unicode.
Oh no. Unicode.
Unicode is fraught with difficulties, but it’s still a godsend. Try working on a system that’s non-ASCII based and you realize how simple Unicode is in comparison. Code pages are horrific to manage. (If you are using the Shift In and Shift Out codes, then you’ve reached a special place in programming hell.)
Unicode is a good thing (with some questionable historical decisions, but still better than the rest). I think the biggest problem is the preconception many programmers have about strings where they nicely fit the “one character per array index” model that just does not work well with real world text data. Of course the problem tends to show up while using Unicode, so blaming Unicode is a lot easier!
Indeed. The notion of char * with only ASCII has taught the wrong abstraction to a lot of people.
And a generation of Java programmers grew up with the idea that a 16-bit char represents a character (it did in UCS-2 of course). I think it is often also a lack of proper education in character sets/encodings in text books and undergraduate programs.
There are still a handful of languages that use an array-of-32-bit-chars model, which retains the traditional equation of strings and character arrays, while also being able to represent all of Unicode. The String in the Haskell prelude comes to mind, although there are plenty of alternatives.
UTF-32 and other 32-bit encoding schemas are usually based on a fixed-width in representation, but only for specific code points. Because of combining marks, the grapheme cluster (“a character”) may require multiple code points to be represented and breaks alignment.
Many textual representations can be done using either (i.e. ‘ç’ can either be ‘c + ¸’, or the pre-combined ç), but some representations just don’t fit that scheme, such as emoji where skin color is done through combining tone modifiers, for example. Another one is the zero-width joiner that can be used to join arbitrary characters or emoji when supported (the family emoji is possibly made up of two adult emoji and one or two child emoji)
c + ¸
You really have to use your Unicode libraries’ functions and algorithms to work, and think in terms of grapheme clusters in multiple cases. Strings as arrays just don’t really make sense anymore.
Racket does UTF8 in unsigned int32. So, just because you use a 32 bit int, doesn’t have to imply UTF-32… that is unless my understanding is insufficient, and UTF-8 implemented this way is equivalent!
There is Unicode directly, using codepoints as larger integers. This, by definition (with joiners and combining marks) cannot have fixed width characters.
UTF-8 as an encoding could sure be encoded with larger integers, but if you do it that way, you still have to implement the encoding using the same code units, which means it gets to be longer than other variants. If not, the encoding is just not utf-8.
Unicode is terrible because the set of all human languages put together is terrible, and different languages interact strangely, with odd rules for specific characters that conflict with rules in other languages. If you want to support all human languages, Unicode is about as good as you can do.
It’s still awful to handle correctly, and I would be willing to bet that no software system gets it fully right.
Some encodings are are as easy as ascii, others are really silly, but not necessarily much worse than unicode. Why do people always have to excuse unicode by comparing it to the worst of them all? Is the bar really that low?
It’s not the encoding that’t the godsend, it’s the fact that you don’t have to convert between encodings when crossing system boundaries.
Some encodings are are as easy as ascii, others are really silly, but not necessarily much worse than unicode.
Could you give an example? Most other encodings require codepage switching (for 8-bit encodings), which is terrible, or have relatively high storage costs per character.
Also, what are you referring to with unicode, since unicode specifies multiple character encodings? I would argue that UTF-8 is one of the most sensible encodings:
Of course, because it is variable length, it does not provide constant-time indexing or length, but fixed-width encodings would require > 16-bits per character for a reasonable character inventory.
And, as ferd points out above, the entire concept of constant-time indexing into a sequence of Unicode codepoints is silly for most applications. (You’re almost certainly interested in grapheme clusters instead.) So the fact that UTF-8 doesn’t facilitate it isn’t much of a loss.
The talk was given in Quebec, thats pretty cool. Anyone know of good conferences or meetups in Montreal / Quebec / possibly Ottawa? I’m thinking of getting more involved locally.
There are a lot of good meetups in Montreal; what scope were you looking for?
I am obviously biased, but I think Papers We Love Montreal (which I organize) is a solid meetup, if you’re into that sort of thing. Lambda Montreal is also interesting and formed out of a bunch of interesting language-specific groups. Montreal Hack’n’Tell, which is new, is slowly taking form. And I am told that All-Girl Hack Night Montreal is excellent.
As for conferences, netdevconf just happened, it was great. Recon is here every year, and is usually excellent. There are plenty of others but I can’t comment on them. (Ottawa has at least BSDCan and PGCon.)
Debconf is located in Montreal this year.
I checked out lambda and papers we love and they look very interesting! I’ll check out bsdcan as well.
Might be able to come to papers we love on the 27th
Looking forward to seeing you there!
Personnaly I go to Northsec and Recon every year. The first one is a netsec with lots of content and one of the biggest onsite CTF on the week-end, definitely one of my favorite conference in Quebec. This year also feature workshop that are kind of hands-on talk (I’m giving Script Engine Hacking For Fun And Profit!). Also, free beer all the event ensure you never get thirsty.
Recon is much more geared toward reverse engineering and hacking. Also fun, but a bit more expensive and doesn’t really have entry-level content.
There’s also Hackfest in Quebec city with a small CTF and a many netsec talks.
The talk in Saguenay for the SagLacIO was just awesome. As a programmer, you get to live with all these problems and it’s rarely getting better. Thanks for sharing this Fred, you are so right, everything is terrible ᴥ
I guess I missed the iPhone 1970 bug the first time around. Was it really only 1970? Or 1971 too? Because some of the explanations involving time underflow don’t really make sense. But if it were a code signature failure, that could happen for any day that’s too old. Of course it’s hard to find a more authoritative source for what happened than a gizmodo article “I googled 1970 and here’s what I learned about unix”.
There were quite a few sources, but none that showed up as well on a slide. Here’s a few:
The kind of common guess going around is that it could have to do with code signatures, either too old or too new. There’s been no official explanation of the absolute details and everything is a guess so far. I personally like the underflow theory (and do mention it is unsure and I only wish it to be true). I also used the 32 bit overflow (year 2104) as an example, even if the bug affected 64bit devices only. It didn’t feel particularly dishonest given the wording around it.
Oh, thanks. Sorry if I was unclear, I didn’t mean you needed well sourced links for the talk. I’m just curious about the exact cause and was hoping somebody had figured it out.
I’m not certain I understand the point of this presentation. yes, this is a list of difficult problems in security and distributed systems and data handling, but they are all pretty basic skills that a technologist must learn. does a chef who wants to cook complain that knife skills, food cost percents, and pastry theory are terrible?
Firstly, we have the option of making several of those things less terrible.
Secondly, clearly many technologists have not learned these skills, and it hasn’t stopped them harming users by building widely-used software that loses data or has serious bugs (eg the OWASP top 10).