Normatively, many have used and continue to use the word “character” to refer to a single grapheme in a piece of text. The point I take @singpolyma and Manish (the person who wrote the linked blog post) to be making is that they shouldn’t.
In short: the word “character” makes sense under ASCII, but doesn’t make sense under Unicode (whatever encoding you’re using). Unicode code points are definitely not “characters,” and grapheme clusters are only sort-of “characters.”
More broadly, adhering to out-of-date ASCII-centric wording helps to perpetuate the exclusion of people who use languages whose scripts are non-Latin (most of the world), and can lead to incorrect thinking around the representation and manipulation of text exactly in the manner the top blog post in this discussion describes.
I respect that you’re finding your way, but this is one tutorial-level guide article too many for me on Lobste.rs. I firmly believe that this is not the place for this variety of content.
That being said, contrary to the flags available on Lobste.rs, I wouldn’t classify this as “spam” or “off-topic”. Some more helpful descriptions would be that the piece is redundant with other, easily-found works; is not comprehensive enough for this site’s readership; or needs more thought and substance put into it before it is ready for Lobste.rs. (If any mods see this, I am interested in your thoughts.)
Good luck with everything you do; onward and upward!
If you don’t like something, you should not comment on it.
Unless I think there’s something worth discussing; then the system works as intended. I appreciate the reference, and I did not know about that, but I will still speak.
I am sorry, but if you still do not know about multi-byte characters in 2020, you should really not be writing software. The 1990ies have long passed in which you could assume 1 byte == 1 char.
Nobody was born knowing about multi-byte characters. There’s always new people just learning about it, and probably lots of programmers that never got the memo.
This is esp. problematic if you read the last paragraph where the author gives encryption/decryption as an example. If somebody really is messing with low level crypto apis, they have to know this. There is no excuse. Really.
Junior Programmers aren’t born having read a Joel Spolsky blog post. There are, forever, people just being exposed to the idea of multibyte characters for the first time. Particularly if, like a lot of juniors are, they’re steered towards something like the K&R C book as a “good” learning resource.
Whether or not this blog post in particular is a good introduction to the topic is kind of a side-point. What was being pointed out to you was that everyone in 2020 is supposed to have learned this topic at some point in the past is beyond silly. There are always new learners.
Right, but this author is purporting to be able to guide others through this stuff. If they haven’t worked with it enough to see the low-hanging edge cases, they should qualify their article with “I just started working on this stuff, I’m not an expert and you shouldn’t take this as a definitive guide.” That’s a standard I apply to myself as well.
We should perhaps not expect newcomers to know about encoding issues, but we should expect the tools they (and the rest of us) use to handle it with a minimum of bad surprises.
That’s a little harsh, everyone has to learn sometime. I didn’t learn about character encoding on my first day of writing code, it took getting bitten in the ass by multibyte encoding a few times before I got the hang of it.
I didn’t learn about character encoding on my first day of writing code, it took getting bitten in the ass by multibyte encoding a few times before I got the hang of it.
Right, but you’re not the one writing and publishing an article that you intend for people to use as a reference for this type of stuff. People are responsible for what they publish, and I hold this author responsible to supply their writing with the caveat that their advice is incomplete, out-of-date, or hobby-level—based, I presume, on limited reading & experience with this stuff in the field.
I’m sure that if I knew you well enough, I could find three things you didn’t know that respected developers would say means “you should really not be writing software”.
Yes it’s 2020, but also, yes, people still get this wrong. 90% of packages addressed to me mangle my city (Zürich) visibly on the delivery slip, so do many local(!) food delivery services.
Every time I make a payment with Apple Pay, the local(!) App messes up my city name in a notification (the wallet app gets it right).
Every week I’m handling support issues with vendors delivering data to our platform with encoding issues.
Every week somebody in my team comes to me with questions about encoding issues (even though by now they should know better)
This is a hard problem. This is also a surprising problem (after all „it’s just strings“).
It’s good when people learn about this. It’s good when they communicate about this. The more people write about this, the more will get it right in the future.
We are SO far removed from these issues being consistently solved all throughout
I know all that. My first name has an accented character in it. I get broken emails all the time. That still does NOT make it okay. People that write software have to know some fundamental things and character encodings is one of them. I consider it as fundamental as understanding how floats work in a computer and that they are not precise and what problems that causes.
The article being discussed is not good and factually wrong in a few places. It is also not written in a style that makes it sound like somebody is documenting their learnings. It is written as stating facts. The tone makes a big difference.
There are very few things that every software developer needs to know–fewer than most lists suggest, at least. Multi-byte encodings and unicode have are about as good a candidate as exists for being included in that list.
However, people come to software through all different paths. There’s no credential or exam you have to pass. Some people just start scripting, or writing mathematical/statistical code, and wander into doing things. Many of them will be capable of doing useful and interesting things, but are missing this very important piece of knowledge.
What does getting cranky in the comments do to improve that situation? Absolutely nothing.
I think that this is one of the problems with our industry. People with zero proof of knowledge are fuzzing around with things they do not understand. I am not talking about hobbyists here, but about people writing software that is being used by people to run critical infrastructure. There is no other technical profession where that is okay.
I think our field is big and fast enough that dealing with things we don’t understand don’t yet understand has just become part of the job description.
There is no such thing as “number of characters in the string”. https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#length()
This does say
and
Normatively, many have used and continue to use the word “character” to refer to a single grapheme in a piece of text. The point I take @singpolyma and Manish (the person who wrote the linked blog post) to be making is that they shouldn’t.
In short: the word “character” makes sense under ASCII, but doesn’t make sense under Unicode (whatever encoding you’re using). Unicode code points are definitely not “characters,” and grapheme clusters are only sort-of “characters.”
More broadly, adhering to out-of-date ASCII-centric wording helps to perpetuate the exclusion of people who use languages whose scripts are non-Latin (most of the world), and can lead to incorrect thinking around the representation and manipulation of text exactly in the manner the top blog post in this discussion describes.
Can you clarify the three terms for us, then: character, code point, grapheme?
What’s the difference between a character, a code point, a glyph and a grapheme?.
I respect that you’re finding your way, but this is one tutorial-level guide article too many for me on Lobste.rs. I firmly believe that this is not the place for this variety of content.
That being said, contrary to the flags available on Lobste.rs, I wouldn’t classify this as “spam” or “off-topic”. Some more helpful descriptions would be that the piece is redundant with other, easily-found works; is not comprehensive enough for this site’s readership; or needs more thought and substance put into it before it is ready for Lobste.rs. (If any mods see this, I am interested in your thoughts.)
Good luck with everything you do; onward and upward!
An important point that you should consider? Lobsters treats comments as upvotes (with a cap, but still).
If you don’t like something, you should not comment on it.
Unless I think there’s something worth discussing; then the system works as intended. I appreciate the reference, and I did not know about that, but I will still speak.
I am sorry, but if you still do not know about multi-byte characters in 2020, you should really not be writing software. The 1990ies have long passed in which you could assume 1 byte == 1 char.
https://jvns.ca/blog/2017/04/27/no-feigning-surprise/
Nobody was born knowing about multi-byte characters. There’s always new people just learning about it, and probably lots of programmers that never got the memo.
The famous joel article on unicode is almost old enough to vote in most countries (17 years). There is really no excuse to be oblivious to this: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
This is esp. problematic if you read the last paragraph where the author gives encryption/decryption as an example. If somebody really is messing with low level crypto apis, they have to know this. There is no excuse. Really.
Junior Programmers aren’t born having read a Joel Spolsky blog post. There are, forever, people just being exposed to the idea of multibyte characters for the first time. Particularly if, like a lot of juniors are, they’re steered towards something like the K&R C book as a “good” learning resource.
Whether or not this blog post in particular is a good introduction to the topic is kind of a side-point. What was being pointed out to you was that everyone in 2020 is supposed to have learned this topic at some point in the past is beyond silly. There are always new learners.
You are aware that there are new programmers born every day, right?
Right, but this author is purporting to be able to guide others through this stuff. If they haven’t worked with it enough to see the low-hanging edge cases, they should qualify their article with “I just started working on this stuff, I’m not an expert and you shouldn’t take this as a definitive guide.” That’s a standard I apply to myself as well.
We should perhaps not expect newcomers to know about encoding issues, but we should expect the tools they (and the rest of us) use to handle it with a minimum of bad surprises.
That’s a little harsh, everyone has to learn sometime. I didn’t learn about character encoding on my first day of writing code, it took getting bitten in the ass by multibyte encoding a few times before I got the hang of it.
Here is another good intro to multibyte encoding for anyone who wants to learn more: https://betterexplained.com/articles/unicode/
Right, but you’re not the one writing and publishing an article that you intend for people to use as a reference for this type of stuff. People are responsible for what they publish, and I hold this author responsible to supply their writing with the caveat that their advice is incomplete, out-of-date, or hobby-level—based, I presume, on limited reading & experience with this stuff in the field.
I’m sure that if I knew you well enough, I could find three things you didn’t know that respected developers would say means “you should really not be writing software”.
[Comment removed by author]
Yes it’s 2020, but also, yes, people still get this wrong. 90% of packages addressed to me mangle my city (Zürich) visibly on the delivery slip, so do many local(!) food delivery services.
Every time I make a payment with Apple Pay, the local(!) App messes up my city name in a notification (the wallet app gets it right).
Every week I’m handling support issues with vendors delivering data to our platform with encoding issues.
Every week somebody in my team comes to me with questions about encoding issues (even though by now they should know better)
This is a hard problem. This is also a surprising problem (after all „it’s just strings“).
It’s good when people learn about this. It’s good when they communicate about this. The more people write about this, the more will get it right in the future.
We are SO far removed from these issues being consistently solved all throughout
I know all that. My first name has an accented character in it. I get broken emails all the time. That still does NOT make it okay. People that write software have to know some fundamental things and character encodings is one of them. I consider it as fundamental as understanding how floats work in a computer and that they are not precise and what problems that causes.
The article being discussed is not good and factually wrong in a few places. It is also not written in a style that makes it sound like somebody is documenting their learnings. It is written as stating facts. The tone makes a big difference.
There’s a difference between knowing there’s a difference, which I suspect is reasonably common knowledge, and knowing what the difference is.
There are very few things that every software developer needs to know–fewer than most lists suggest, at least. Multi-byte encodings and unicode have are about as good a candidate as exists for being included in that list.
However, people come to software through all different paths. There’s no credential or exam you have to pass. Some people just start scripting, or writing mathematical/statistical code, and wander into doing things. Many of them will be capable of doing useful and interesting things, but are missing this very important piece of knowledge.
What does getting cranky in the comments do to improve that situation? Absolutely nothing.
I think that this is one of the problems with our industry. People with zero proof of knowledge are fuzzing around with things they do not understand. I am not talking about hobbyists here, but about people writing software that is being used by people to run critical infrastructure. There is no other technical profession where that is okay.
I think our field is big and fast enough that dealing with things we
don’t understanddon’t yet understand has just become part of the job description.