The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

How UTF-8 works

This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops.

.. UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed.

.. If there’s no equivalent for the Unicode code point you’re trying to represent in the encoding you’re trying to represent it in, you usually get a little question mark: ? or, if you’re reallygood, a box. Which did you get? -> �

.. There are hundreds of traditional encodings which can only store somecode points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252

..  ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks.

..  It does not make sense to have a string without knowing what encoding it uses.

.. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

.. they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle.


The Harmful Consequences of Postel’s Maxim

No implementation can hope to avoid having to trade correctness for
interoperability indefinitely.

An implementation that reacts to variations in the manner advised by
Postel sets up a feedback cycle:

  • Over time, implementations progressively add new code to constrain
    how data is transmitted, or to permit variations in what is
  • Errors in implementations, or confusion about semantics can thereby be masked.
  • These errors can become entrenched, forcing other implementations to be tolerant of those errors.

.. the original JSON specification [RFC4627] omitted critical details on a range of points including Unicode handling, ordering and duplication of object members, and number encoding. Consequently, a range of interpretations were used by implementations. An update [RFC7159] was unable to correct these Thomson Expires December 14, 2017 errors, instead concentrating on defining the interoperable subset of JSON.

.. An entrenched flaw can become a de facto standard.

.. This is colloquially referred to as being “bug for bug compatible”.

4. A New Design Principle

   The following principle applies not just to the implementation of a
   protocol, but to the design and specification of the protocol.

      Protocol designs and implementations should fail noisily in
      response to bad or undefined inputs.

In contrast, generating warnings provide no incentive to fix a problem as the system remains operational. Users can become inured to frequent use of warnings and thus systematically ignore them, whereas a fatal error can only happen once and will demand attention.

The Benedict Option

Maybe if I shared Rod’s views on L.G.B.T. issues, I would see the level of threat and darkness he does. But I don’t see it. Over the course of history, American culture has tolerated slavery, sexual brutalism and the genocide of the Native Americans, and now we’re supposed to see 2017 as the year the Dark Ages descended?

.. It should be possible to find a workable accommodation between L.G.B.T. rights and religious liberty, especially since Orthodox Jews and Christians aren’t trying to impose their views on others

.. My big problem with Rod is that he answers secular purism with religious purism.

.. The right response to the moment is not the Benedict Option, it is Orthodox Pluralism.

.. To me that means the real enemy is not the sexual revolution. It is a form of purism that can’t tolerate difference because it can’t humbly accept the mystery of truth.