Search This Blog

Monday, March 2, 2015

Unicode: Universal or Whimsical?

Unicode Classification

In my last post, I wrote about two sides to unicode — a universal side and a babel side. Some readers while agreeing with this classification were jarred by a passing reference to ‘gibberish’ in unicode⁵.

Since I learnt some things from those comments, this post expands that classification into these¹.
  1. Babel
  2. Universal
  3. Legacy
  4. Unavoidable mess
  5. Political mess
  6. Whimsical
Babel
Arabic, Cyrillic, Devanagari, Ethiopic, Hangul, Tamil etc. Babel is why Unicode exists in the first place.
Universal
I wrote about this here and I believe it represents the biggest hope in/for unicode. Terry Reedy’s term – planetary – in its own way captures that perhaps better than ‘universal’.
Unfortunately Unicode has also a not so savory side:
Legacy
ASCII-compatibility increases unicode's acceptance, dissemination, proliferation. See Galls’ law below. It is also somewhat politically suspect.
Unavoidable mess
There are certain terms that the consortium has set itself, eg round-tripping that perpetuates some unfortunate messing. [I dont know much about this; see Steven d’Aprano’s explanation on round tripping here]
Political Messing
UTF-8 with BOM?? Instead of simply saying this is nonsense, the consortium hums and haws round this. Why? Evidently Microsoft is an important corporation.
Whimsicality
I personally regard having blocks for Egyptian hieroglyphs, Cuneiform, Shavian, Deseret, Mahjong as the extremity of this whimsicality.
Below I expand on these. Babel is why unicode exists — fine. Others may like to put the 'whimsical' tag on other blocks/characters — that's not very central to my point. Some things in unicode are whimsical, some are messy — take your pick!  Point is to ignore these and focus on the universal side.

On Laws of standards

Our field is subject to two opposite vectors — complexifiers and simplifiers.

Moore’s law

is the most obvious example of a complexifier – a typical today’s machine with 8GB RAM and 2TB disk is bigger than the machines on which I grew up — 640KB RAM, 1MB God’s limit, no disk, 360KB floppies.²

Moore’s law has some counterparts such as Gall's law and Sowa's Law of Standards. [Also Rock's law³]

Galls law

A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system.

We may say that unicode by being at least somewhat compatible with (7-bit) ASCII is in accordance with Gall’s law. [Yeah all those control-chars wasting space in prime-position… That’s called compatibility. Just like Intel continuing to ship with all the archaic instructions like DAA/AAA that it did 40 years ago which no one today uses]

And then there is John Sowa's

Law of Standards

Whenever a major organization develops a new system as an official standard for X, the primary result is the widespread adoption of some simpler system as a de facto standard for X.
John Sowa then goes on to give these examples:
  1. The PL/I project by IBM and SHARE resulted in Fortran and COBOL becoming the de facto standards for scientific and business computing.
  2. The Algol 68 project by IFIPS resulted in Pascal becoming the de facto standard for academic computing.
  3. The Ada project by the US DoD resulted in C becoming the de facto standard for system programming.
  4. The OS/2 project by IBM and Microsoft resulted in Windows becoming the de facto standard for desktop computing.
Some more recent examples:
  • OSI-networking leads to TCP/IP
  • XHtml-strict leads to widescale adoption of html5
  • Not 'taken over' yet, but Microsoft spends billions on crafting MS-office and many people start using a copy — the not quite as good but free libreoffice
A university professor who keeps on adding to his pet language/OS/DBMS is one thing.  An international standards organization doing the same is quite another.
Let us suppose that every member of ISO committee for C agrees that garbage collection is a good idea, should/can they just add it to C?

It is at least conceivable that the unicode consortium in going overboard with newer versions is making the same error that the w3 did in trying to strictify html.

As opposed to these two Sowa and Gall’s law we also have

Dave Angel’s list

On the python list, Dave Angel gave the following historical list. Most of them go to show the folly of disregarding Moore’s law or if you prefer, cost gravity.
His examples which I have summarized as a ‘law’ below
CDC display code is enough. Who needs lowercase?

Baudot code is enough.

EBCDIC is good enough. Who cares about other companies.

The “golf-ball” only holds this many characters. If we need more, we can just get the operator to switch balls in the middle of printing.

2 digit years is enough. This world won’t last till the millennium anyway.

2k is all the EPROM you can have. Your code HAS to fit in it, and only 1.5k RAM.

640k is more than anyone could need.

You cannot use a punch card made on a model 26 keypunch in the same deck as one made on a model 29. Too bad, many of the codes are different. (This one cost me travel back and forth between two different locations with different model keypunches)

8 bits is as much as we could ever use for characters. Who could possibly need names or locations outside of this region? Or from multiple places within it?
[The most famous that Dave missed: The world only needs 5 computers]

all of which can be summarized as

The Kiss-Moore Law

Between the KISS principle and Moore’s law,
Moore usually wins.
Clearly Gall/Sowa pull one way, Dave/Moore the other.

Who will win?

Not being much of an astrologer I dont really know the future. However the present at least has hints that Gall/Sowa may win

Wide is too narrow

When Unicode first emerged, it was a 16-bit code and C came up with the corresponding w_char – wide char – to support unicode. however with the coming of chars upward of FFFF w_char was not wide enough. This is usually noted under the rubric of “UTF-16 is screwed up”
and leads to vulnerabilities eg
and yet, unfortunately UTF-16 is widely used, see for example the following notable instances of

½-assed unicode support

Below Ive listed some examples of notable software where breakage going from BMP-only unicode to SMP.
  1. Java
  2. Javascript
  3. Mysql
  4. Blogger – see my unicoded python post section 12. Ive tried to put some SMP chars there
  5. Windows – New Windows applications should use UTF-16!!
  6. Emacs – Try entering some SMP chars using C-x 8 RET
    [Well ok this is probably a font problem more than an emacs one… Anyway something-SMP is borked somewhere]
  7. Python’s Idle
And then when the above combine… more fun: Perl-MySql-MovableType

The usual response to all these is that these are teething troubles and in due course all these issues will be settled.

Maybe good to consider a counter-view??…

When everybody is below average

Twenty-five years ago I wrote about how teaching C to beginning students can be quite a travail. This came about as follows.

I was teaching C. For some time I took it simply: If a student forgot to match an malloc with a free, the student was wrong.  But then after a year or so of having to beat up almost every student I started thinking: Hell! If everyone is getting something wrong, maybe the error is elsewhere? If one plane model has 10 times more crashes than another do we blame the pilots or look if something is mis-designed in the cockpit? 

Note: This conclusion may seem natural and obvious in 2015. It was not so obvious in 1990 when only 'research' languages like Lisp sported garbage-collection.  At that time, the juxtaposition mainstream language and garbage-collection seemed like an oxymoron to many people.

We need to at least consider that if so many notable systems are getting unicode wrong it may that all of them are wrong. It may also be that something is wrong with unicode.

Formalizing a de facto standard

So there is this ‘standard’:
  • It is like UCS-2 except that instead of being silent in what happens for codepoints beyond FFFF it explicitly errors out
  • It is like UTF-16 except that instead of breaking the fixed-width invariant beyond FFFF it breaks the user code in the most graceful way available, eg by raising an exception Cant decode SMP character
  • It is already a de facto standard, being implemented by examples such as these (and many more) earlier listed.
It only needs to be formalized.

The Chinese Question

I hesitate on this one since I like to stay away from political(izable) questions. Suffice it to say that if Chinese matters to you, you may come to different conclusions than this post. But then you should also consider whether you should be looking at unicode or at BIG5/GB.

Feature of Bug?

½-assed unicode support – BMP-only – is better than 1/100-assed⁴ support – ASCII.  BMP-only Unicode is universal enough but within practical limits whereas full (7.0) Unicode is 'really' universal at a cost of performance and whimsicality.

If like python’s FSR you choose to provide full support for full Unicode 7 in addition to doing due diligence to optimizing storage — all power to you.

Just remember:
  • Most languages (think C) dont have the the advantages that python has which has made FSR possible viz. dynamicness and immutable strings. This means that programmers need to make early decisions on data structure representations
  • Even in languages like python3 with good unicode 7.0 support, there can be interaction with BMP-only components. eg Idle can't handle SMP characters not because python is broken but because Tk is.
  • This kind of interaction leading to the strength-of-the-weakest-link is the norm not the exception
If however you dont think unicode has any serious bearing on your project
  • There are more points on the spectrum than just ASCII and Unicode-7
  • Consider BMP as a midpoint before falling back to ASCII
  • If – more likely – you are basing your work on some software like the examples I mentioned earlier which only support BMP, do try to error out early, gracefully and vigorously on encountering an SMP character.

Conclusion?

Does it matter to me whether a language/system uses UTF16/FSR/UTF8 etc etc? Not too much. Yeah I am a programmer so cant say no, its just that these representation fights are not my baby.

Would I like unicode to prosper and spread? Yes! The world will be a better place for the spread of the universal (or planetary) side.

I just want to suggest that the Unicode consortium going overboard in adding zillions of codepoints of nearly zero usefulness, is in fact undermining unicode’s popularity and spread.

And so I would urge programmers to bite the Gall/Sowa bullet, support BMP in preference to ASCII.

And what about 1,000,000+ unicode? The next generation of programmers can have some work!

Java, Javascript, MySql, Blogger, Windows, Emacs, Idle (and probably two dozen other notable names) are ok… no need to grumble


¹ And I've removed the references to 'gibberish'
² Ok more complex and more large are not quite the same. Not completely different either. Consider obese.
³ Rock’s law The cost of a semiconductor FAB unit doubles every 4 years — gives an almost diametrical antipode to Moore's law — decreasing cost of electronics for a user cannot be separated from increasing costs for the manufacturer. 
⁴ My own whimsical addition wish: If ½ is a unicode character why not 1/100 ?
  Later thanks to Chris Angelico (and some hacking) Ive figured out how to write ⅟₁₀₀ . See thread below
⁵ Thread here

No comments:

Post a Comment