Search This Blog

Tuesday, January 6, 2015

Unicode and the Universe

If you're trilingual you speak three languages, if you're bilingual you speak two languages, if you're monolingual you're American.

Mark Harris on the python list
Well if one reads that thread above, one would find that people were rather uptight with Mark Harris for that statement. And yet they have the same insular attitude towards ASCII-in-programming that Mark describes in Americans towards English (or more correctly Americanese); to wit they consider that programming with ASCII (alone) is natural, easy, convenient, obvious, universal, inevitable etc.

Is it mere coincidence that the 'A' of ASCII is short for American?

Not so long ago the world lay from a few kilometers east of The Garden of Eden to a few hundreds kilometers west.  And then it stretched to a spherical globe of 40,000 km circumference.  At that time the gods used to light lamps at night called 'stars'.

And then things changed a wee little bit, the stars and our world – suddenly grown quite small – became more 'similar' and the wider world stretches now to a few billion light-years across.

In many respects the story of ASCII to Unicode is similar. Pragmatically both represent a 0 → ∞ jump, in the sense that it was natural to use the whole of the (printable) part of ASCII.  [Many of us even used to know the code-points of ASCII quite well!] With unicode, not only is any one person knowing all the 1,114,112 characters unrealistic, even knowing what all blocks exist is infeasible.

At base this is

The problem of meaning

The smaller world is naturally more meaningful than the larger one.  Just as one can have a more warm fuzzy feeling about Momma than woman-kind, one can at least imagine a God who selects a chosen people and is solicitous and possessive about them as long as the world is comprehensible on my scale. When it becomes too large that life itself looks like a freak-accident, such beliefs are harder to maintain.

As example, consider Amerigo Vespucci 
We saw more wild animals—such as wild hogs, kids, deer, hares, and rabbits—than could ever have entered the ark of Noah; but we saw no domestic animals whatever… I fancied myself near the terrestrial paradise…
Vespucci was an adventurer, not a religious man.  By contrast today even a committed religious person would not ask whether a specific animal of the mundane world is found in the scripture of his choice. And I dare say Vespucci talks of paradise with a literalness that is not possible for a modern.

In effect our world has become so large it is difficult to give it meaning.

Likewise, even considering only extant languages…

Unicode is too large

People want to stick to ASCII because of the unending, terrifying swathes of undecipherable characters.  An argument I often hear is
Given that I have only ten fingers and a hundred or so keys in front me, how am I to invoke a specific symbol from the hundred thousand or so that are available in Unicode?
Well… Dunno what to say… If I can go from 100 characters to 200 I am twice as rich. Why worry about the million I have no use for?

But it is really much worse

Unicode has plain gibberish

You dont play with Mahjong characters? How crude!
You dont know about cuneiform? How illiterate!
You dont compose poetry with Egyptian hieroglyphs? How rude!
Shavian has not reformed you? How backward!
In short, to make effective use of unicode, it may be worthwhile to distinguish the international blocks (also called the tower of babel) from the universal parts of unicode, viz. math.

That is,

Unicode is like the universe

in the sense that in the pre-unicode era, the universe was so small that parochialism was unavoidable. Today it is so big, meaninglessness is inevitable.

In the medieval ASCII world one could choose between being one of:
1 Dummy
To sell one's computer and work (and soul?) to a proprietary format and word-processing software
2 Wizard
To master something intricate and complicated such as latex (or mathml, lilypond, troff…)
3 Programmer
Everything that is worth expressing can be expressed in ASCII.


God made ASCII. All the rest is the work of man.
And so we had before us a delicious à la carte offering:
  1. idiocy of ignorance
  2. slavery to savantery
  3. prison of penury
Now while we are not completely free from these 'blessings' yet, we are better off than before, thanks to Unicode

To see why 1 and 2 need not be the case any more, see some suggestions made in the context of python.  Now while the suggestions are not quite serious and are unlikely to be taken seriously, as we go from established/old languages towards the bleeding edge they become more realistic.  Here's Julia and Agda.

As for not having to choose between 2 and 3, heres something I recently asked on the (la)tex list:

Here is the wikipedia page on ε-δ definition of limit where we see the well-known definition

Editing it produces this excerpt [note this is input text]
(\forall \varepsilon > 0)(\exists \ \delta > 0) (\forall x \in D)
(0 < |x − c | < \delta \ \Rightarrow \ |f(x) - L| < \varepsilon)

Now compare it with the following – also input text:

(∀ ε > 0) (∃ δ > 0) (∀ x ∈ D) (0 < |x − c| < δ  ⇒  |f(x) - L| < ε)

[Note particularly the real minus between x and c and the ASCII hyphen minus between f(x) and L]

In this age of unicode when we have xetex/luatex why do we use the first when the second is so much closer to the desired result?
Hopefully most people would agree the latter is more readable than the former.
The questions that remain are
  1. Typing it in.
  2. Is it close to luatex/xetex? 
For 2. I'd welcome help/suggestions ;-)

For 1., Ive just recently discovered pointless-xcompose which goes a good way towards solving this at least on linux¹

And I suggest we distinguish these

Levels of Input Methods

  1. Cut paste a character after searching with google
  2. Select a character from a local app like gucharmap (emacs: C-x 8 Ret)
  3. Use an editor abbrev(iation)
  4. Use an editor input method eg emacs' tex input-method will convert \forall into ∀ etc
  5. Use the compose-key (Windows users may try this – dunno…) 
  6. Switch keyboard layouts in software with something like ibus
  7. Use a special purpose hardware keyboard
As we go from 1 to 7 the expertise and efficiency increases but also the expense of setup, hardware etc. and most important, learning. The cost of assuming that only the extreme choices – 1 and 6 – are available and not all the other interim possibilities, is the binary choice between meaninglessness and parochialism.

IOW placing the slider effectively along this spectrum represents an efficient…

Huffman coding

applied to keystrokes and mouse gestures (in analogy to bits)

For a while now Ive used 1 and 3.

Combining 3 and 4 thanks to pointless-xcompose is, I expect, going to be more convenient and effective, especially when it is tailored to the subset of characters one needs frequently.

The one thing not clear is how to set up the compose key. Complete noob myself but on a recent linux¹ this may work:

$ setxkbmap -option compose:menu

to make the menu key behave like compose.  Replace the 'menu' by 'rwin' or 'ralt' to get the same behavior out of the right-windows or right-alt keys.


  1. Thomas Reuben for writing pointless-xcompose
  2. David de la Harpe Golden for introducing me to xkb (setxkbmap)

¹ Thomas Reuben, author of pointless-xcompose, points out to me that saying linux is inappropriate where X-windows would be more correct. He is right.
Left the linux there as more people are likely to know they are using linux than that they are using X-windows  ☺

No comments:

Post a Comment