Languager: Unicode and the Unix Assumption

Once upon a time, file was a rich, profound, daunting and wondrously messy concept. It involved ideas like

record orientation
blocking factor
partitioned data sets

and other wonders of computer (rocket) science.

Then there came along 2 upstarts, playing around in their spare time with a machine that their Lab had junked. They were having a lot of fun…

They decided that for them File was just List of Bytes.
type File = [Byte]
Oh the fun of it!

Further they mused: Isn't the type [Byte] just too cute to just store away into files?
In particular why not also

pass it in/out from terminals
pass it between processes
send it on networks
und so weiter

??

Like children playing 'Doctor-Doctor' or 'Teacher' they decided to call their toy an 'OS'. And when other OSes were called by proper and respectable names like Tops-10, MVS, DOS/VSE, VMS, VME they – What a riot! – named their joke Unix and even pretended it was a real OS!!

To make their catchy humour stick as well, they also invented something – first called Cute and later shortened to C – which delighted in a variety of puns. Many of these are remembered here.

The one which is most central to this story is
type byte = char
combined with
type byte = (tiny) int

a piece of superb humour that lasted 30 years before beginning to wear thin.

O such fun: If one squinted with the left eye, char was unsigned, if with the right it was signed. The advantage of the first was that one could store 256 distinct (positive) characters in a char unlike 128 in the other case. The disadvantage was that one couldn't store a -1. Now why on earth would one want to store an int in a char??
Anyways… isn't it quite obvious that int is the most natural type to store a char?

Here is what the oracle docs say for the C locale
The C locale, also known as the POSIX locale, is the POSIX system default locale for all POSIX-compliant systems.

Or in other language

ASCII also known as the Unix locale is the default for all *nix compliant systems

50 years on…

The rioters are… now old some dead (RIP)…
Our machines are billion-fold bigger…
Connected together in a(nother) billion-fold network

But the joke continues…
And its called…

The Unix Assumption

It runs like this:

human communication…

machine communication…
text…

ASCII is fine…

bytes…

von Neumann computers

Now a wise support for the above profoundness is the

Lemma: 7 = 8

This follows from some powerful axioms:

Axiom 1: Clean = Dirty

In the distant past when ASCII was 7 bits, one could transmit 7 or 8 bits.

[O! For those magical days: One could even transmit ½ a bit]

Now when bytes were 8 bits and chars (plain ASCII) was 7, arbitrary binary could not be transmitted. So this important concept was invented called 8-bit cleanness which basically allowed 8 bits to be transmitted.

However this led people to (mis)use the extra 8th bit.

And so we went from ASCII — a 7-bit code — to ASCII its hydra-headed 8-bit extensions. Of which there were 100s. IOW the...

Axiom 2: ASCII = ASCII

It of course helped that the first too was larger than singleton. And so people said ASCII and meant anything and everything:

This had the salutary effect that Pascal code that looked like this:

{return the net }
ret := gross[unit] * grossRate

on one machine, looked like this on another

ä return the net å
ret := grossÄunitÅ * grossRate

And while a specially funny joke can be enduring, sometimes when the funniness is long over its time to

Deconstruct the humour

All the world's alphabets

fit into 128 chars.
Or
World = Bell Labs

All the world's machines

agree on byte-order
Or
'Endianness' is a joke from Gulliver's Travels

Other implications of the Unix-assumption such as

Human communication

is all about text

[Also called Being a Nerd]

And therefore GUIs are of, for and by weenies

will be left alone for now.

Yes

Unicode is a Headache

With ASCII, data is ASCII whether its file, core, terminal, or network; ie "ABC" is 65,66,67.

Ok there are a few wrinkles to this eg. the null-terminator in C-strings. This is the exception to the rule that in classic Unix, ASCII is completely inter-operable and therefore a universal data-structure for inter-process or inter-machine communication.

It is this universal data structure that makes classic unix pipes and filters possible and easy. IOW composability, so dear to programmers is almost free in the Unix world. eg separation of mail-presentation ie clients and transportation ie servers like sendmail, postfix is facilitated because the universal data-structure – 'ASCII' text – can be passed around conveniently

With unicode there are:

encodings
and multi-layer encodings
in-memory formats vs
transportation formats
strange beasties like python's FSR
BOMs and unnecessary/inconsistent BOMs (grace of Microsoft-notepad)

Go up from ASCII ↔ Unicode level to the plain-text ↔ hypertext (aka html) level and these composability problems hit with redoubled force.

So, yes if unicode were avoidable, our headaches would be significantly reduced.

But its not! In 2014…

ASCII is not ONE charset but an amorphous bunch

If one is under the delusion that ASCII is an alternative to Unicode it means one is (one of):

living in 1974
dreaming
a child playing 'Doctor-doctor'

For the awake adults in 2014 ASCII is called:

CodePage Hell

If you are on a *nix and say you are using ASCII, you are probably using ISO8859-1, more commonly known as Latin-1
Of course then there's Europe, the land of more elaborate char-soup – also knows as the latins.
Wherein is found this piece of 'higher-math: that there are 15 latins, numbered 1-16 in which Latin-9 is 8859-15, while 8859-9 is Latin-5.
If you are on a Windows and say you are using ASCII, you are probably using Windows-1252 (often mislabelled ANSI to add to the confusion)
All these seemingly similar but actually different charsets {Windows, Unix} × Latin[1-15] constitute codepage hell
As it happens the world is a bit larger than Europe and the set of OSes is a superset of {Windows, Unix}. And so codepage-hell is more hellish than the above description. eg there's Russian, Vietnamese, Indian, Japanese
Here is a random sample I saw yesterday of 8859 becoming wrong simply with the passage of time.

In short

It's 2014.

No one uses machines that are pure ie 7-bit ASCII.

And because ASCII only means code-page hell, ie it means all kinds of different things to all kinds of people, it really does not mean anything;

IOW in 2014 ASCII is one of the following (take your pick):

An archaeologically interesting artifact

Evidently used by some computers which were used by a certain Mr. Noah to design and build a large boat that carried our ancestors. Like most archaeological artifacts the details remain to be verified.

Another name for a certain large flightless Bird

In short sticking with ASCII is the analog of sticking with a keyboard having a broken 'k' and then moaning O why do people use strange characters like 'k'?

Whether its k or «, α or ω, π or ≠, it is as Richard Bach said:

«Argue for your limitations, and sure enough they're yours.»

Unicode Status

All modern OSes are Unicode compliant
Editors and IDEs are increasingly becoming Unicode compliant
Modern languages are increasingly becoming Unicode compliant; and at two levels

They are growing and improving in their support for unicode
They are widening the contents of their lexical elements to allow arbitrary unicode lexemes

It would be good if we distinguished these two levels of support. Eg. Python is good at 1 whereas Agda is making forays in 2

Acknowledgements

Joel Spolsky is required reading for people wanting to understand this
From hanging about on the python list with fellows like Steven D'Aprano I've learnt much of this subject; also moved on from O-No! Unicode! to O-Yes! Unicode!
Unicode is a headache; the mistake is to assume its an avoidable headache.
Perhaps toothache would be a better analogy, in the sense that we are currently facing teething troubles post the falling off of the ASCII-milk-teeth. Ive been helped in understanding the resistance to acceptance by an insight from Roel van Dijk on the Haskell list:
My anecdotal observation is that it seems to be used more by people who speak a native language that isalready poorly served by ASCII. Perhaps because they are already used to not being able to simply type every character they need.
One of the things that needs elaboration is the distinction that I briefly allude to between language supporting unicode and unicode lexemes. I'll write about that separately.
Added later: Following this python-list thread, Lemma 7=8 was added thanks to Steven, Marko and others, Jussi reminded of the mess regarding Latin numbering and Matthew Barnett tells me that ostriches are not Australian birds -- How about that!

Languager

Search This Blog

Tuesday, April 29, 2014

Unicode and the Unix Assumption