Search This Blog

Thursday, February 26, 2015

Universal Unicode

What is the 'uni-' in unicode? According to the official records it comes from Unique Uniform and Universal.

Unicode starts out with the realization that ASCII is ridiculously restrictive, or the world is larger than the two sides of the Atlantic¹. This gives rise to all the blocks from Arabic to Zhuang.

However the greatest promise of unicode lies not in catering to this tower of babel but rather in those areas that are more universal. Yeah I know technically this distinction between universal and international will not stand up to scrutiny.
Different people will want to classify any given character as babel or universal differently. Nevertheless the distinction is important and the world can be a better place for the making of it.

Below is a first stab at my choice of the universal side of unicode.

Prompted by Dave Angel's neat list of historical examples showing how poverty (in this case of 7 bits) as a natural way of life is not necessarily a positive attitude.
(Although) I'm a native English speaker, 7 bits is not nearly enough. Even if I didn't currently care, I have some history:

No.  CDC display code is enough. Who needs lowercase?

No.  Baudot code is enough.

No, EBCDIC is good enough.  Who cares about other companies.

No, the "golf-ball" only holds this many characters.  If we need more,
we can just get the operator to switch balls in the middle of printing.

No. 2 digit years is enough.  This world won't last till the millennium

No.  2k is all the EPROM you can have.  Your code HAS to fit in it, and
only 1.5k RAM.

No.  640k is more than anyone could need.

No, you cannot use a punch card made on a model 26 keypunch in the same
deck as one made on a model 29.  Too bad, many of the codes are
different.  (This one cost me travel back and forth between two
different locations with different model keypunches)

No. 8 bits is as much as we could ever use for characters.  Who could
possibly need names or locations outside of this region?  Or from
multiple places within it?

2 Math

2.1 Basic

2.1.1 set theory

x ∈ A, A ⊆ B, A ∩ B, A ∪ B, ∅

2.1.2 Logic

∧ ∨ ¬ ∃ ∀

2.1.3 Standard Sets


2.1.4 n-arys

∑ ⋂ ⋃

2.1.5 Various

∞ ±

2.1.6 APL, Z

APL and Z Notation are two notable languages APL is a programming language and Z a specification language that did not tie themselves down to a restricted charset even in the day that ASCII ruled.

Yeah I know many people think that APL 'failed' because it was 'too mathematical.' Maybe they should reflect on whether between

      23 + 45 = 68


     twenty-three plus forty-five is sixty-eight

which is more perspicuous.

Or more simply why Cobol is not more popular.

2.2 Arrows

← → ↑ ↓ ⇒ ⇄ and zillions more

2.3 Brackets

Those who started with Fortran may remember how much trouble was caused both to programmers and language implementers because arrays and functions were indistinguishable – all because at that time there was only '()', no [] or {}.

However the acute worldwide shortage of brackets has not ended with Fortran. ASCII provides nothing more than '[{(' and their r-counterparts. Which means that every language has to invent its own ad hoc collection data-structures.

Now we have ⟦ ⟧ ⟨ ⟩ ⟪ ⟫ ⟮ ⟯ ⟬ ⟭ ⌈ ⌉ ⌊ ⌋ ⦇ ⦈ ⦉ ⦊ ⟅ ⟆ and much more

2.4 Sub/superscripts

x¹ y² z³ a₁ b₂ c₃

2.5 Greek and Math-Greek

Its hard to imagine doing math without the familiar
α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ σ τ υ φ χ ψ ω
See also math-greek blocks

3 Typography

3.1 IPA phonetcs

such as ɐ ə ɘ, see


Some European languages use «»‹› for quote marks. Also they are not consistent. eg sometimes ‹ may be an open and sometimes a close.  IOW technically quotes should be in the babel and not the universal part of unicode.

However experience with programming language design shows that two quote-marks are way too impoverished. eg python needs single, double, triple-single, triple-double, raw, unicode and all sorts of combinations of these.

More examples here

3.3 Typography

  • Space ␢
  • Para ¶
  • Section §
  • Return ⏎
  • Distinguish hyphen ‐, dashes (‒, –, —, ―) and minus −
    Rather than overloading all onto the one ASCII -
  • And finally the 'replacement-char' (unicode-goofup) �

5 Iconic

The below is much more iffy. Some will think them nonsense. Some will base their life on them! Personally, my feeling is that things that are random icons but are not language-ish in their own right should not be in a standard like unicode.
However they are! So lets use them!

5.1 Whimsical

When going from the original 2-byte unicode (around version 3?) to the one having supplemental planes, the unicode consortium added blocks such as
To me – a unicode-layman – it looks unprofessional… Billions of computing devices world over, each having billions of storage words having their storage wasted on blocks such as these?? Seems whimsical (if you ask me).

5.2 Astrology

Planets ☿ ♀ ♁ ♂♃♄♅ ♆ ♇
Zodiac ♈ ♉ ♊ ♋ ♌ ♍ ♎ ♏ ♐ ♑ ♒ ♓

    5.3 Cards and Chess

    eg ♠ ♥ ♦ ♣ ♔ ♕

    5.4 Traffic and maps

    eg ⚠ ☡ ✈ ✆

    5.5 Emoji

    Blogger is currently barfing on these 😁 😞 😠 – as with all things SMP. But we know what these are: Emoji

    5.6 Music

    As far as I can see putting music ♩ ♪ ♫ ♬ into unicode is one of
    • iconic
    • crazy
    • I dont get it
    That is I dont see (hear?) how one can write still less perform music using these 'unicode-music' signs.  Anyway they are there with many more here.

    5.7 Cultural, Religious, Ecological

    ✡ ☪

    ✟ ॐ
    ☭ 卐
    ♲ ☢


    Drawn largely from Xah Lee's excellent Unicode pages.

    ¹ Neglecting the years between the creation of ASCII and Unicode; also called CodePage hell

    No comments:

    Post a Comment