Search This Blog

Showing posts with label Unicode. Show all posts
Showing posts with label Unicode. Show all posts

Monday, March 2, 2015

Unicode: Universal or Whimsical?

Unicode Classification

In my last post, I wrote about two sides to unicode — a universal side and a babel side. Some readers while agreeing with this classification were jarred by a passing reference to ‘gibberish’ in unicode⁵.
unicode-universal-or-whimsical.html
Since I learnt some things from those comments, this post expands that classification into these¹.
  1. Babel
  2. Universal
  3. Legacy
  4. Unavoidable mess
  5. Political mess
  6. Whimsical

Thursday, February 26, 2015

Universal Unicode

What is the 'uni-' in unicode? According to the official records it comes from Unique Uniform and Universal.

Unicode starts out with the realization that ASCII is ridiculously restrictive, or the world is larger than the two sides of the Atlantic¹. This gives rise to all the blocks from Arabic to Zhuang.

However the greatest promise of unicode lies not in catering to this tower of babel but rather in those areas that are more universal. Yeah I know technically this distinction between universal and international will not stand up to scrutiny.

Tuesday, January 6, 2015

Unicode and the Universe

If you're trilingual you speak three languages, if you're bilingual you speak two languages, if you're monolingual you're American.

Mark Harris on the python list
Well if one reads that thread above, one would find that people were rather uptight with Mark Harris for that statement. And yet they have the same insular attitude towards ASCII-in-programming that Mark describes in Americans towards English (or more correctly Americanese); to wit they consider that programming with ASCII (alone) is natural, easy, convenient, obvious, universal, inevitable etc.

Is it mere coincidence that the 'A' of ASCII is short for American?

Not so long ago the world lay from a few kilometers east of The Garden of Eden to a few hundreds kilometers west.  And then it stretched to a spherical globe of 40,000 km circumference.  At that time the gods used to light lamps at night called 'stars'.

And then things changed a wee little bit, the stars and our world – suddenly grown quite small – became more 'similar' and the wider world stretches now to a few billion light-years across.

In many respects the story of ASCII to Unicode is similar. Pragmatically both represent a 0 → ∞ jump, in the sense that it was natural to use the whole of the (printable) part of ASCII.  [Many of us even used to know the code-points of ASCII quite well!] With unicode, not only is any one person knowing all the 1,114,112 characters unrealistic, even knowing what all blocks exist is infeasible.

At base this is

The problem of meaning

The smaller world is naturally more meaningful than the larger one.  Just as one can have a more warm fuzzy feeling about Momma than woman-kind, one can at least imagine a God who selects a chosen people and is solicitous and possessive about them as long as the world is comprehensible on my scale. When it becomes too large that life itself looks like a freak-accident, such beliefs are harder to maintain.

As example, consider Amerigo Vespucci 
We saw more wild animals—such as wild hogs, kids, deer, hares, and rabbits—than could ever have entered the ark of Noah; but we saw no domestic animals whatever… I fancied myself near the terrestrial paradise…
Vespucci was an adventurer, not a religious man.  By contrast today even a committed religious person would not ask whether a specific animal of the mundane world is found in the scripture of his choice. And I dare say Vespucci talks of paradise with a literalness that is not possible for a modern.

In effect our world has become so large it is difficult to give it meaning.

Likewise, even considering only extant languages…

Unicode is too large

People want to stick to ASCII because of the unending, terrifying swathes of undecipherable characters.  An argument I often hear is
Given that I have only ten fingers and a hundred or so keys in front me, how am I to invoke a specific symbol from the hundred thousand or so that are available in Unicode?
Well… Dunno what to say… If I can go from 100 characters to 200 I am twice as rich. Why worry about the million I have no use for?

But it is really much worse

Unicode has plain gibberish

You dont play with Mahjong characters? How crude!
You dont know about cuneiform? How illiterate!
You dont compose poetry with Egyptian hieroglyphs? How rude!
Shavian has not reformed you? How backward!
In short, to make effective use of unicode, it may be worthwhile to distinguish the international blocks (also called the tower of babel) from the universal parts of unicode, viz. math.

That is,

Unicode is like the universe

in the sense that in the pre-unicode era, the universe was so small that parochialism was unavoidable. Today it is so big, meaninglessness is inevitable.

In the medieval ASCII world one could choose between being one of:
1 Dummy
To sell one's computer and work (and soul?) to a proprietary format and word-processing software
2 Wizard
To master something intricate and complicated such as latex (or mathml, lilypond, troff…)
3 Programmer
Everything that is worth expressing can be expressed in ASCII.

IOW…

God made ASCII. All the rest is the work of man.
And so we had before us a delicious à la carte offering:
  1. idiocy of ignorance
  2. slavery to savantery
  3. prison of penury
Now while we are not completely free from these 'blessings' yet, we are better off than before, thanks to Unicode

To see why 1 and 2 need not be the case any more, see some suggestions made in the context of python.  Now while the suggestions are not quite serious and are unlikely to be taken seriously, as we go from established/old languages towards the bleeding edge they become more realistic.  Here's Julia and Agda.

As for not having to choose between 2 and 3, heres something I recently asked on the (la)tex list:

Here is the wikipedia page on ε-δ definition of limit where we see the well-known definition


Editing it produces this excerpt [note this is input text]
(\forall \varepsilon > 0)(\exists \ \delta > 0) (\forall x \in D)
(0 < |x − c | < \delta \ \Rightarrow \ |f(x) - L| < \varepsilon)


Now compare it with the following – also input text:

(∀ ε > 0) (∃ δ > 0) (∀ x ∈ D) (0 < |x − c| < δ  ⇒  |f(x) - L| < ε)

[Note particularly the real minus between x and c and the ASCII hyphen minus between f(x) and L]

In this age of unicode when we have xetex/luatex why do we use the first when the second is so much closer to the desired result?
Hopefully most people would agree the latter is more readable than the former.
The questions that remain are
  1. Typing it in.
  2. Is it close to luatex/xetex? 
For 2. I'd welcome help/suggestions ;-)

For 1., Ive just recently discovered pointless-xcompose which goes a good way towards solving this at least on linux¹

And I suggest we distinguish these

Levels of Input Methods

  1. Cut paste a character after searching with google
  2. Select a character from a local app like gucharmap (emacs: C-x 8 Ret)
  3. Use an editor abbrev(iation)
  4. Use an editor input method eg emacs' tex input-method will convert \forall into ∀ etc
  5. Use the compose-key (Windows users may try this – dunno…) 
  6. Switch keyboard layouts in software with something like ibus
  7. Use a special purpose hardware keyboard
As we go from 1 to 7 the expertise and efficiency increases but also the expense of setup, hardware etc. and most important, learning. The cost of assuming that only the extreme choices – 1 and 6 – are available and not all the other interim possibilities, is the binary choice between meaninglessness and parochialism.

IOW placing the slider effectively along this spectrum represents an efficient…

Huffman coding

applied to keystrokes and mouse gestures (in analogy to bits)

For a while now Ive used 1 and 3.

Combining 3 and 4 thanks to pointless-xcompose is, I expect, going to be more convenient and effective, especially when it is tailored to the subset of characters one needs frequently.

The one thing not clear is how to set up the compose key. Complete noob myself but on a recent linux¹ this may work:

$ setxkbmap -option compose:menu

to make the menu key behave like compose.  Replace the 'menu' by 'rwin' or 'ralt' to get the same behavior out of the right-windows or right-alt keys.

Acknowledgements

  1. Thomas Reuben for writing pointless-xcompose
  2. David de la Harpe Golden for introducing me to xkb (setxkbmap)


¹ Thomas Reuben, author of pointless-xcompose, points out to me that saying linux is inappropriate where X-windows would be more correct. He is right.
Left the linux there as more people are likely to know they are using linux than that they are using X-windows  ☺

Tuesday, May 13, 2014

Unicode in Haskell Source

After writing Unicoded Python, I discovered that Haskell can do some of this already.  No its not even half way there but I am still mighty pleased!

Tuesday, April 29, 2014

Unicode and the Unix Assumption

Once upon a time, file was a rich, profound, daunting and wondrously messy concept. It involved ideas like
  • record orientation
  • blocking factor
  • partitioned data sets
and other wonders of computer (rocket) science.

Then there came along 2 upstarts, playing around in their spare time with a machine that their Lab had junked. They were having a lot of fun…

They decided that for them File was just List of Bytes.
type File = [Byte]
Oh the fun of it!

Saturday, April 19, 2014

Unicode in Python

1 Introduction

Python has been making long strides in embracing unicode. With python 3 we are at a stage where python programs can support unicode well however python program-source is still completely drawn from the ASCII subset of unicode.
Well… Actually with python 3 (not 2) this is already possible
def solvequadratic(a,b,c):
    Δ = b*b - 4*a*c
    α = (-b + sqrt(Δ))/(2*a)
    β = (-b - sqrt(Δ))/(2*a)
    return (α, β)

>>> solvequadratic(1,-5,6)
(3.0, 2.0)
>>>
Now to move ahead!

Tuesday, September 17, 2013

Haskell: From unicode friendly to unicode embracing

Doesn't λ x ⦁ x  :  α → α look better and communicate more clearly than \ a -> a :: a -> a  ?

What are the problems with the second (current Haskell) form?
  1. The a in the value world is the same as the a in the type world -- a minor nuisance and avoidable -- one can use different names
  2. λ looks like \
  3. The purely syntactic -> that separates a lambda-variable and its body is the same token that denotes a deep semantic concept -- the function space constructor
APL was one of the oldest programming language and is still one of the most visually striking.  It did not succeed because of various reasons, most notable of which is that it was at its heyday too long before unicode.

While APL is the first in using mathematical notation in programming, Squiggol, Bananas and Agda are more recent precedents in this direction.

In short, its time for programming languages to move from unicode-friendly to unicode-embracing

Some stray thoughts incorporating these ideas into Haskell.