Search This Blog

Saturday, April 19, 2014

Unicode in Python

1 Introduction

Python has been making long strides in embracing unicode. With python 3 we are at a stage where python programs can support unicode well however python program-source is still completely drawn from the ASCII subset of unicode.
Well… Actually with python 3 (not 2) this is already possible
def solvequadratic(a,b,c):
    Δ = b*b - 4*a*c
    α = (-b + sqrt(Δ))/(2*a)
    β = (-b - sqrt(Δ))/(2*a)
    return (α, β)

>>> solvequadratic(1,-5,6)
(3.0, 2.0)
Now to move ahead!

Why do we have to write x!=y then argue about the status of x<>y when we can simply write x≠y?
Or take a random example from the tutor list :
import math
print math.pi
print math.floor( 31.58889 )
print math.ceil( 31.58889 )
as compared to
print π
print 31.58889
print 31.58889

So we could say python is half-way towards becoming a full unicode language. To move in this direction can mean at least two things:
  1. Make python 'native' to other natural human languages
  2. Embrace the universal (ie mathematical) side of unicode more fully
1 is all about internationalization and localization. The writeup addresses only 2. It is given in the form of tables showing how current Ascii syntax could transform into a Unicode-embracing one.

The ideas came from a number of people on the python list – see references below.

2 Legend

Since most of the following is in the form of tables with current (Ascii) syntax juxtaposed with the more unicode-d one, it turns out that many of the comments on these pairs are similar and repetitious. To keep these tables neat, the repeating comments are spelt out first as under:

2.1 Math Space Advantage – MSA

One of the less noticed benefits of math (like) operators is that a math-op like + in program text is lexically unambiguous ie '+x' is two tokens + and x and not a single token composed of + and x. This is unlike alphanumerics where all the following being lexically different
for x in line:
for x inline:
forx in line:

makes spaces mandatory
We will see that moving to a more pervasively unicode form, makes many spaces that are currently inevitable, become unnecessary.
Below I will point such cases out with a 'MSA'. In some cases its technically required to have spaces, in others its just more aesthetic to have them. eg.
x in lst
cannot be written as
1 in [1,2,3]
can be written
However that's completely unreadable.
There's no such problem with
So replacing in by has a math-space-advantage (MSA). It also has the advantage of

2.2 Disambiguation – Dis

The in in for loops and in predicates have very different semantics conceptually; the latter is purely declarative, the former creates a binding. So having two unmixupable in⁠s is good for reducing confusions eg. for x ⬅ [1,2,3]:
and if x ∈ [1,2,3]:
IOW due to extreme scarcity of characters in Ascii, many characters have for generations been overloaded willy-nilly. As that scarcity becomes a thing of the past maybe we should avoid useless overloading? These cases are marked by Dis.

2.3 Name Space burden reduction

A (perfectly normal English) word like floor or ceiling cannot be put into the global (builtin) namespace because a programmer may want to use that name for usual or related connotation of floor/ceiling. For a symbols like ⌊,⌈ no such issue arises. Symbol NS

2.4 Unicode Choice – UC

In many (all?) cases unicode offers so much new variety that its not clear which choice to make. Such choices are indicated with UC

2.5 Font Issue – FI

When things are not looking exactly proper/pretty on my end and it seems to be a font issue, I'll mark a FI

3 Basic math

Ascii Unicode
2*pi*r 2×π×r FI
x!=y x≠y
x<=y x≤y
x>=y x≥y
q,r=divmod(a,b) q,r=a÷b 1
float(inf) NS
pow(2,4) 2⇑4 2
2**4 2⇑4 2
math.floor(3.5) ⌊3.5 NS
math.ceiling(3.5) ⌈3.5 NS
Python already has a large bunch of division related operators and functions: /, //, %, divmod. Given that quotient together with remainder is a common integer arithmetic pattern, and structured return values is much easier in python than in classic imperative languages like C, my preference is for ÷ to stand in for divmod. Other choices with their justifications are of course possible.
Are pow and ** the same?
Do x and × look the same? If yes, this is a problem and maybe * is just preferable?

4 Other basic Syntax

4.1 Assignment ←

Ascii Unicode
x = 1 x ← 1
x,y = y,x x,y ← y,x
x += y x +← y
If one could count the grief caused by thinking that = is math-equality – not just noobs but experienced C programmers who mistakenly put a = when they meant ==
The is not looking very nice out here (in different fonts): either too scrawny or to stubby. So...
While in an earlier version of this post I had used that for examples, I am (for now) reverting to good ol =

4.2 Attribute access →

Ascii Unicode
sys.argv[1] sys→argv[1]
(5).to_bytes(4,"little") 5→to_bytes(4,"little") Dis, MSA

4.3 in (predicate)

Ascii Unicode
1 in [1,2,3] 1 ∈ [1,2,3] MSA,FI
Most of the fonts Ive checked make the ∈ a little too large
I guess this should be treated as a transient problem – a fixable bug

4.4 in (for)

Ascii Unicode
for x in [1,2,3,4]: for x⬅[1,2,3,4]: MSA,UC
The sign could be any one of ⬅ ⇐ ⇦ ?
The two ins now disambiguated to ⬅ and ∈ should be a help to noobs

4.5 lambda λ

Ascii Unicode
lambda x: x+3 λx: x+3 MSA

5 Logic

Ascii Unicode
not x ¬x MSA
x and y x∧y MSA
x or y x∨y MSA

6 Collections

Sets, Bags and Lists (numpy arrays??) form a series. Having literals for all makes some succinct expressions possible

6.1 Lists

Ascii Unicode
[1,2]+[3,4] [1,2]⤚[3,4] Dis
List append is not symmetric (commutative).
The operator should reflect that fact.

6.2 Set theory

The most natural charecter for set literals is '{}' However given that
  • that is already taken by dicts
  • and dicts are more fundamental to programming than sets
⦃ ⦄ should be a good enough approx to conventional usage
Common set theory operators that mathematicians use ∈ ∉ ⊂ ⊃ ⊆ ⊇ ⊈ ⊉ ∪ ∩ ∅
Now unicode makes these available without any markuping
Ascii – OO forms Ascii – functional forms Unicode
set([]) set([])
s = set([1,2,3] s=⦃1,2,3⦄ MSA
t = set([2,3,4,5]) t=⦃2,3,4,5⦄
x in s x∈s MSA
x not in s x∉s MSA
s.issubset(t) s<=t s⊆t
??? s<t s⊂t
not s.issubset(t) not (s <= t) s⊈t 1,2,3
set([1]) <= set([2,1]) ⦃1⦄ ⊆ ⦃2,1⦄ 3
s.issuperset(t) s>=t s⊇t
s.union(t) s|t s∪t
s.intersection(t) s&t s∩t
s.difference(t) s-t s∖t FI,UC
s^t s∆t
s.update(t) s|t s∪=t
s&=t s∩=t
For numbers, not (x <= y)x>y This is not the case for sets. In somewhat incorrect! math jargon, <= is a total order whereas ⊆ is a partial order . Therefore is more needed than <=
The low precedence of not makes parentheses unnecessary but I find it confusing
While in general the OO form (column 1) is the most verbose, in these cases it is more readable than column 2
Are s\t and s∖t distinguishable? They dont look to me…
Unicode gives one of the names of ∆ as "symmetric difference".
Dont know of any natural/standard sign for difference (other than '-' '\' '/'). There are zillions of other symbols of course.

6.3 Counter (bag/multiset)

Ascii Unicode
c = Counter(a=3, b=4) c = ⟅'a':3, 'b':4⟆ NS
d = Counter(a=1, b=2) d = ⟅'a':1, 'b':2⟆
c + d c ⊕ d
Counter({'b': 6, 'a': 4}) ⟅'a':4, 'b':6⟆
c & d c ∩ d
Counter({'a': 1, 'b': 1}) ⟅'a':1, 'b':1⟆
c | d c ∪ d
Counter({'a': 3, 'b': 4}) ⟅'a':3, 'b':4⟆
Counter can only be used after from collections import Counter
Having to do this is an avoidable headache.
Not having to do this (in the current dispensation) entails a pollution of the global namespace)
Its another matter that Counter is an unfortunate name choice, given that
  • Bag/Multiset already exist and are well known
  • Counter already has more than many other established meanings in CS
Note that list 'addition' (append) is not symmetric hence the asymmetric ⤚
Bag 'addition' is symmetric. The operator should reflect that.
Which symbol to use? ⊕ or ⋄ ?
The (in code) looks worse than the plain ⋄ out here

6.4 Casting

Python already has 'natural' casting (at the type level). Given
l = [1,2,3]
we can do
s = set(l)
c = Counter(l)
Literals even allow for use of the most 'natural' operators
Type Operation
Set ∪, ∩
with the general rule that the upper-row operators pull lower data upwards eg
ie order and repetition vanishes
[2,1,2]⊕[2,3,4,5]⟅1:1, 2:3, 3:1, 4:1, 5:1⟆
ie order vanishes, repetition maintained
Disambiguated literals makes natural casting possible:
x∪y expects x, y to be sets. What if they are not?? Simple – they are cast to sets
Likewise x⊕y expects x, y to be Counters. Else they are cast to counters
Presence of literals makes other things possible and natural, eg…

6.5 Comprehensions

Once we have literals for sets and bags we can have comprehensions for them:
Natural Comprehensions
Natural because both input and output collection are same
We can also have
Casting Comprehensions
ie the intention of the list-to-set cast is that order and repetition are discarded
Note: Many noob misunderstandings re comprehensions come from the clever pun – for in loops and in comprehensions. This removes that problem
UC: The │ (∣) is not the usual | (codepoint 9474 vs 124). It could be some other character – in addition to the ascii | there are │∣┃ ¦ │ (and probably more!!)

6.6 N-ary Operators

6.6.1 Examples

In mathematics there are a number of constructs like ∑, ∀ etc. They can be subsumed under the general concept of n-ary operators – aka generalized products.

6.6.2 Types

N-ary operators are complementary to comprehensions. If t is some type and C is one of set, Counter or list
can be thought to have type tC(t)
N-ary operators
can be thought to have type C(t)t

6.6.3 Correlations

N-ary operators are like reduce in that they generalize a binary to a collection.
N-ary operators are like lambda/comprehensions in that they imply a local binding
However there are issues. Consider for some arbitrary term t(x)
(∑ x∈⦃1,2,3⦄ : t(x))
= t(1) + t(2) + t(3)
However there is a catch: ⦃1,2,3⦄ == ⦃1,2,3,1,2⦄ [In standard python syntax set([1,2,3,1,2]) == set([1,2,3]) ]
That is, since sets contain elements whose repetition count is unspecified, the sum above is also t(1)+t(2)+t(3)+t(1)+t(2) or anything else!!
So clearly the appropriate collection for a ∑ is Counter, not set or list
In general, we see that for the n-ary operators we also have a natural collection over which they operate
Operator n-ary Natural Collection
+ Counter
× Counter
In general the principle is that for operators that are commutative and associative we use Counter. For operators that are idempotent as well we use Set.
Note that if an operator is not commutative and associative it has no meaningful n-ary. If it is, then list is over-specific; which is why we only find set and counter above.

7 Strings/Quoteds

Python has a menagerie of quoteds and unicode has a corresponding one of quote-like characters. How to match them I'm not really sure... Heres a start
Ascii Unicode
"Tom said \"Mary said \"Yoohoo!\"\"" «Tom said «Mary said «Yoohoo!»»»
r"a\nb" ‹a\nb›
u"हरि ॐ" ⟪हरि ॐ⟫
Note that whether » is one character or two is similar to the problem we have with quotes. Is '' a single double-quote or a double(d) single quote? Depending on the font this may be obvious or not
The above – so-called 'French-quotes' – seem to be widely used in languages other than French. German quotes however have some inconsistency problems.

Maybe code literals (compile, parser etc) ⟦ ⟧ following denotational semantics?
Ascii Unicode
code = compile('a + 5',...) code = compile(⟦a + 5⟧, ...)

[Seems neat in the context of Lisp or denotational semantics, not sure of python]

8 Long·Identifiers

There is also some evidence (?) suggesting that a-long-identifier is more readable than a_long_identifier is more readable than aLongIdentifier
The hyphenated option suffers from a severe ambiguity because hyphen and minus are the same letter…
… in Ascii only!
No More! Now we can write a·long·identifier
Well lisp and Cobol are exceptions but they incur their own heavy cost – math expressions cant be written naturally

9 is

Ascii Unicode
a is b a ≣ b
Or ≡ ?
The difficulties/noob-confusions of python's is should significantly reduce with this!

10 APL/Numpy integration

Ideas in numpy is largely lifted from APL.  Unicode makes it possible to carry (some of!) APL's lexemes as well. And not to go overboard in this and repeat APL's mistakes!
Ascii Unicode
array([2,3,4]) ⟨2,3,4⟩
range(10) ⍳10
a.shape ⍴a
a.reshape(2,3) a⍴(2,3)
take(a,2) a↑2
drop(a,2) a↓2
Numpy-array comprehensions
Advanced stuff – probably with inspiration from Alpha-Polyhedra

11 Questionable below

12 Keywords and Special Constants

Following Antoon's wish for def we could have 𝗮𝗯𝗰𝗱𝗲𝗳𝗴𝗵𝗶𝗷𝗸𝗹𝗺𝗻𝗼𝗽𝗾𝗿𝘀𝘁𝘂𝘃𝘄𝘅𝘆𝘇 versions of the following keywords
and del for is raise
assert elif from lambda return
break else global not try
class except if or while
continue exec import pass yield
def finally in print
I personally consider more important to have 𝐍𝐨𝐧𝐞, ( 𝗡𝗼𝗻𝗲 ?) 𝕋, 𝔽 T F for True and False
I wonder about this
  1. Really mixing up fonts with characters seems like a bad idea (for programming). Why not colors? Sizes?…
  2. More generally most of the SMP seems like nonsense (to me) 
  3. Finally this does not seem to be working! So even if SMP is a good idea its probably not ready for general use (Trying numeric 핋 120139 dec or 핋 ie hex 1D54B )

13 Root

Ascii Unicode
sqrt(s) √x
Looks like poor over-specific syntax (to me) (But what do i know?!)

14 Operators

Large swathes of unicode's math-space could be available in operator which users (aka programmers) can choose to bind at will.
Given the experience of readability of APL this may be ill-advised… Maybe not – C++ devotees like the possibilities of overloading basic arithmetic operators.

15 References

15.1 Steven D'Aprano

  1. π (some other math symbols?) [Steven ?]
  2. (Problems with) ∑ for sum Steven 1
  3. Steven 2 example: was towards showing that something like this is undesirable:
  4. import ⌺
    ⌚ = ⌺.╩░
    ⑥ = 5*⌺.⋨⋩
    ❹ = ⑥ - 1
    ♅⚕⚛ = [⌺.✱✳**⌺.❇*❹{⠪|⌚.∣} forin ⌺.⣚]
    Somebody else pointed out that this is actually valid. Cant remember who and I certainly cant make this (as is) work. 
  5. That mathematicians used sets does not makes sets as fundamental in programming as dicts – [Steven ?] (so {} for dicts and something else for sets is ok)

15.2 Antoon Pardon

  1. · for ident separator (instead of '_') [Antoon ?]
  2. × for multiplication Antoon 2
  3. ⇑ for exponentiation [Antoon ?]
  4. → for attribute access Antoon 3
  5. ⤚ for list append Antoon 3
  6. bold (SMP) letters in identifiers [Antoon ?]

15.3 Mark Harris

  1. ∈ ∉ ∀ Δ Mark 1 Mark 2
  2. √ for sqrt Mark ?

No comments:

Post a Comment