A Binary Branching Theory

Tue, April 16, 1:30 to 3:00pm


Binary branching is a theory of how basic building blocks of Vietnamese syllable phonology and word formation are related, see Nhan (1984).  This theory is applied to discover the basic building blocks for East Asian ideographic scripts, which appeared long before the birth of China, and which had been used in China, Japan, Korea, Vietnam and Singapore. The East Asian scripts, like all Asian scripts, are built on syllables, not on alphabets.

An ideogram is written with strokes arranged neatly within an imaginary box.  This is probably inconceivable to outsiders, esp. Europeans who named it a “character.”  The misconception has been perpetuated by the Europeans since their first contacts until today, with the International Standards Organization, i.e. ISO/IEC 10646, and UniHan 11.0.0 (July 2018) multilingual character codes used in all computers.  It is believed that each ideogram is a character, equal to a European character/letter in its alphabets,… This position ends up with UniHan containing 88,889 “characters”. A glaring error! Which alphabet has 88,889 different “letters”? And worse: the ideogram repertoire of UniHan is not even complete.  There are more that have not been included.

In reality, to any reader or writer in these countries, each of the 88,889 ideograms is meaningful and represents a syllable. It is composed of other ideograms, such as 國 “nation” is composed of 囗 and 或; 語 “language” is composed of 訁and 吾.

Vietnam used a 國語 national script called 𡨸喃 chữ Nôm to represent its language for about 1,000 years, until the 1920's.  Our collection of known Nôm ideograms is currently at 31,931, covering 19,564 unique ideograms.  These are far from being exhaustive.

This paper uses, as an alternative, a recursive binary deconstruction of known ideograms using internal regularities of their graphic representation, to arrive at a set of the smallest meaningful units.  We call these orthographic units.

We are motivated by the fact that people who are fluent in Nôm intuitively identify ideograms by their graphic parts, especially when they spell them out loud.  Spelling an ideogram reveals how a native conceive their script. This approach breaks down a sample of 17,823 ideograms successively into the smallest meaningful parts, i.e. orthographic units, together with their graphic operators (UniHan calls them “ideographic description characters”).  Thus, for example, ideograms 漴 sòng, 𧐿 sùng, 𣙩 song, and 𠼾 song are composed by a strict sequence of the operator ⿰, a classifier (currently called “radical”), and 崇 sùng
漴 → ⿰ 氵+崇; 𠼾 → ⿰ 口+崇; 𣙩 → ⿰ 木+崇; 𧐿 → ⿰ 虫+崇;
in turn, 崇 is composed by ⿱ 山+宗;
in turn, 宗 is composed by ⿱ 宀+示;
in turn, 示 is composed by ⿱ 二+小; and
in turn, 二 is composed by ⿱ 一+一.
We say 一 nhất, 二 nhị, 小 tiểu, 宀 miên, 山 sơn, 木 mộc, 口 khẩu, 氵 thuỷ, 虫 trùng, and ⺀ nháy are orthographic units that successively form  示 kỳ, 宗 tông, 崇 sùng, 漴 sòng, 𧐿 sùng, 𣙩 song, and  𠼾 song in specific configurations.  They also form 呩, 沶, 柰, 标, 祘, 淙, 棕, 崈, and 𠮛, 亖, 亗, 吅, 吕, 𠮿, 品, 㝉, 宫, 屾, 岀, 未, 末, 本, 杏, 束, 杣, 林, 棕, 森, 虽, 蚞, 䖵, … exponentially.

An ideogram is then composed of a linear ordered set of other ideograms binarily arranged in an imaginary square.


