Skip to content

Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED ! #87

Open
@r12a

Description

@r12a

This issue is applicable to most languages with conjunct forms that involve a virama.

Many scripts descended from Brahmi indicate clusters of consonant sounds by merging or stacking the glyphs of the consonants involved in one way or another. These scripts are abugidas, and each consonant character represents a consonant sound and an inherent vowel sound. The merging of glyphs indicates that the inherent vowel sound is dropped between the consonants. In Unicode text, this merging is usually accomplished using a special character between the consonants, which is typically called a virama or 'vowel-killer'.

When operations such as line breaking, cursor movement, vertical text rendering, deletion, hyphenation, etc are applied to the text these conjuncts must not be split apart. (Line-break opportunities in these scripts usually occur at inter-word spaces, but when a very long word doesn't entirely fit on a line or the CSS word-break property is set to break-all, or the CSS line-break property is set to anywhere, conjuncts should be kept together.)

A grapheme is a user-perceived unit of text. Text operations that use graphemes as a unit of text include line-breaking, forwards deletion, cursor movement & selection, character counts, text spacing, text insertion, justification, case conversions, and sorting. The Unicode Standard uses generalised rules to define 'grapheme clusters', which approximate the likely grapheme boundaries in a writing system.

More:

The GAP

The Unicode concept of grapheme cluster up to Unicode 15.0 fails to represent syllabic conjuncts (plus vowels, etc) in scripts like Bengali, Devanagari, Gujarati, etc. This means that various editing operations, line breaking algorithms, vertical text, etc. are likely to break text at the wrong point.

The reason conjuncts are not kept together is that segmentation rules in Unicode start a new grapheme cluster after the virama.

CSS uses the concept of 'typographic character unit', rather than grapheme cluster, in its specs, with the explanation that these cases are beyond the scope of the grapheme cluster concept and that implementations should provide appropriate support.

More:

Priority

The impact of incorrectly segmenting text containing conjuncts is significant, affecting the correct handling of editing operations, line breaking algorithms, vertical text, etc. This is an issue with the priority of Basic.

Tests

Action taken

Discussions took place in the Unicode Script Ad Hoc committee, and an initial proposal was made by Norbert Lindenberg that would form the basis for gradual deployment of changes for a number of scripts.

Unicode 15.1 introduced an initial set of changes to Unicode® Standard Annex #29, Unicode Text Segmentation that recognised consonants after a virama as a continuation of the grapheme cluster for certain scripts. The scripts affected by this change are those with Indic_Conjunct_Break (InCB)=Linker. Those scripts are currently Bengali, Devanagari, Gujarati, Oriya, Telugu, and Malayalam. (The problem remains for several other scripts, and more will be addressed for Unicode 17).

As long as applications support the latest rules for grapheme clusters, those scripts should keep conjuncts together.

Outcomes

The latest versions of the Gecko, Blink, and Webkit engines support the new rules for grapheme clusters for Bengali, Devanagari, and Gujarati.

Metadata

Metadata

Assignees

No one assigned

    Labels

    doc:bengdoc:devadoc:gujrgapi:segmentationGrapheme/word segmentation & selectionl:bnBengali language & scriptl:guGujurati language & scriptl:hiHindi, Devanagari scriptp:oks:bengBengali scripts:devaDevanagari scripts:gujrGurajati script

    Type

    No type

    Projects

    Status

    Fixed

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions