Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED !

This issue is applicable to most languages with conjunct forms that involve a virama.

Many scripts descended from Brahmi indicate clusters of consonant sounds by merging or stacking the glyphs of the consonants involved in one way or another. These scripts are abugidas, and each consonant character represents a consonant sound and an inherent vowel sound. The merging of glyphs indicates that the inherent vowel sound is dropped between the consonants. In Unicode text, this merging is usually accomplished using a special character between the consonants, which is typically called a virama or 'vowel-killer'.

When operations such as line breaking, cursor movement, vertical text rendering, deletion, hyphenation, etc are applied to the text these conjuncts must not be split apart. (Line-break opportunities in these scripts usually occur at inter-word spaces, but when a very long word doesn't entirely fit on a line or the CSS `word-break` property is set to `break-all`, or the CSS `line-break` property is set to `anywhere`, conjuncts should be kept together.)

A [grapheme](file:///Users/ishida/Sites/scripts/glossary/index.html#grapheme) is a user-perceived unit of text. Text operations that use graphemes as a unit of text include line-breaking, forwards deletion, cursor movement & selection, character counts, text spacing, text insertion, justification, case conversions, and sorting. The Unicode Standard uses generalised rules to define '<a href="https://www.w3.org/TR/i18n-glossary/#dfn-grapheme-cluster">grapheme clusters</a>', which approximate the likely grapheme boundaries in a writing system.

More:
- [Bengali graphemes](https://r12a.github.io/scripts/beng/bn.html#graphemes)
- [Devanagari graphemes](https://r12a.github.io/scripts/deva/hi.html#graphemes)




### The GAP

The Unicode concept of grapheme cluster up to Unicode 15.0 fails to represent syllabic conjuncts (plus vowels, etc) in scripts like Bengali, Devanagari, Gujarati, etc. This means that various editing operations, line breaking algorithms, vertical text, etc. are likely to break text at the wrong point.

The reason conjuncts are not kept together is that segmentation rules in Unicode start a new grapheme cluster after the virama.

CSS uses the concept of <a href="https://drafts.csswg.org/css-text-3/#typographic-character-unit">'typographic character unit'</a>, rather than grapheme cluster, in its specs, with the explanation that these cases are beyond the scope of the grapheme cluster concept and that implementations should provide appropriate support.

More:
- [Typographic character units in complex scripts](https://www.w3.org/International/questions/qa-indic-graphemes)





### Priority
The impact of incorrectly segmenting text containing conjuncts is significant, affecting the correct handling of editing operations, line breaking algorithms, vertical text, etc. This is an issue with the priority of Basic.





### Tests

- <a href="https://w3c.github.io/i18n-tests/results/first-letter-pseudo#deva_tailoring">Selectors 3, first-letter &gt; Conjuncts & orthographic syllables</a>
- [line-break:anywhere should not break conjuncts](https://github.com/w3c/line_paragraph_tests/issues/1)
- [Extending highlighted text should not split conjuncts](https://github.com/w3c/glyph_character_tests/issues/39)




### Action taken
Discussions took place in the Unicode Script Ad Hoc committee, and an initial proposal was made by Norbert Lindenberg that would form the basis for gradual deployment of changes for a number of scripts.

Unicode 15.1 introduced an initial set of changes to <a href="https://www.unicode.org/reports/tr29/">Unicode® Standard Annex #29, Unicode Text Segmentation</a> that recognised consonants after a virama as a continuation of the grapheme cluster for certain scripts. The scripts affected by this change are those with <a href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AIndic_Conjunct_Break%3DLinker%3A%5D&g=&i=">Indic_Conjunct_Break (InCB)=Linker</a>. Those scripts are currently Bengali, Devanagari, Gujarati, Oriya, Telugu, and Malayalam. (The problem remains for several other scripts, and more will be addressed for Unicode 17).

As long as applications support the latest rules for grapheme clusters, those scripts should keep conjuncts together.


### Outcomes
The latest versions of the Gecko, Blink, and Webkit engines support the new rules for grapheme clusters for Bengali, Devanagari, and Gujarati.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED ! #87

The GAP

Priority

Tests

Action taken

Outcomes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Grapheme clusters fail to represent syllabic conjuncts in Bengali, Devanagari, and Gujarati — FIXED ! #87

Description

The GAP

Priority

Tests

Action taken

Outcomes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions