Western calligraphy has a long tradition of embellishing the first letter, first word, or first line of a paragraph of text to enhance the visual appeal of a written document. Extraordinary examples of this tradition may be found in the Book of Kells, the splendid illuminated Gospel manuscript of the early Middle Ages from Kells, County Meath, Ireland. Although it is not known precisely when or where the Book of Kells was written, written evidence strongly supports that it was present in the monastary at Kells in the year 1007 CE(1). Below I have reproduced an example of each type of embellishment: first letter (Fig. 1), first word (Fig. 2) and entire first line (Fig. 3) from folios in the Book of Kells.
Fig. 1. Example embellishment of initial majuscule “P” in the Book of Kells. From the last folio of the Breves causae (Gospel summary) of John covering XIV, 26 - XXI, 17.
Fig. 2. Example embellishment of the initial word “Tunc” (Thereupon, at that time ...) in the Book of Kells. From the Gospel of Luke XXIV (folio 288v).
Fig. 3. Example embellishment of an entire initial line from the Book of Kells. The line reads, "FUIT IN DIEBUS HERO" (There was in the days of Herod ...) from Luke I, 1-5 (folio 188v).
Of course after the invention of the printing press, the textual embellishment motifs of the medieval calligraphers were borrowed by printers and typographers who created their own numerous variations on these themes, guided by what worked well within the constraints of the various technologies that were available to them over time -- up to our present day.
The World Wide Web Consortium's(2) Cascading Style Sheet standards for first-letter, first-word, and first-line pseudo-elements can be viewed as a part of the digital evolution of these visual motifs that go back centuries.
While Western-style typography has a tradition that goes back centuries, in countries such as Thailand, the arts of printing and typography have only flourished during the course of the most recent century(3). Typography in Thailand has clearly been influenced by the Western typographical traditions. For example, it is now quite common to see decorative fonts with western-style serifs in use in magazines and advertising media. Although serifs are undoubtedly a completely foreign borrowing, the effect is often quite pleasing.
In higher-quality typography in Thailand one now also finds enlarged initial drop letters just as one does in high-quality Western typography (We cannot call them "drop caps" because there is no case distinction in Thai). Note however that the Thai script shares with other Indic and Indic-derived scripts the character of having vowels which surround consonants or consonant clusters. The vowels may precede, follow after, hover above, or hover below consonants:
| Some Thai vowels ... | |
| ... which precede or follow: | ... which hover above or below: |
| ไ◌ เ◌ ◌ะ | ◌ิ ◌ี ◌ุ ◌ู |
In addition to vowels, Thai and related scripts like Lao have tone marks that also hover above letters or vowels:
| Thai Tone Marks |
| ◌่ ◌้ ◌๊ ◌๋ |
The set of vowels which precede or follow after consonants or consonant clusters present no problem when creating the embellishment of an initial letter in Thai. If the word starts with one of the vowels that precedes a consonant, one just enlarges that vowel, as shown in Fig. 4. While true for Thai, this simplicity may not be true for Indic- or other Indic-derived scripts.
Fig. 4. In this example from a printed book (4), the vowel “เ” THAI CHARACTER SARA E precedes the consonant “ป” THAI CHARACTER PO PLA in the word “เป็น” and is enlarged to create a “drop cap” just as one would do with an initial consonant.
In any case, vowels and tone marks that hover above and vowels that hover below need to be treated in a fashion analagous to the Unicode combinining diacritical marks and must be combined with the logically-preceding base character when rendering the glyphs, as shown in Fig. 5:
Fig. 5. Here's an example from the same book about butterflies (ผีเสื้อ) where the vowel “◌ี” THAI CHARACTER SARA II hovers over the consonant “ผ” THAI CHARACTER PHO PHUNG to form an embellished initial letter.
Proper rendering of the CSS first-letter pseudo-element on a web browser or other device (such as a printer) needs to take the positioning of vowels and tone marks in Thai and other Indic-derived scripts into account in exactly the same manner as combining diacritical marks need to be taken into account when rendering the first-letter pseudo-element in Latin, Greek, or Cyrillic CSS-styled HTML texts in NFD normalization.
Many other scripts in the world beside the Indic and Indic-derived scripts have characters, vowels, or other special marks which combine with a base character in one way or another. These characters need to be considered when implementing CSS first-letter styling support in modern, Unicode-capable browsers.
Thai and related scripts like Lao do not use spaces between words. Implementing CSS first-word support for such scripts therefore requires that a browser knows where to break words in these languages. Of course this problem is not limited to CSS styling: the browser needs to know where to break words at line ends as well.
While CSS styling of the first-line pseudo-element should not present a problem, styling the first-letter and first-word pseudo-elements is non-trivial for a subset of modern language scripts. Developers interested in implementing these features in browsers or other software should review the section on Combining Characters in The Unicode Standard (section 2.6 in version 3.0: a more recent version is now available) (5).
The following chart is my attempt to organize by script/Unicode block the set of combining characters that will affect the rendering of the CSS first-letter element by a browser or other client. The majority of these combining characters hover over or below the logically previous base character. The set includes diacritical marks, tone marks, vowel marks, and other symbols.
The chart also contains my suggestions for how first-letter and first-word styling should work, or not work as the case may be, for a given script described below. Note that my perspective is that one should implement the important cases first --i.e., those cases that are likely to appear frequently in texts-- and forget about the numerous "edge cases" --such as the double diacritics, or the combining half-marks-- which one could waste a lot of time on.
| Unicode Value | Glyph | Description | Treatment |
|---|---|---|---|
| Combining Diacritical Marks u0300-u036f | |||
| u0300 - u036f | All of the characters in this block combine with the logically preceding base character and therefore need to be rendered as part of the first-letter element if the base character is the first letter. INCLUDE FOR TREATMENT. However see the note below regarding the double diacritics, u0360 - u0362. | ||
| u0360 - u0362 | ͠ ͡ ͢ |
DOUBLE DIACRITICS | IGNORE these: they are rare edge cases. |
| Greek and Coptic | |||
| u037a | ͺ | GREEK YPOGEGRAMMENI | This is a SPACING character: IGNORE it. |
| u0384 | ΄ | GREEK TONOS | This is a SPACING character: IGNORE it. |
| u0385 | ΅ | GREEK DIALYTIKA TONOS | This is a SPACING character: IGNORE it. |
| Cyrillic | |||
| u0483 - u0486 | ҃ ҄ ҅ ҆ |
These are historic miscellaneous combining characters. | There is no harm in including for treatment. INCLUDE FOR TREATMENT. |
| u0488 & u0489 | ҈ ҉ |
COMBINING CYRILLIC HUNDRED THOUSANDS and MILLIONS signs | IGNORE them: weird edge cases. |
| Armenian | |||
| u0559 | ՙ | ARMENIAN MODIFIER LETTER LEFT HALF RING | I suspect that this does not combine: needs verification. Probably IGNORE. |
| u055a | ՚ | ARMENIAN MODIFIER LETTER RIGHT HALF RING APOSTROPHE | I suspect that this does not combine: needs verification. Probably IGNORE. |
| Hebrew | |||
| u0591 -u05bd, u05bf, u05c1, u05c2, u05c4 |
Hebrew Cantillation Marks and Vowels | One or more Hebrew cantillation marks and vowels combine with a base consonant and therefore should be treated just as Unicode combining diacritical marks are to be treated. INCLUDE FOR TREATMENT. Note that “־” u05be MAQAF, “׀” u05c0 PASEQ, and “׃” u05c3 SOF PASUQ have been excluded from this range as they are non-combining. | |
| Arabic u0600 - u06ff | |||
| u0600 - u06ff | Arabic |
Because this script is cursive and the letters join, it seems to me that it would be difficult to achieve smooth and aesthetically pleasing transitions between the styled first-letter and succeeding letters within the limitations of what CSS provides for embellishment. In any first-round implementation, I would therefore recommend explicitely EXCLUDING the Arabic blocks from first-letter styling support. The current generation of browsers which I have tested ignore the fact that they are dealing with Arabic, a RTL language, with rather dismal results (such as trying to style the initial letter on the left while rendering the succeeding letters RTL starting from the right. Therefore browsers will need to actively EXCLUDE Arabic rather than simply ignore it as is currently done. |
|
| Syriac u0700 - u074f | |||
| u0700 - u074f | Syriac |
This is another cursive script where the letters join as in Arabic. For the same reasons as I gave for Arabic, I would recommend explicitely EXCLUDING Syriac text from first-letter styling support in any first-round implementation. |
|
| Thaana | |||
| u07a6 - u07b0 | Thaana Vowels | Thaana vowels appear above or below consonants in this right-to-left, non-cursive script used to write Dhivehi in the Republic of Maldives. INCLUDE FOR TREATMENT. | |
| Devanagari | |||
| u0900 - u097f | Devanagari | While not a cursive script in the same sense as Arabic, Devanagari letters do join together along a horizontal top line. In modern printed Hindi typography, an embellished first letter is sometimes used, for example in children's books. However, in my view, it hardly seems possible to achieve aesthetically-pleasing transitions between an initial embellished letter and the following unembellished letters within the limitations of current font and browser rendering technologies. In addition, as is true for other Indic scripts, the rules governing the ligation of letters and the reordering of the typed text stream to achieve the correct visual appearance are quite involved. Therefore, I would recommend explicitely EXCLUDING the Devanagari block from first-letter treatment in a first-pass implementation. | |
| Bengali | |||
| Gurmukhi | |||
| Gujarati | |||
| Oriya | |||
| Tamil | |||
| Telugu | |||
| Kannada | |||
| Malayalam | |||
| Sinhala | |||
| Thai | |||
| u0e31 | ◌ั | THAI CHARACTER MAI HAN-AKAT | This is a tricky one. In proper typography, this vowel hovers in the middle above the previous base consonant and the succeeding consonant. Probably have to IGNORE in a first-pass implementation. |
| u0e33 | ◌ำ | THAI CHARACTER SARA AM | This vowel hovers partially over the previous character with the remaining part to the right. As with u0e31, probably have to IGNORE in a first-pass implementation. |
| u0e34 - u0e39 | Subset of Thai vowels that hover directly over or under a base character | Treat exactly like the Unicode combining diacritical marks. INCLUDE FOR TREATMENT. | |
| u0e3a | ◌ฺ | THAI CHARACTER PHINTHU (PALI VIRAMA) | AFAIK, this is only used in religious documents. Hovers under: no harm in including. INCLUDE FOR TREATMENT. |
| u0e47 - u0e4b | ◌็ ◌่ ◌้ ◌๊ ◌๋ |
Thai MAITAIKHU and four tone marks | All these hover directly over the base character. The four tone marks hover above any hovering vowels. A lot of TrueType fonts and rendering engines don't do this correctly. OpenType tables or similar technology would probably help. In any case, it is clear that these need to be included. INCLUDE FOR TREATMENT. |
| u0e4c - u0e4e | THAI KARAN (THANTHAKHAT) THAI NIKHAHIT THAI YAMAKKAN |
These three hover directly over the base character, but AFAIK never occur over the very first letter in a word. Of these, only KARAN is common: it is used to silence consonants at the ends of words derived from Sanskrit/Pali or foreign loan words, so it should never occur above the first letter. Nevertheless, there is no harm in including, and there may be rare edge cases where they occur. INCLUDE FOR TREATMENT. | |
| Lao | |||
| u0eb1 | ◌ັ | LAO VOWEL SIGN MAI KAN | Hovers over the base consonant. INCLUDE FOR TREATMENT. |
| u0eb3 | ◌ຳ | LAO VOWEL SIGN AM | Like THAI VOWEL SARA AM which has a similar appearance, it seems that part of the vowel might hover partially over the base character. If that is true, it is not clear from a typographic perspective what one would want to do with this case. Therefore IGNORE it in a first-pass implementation. |
| u0eb4 - u0ebb | Lao vowels which hover over or under a base character | Treat exactly like the Unicode combining diacritical marks. INCLUDE FOR TREATMENT. | |
| u0ebc | ◌ຼ | LAO SEMIVOWEL SIGN LO | Hovers below. Treat like a Unicode combining diacritical mark. INCLUDE FOR TREATMENT. |
| u0ec8 - u0ecb | ◌່ ◌້ ◌໊ ◌໋ |
Lao tone marks | As in Thai, these tone marks hover above any hovering vowel marks which may be present over the consonant letters and a lot of fonts and rendering engines don't do it correctly. In any case, INCLUDE FOR TREATMENT. |
| u0ecc and u0ecd | ◌໌ ◌ໍ |
LAO CANCELLATION MARK LAO NIGGAHITA |
As in Thai, I doubt these ever occur over the first letter of a word, but there is no harm in including them. INCLUDE FOR TREATMENT. |
| Tibetan | |||
| Myanmar | |||
| Khmer | |||
1. Henry, Françoise. The Book of Kells. Alfred A. Knopf, New York 1977. See p. 150.
2. The World Wide Web Consortium, http://www.w3c.org.
4. สุวรรณภักดิ์, เกรียงไกร. ผีเสื้อ. สำนักพิมพ์สารคดี Sarakadee Press, Bangkok 1997.
5. The Unicode Consortium. The Unicode Standard Version 3.0. Addison-Wesley, 2000. http://www.unicode.org.