A Quick Primer On Unicode and Software Internationalization Under Linux and UNIX

Unicode
The Unicode (R) Consortium is a registered trademark, and Unicode (TM) is a trademark of Unicode, Inc. Linux is a registered trademark of Linus Torvalds. UNIX is a registered trademark of The Open Group. Solaris is a trademark of Sun Microsystems. Windows is a registered trademark of Microsoft Corporation.

by Ed Trager. Last updated: 2013.03.08.ET

List of translations of this document
srpskohrvatski / hrvatskosrpski (Serbo-Croatian) translation by Anja Skrba from Webhostinggeeks.com.

Contents

General Introduction

This page provides a quick summary of information with links to other URLs regarding using Unicode for multilingual internationalization projects on Linux and other UNIX-based operating systems. If you would like to be able to use more than one language on your Linux/UNIX computer but haven't completely figured out how to do that yet, then you should read this page. I have tested the software and setup configurations mentioned in this document primarily on Linux (SuSE 7.2, 7.3, 8.1, 8.2, 9.0beta) and, to a lesser extent, on OpenBSD (3.2, 3.3) and Solaris 8.

The goals of this document are 1) to introduce you to some indispensible Open Source software for using Unicode in a Linux or other UNIX environment, and 2) to highlight key aspects of setting up that software. Other Unicode web resources cover some of the topics below in much greater depth than here. Instead of being comprehensive, I have tried to focus on a few key pieces of software and key configuration issues that will allow you to quickly become productive on your multilingual or internationalization projects on Linux or other UNIX-based operating systems today. Pointers to more comprehensive treatments of various topics are provided throughout the document.

Note: This document assumes that you are comfortable working from a command shell and have knowledge of some basic Linux/UNIX system administration tasks (such as how to compile and install software from source using the common
./configure --> make --> su -c "make install" command sequence).

Introduction to Unicode

Computers assign numbers (code points) to represent letters. There are hundreds of national and ISO standards in existence for computer encoding of modern language scripts. Many of these legacy encodings are limited to 256 ( i.e., 28 ) code points. This results in numerous problems. A major problem is that 256 code points is often not enough even for a single language, much less multiple languages. A second rather obvious problem is that code points set aside to represent letters in one national or ISO encoding will unavoidably be re-used to represent completely different letters in some other national or ISO encoding (For example, LATIN SMALL LETTER U WITH GRAVE, "ù" in the Western European ISO-8859-1 encoding becomes LATIN SMALL LETTER U WITH RING ABOVE "ů" in the Central and Eastern European ISO-8859-2 encoding, GREEK SMALL LETTER OMEGA "ω" in ISO-8859-7, HEBREW LETTER SHIN "ש" in ISO-8859-8 ... and so on! For all the gory details, read this). This can easily result in garbled emails, web pages, or databases, among other things.

For an illustration of the same problem in a slightly different domain, consider a familiar language like English. The language can be written with just 26 letters, but publishers of English-language scientific and mathematical documents require many additional symbols --and 256 code points are simply not enough! Imagine how much more problematic electronic information exchange can be for languages like Chinese where multiple, incompatible encodings exist.

Unicode solves the problems of multiple encodings by assigning unique code points to the letters and ideographs of all of the world's modern language scripts and commonly used symbols. The Unicode Consortium's page is at www.unicode.org.

UTF-8

UTF-8 is a serialization method for Unicode that is the de facto standard for encoding Unicode on UNIX-based operating systems, notably Linux. UTF-8 is also the preferred encoding for multi-lingual web pages. In this method, ASCII code points occupy one byte. That is, the ASCII subset of Unicode serialized in UTF-8 is identical to ASCII. Unicode code points in the Basic Multilingual Plane above the ASCII range are serialized to two or three bytes (additional planes exist in Unicode, which can produce serializations of up to six bytes).

When characters are serialized to multiple bytes, the most significant bit is always set, and thus these bytes never fall in the ASCII range. Also, the first byte of a multibyte sequence representing a non-ASCII character always reserves some bits that indicate how many bytes are used for the serialization of this character (Fig. 1).

UTF-8 Serialization Table
Fig. 1. UTF-8. When Unicode characters are serialized to multiple bytes in UTF-8, the high bits of the first serialized byte indicate how many bytes are used for the serialization of that character. The bits represented by "n"s hold the unicode character code value.

This results in a stateless encoding in which missing bytes will be evident. UTF-8 provides a simple and elegant solution for internationalizing UNIX-based, byte-oriented operating systems and software. For all of the details, read Markus Kuhn's excellent FAQ, UTF-8 and Unicode FAQ for Unix/Linux. All of the software mentioned below supports UTF-8 well.

Advice: UTF-8 is simple to use, store, and view in documents, databases, and source code. Use the UTF-8 encoding for all of your multilingual, international, or non-English data and documents. Avoid using legacy national character encodings (i.e. ISO-8859-1,ISO-8859-2, ISO-8859-15, TIS-620, shift-jis, gb-18030, KOI8, etc.). There are also good reasons to avoid using other Unicode encodings, such as UTF-16. Information on how to convert legacy data to UTF-8 is provided below (see Utilities).

Setting Your Locale to UTF-8

In order to take full advantage of Unicode on your Linux or other UNIX system, you will need to set your locale to a UTF-8 locale. Some recent distributions of Linux now default to using a UTF-8 locale by default. However, unless you are using a very recent Linux distribution, you are still very likely using a legacy locale based on ISO-8859 or other national encoding. If you are using some UNIX-based OS other than Linux, it is even less likely that you are already using a UTF-8 locale. To determine your current locale settings, type locale. Here are some results from Linux and Solaris:

"locale" example from Linux:
user_a@some_linux_box:~> locale
LANG=en_US
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE=POSIX
LC_MONETARY="en_US"
LC_MESSAGES="en_US"
LC_PAPER="en_US"
LC_NAME="en_US"
LC_ADDRESS="en_US"
LC_TELEPHONE="en_US"
LC_MEASUREMENT="en_US"
LC_IDENTIFICATION="en_US"
LC_ALL=
"locale" example from Solaris:
user_b@some_sun_box:~> locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

All UTF-8-based locale settings end in "UTF-8", so it is evident that neither user_a nor user_b in the examples above is using a UTF-8 locale. To determine what other locale settings are available to you, type locale -a:

"locale -a" example from Linux:
user_a@some_linux_box:~> locale -a
C
POSIX
af_ZA
ar_AE
ar_BH
ar_DZ
ar_EG
ar_EG.utf8
ar_IN

. . .

uz_UZ
vi_VN.utf8
yi_US
zh_CN
zh_CN.gb18030
zh_CN.gbk
zh_CN.utf8
zh_HK
zh_TW
zh_TW.euctw
zh_TW.utf8
"locale -a" example from Solaris:
user_b@some_sun_box:~> locale -a
POSIX
C
iso_8859_1

It's evident that the Linux distribution (SuSE 7.3 was used for the example) has many UTF-8 locales installed by default (not all are shown), while the Solaris box has none. Solaris does provide UTF-8 locales, but they must be installed as optional packages: see the Solaris Internationalization Guide.

To change your locale setting in Linux, just set the LANG environment variable in your .profile file. Note that the output from locale -a on the Linux box shown above shows "utf8" in lower case without a hyphen: this is a BUG. When you set the LANG variable, be sure to type UTF-8 in UPPER CASE and with a hyphen:

Setting the LANG variable in a .profile file for the BASH shell under Linux:
...
export LANG=en_US.UTF-8

When you log back in using the new LANG setting, you should now see that many of the other "LC_" locale environment variables have been updated automatically:

After setting LANG to a UTF-8 locale in Linux:
user_a@some_linux_box:~> locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=POSIX
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Under a UTF-8 locale, you can now take full advantage of Unicode on your machine. Note however, that some Unicode software can be used quite effectively even if you cannot, or are not yet ready, to change to a UTF-8 locale. For example, Yudit, described below, will work just fine even on systems such as OpenBSD which currently do not support different locales.

Terminal Emulators

Even in these days of KDE and Gnome, no Linux or UNIX aficionado would want to live without a good terminal emulator. Several Unicode-enabled terminal emulators are described below.

Mlterm

Mlterm is arguably the best terminal emulator for multilingual work and is certainly my favorite (Fig. 1). When compiled with fribidi and libind, mlterm supports complex Indic scripts like Devanagari, Indic-derived scripts like Thai, and right-to-left scripts like Arabic and Hebrew. Mlterm also sports a GTK+-based GUI configurator which is activated using the rather unusual CTRL-<RIGHT MOUSE CLICK> combination (Fig. 2).
mlterm
Fig. 2. Mlterm. A GUI configurator makes it easy to set up mlterm. An HTML file encoded in UTF-8 is being viewed in vim under mlterm.

By default, Mlterm uses a bitmap font, usually GNU Unifont which is pre-installed on most Linux distributions and other free Unices. GNU Unifont is the bitmap font shown in Fig. 2 above.

If you want, you can also have Mlterm use an anti-aliased TrueType font. In this case, a monospaced font like Bitstream Vera Sans Mono or Everson Mono Unicode is best. You will need (probably as root) to modify Mlterm's $PREFIX/etc/mlterm/aafont configuration file to indicate which fonts you want to use to display normal and double-width CJK characters ($PREFIX depends on where Mlterm is installed. If you installed it yourself, it is probably /usr/local/. If Mlterm came preinstalled, it is probably just /etc. As an example, here is what my aafont file looks like:

ISO10646_UCS4_1=Everson Mono Unicode-iso10646-1;
ISO10646_UCS4_1_BIWIDTH=Bitstream Cyberbit-iso10646-1;

This specifies that Everson's sans-serif Everson Mono Unicode font be used for normal width characters while the serif Bitstream Cyberbit font be used for double-width CJK characters. In order to toggle anti-aliased fonts, I need to start Mlterm with the -A flag, like this:

mlterm -A &

Here is what the result looks like on a Mandrake Linux box:

If you want to use a variable-width font, then after modifying aafont appropriately, start Mlterm like this instead:
mlterm -A -V &

Note: You might need to become root in order to compile libind. The problem appears to be in the supplied Makefile. You can either fix the Makefile, or you can remain lazy and just become root.

Xterm

A second alternative is to use xterm (Fig. 3) which is supplied with XFree86. Xterm does not support right-to-left languages like Arabic or Hebrew. I don't think it supports most Indic scripts. It does support Thai though, as shown in Fig. 3 below.

xterm
Fig. 3. Xterm in UTF-8 mode. Xterm supports UTF-8, including Thai, but not right-to-left languages like Arabic.

For both mlterm and xterm, you will need to set your locale to a UTF-8 locale. When I use an account or a machine where the locale is not a UTF-8 locale, I use the following "mini scripts" for starting mlterm and xterm for multilingual work:

"uterm" script for starting mlterm with UTF-8 support when the locale has not yet been set to UTF-8:
#!/bin/sh
LC_CTYPE=en_US.UTF-8 mlterm --sbmod=right &

"uxterm" script for starting xterm with UTF-8 support when the locale has not yet been set to UTF-8:

For xterm you must specify the font on the command line which is very inconvenient unless you use a script or alias to start the thing:

#!/bin/sh
LC_CTYPE=en_US.UTF-8 xterm -u8 -fn \
'-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1' &

Note: OpenBSD 3.2 does not appear to have locale support, so these scripts produce a "locale settings failed" message. Despite that, simple utilities like UNIX cat do work correctly and produce readable displays of UTF-8 files on OpenBSD. However, other software such as vim fails to work correctly in the absence of locale support from the operating system, despite the capabilities of the terminal emulator.

Although I like the features of KDE's Konsole (KDE 3.x), I have noticed annoying bugs when rendering Thai and Arabic scripts under a UTF-8 locale, so I refrain from recommending it for multilingual work at this time.

Unicode Editors

A number of good unicode editors are now available for Linux/Unix, but here I am only going to describe three:

Yudit

Yudit (Fig. 4) is an indispensible unicode text editor for the X Window System. Yudit can be used under any locale setting. It can even be used on OpenBSD which lacks locales. The program is extremely easy to use and comes with a large number of keyboard maps -- and even handwriting recognition for Kanji and Hanzi. Handwriting recognition is an excellent idea, but in practice it only works well for fairly simple characters with few strokes, like "人" or "水", as drawing more complex characters with a mouse is too tedious. For serious Chinese, Japanese, or Korean (CJK) typing tasks, an input method engine such as SCIM (discussed below) is required.

Besides the editor itself, the program distribution includes two absolutely indispensible utilities:

Information on using uniprint and uniconv are provided below under Utilities.

Program menus are available in many languages. The program has few external dependencies (other than X Windows itself) and there is no need for any pre-installed multi-lingual locale environment. For example, Yudit works perfectly on OpenBSD 3.2 which lacks locale support, where vim fails.

Yudit
Fig. 4. Yudit comes with pre-installed keymaps for numerous languages. The program and its accompanying utilities, uniprint and uniconv, are must-have tools in your Unicode toolkit.

Vim

Many professional developers are already addicted to using vi as their editor of choice, so it is nice to know that the popular implementation vim fully supports UTF-8 (Fig. 5).
VIM
Fig. 5. Vim running in mlterm with color syntax highlighting for C/C++. In the code shown here, the static C-style strings directly contain UTF-8 encoded locale data.

There are two keys to making console-mode vim useful for multilingual work. First, you must run vim in a UTF-8-capable terminal emulator like mlterm. Secondly, you are going to need keyboard maps for inputting languages of your choice. Unlike Yudit, numerous standard keymaps do not appear to be distributed with vim.

To determine what keymaps are available, enter the following vim command:

:echo globpath(&rtp, "keymap/*.vim")

This tells you the location of the globally available keymaps, as well as the path where you'll want to place any keymaps that you create if you want to make those maps available to all users.

Setting up and using a keyboard map for vim is not difficult. An excerpt from a Thai keyboard map is shown below. The conventions for naming a keyboard map file are:

<language>_<encoding>.vim

So, in this case, the file will be called:

thai_utf-8.vim

Here's an excerpt from the file:

Example vim keyboard map: An excerpt from thai_utf-8.vim is shown below.
" Vim Keymap file for UTF-8 Thai
" Maintainer: Edward H. Trager <ehtrager@umich.edu>
" Last Updated: 2003-04-08.ET
"
" This mapping adheres to the Thai standard TIS820-2538 keyboard
" layout.

let b:keymap_name = "thai"

loadkeymap

~ ๛
! ๅ
@ ๑
# ๒
$ ๓
% ๔
^ <char-0x0E39> " THAI CHARACTER SARA UU
& <char-0x0E4E> " THAI CHARACTER YAMAKKAN
. .
. .
. .

Comment lines in a keyboard map begin with a quotation mark, '"'. The line, "let b:keymap_name = "thai"" provides a short name for the map so that we can issue a command in vim to use this map like this:

:set keymap=thai

Everything on the lines following the word "loadkeymap" represents the keyboard mapping. One or multiple keys can be specified in the first column as the keys to type. One or more bytes can be specified as the result in the second column.

For example, as shown in the excerpt above, the first six key mappings from the beginning of the top row of a QWERTY keyboard are mapped directly to the Thai characters which have been serialized as UTF-8. Each of these characters actually requires three bytes, but they appear as the Thai characters in your web browser. The fastest way to create a keymap like this is to use Yudit, which is exactly what I did.

The next two entries show an alternate approach: here the unicode code points are entered directly in hexadecimal which can be typed in simple ASCII using any editor you like. The unicode code points for any script can be obtained online in portable document format (PDF) from www.unicode.org/charts/.

After you have created a keyboard map and placed it in the appropriate location (for example, /usr/share/vim/current/keymap), from within vim simply type:

:set keymap=thai

...to enable this alternate keymap. When you are in insertion mode, you can toggle between the standard and alternate key maps using CTRL-^.

Finally, just to give you an idea of what else you can do, here a short excerpt from a custom keymap file which uses pinyin romanization to specify some Chinese characters. This example simply demonstrates how a series of multiple keystrokes can be mapped to unicode characters:

Another example vim keyboard map: An excerpt from a special map that uses pinyin spellings for entering some Chinese characters:
" Custom pinyin keymap
" Maintainer: Edward H. Trager <ehtrager@umich.edu>
" Last Updated: 2003-04-08.ET
"
let b:keymap_name = "special"

loadkeymap

ri 日
shui 水
ni 你
ren 人
xin 心
zhu 竹

. .
. .
. .

Note that you can specify a keymap to use in your .vimrc file, as the following example shows:

Example ~/.vimrc file:
This .vimrc file specifies an alternate keymap which can be toggled using CTRL-^. The other lines set up vim for C/C++ color syntax highlighting and automatic indentation.
set nocp incsearch
set cinwords=if,else,while,do,for,switch,case
set cindent
set nowrap
set keymap=thai
syntax on

For a complete treatment of using Unicode and keymaps, from within vim type:

:help mbyte.txt
:help mbyte-keymap

Mined

Mined is a console-mode unicode editor with an intuitive user interface, pull-down menus, extensive Unicode support, including double-width and combining characters, Arabic ligature joining, keyboard mapping, syntax highlighting, and many other features. Mined can be used on UNIX and DOS/Windows platforms.

Mined
Fig. 6. Mined is another unicode editor.

I have not personally used mined, but it appears to have a nice feature set.

For a more extensive review of Unicode editors, see Alan Wood's summary, Unicode and Multilingual Editors and Word Processors for Unix and Linux.

Input Methods for Chinese, Japanese, Korean, and Other Languages

Keyboard maps are insufficient for typing Chinese, Japanese, and Korean (commonly referred to as "CJK"), as well as other languages such as Tibetan. These languages require sophisticated software input methods (IMs) that operate through XIM. Mike Fabian of SuSE Linux provides an excellent set of pages describing how to set up a CJK computing environment on your Linux box in which he includes descriptions and setup details for a number of IM engines. One of the best among the set of Open Source IM engines is Smart Common Input Method (SCIM) , which I describe below.

SCIM (智能通用输入法平台)

SCIM logo James Su's Smart Common Input Method (SCIM) is a Unicode-based IM platform written in C++. For users, SCIM is an excellent choice because it is simple to set up and use in a UTF-8 or legacy locale. For software developers, it is also nice because it abstracts input method interfaces into a set of simple, independent classes so developers can write their own input methods easily in a few lines of code.
Google search for 'Olympics'
Fig. 7. SCIM is an excellent IM application with support for a number of CJK input methods, including 自然码zìránmǎ which is shown being used to enter Chinese for a Google search in Mozilla.

SCIM currently provides input tables for at least the following methods:

Of the numerous Chinese input methods available, intelligent pinyin and ziranma are the easiest to use. The keyboard layout and a description of how to use the 自然碼 zìránmǎ, or 自然双拼 zìrán shuāngpīn, method can be found here. Note that the intelligent pinyin method is closed-source software, but you can install a binary RPM version for use with SCIM free of charge. If you are compiling from source, I think you will find the supplied 自然碼 zìránmǎ method quite satisfactory.

SCIM requires atk-1.0+, glib 2.0+, pango-1.0+, and gtk+2.0+. These libraries will be present in newer Linux distributions, or you can download them from the GTK+ site here.

After compiling SCIM, you will need to add the following lines to your .xinitrc file in order to have SCIM start whenever you start X windows:

Example lines to add to ~/.xinitrc file for starting SCIM:
The first line starts scim as a daemon. The second line tells X to use SCIM as the input method server.
scim -d
export XMODIFIERS=@im=SCIM

If you are using an older version of SCIM (prior to version 0.8.0) and are not already running in a Chinese, Japanese, or Korean locale, then you will need to set the LC_CTYPE environment variable to refer to a Chinese, Japanese, or Korean locale in your ~/.profile file. Note that you can do this even if your LANG environment variable is set to another (UTF-8) locale, such as English, as shown in the example below. Versions of SCIM after v. 0.8.0 will work fine with LANG set to any UTF-8 locale.

Example lines to add to ~/.profile file for starting SCIM:
Versions of SCIM prior to v. 0.8.0 expect to operate in a CJK environment. However, you can have SCIM operate under another primary language environment, such as UTF-8 English, by setting the LANG and LC_TYPE environment variables in the manner shown here. This example assumes you are using the BASH shell. With versions of SCIM after v. 0.8.0, any UTF-8 locale will do and you won't need to set LC_CTYPE separately.
export LANG=en_US.UTF-8
export LC_CTYPE=zh_TW.UTF-8

Email Agents

Mutt

Mutt logo Mutt is an excellent email agent with good UTF-8 Unicode support. Mutt can also be extensively customized to meet your individual needs. For example, a very simple customization is to have emails from certain people, domains, or mailing lists highlighted in special colors in the message index. An example of this is shown in the image on the left below (Fig. 8). Another feature I like about Mutt is that you can use any editor you want to compose your emails. I have Mutt set to use Yudit for composing email in UTF-8. A more common choice is to use Vim as the editor. Once you have used Mutt a few times and started to play around with its various configuration options, you'll never go back to using another email agent again!

Mutt index in Konsole Mutt displaying a UTF-8-encoded email
Fig. 8. Mutt is an excellent email agent. Once you have used Mutt a few times and customized it to your liking, you'll never want to use any other email agent! Left side: Message index displayed in Mutt with color customizations running in a KDE konsole terminal. Right side: Mutt displaying a UTF-8 encoded message in mlterm.

Utilities

This section lists conversion and printing utilities.

Conversion Utilities

For converting a file from one encoding to another, three utilities are worth mentioning here: