Issues in iconviconv is one of the most common tool employed in character encoding conversions. However, this tool is not complete in terms of Japanese character sets handlings. This paper describes what these problems are, and how to deal with them. Table of Contents1. IntroductionNowadays, it is getting more and more common for OSS and FSS software, to employ UCS(Unicode) in string conversion in order to facilitate the multibyte character sets capability. The most significant reason for using UCS is that software developers are no longer forced to consider various encodings formats from all over the world; Within software itself, it just handles data strings in UCS encoding, and only converts it to various encodings when dealing with inputs and outputs outside software. A prime example of functions implemented for string conversion is iconv(), which are contained in libiconv and glibc, both widely used packages in OSS/FSS world. They are maintained by different project and should be considered as two independent iconv() implementations. Let me emphasise that both iconv()s are not sufficient in terms of completeness in Japanese encoding conversions. As the result, it is getting a common practice for OSS/FSS software communities in Japan to deal with the conversion issues independently, since no conprehensive solution is available. The most significant source of the incompleteness is that conversion mappings provided by Microsoft are different to the ones provided by glibc/libiconv. This is posing problems for applications which work closely with Windows, such as Samba and Web-DAV. However, changing the implementations of glibc/libiconv could break the existing software relying on iconv() conversions. The best compromise is to provide two different converters for every encondings. Also, glibc/iconv conversions refer to obselute, outdated tables for some data, which needs to be corrected. Goals of this document is to clarify the problems in existing glibc/libiconv iconv() Japanese encoding conversions, and to propose solutions to resolve such problems. 2. Problems in libiconv/glibc2-1. Problems in cp932 encodinglibiconv/glibc contain a converter called CP932 which supports conversion between Windows Code Page 932 and UCS(Unicode), but the followings are different to Microsoft conversion implementation.
Because of the reasons (1) and (2), data which is converted to Unicode by Windows cannot be converted back to the orginal data by libiconv, and vice-versa. The reaon (3) may pose problems when performing string comparison in software without support for multi-defined Shift-JIS(CP932) characters. The term "multiple definition of characters" means, two mapping locations exist for a single character, and string matching for a same character might fail because of the different locations in code points. 2-2. Problems in JIS encodingsFunctionality provided by libiconv/glibc conversion for Shift_JIS/EUC-JP/ISO-2022-JP encodings is limited for the following reasons.
2-3. cp932, euc-jp and iso-2022-jpBy employing UCS code points in Table1, libiconv/glibc is capable of performing bi-directional code conversions for cp932, euc-jp and iso-2022-jp. However, it is already mentioned in the previous sections that this conversion is not compatible to the Microsoft's UCS conversions. If cp932 is modified to work in the same routine as the Microsoft conversion, then this would break the bi-directional conversions to eucjp-ms and iso-2022-jp. When performing conversions via intermediary UCS, the intermediary UCS code point needs to have a unified value among encodings. Hence, if cp932 is modified to act as Microsoft conversions, eucjp-ms and iso-2022-jp also require Microsoft-compatible converters as well as the existing converters, just as conversions for Shift-JIS encoded characters own two converters sjis and cp932.
Table6 Differences between JIS UCS mappings and MS UCS mappings
Table7 represents the UCS mapping converter names for each encodings.
(1) and (2) in Table7 corresponds to the following code pages under Windows.
OpenGroup/ODVA(Open DeviceNet Vendor Association,Inc.) define the following codeset names.
Technical Report(TR) TR X 0015:1999 XML Japanese Profile defines the following charset names.
3. Descriptions of libiconv/glibc patches3-1. Proposals for glibc/libiconv refinement
3-2. libiconv patch in details
3-3. Descriptions of glibc patch
3.4. Patches for downloadPatches are downloadable from the following sites. Download:
patch for libiconv 1.9.1 without fixes in JIS converter patch for libiconv 1.9.1 with fixes in JIS converter patch for libiconv 1.8 patch for glibc 2.3.2 patch for glibc 2.3.1 patch for glibc 2.2.5 Below is the table representing the changes included in each patch.
* 'Y' represents included in the patches, 'N' represents not included in the patches but need to be fixed. The modification of CP932 and the addition of eucJP-ms have already merged into the CVS tree of glibc, and they will be available from the next official release of glibc. These changes have not included in libiconv yet. Fixes for iso-2022-jp table is engaged only in the patch for libiconv 1.8, as it is only an ad-hoc solution. A more comprehensive solution may be preparing a completely new table for iso-2022jp such as iso-2022jp-ms, but such standard has not yet defined. It is not favorable to modify iso-2022-jp directly as it is a common standard widely available. Doing so could affect the existing working environment, and because of it this change has been ommitted from the patches for libiconv 1.9.1. Fixes in JIS encoding are required for libiconv 1.9.1, if it is necessary to consider the conversion between CP932<->EUC-JP. CP932 is modified intentionally to work only with eucJP-ms, so it would cause problems for such conversions. A good example of applications employing such conversion routines is vim6, and some probems have been observed. However, in order to fix this issue, a modification on standardised JIS converters is required. For the same reason as not including the modification for iso-2022jp in the patch, more consideration is needed for making this change. For time being, two patches are prepared for libiconv 1.9.1; one with the fixes for JIS converters and one without them. 4. TODO
5. References
Conversion tables differ between venders
http://www.debian.or.jp/~kubota/unicode-symbols-map2.html.ja Issues in Java Character Encodings (Japanese) http://www.ingrid.org/java/i18n/encoding/ Qt/KDE Japanese Localisation (Japanese) http://www.asahi-net.or.jp/~hc3j-tkg/ Issues in existing codes and Unicode (Japanese) http://euc.jp/i18n/ucsnote.ja.html eucJP-ms as a solution for resolving conversions for Unicode and User-defined/Vendor-defined characters (Japanese) Recommended mapping table for JIS-Unicode conversion http://hp.vector.co.jp/authors/VA010341/unicode/
Information for Windows-31J (Japanese)
http://www2d.biglobe.ne.jp/~msyk/charcode/cp932/index.html libiconv-1.9.1 patch (Japanes) http://www2d.biglobe.ne.jp/~msyk/software/libiconv-1.9.1-patch.html libiconv-1.8 patch (Japanese) http://www2d.biglobe.ne.jp/~msyk/software/libiconv-patch.html glibc cp932/eucjp-ms patches http://www2d.biglobe.ne.jp/~msyk/software/glibc/
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||