MIRACLE LINUX CORPORATION
Contact Us  
Japanese Home
Home    Press Release    Corporate Information    Mlupdater  
English Home > Samba Internationalisation > Issues in iconv


Issues in iconv

iconv is one of the most common tool employed in character encoding conversions. However, this tool is not complete in terms of Japanese character sets handlings. This paper describes what these problems are, and how to deal with them.


Table of Contents

1. Introduction

Nowadays, it is getting more and more common for OSS and FSS software, to employ UCS(Unicode) in string conversion in order to facilitate the multibyte character sets capability.

The most significant reason for using UCS is that software developers are no longer forced to consider various encodings formats from all over the world; Within software itself, it just handles data strings in UCS encoding, and only converts it to various encodings when dealing with inputs and outputs outside software.

A prime example of functions implemented for string conversion is iconv(), which are contained in libiconv and glibc, both widely used packages in OSS/FSS world. They are maintained by different project and should be considered as two independent iconv() implementations. Let me emphasise that both iconv()s are not sufficient in terms of completeness in Japanese encoding conversions.

As the result, it is getting a common practice for OSS/FSS software communities in Japan to deal with the conversion issues independently, since no conprehensive solution is available.

The most significant source of the incompleteness is that conversion mappings provided by Microsoft are different to the ones provided by glibc/libiconv. This is posing problems for applications which work closely with Windows, such as Samba and Web-DAV. However, changing the implementations of glibc/libiconv could break the existing software relying on iconv() conversions. The best compromise is to provide two different converters for every encondings. Also, glibc/iconv conversions refer to obselute, outdated tables for some data, which needs to be corrected.

Goals of this document is to clarify the problems in existing glibc/libiconv iconv() Japanese encoding conversions, and to propose solutions to resolve such problems.

2. Problems in libiconv/glibc


2-1. Problems in cp932 encoding

libiconv/glibc contain a converter called CP932 which supports conversion between Windows Code Page 932 and UCS(Unicode), but the followings are different to Microsoft conversion implementation.

  1. JIS X 0208 characters in Table1 have different mappings to UCS2 under glibc/libiconv and Microsoft.

    Table 1
    Char Shift-JIS(Row-Cell) UCS(Unicode)
    Name Code libiconv/glibc MS
    FULLWIDTH TILDE FULLWIDTH TILDE 0x8160(01-33) U+301C U+FF5E
    PARALLEL TO PARALLEL TO 0x8161(01-34) U+2016 U+2225
    FULLWIDTH HYPHEN-MINUS FULLWIDTH HYPHEN-MINUS 0x817C(01-61) U+2212 U+FF0D
    FULLWIDTH CENT SIGN FULLWIDTH CENT SIGN 0x8191(01-81) U+00A2 U+FFE0
    FULLWIDTH POUND SIGN FULLWIDTH POUND SIGN 0x8192(01-82) U+00A3 U+FFE1
    FULLWIDTH NOT SIGN FULLWIDTH NOT SIGN 0x81CA(02-44) U+00AC U+FFE2

  2. JIS X 0201 defined characters in Table2 have different UCS mapping for glibc and Microsoft. (Only applicable to glibc)

    Table2
    Char Shift-JIS UCS(Unicode)
    Name Code glibc MS
    REVERSE SOLIDUS REVERSE SOLIDUS 0x5C(5/12) U+00A5 U+005C
    TILDE TILDE 0x7E(7/14) U+203E U+007E

  3. Table3 contains characters whose conversion mappings from UCS to Shift-JIS are different for libiconv and Microsoft. (Only applicable to libiconv, as glibc does not support Vendor-defined characters)

    Table3
    Char UCS (Unicode) Shift-JIS
    Name Code libiconv MS
    ROMAN NUMERAL ONE ROMAN NUMERAL ONE U+2160 0xFA4A 0x8754
    ROMAN NUMERAL TWO ROMAN NUMERAL TWO U+2161 0xFA4B 0x8755
    ROMAN NUMERAL THREE ROMAN NUMERAL THREE U+2162 0xFA4C 0x8756
    ROMAN NUMERAL FOUR ROMAN NUMERAL FOUR U+2163 0xFA4D 0x8757
    ROMAN NUMERAL FIVE ROMAN NUMERAL FIVE U+2164 0xFA4E 0x8758
    ROMAN NUMERAL SIX ROMAN NUMERAL SIX U+2165 0xFA4F 0x8759
    ROMAN NUMERAL SEVEN ROMAN NUMERAL SEVEN U+2166 0xFA50 0x875A
    ROMAN NUMERAL EIGHT ROMAN NUMERAL EIGHT U+2167 0xFA51 0x875B
    ROMAN NUMERAL NINE ROMAN NUMERAL NINE U+2168 0xFA52 0x875C
    ROMAN NUMERAL TEN ROMAN NUMERAL TEN U+2169 0xFA53 0x875D
    NUMERO SIGN NUMERO SIGN U+2116 0xFA59 0x8782
    TELEPHONE SIGN TELEPHONE SIGN U+2121 0xFA5A 0x8784
    PARENTHESIZED IDEOGRAPH STOCK PARENTHESIZED IDEOGRAPH STOCK U+3231 0xFA58 0x878A

Because of the reasons (1) and (2), data which is converted to Unicode by Windows cannot be converted back to the orginal data by libiconv, and vice-versa.

The reaon (3) may pose problems when performing string comparison in software without support for multi-defined Shift-JIS(CP932) characters. The term "multiple definition of characters" means, two mapping locations exist for a single character, and string matching for a same character might fail because of the different locations in code points.


2-2. Problems in JIS encodings

Functionality provided by libiconv/glibc conversion for Shift_JIS/EUC-JP/ISO-2022-JP encodings is limited for the following reasons.

  1. Conversions between JIS X 0201 Latin characters and UCS, and conversions between US-ASCII and UCS have following differences.

    Table4
    Char JISX0201
    US-ASCII
    UCS(Unicode)
    Name Code JISX0201 US-ASCII
    REVERSE SOLIDUS REVERSE SOLIDUS 0x5C(5/12) U+00A5 U+005C
    TILDE TILDE 0x7E(7/14) U+203E U+007E

    Because of such differences, problems arise when converting Shift-JIS which incorporates JIS X 0201 Latin characters, and EUC-JP/ISO-2022-JP which facilitate US-ASCII characters. Precisely speaking, \(0x5C) and ~(0x7E) in EUC-JP/ISO-2022-JP can not be converted correctly to Shift-JIS.

  2. There is a difference in mappings for JIS X 0208 Row 1 Cell 29 to UCS.

    Table5
    Char JIS X 0208
    Row-Cell
    UCS(Unicode)
    Name Code libiconv/glibc JIS standards
    EM DASH EM DASH 01-29 U+2015 U+2014


2-3. cp932, euc-jp and iso-2022-jp

By employing UCS code points in Table1, libiconv/glibc is capable of performing bi-directional code conversions for cp932, euc-jp and iso-2022-jp. However, it is already mentioned in the previous sections that this conversion is not compatible to the Microsoft's UCS conversions.

If cp932 is modified to work in the same routine as the Microsoft conversion, then this would break the bi-directional conversions to eucjp-ms and iso-2022-jp.

When performing conversions via intermediary UCS, the intermediary UCS code point needs to have a unified value among encodings. Hence, if cp932 is modified to act as Microsoft conversions, eucjp-ms and iso-2022-jp also require Microsoft-compatible converters as well as the existing converters, just as conversions for Shift-JIS encoded characters own two converters sjis and cp932.

Table6 Differences between JIS UCS mappings and MS UCS mappings
Char Shift-JIS (Row-Cell) UCS(Unicode)
Name Code JIS Standards MS
EM DASH EM DASH 01-29 U+2014 U+2015
FULLWIDTH TILDE FULLWIDTH TILDE 01-33 U+301C U+FF5E
PARALLEL TO PARALLEL TO 01-34 U+2016 U+2225
FULLWIDTH HYPHEN-MINUS FULLWIDTH HYPHEN-MINUS 01-61 U+2212 U+FF0D
FULLWIDTH CENT SIGN FULLWIDTH CENT SIGN 01-81 U+00A2 U+FFE0
FULLWIDTH POUND SIGN FULLWIDTH POUND SIGN 01-82 U+00A3 U+FFE1
FULLWIDTH NOT SIGN FULLWIDTH NOT SIGN 02-44 U+00AC U+FFE2

Table7 represents the UCS mapping converter names for each encodings.

Table7 iconv() converters (codeset name)
  JIS UCS mapping MS UCS mapping
Shift-JIS encoding sjis cp932
Japanese EUC encoding euc-jp (1)
ISO-2022-JP encoding iso-2022-jp (2)

(1) and (2) in Table7 corresponds to the following code pages under Windows.

  1. Code Page 51932 (No User-defined characters)
  2. Code Page 50220 (No User-defined characters)
* Windows do not provide conversion via JIS UCS mapping.

OpenGroup/ODVA(Open DeviceNet Vendor Association,Inc.) define the following codeset names.

  1. eucJP-ms (Able to use User-defined characters)
  2. N/A
* eucJP-ms and Windows Code Page 51932 is not compatible.

Technical Report(TR) TR X 0015:1999 XML Japanese Profile defines the following charset names.

  1. x-eucjp-open-19970715-ms (Able to use User-defined characters in eucJP-ms)
  2. x-iso2022jp-cp932 (No User-defined characters)

3. Descriptions of libiconv/glibc patches


3-1. Proposals for glibc/libiconv refinement

  1. Employ Java sjis/euc-jp/iso-2022-jp conversions.

    References

    * Applied to libiconv 1.9.1 patch.
    * Not included in glibc patch. Not merged into the mainstream.

  2. Because tables provided by Unicode Consortium is not complete, modification is needed for cp932 conversion so that it works not as an alias for s-jis, but as an independent converter almost identical to Microsft Code Page 932.


  3. Add new codeset names to enable conversions from cp932 to Japanese EUC and ISO-2022-JP via UCS.

    • Japanese EUC

      add eucJP-ms defined by OpenGroup/ODVA.

      * Already merged into the mainstream of glibc.
      * Included in libiconv 1.8/1.9.1 patches.

    • ISO-2022-JP

      No appopriate codeset names and rules for a converting table are available at the moment.


3-2. libiconv patch in details

  1. Modified cp932 conversion to the conversion between Code Page 932 and UCS under Microsoft Windows.

    libiconv 1.8 before patch

    Char Name (JIS) cp932 -> -> Unicode -> -> cp932
    FULLWIDTH TILDE FULLWIDTH TILDE 0x8160 U+301C 0x8160
    PARALLEL TO PARALLEL TO 0x8161 U+2016 0x8161
    FULLWIDTH HYPHEN-MINUS FULLWIDTH HYPHEN-MINUS 0x817C U+2212 0x817C
    FULLWIDTH CENT SIGN FULLWIDTH CENT SIGN 0x8191 U+00A2 0x8191
    FULLWIDTH POUND SIGN FULLWIDTH POUND SIGN 0x8192 U+00A3 0x8192
    FULLWIDTH NOT SIGN FULLWIDTH NOT SIGN 0x81CA U+00AC 0x81CA

    Char Name (JIS) cp932 -> -> Unicode -> -> cp932
    ROMAN NUMERAL ONE ROMAN NUMERAL ONE 0x8754 U+2160 0xFA4A
    ROMAN NUMERAL TWO ROMAN NUMERAL TWO 0x8755 U+2161 0xFA4B
    ROMAN NUMERAL THREE ROMAN NUMERAL THREE 0x8756 U+2162 0xFA4C
    ROMAN NUMERAL FOUR ROMAN NUMERAL FOUR 0x8757 U+2163 0xFA4D
    ROMAN NUMERAL FIVE ROMAN NUMERAL FIVE 0x8758 U+2164 0xFA4E
    ROMAN NUMERAL SIX ROMAN NUMERAL SIX 0x8759 U+2165 0xFA4F
    ROMAN NUMERAL SEVEN ROMAN NUMERAL SEVEN 0x875A U+2166 0xFA50
    ROMAN NUMERAL EIGHT ROMAN NUMERAL EIGHT 0x875B U+2167 0xFA51
    ROMAN NUMERAL NINE ROMAN NUMERAL NINE 0x875C U+2168 0xFA52
    ROMAN NUMERAL TEN ROMAN NUMERAL TEN 0x875D U+2169 0xFA53
    NUMERO SIGN NUMERO SIGN 0x8782 U+2116 0xFA59
    TELEPHONE SIGN TELEPHONE SIGN 0x8784 U+2121 0xFA5A
    PARENTHESIZED IDEOGRAPH STOCK Roman PARENTHESIZED IDEOGRAPH STOCK 0x878A U+3231 0xFA58

    Char Name (JIS) cp932 -> -> Unicode -> -> cp932
    ROMAN NUMERAL ONE ROMAN NUMERAL ONE 0xFA4A U+2160 0xFA4A
    ROMAN NUMERAL TWO ROMAN NUMERAL TWO 0xFA4B U+2161 0xFA4B
    ROMAN NUMERAL THREE ROMAN NUMERAL THREE 0xFA4C U+2162 0xFA4C
    ROMAN NUMERAL FOUR ROMAN NUMERAL FOUR 0xFA4D U+2163 0xFA4D
    ROMAN NUMERAL FIVE ROMAN NUMERAL FIVE 0xFA4E U+2164 0xFA4E
    ROMAN NUMERAL SIX ROMAN NUMERAL SIX 0xFA4F U+2165 0xFA4F
    ROMAN NUMERAL SEVEN ROMAN NUMERAL SEVEN 0xFA50 U+2166 0xFA50
    ROMAN NUMERAL EIGHT ROMAN NUMERAL EIGHT 0xFA51 U+2167 0xFA51
    ROMAN NUMERAL NINE ROMAN NUMERAL NINE 0xFA52 U+2168 0xFA52
    ROMAN NUMERAL TEN ROMAN NUMERAL TEN 0xFA53 U+2169 0xFA53
    NUMERO SIGN NUMERO SIGN 0xFA59 U+2116 0xFA59
    TELEPHONE SIGN TELEPHONE SIGN 0xFA5A U+2121 0xFA5A
    PARENTHESIZED IDEOGRAPH STOCK PARENTHESIZED IDEOGRAPH STOCK 0xFA58 U+3231 0xFA58

    libiconv 1.8 patch applied

    Char Name (JIS) cp932 -> -> Unicode -> -> cp932
    FULLWIDTH TILDE FULLWIDTH TILDE 0x8160 U+FF5E 0x8160
    PARALLEL TO PARALLEL TO 0x8161 U+2225 0x8161
    FULLWIDTH HYPHEN-MINUS FULLWIDTH HYPHEN-MINUS 0x817C U+FF0D 0x817C
    FULLWIDTH CENT SIGN FULLWIDTH CENT SIGN 0x8191 U+FFE0 0x8191
    FULLWIDTH POUND SIGN FULLWIDTH POUND SIGN 0x8192 U+FFE1 0x8192
    FULLWIDTH NOT SIGN FULLWIDTH NOT SIGN 0x81CA U+FFE2 0x81CA

    Char Name (JIS) cp932 -> -> Unicode -> -> cp932
    ROMAN NUMERAL ONE ROMAN NUMERAL ONE 0x8754 U+2160 0x8754
    ROMAN NUMERAL TWO ROMAN NUMERAL TWO 0x8755 U+2161 0x8755
    ROMAN NUMERAL THREE ROMAN NUMERAL THREE 0x8756 U+2162 0x8756
    ROMAN NUMERAL FOUR ROMAN NUMERAL FOUR 0x8757 U+2163 0x8757
    ROMAN NUMERAL FIVE ROMAN NUMERAL FIVE 0x8758 U+2164 0x8758
    ROMAN NUMERAL SIX ROMAN NUMERAL SIX 0x8759 U+2165 0x8759
    ROMAN NUMERAL SEVEN ROMAN NUMERAL SEVEN 0x875A U+2166 0x875A
    ROMAN NUMERAL EIGHT ROMAN NUMERAL EIGHT 0x875B U+2167 0x875B
    ROMAN NUMERAL NINE ROMAN NUMERAL NINE 0x875C U+2168 0x875C
    ROMAN NUMERAL TEN ROMAN NUMERAL TEN 0x875D U+2169 0x875D
    NUMERO SIGN NUMERO SIGN 0x8782 U+2116 0x8782
    TELEPHONE SIGN TELEPHONE SIGN 0x8784 U+2121 0x8784
    PARENTHESIZED IDEOGRAPH STOCK PARENTHESIZED IDEOGRAPH STOCK 0x878A U+3231 0x878A

    Char Name (JIS) cp932 -> -> Unicode -> -> cp932
    ROMAN NUMERAL ONE ROMAN NUMERAL ONE 0xFA4A U+2160 0x8754
    ROMAN NUMERAL TWO ROMAN NUMERAL TWO 0xFA4B U+2161 0x8755
    ROMAN NUMERAL THREE ROMAN NUMERAL THREE 0xFA4C U+2162 0x8756
    ROMAN NUMERAL FOUR ROMAN NUMERAL FOUR 0xFA4D U+2163 0x8757
    ROMAN NUMERAL FIVE ROMAN NUMERAL FIVE 0xFA4E U+2164 0x8758
    ROMAN NUMERAL SIX ROMAN NUMERAL SIX 0xFA4F U+2165 0x8759
    ROMAN NUMERAL SEVEN ROMAN NUMERAL SEVEN 0xFA50 U+2166 0x875A
    ROMAN NUMERAL EIGHT ROMAN NUMERAL EIGHT 0xFA51 U+2167 0x875B
    ROMAN NUMERAL NINE ROMAN NUMERAL NINE 0xFA52 U+2168 0x875C
    ROMAN NUMERAL TEN ROMAN NUMERAL TEN 0xFA53 U+2169 0x875D
    NUMERO SIGN NUMERO SIGN 0xFA59 U+2116 0x8782
    TELEPHONE SIGN TELEPHONE SIGN 0xFA5A U+2121 0x8784
    PARENTHESIZED IDEOGRAPH STOCK PARENTHESIZED IDEOGRAPH STOCK 0xFA58 U+3231 0x878A

    For characters whose background is , the following conversions are employed.

      PRB: Conversion Problem Between Shift-JIS and Unicode
      http://support.microsoft.com/default.aspx?scid=kb;en-us;Q170559

  2. Addition of eucJP-ms.

  3. In order to support conversion from cp932/eucJP-ms to iso-2022-jp, the mappings from UCS to iso-2022-jp conversion table are added for for cp932/eucJP-ms.

    * Only contained within the patch for libiconv 1.8.
    * Not included in the patch for libiconv 1.9.1.


3-3. Descriptions of glibc patch

  1. Make cp932 independent of sjis, which follows the same conversion routine as Microsoft one.
  2. Addition of eucJP-ms.

3.4. Patches for download

Patches are downloadable from the following sites.

Download:

patch for libiconv 1.9.1 without fixes in JIS converter
  libiconv-1.9.1-cp932.patch.gz

patch for libiconv 1.9.1 with fixes in JIS converter
  libiconv-1.9.1-cp932-jis.patch.gz

patch for libiconv 1.8
  libiconv-1.8-cp932-patch.diff.gz

patch for glibc 2.3.2
  glibc-2.3.2-cp932-2.diff.gz

patch for glibc 2.3.1
  glibc-2.3.1-cp932-2.diff.gz

patch for glibc 2.2.5
  glibc-2.2.5-cp932-2.diff.gz

Below is the table representing the changes included in each patch.

  glibc patches libiconv 1.8 libiconv 1.9.1
without JIS fixes with JIS fixes
CP932 modification YYYY
addition of eucJP-ms YYYY
modification of iso-2022-jp table NYNN
fixes in JIS converters NNNY

* 'Y' represents included in the patches, 'N' represents not included in the patches but need to be fixed.

The modification of CP932 and the addition of eucJP-ms have already merged into the CVS tree of glibc, and they will be available from the next official release of glibc. These changes have not included in libiconv yet.

Fixes for iso-2022-jp table is engaged only in the patch for libiconv 1.8, as it is only an ad-hoc solution. A more comprehensive solution may be preparing a completely new table for iso-2022jp such as iso-2022jp-ms, but such standard has not yet defined. It is not favorable to modify iso-2022-jp directly as it is a common standard widely available. Doing so could affect the existing working environment, and because of it this change has been ommitted from the patches for libiconv 1.9.1.

Fixes in JIS encoding are required for libiconv 1.9.1, if it is necessary to consider the conversion between CP932<->EUC-JP. CP932 is modified intentionally to work only with eucJP-ms, so it would cause problems for such conversions. A good example of applications employing such conversion routines is vim6, and some probems have been observed. However, in order to fix this issue, a modification on standardised JIS converters is required. For the same reason as not including the modification for iso-2022jp in the patch, more consideration is needed for making this change. For time being, two patches are prepared for libiconv 1.9.1; one with the fixes for JIS converters and one without them.

4. TODO

  1. Support for bi-directional conversions for sjis/euc-jp/iso-2022-jp

    Fix the bi-directional conversion for sjis/euc-jp/iso-2022-jp encodings in section 2-2 issue (1).

  2. Implement bi-directional converter for iso-2022-jp and cp932/eucJP-ms.

    Add new converters as iso-2022-jp-ms or iso-2022-jp-cp932.

  3. NLS conversion under Linux

    The normalisation process for Shift-JIS characters (integrating characters with multiple code points) is different under Linux NLS(National Language Support) and glibc/libiconv.

  4. Interoperability

    There are software which implement their own logic to support eucJP-open(eucJP-ms) encoding conversions. Good examples of it are eucJP-win in PHP and EUC_JP in PostgreSQL.

    Due to the facts that non-standardised names are used among software, and also no cohesive support is developed for Vendor-defined characters, some kind of interoperability verifications are required.

    For cp932/eucJP-ms, it is essential to prepare a correct conversion tables first, and then implement converters according to them.

  5. Creating new locale for eucJP-ms

      JF Documents > Policies for Japanese Locale under Linux > Chapter 3. Character Codes (Japanese)
      http://www.linux.or.jp/JF/JFdocs/Japanese-Locale-Policy/character-code.html

    According to its "Note" section, it states that "instead of modifying the existing ja_JP.eucJP locale, it is recommended to create a new locale in order to add User defined characters for achieving compatibility to the working environment."

    In another words, a new locale for eucJP-ms is needed.

5. References


   JIS Standards
  JIS X 0201
  JIS X 0208
  JIS X 0221

   WWW

Conversion tables differ between venders
http://www.debian.or.jp/~kubota/unicode-symbols-map2.html.ja

Issues in Java Character Encodings (Japanese)
http://www.ingrid.org/java/i18n/encoding/

Qt/KDE Japanese Localisation (Japanese)
http://www.asahi-net.or.jp/~hc3j-tkg/

Issues in existing codes and Unicode (Japanese)
http://euc.jp/i18n/ucsnote.ja.html

eucJP-ms as a solution for resolving conversions for Unicode and User-defined/Vendor-defined characters (Japanese)


Recommended mapping table for JIS-Unicode conversion
http://hp.vector.co.jp/authors/VA010341/unicode/

   Pages created by MORIYAMA Masayuki

Contacts

For more information, please send emails to samba-dev@miraclelinux.com.

Copyright(c)2000-2008 MIRACLE LINUX CORPORATION. All Rights Reserved.