diff options
Diffstat (limited to 'newlib/libc/iconv/iconv.tex')
-rw-r--r-- | newlib/libc/iconv/iconv.tex | 1846 |
1 files changed, 284 insertions, 1562 deletions
diff --git a/newlib/libc/iconv/iconv.tex b/newlib/libc/iconv/iconv.tex index 305c3ce16..1a36ceb2b 100644 --- a/newlib/libc/iconv/iconv.tex +++ b/newlib/libc/iconv/iconv.tex @@ -1,1710 +1,432 @@ @node Iconv -@chapter Encoding conversions (@file{iconv.h}) +@chapter Character-set conversions (@file{iconv.h}) This chapter describes the Newlib iconv library. The iconv functions declarations are in @file{iconv.h}. @menu -* iconv:: Encoding conversion routines -* Introduction:: Introduction to iconv and encodings -* Supported encodings:: The list of currently supported encodings -* iconv design decisions:: General iconv library design issues -* iconv configuration:: iconv-related configure script options -* Encoding names:: How encodings are named. -* CCS tables:: CCS tables format and 'mktbl.pl' Perl script -* CES converters:: CES converters description -* The encodings description file:: The 'encoding.deps' file and 'mkdeps.pl' -* How to add new encoding:: The steps to add new encoding support -* The locale support interfaces:: Locale-related iconv interfaces -* Contact:: The author contact +* iconv:: Character set conversion routines +* iconv architecture:: Architecture of Newlib iconv library +* iconv configuration:: Newlib iconv-specific configure options +* Generating CCS tables:: How to generate CCS tables +* Adding new converter:: Steps on adding a new converter @end menu @page @include iconv/iconv.def @page -@node Introduction -@section Introduction +@node iconv architecture +@section iconv architecture +@findex iconv architecture @findex encoding -@findex character set -@findex charset -@findex CES @findex CCS +@findex CES +@findex iconv converter @* -The iconv library is intended to convert characters from one encoding to -another. It implements iconv(), iconv_open() and iconv_close() -calls, which are defined by the Single Unix Specification. - -@* -In addition to these user-level interfaces, the iconv library also has -several useful interfaces which are needed to support coding -capabilities of the Newlib Locale infrastructure. Since Locale -support also needs to -convert various character sets to and from the @emph{wide characters -set}, the iconv library shares it's capabilities with the Newlib Locale -subsystem. Moreover, the iconv library supports several features which are -only needed for the Locale infrastructure (for example, the MB_CUR_MAX value). - -@* -The Newlib iconv library was created using concepts from another iconv -library implemented by Konstantin Chuguev (ver 2.0). The Newlib iconv library -was rewritten from scratch and contains a lot of improvements with respect to -the original iconv library. - -@* -Terms like @dfn{encoding} or @dfn{character set} aren't well defined and -are often used with various meanings. The following are the definitions of terms -which are used in this documentation as well as in the iconv library -implementation: - @itemize @bullet @item -@dfn{encoding} - a machine representation of characters by means of bits; - +Encoding - a rule to represent computer text by means of bits and bytes. @item -@dfn{Character Set} or @dfn{Charset} - just a collection of -characters, i.e. the encoding is the machine representation of the character set; - +CCS (Coded Character Set) - a mapping from an abstract character set +to a set of non-negative integers (character codes). @item -@dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a -set of integers @dfn{character codes}; - -@item -@dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character -codes to a sequence of bytes; +CES (Character Encoding Scheme) - a mapping from a set of character codes +units to a sequence of bytes. @end itemize @* -Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8, -ASCII, etc. Encodings are formed by the following chain of steps: - -@enumerate -@item -User has a set of characters which are specific to his or her language (character set). - -@item -Each character from this set is uniquely numbered, resulting in an CCS. - -@item -Each number from the CCS is converted to a sequence of bits or bytes by means -of a CES and form some encoding. Thus, CES may be considered as a -function of CCS which produces some encoding. Note, that CES may be -applied to more than one CCS. -@end enumerate +Examples of CCS: ASCII, ISO-8859-x, KOI8-R, KSX-1001, GB-2312.@* +Examples of CES: UTF-8, UTF-16, EUC-JP, ISO-2022-JP. @* -Thus, an encoding may be considered as one or more CCS + CES. +The iconv library is used to convert an array of characters in one encoding +to array in another encoding. @* -Sometimes, there is no CES and in such cases encoding is equivalent -to CCS, e.g. KOI8-R or ASCII. +From a user's point of view, the iconv library is a set of converters. Each converter +corresponds to one encoding (e.g., KOI8-R converter, UTF-8 converter). +Internally the meaning of converter is different. @* -An example of a more complicated encoding is UTF-8 which is the UCS -(or Unicode) CCS plus the UTF-8 CES. +The iconv library always performs conversions through UCS-32: i.e., to convert +from A to B, iconv library first converts A to UCS-32, and then USC-32 to B. @* -The following is a brief list of iconv library features: -@itemize -@item -Generic architecture; -@item -Locale infrastructure support; -@item -Automatic generation of the program code which handles -CES/CCS/Encoding/Names/Aliases dependencies; -@item -The ability to choose size- or speed-optimazed -configuration; -@item -The ability to exclude a lot of unneeded code and data from the linking step. -@end itemize - - - +Each encoding consists of CES and CCS. CCS may be represented as data tables +but CES always implies some code (algorithm). Iconv uses CCS tables +to map from some encoding to UCS-32. CCS tables are placed into +the iconv/ccs subdirectory of newlib. The iconv code also uses CES +modules which can convert some CCS to and from UCS-32. CES modules are placed +in the iconv/ces subdirectory. -@page -@node Supported encodings -@section Supported encodings -@findex big5 -@findex cp775 -@findex cp850 -@findex cp852 -@findex cp855 -@findex cp866 -@findex euc_jp -@findex euc_kr -@findex euc_tw -@findex iso_8859_1 -@findex iso_8859_10 -@findex iso_8859_11 -@findex iso_8859_13 -@findex iso_8859_14 -@findex iso_8859_15 -@findex iso_8859_2 -@findex iso_8859_3 -@findex iso_8859_4 -@findex iso_8859_5 -@findex iso_8859_6 -@findex iso_8859_7 -@findex iso_8859_8 -@findex iso_8859_9 -@findex iso_ir_111 -@findex koi8_r -@findex koi8_ru -@findex koi8_u -@findex koi8_uni -@findex ucs_2 -@findex ucs_2_internal -@findex ucs_2be -@findex ucs_2le -@findex ucs_4 -@findex ucs_4_internal -@findex ucs_4be -@findex ucs_4le -@findex us_ascii -@findex utf_16 -@findex utf_16be -@findex utf_16le -@findex utf_8 -@findex win_1250 -@findex win_1251 -@findex win_1252 -@findex win_1253 -@findex win_1254 -@findex win_1255 -@findex win_1256 -@findex win_1257 -@findex win_1258 @* -The following is the list of currently supported encodings. The first column -corresponds to the encoding name, the second column is the list of aliases, -the third column is its CES and CCS components names, and the fourth column -is a short description. - -@multitable @columnfractions .20 .26 .24 .30 -@item -Name -@tab -Aliases -@tab -CES/CCS -@tab -Short description -@item -@tab -@tab -@tab - - -@item -big5 -@tab -csbig5, big_five, bigfive, cn_big5, cp950 -@tab -table_pcs / big5, us_ascii -@tab -The encoding for the Traditional Chinese. - - -@item -cp775 -@tab -ibm775, cspc775baltic -@tab -table / cp775 -@tab -The updated version of CP 437 that supports the balitic languages. - - -@item -cp850 -@tab -ibm850, 850, cspc850multilingual -@tab -table / cp850 -@tab -IBM 850 - the updated version of CP 437 where several Latin 1 characters have been -added instead of some less-often used characters like the line-drawing -and the greek ones. - - -@item -cp852 -@tab -ibm852, 852, cspcp852 -@tab -@tab -IBM 852 - the updated version of CP 437 where several Latin 2 characters have been added -instead of some less-often used characters like the line-drawing and the greek ones. - - -@item -cp855 -@tab -ibm855, 855, csibm855 -@tab -table / cp855 -@tab -IBM 855 - the updated version of CP 437 that supports Cyrillic. - - -@item -cp866 -@tab -866, IBM866, CSIBM866 -@tab -table / cp866 -@tab -IBM 866 - the updated version of CP 855 which follows more the logical Russian alphabet -ordering of the alternative variant that is preferred by many Russian users. - - -@item -euc_jp -@tab -eucjp -@tab -euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990 -@tab -EUC-JP - The EUC for Japanese. - - -@item -euc_kr -@tab -euckr -@tab -euc / ksx1001 -@tab -EUC-KR - The EUC for Korean. - - -@item -euc_tw -@tab -euctw -@tab -euc / cns11643_plane1, cns11643_plane2, cns11643_plane14 -@tab -EUC-TW - The EUC for Traditional Chinese. - - -@item -iso_8859_1 -@tab -iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1 -@tab -table / iso_8859_1 -@tab -ISO 8859-1:1987 - Latin 1, West European. - - -@item -iso_8859_10 -@tab -iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10 -@tab -table / iso_8859_10 -@tab -ISO 8859-10:1992 - Latin 6, Nordic. - - -@item -iso_8859_11 -@tab -iso8859_11, iso885911 -@tab -table / iso_8859_11 -@tab -ISO 8859-11 - Thai. - - -@item -iso_8859_13 -@tab -iso_8859_13:1998, iso8859_13, iso885913 -@tab -table / iso_8859_13 -@tab -ISO 8859-13:1998 - Latin 7, Baltic Rim. - - -@item -iso_8859_14 -@tab -iso_8859_14:1998, iso885914, iso8859_14 -@tab -table / iso_8859_14 -@tab -ISO 8859-14:1998 - Latin 8, Celtic. - - -@item -iso_8859_15 -@tab -iso885915, iso_8859_15:1998, iso8859_15, -@tab -table / iso_8859_15 -@tab -ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1. - - -@item -iso_8859_2 -@tab -iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2 -@tab -table / iso_8859_2 -@tab -ISO 8859-2:1987 - Latin 2, East European. - - -@item -iso_8859_3 -@tab -iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593 -@tab -table / iso_8859_3 -@tab -ISO 8859-3:1988 - Latin 3, South European. - - -@item -iso_8859_4 -@tab -iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4 -@tab -table / iso_8859_4 -@tab -ISO 8859-4:1988 - Latin 4, North European. - - -@item -iso_8859_5 -@tab -iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic -@tab -table / iso_8859_5 -@tab -ISO 8859-5:1988 - Cyrillic. - - -@item -iso_8859_6 -@tab -iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596 -@tab -table / iso_8859_6 -@tab -ISO i8859-6:1987 - Arabic. - - -@item -iso_8859_7 -@tab -iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597 -@tab -table / iso_8859_7 -@tab -ISO 8859-7:1987 - Greek. - - -@item -iso_8859_8 -@tab -iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598 -@tab -table / iso_8859_8 -@tab -ISO 8859-8:1988 - Hebrew. - - -@item -iso_8859_9 -@tab -iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599 -@tab -table / iso_8859_9 -@tab -ISO 8859-9:1989 - Latin 5, Turkish. - - -@item -iso_ir_111 -@tab -ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic -@tab -table / iso_ir_111 -@tab -ISO IR 111/ECMA Cyrillic. - - -@item -koi8_r -@tab -cskoi8r, koi8r, koi8 -@tab -table / koi8_r -@tab -RFC 1489 Cyrillic. - - -@item -koi8_ru -@tab -koi8ru -@tab -table / koi8_ru -@tab -The obsolete Ukrainian. - - -@item -koi8_u -@tab -koi8u -@tab -table / koi8_u -@tab -RFC 2319 Ukrainian. - - -@item -koi8_uni -@tab -koi8uni -@tab -table / koi8_uni -@tab -KOI8 Unified. - - -@item -ucs_2 -@tab -ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode -@tab -ucs_2 / (UCS) -@tab -ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_2_internal -@tab -ucs2_internal, ucs_2internal, ucs2internal -@tab -ucs_2_internal / (UCS) -@tab -ISO-10646-UCS-2 in system byte order. -NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_2be -@tab -ucs2be -@tab -ucs_2 / (UCS) -@tab -Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2). -Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_2le -@tab -ucs2le -@tab -ucs_2 / (UCS) -@tab -Little Endian version of ISO-10646-UCS-2. -Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_4 -@tab -ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4 -@tab -ucs_4 / (UCS) -@tab -ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_4_internal -@tab -ucs4_internal, ucs_4internal, ucs4internal -@tab -ucs_4_internal / (UCS) -@tab -ISO-10646-UCS-4 in system byte order. -NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_4be -@tab -ucs4be -@tab -ucs_4 / (UCS) -@tab -Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4). -Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_4le -@tab -ucs4le -@tab -ucs_4 / (UCS) -@tab -Little Endian version of ISO-10646-UCS-4. -Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -us_ascii -@tab -ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii -@tab -us_ascii / (ASCII) -@tab -7-bit ASCII. - - -@item -utf_16 -@tab -utf16 -@tab -utf_16 / (UCS) -@tab -RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM. - - -@item -utf_16be -@tab -utf16be -@tab -utf_16 / (UCS) -@tab -Big Endian version of RFC 2781 UTF-16. -NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -utf_16le -@tab -utf16le -@tab -utf_16 / (UCS) -@tab -Little Endian version of RFC 2781 UTF-16. -NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -utf_8 -@tab -utf8 -@tab -utf_8 / (UCS) -@tab -RFC 3629 UTF-8. - - -@item -win_1250 -@tab -cp1250 -@tab -@tab -Win-1250 Croatian. - - -@item -win_1251 -@tab -cp1251 -@tab -table / win_1251 -@tab -Win-1251 - Cyrillic. - - -@item -win_1252 -@tab -cp1252 -@tab -table / win_1252 -@tab -Win-1252 - Latin 1. - - -@item -win_1253 -@tab -cp1253 -@tab -table / win_1253 -@tab -Win-1253 - Greek. - - -@item -win_1254 -@tab -cp1254 -@tab -table / win_1254 -@tab -Win-1254 - Turkish. - - -@item -win_1255 -@tab -cp1255 -@tab -table / win_1255 -@tab -Win-1255 - Hebrew. - - -@item -win_1256 -@tab -cp1256 -@tab -table / win_1256 -@tab -Win-1256 - Arabic. - - -@item -win_1257 -@tab -cp1257 -@tab -table / win_1257 -@tab -Win-1257 - Baltic. - - -@item -win_1258 -@tab -cp1258 -@tab -table / win_1258 -@tab -Win-1258 - Vietnamese7 that supports Cyrillic. -@end multitable - - - - - -@page -@node iconv design decisions -@section iconv design decisions -@findex CCS table -@findex CES converter -@findex Speed-optimized tables -@findex Size-optimized tables -@* -The first iconv library design issue arises when considering the -following two design approaches: - -@enumerate -@item -Have modules which implement conversion from the encoding A to the encoding B -and vice versa i.e., one conversion module relates to any two encodings. -@item -Have modules which implement conversion from the encoding A to the fixed -encoding C and vice versa i.e., one conversion module relates to any -one encoding A and one fixed encoding C. In this case, to convert from -the encoding A to the encoding B, two modules are needed (in order to convert -from A to C and then from C to B). -@end enumerate - -@* -It's obvious, that we have tradeoff between commonality/flexibility and -efficiency: the first method is more efficient since it converts -directly; however, it isn't so flexible since for each -encoding pair a distinct module is needed. - -@* -The Newlib iconv model uses the second method and always converts through the 32-bit -UCS but its design also allows one to write specialized conversion -modules if the conversion speed is critical. - -@* -The second design issue is how to break down (decompose) encodings. -The Newlib iconv library uses the fact that any encoding may be -considered as one or more CCS plus a CES. It also decomposes its -conversion modules on @dfn{CES converter} plus one or more @dfn{CCS -tables}. CCS tables map CCS to UCS and vice versa; the CES converters -map CCS to the encoding and vice versa. - -@* -As the example, let's consider the conversion from the big5 encoding to -the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5 -CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2, -and CNS11643_PLANE14 CCS-es plus the EUC CES. +Some encodings have CES = CCS (e.g., KOI8-R). For such encodings iconv uses +special subroutines which perform simple table conversions (ccs_table.c). @* -The euc_jp -> big5 conversion is performed as follows: - -@enumerate -@item -The EUC converter performs the EUC-TW encoding to the corresponding CCS-es -transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14 -CCS-es); -@item -The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1, -CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables; -@item -The resulting UCS codes are transformed to the ASCII and BIG5 codes using -the corresponding CCS tables; -@item -The obtained CCS codes are transformed to the big5 encoding using the corresponding -CES converter. -@end enumerate - -@* -Analogously, the backward conversion is performed as follows: - -@enumerate -@item -The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation -(the ASCII and BIG5 CCS-es); -@item -The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables; -@item -The resulting UCS codes are transformed to the ASCII and BIG5 codes using -the corresponding CCS tables; -@item -The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding -CES converter. -@end enumerate - -@* -Note, the above is just an example and real names (which are implemented -in the Newlib iconv) of the CES converters and the CCS tables are slightly different. - -@* -The third design issue also relates to flexibility. Obviously, it isn't -desirable to always link all the CES converters and the CCS tables to the library -but instead, we want to be able to load the needed converters and tables -dynamically on demand. This isn't a problem on "big" machines such as -a PC, but it may be very problematical within "small" embedded systems. - -@* -Since the CCS tables are just data, it is possible to load them -dynamically from external files. The CES converters, on the other hand -are algorithms with some code so a dynamic library loading -capability is required. - -@* -Apart from possible restrictions applied by embedded systems (small -RAM for example), Newlib itself has no dynamic library support and -therefore, all the CES converters which will ever be used must be linked into -the library. However, loading of the dynamic CCS tables is possible and is -implemented in the Newlib iconv library. It may be enabled via the Newlib -configure script options. - -@* -The next design issue is fine-tuning the iconv library -configuration. One important ability is for iconv to not link all it's -converters and tables (if dynamic loading is not enabled) but instead, -enable only those encodings which are specified at configuration -time (see the section about the configure script options). +Among specialized CES modules, the iconv library has +generic support for EUC and ISO-2022-family encodings (ces_euc.c and +ces_iso2022.c). @* -In addition, the Newlib iconv library configure options distinguish between -conversion directions. This means that not only are supported encodings -selectable, the conversion direction is as well. For example, if user wants -the configuration which allows conversions from UTF-8 to UTF-16 and -doesn't plan using the "UTF-16 to UTF-8" conversions, he or she can -enable only -this conversion direction (i.e., no "UTF-16 -> UTF-8"-related code will -be included) thus, saving some memory (note, that such technique allows to -exclude one half of a CCS table from linking which may be big enough). +To enable iconv to work with CCS or CES-based encodings, the correspondent +CES table or CCS module should be linked with Newlib. The iconv support +can also load CCS tables dynamically from external files (.cct files from +iconv/ccs/binary subdirectory). CES modules, on the other-hand, can't +be dynamically loaded. @* -One more design aspect are the speed- and size- optimized tables. Users can -select between them using configure script options. The -speed-optimized CCS tables are the same as the size-optimized ones in -case of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimized -CCS tables may be 1.5 to 2 times less then the speed-optimized ones. On the -other hand, conversion with speed tables is several times faster. - -@* -Its worth to stress that the new encoding support can't be -dynamically added into an already compiled Newlib library, even if it -needs only an additional CCS table and iconv is configured to use -the external files with CCS tables (this isn't the fundamental restriction -and the possibility to add new Table-based encoding support dynamically, by -means of just adding new .cct file, may be easily added). - -@* -Theoretically, the compiled-in CCS tables should be more appropriate for -embedded systems than dynamically loaded CCS tables. This is because the compiled-in tables are read-only and can be placed in ROM -whereas dynamic loading requires RAM. Moreover, in the current iconv -implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding. -This means, for example, that if two iconv descriptors for -"KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies of -koi8-r .cct file will be loaded (actually, iconv loads only the needed part -of these files). On the other hand, in the case of compiled-in CCS tables, there will always be only one copy. +Each iconv converter has one name and a set of aliases. The list of +aliases for each converter's name is in the iconv/charset.aliases file. +Note: iconv always normalizes converter names and aliases before using. @page @node iconv configuration @section iconv configuration @findex iconv configuration -@findex --enable-newlib-iconv-encodings -@findex --enable-newlib-iconv-from-encodings -@findex --enable-newlib-iconv-to-encodings -@findex --enable-newlib-iconv-external-ccs -@findex NLSPATH -@* -To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configure -script option should be used. This option accepts a comma-separated list -of @emph{encodings} that should be enabled. The option enables each encoding in both -("to" and "from") directions. - -@* -The @option{--enable-newlib-iconv-from-encodings} configure script option enables -"from" support for each encoding that was passed to it. - +@findex iconv converter @* -The @option{--enable-newlib-iconv-to-encodings} configure script option enables -"to" support for each encoding that was passed to it. +To enable iconv, the --enable-newlib-iconv configuration option should be +used when configuring newlib. @* -Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and -"KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconv -code and data will be linked) is to configure Newlib with the following -options: -@* -@code{--enable-newlib-iconv-encodings=UTF-8 ---enable-newlib-iconv-from-encodings=KOI8-R ---enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5} -@* -which is the same as -@* -@code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8 ---enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8} -@* -User may also just use the -@* -@code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2} -@* -configure script option, but it isn't so optimal since there will be -some unneeded data and code. - -@* -The @option{--enable-newlib-iconv-external-ccs} option enables iconv's -capabilities to work with the external CCS files. - -@* -The @option{--enable-target-optspace} Newlib configure script option also affects -the iconv library. If this option is present, the library uses the size -optimized CCS tables. This means, that only the size-optimized CCS -tables will be linked or, if the -@option{--enable-newlib-iconv-external-ccs} configure script option was used, -the iconv library will load the size-optimized tables. If the -@option{--enable-target-optspace}configure script option is disabled, -the speed-optimized CCS tables are used. - -@* -Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory. -Thus, the NLSPATH environment variable should be set. - - - - - -@page -@node Encoding names -@section Encoding names -@findex encoding name -@findex encoding alias -@findex normalized name -@* -Each encoding has one @dfn{name} and a number of @dfn{aliases}. When -user works with the iconv library (i.e., when the @code{iconv_open} call -is used) both name or aliases may be used. The same is when encoding -names are used in configure script options. +To link a specific converter (CCS table or CES module) into Newlib, the +---enable-newlib-builtin-converters option should be used. A +comma-separated list of converters can be passed with this option +(e.g., ---enable-newlib-builtin-converters=koi8-r,euc-jp to link KOI8-R +and EUC-JP converters). Either converter names or aliases may be used. @* -Names and aliases may be specified in any case (small or capital -letters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol. -Also, when working with the iconv library, +If the target system has a file system accessible by Newlib, table-based +converters may be loaded dynamically from external files. The iconv +code tries to load files from the iconv_data subdirectory of the directory +specified by the NLSPATH environment variable. @* -Internally the Newlib iconv library always converts aliases to names. It -also converts names and aliases in the @dfn{normalized} form which means -that all capital letters are converted to small letters and the @kbd{-} -symbols are converted to @kbd{_} symbols. - - - +Since Newlib has no generic dynamic module load support, CES-based converters +can't be dynamically loaded and should be linked-in. @page -@node CCS tables -@section CCS tables -@findex Size-optimized CCS table -@findex Speed-optimized CCS table -@findex mktbl.pl Perl script -@findex .cct files -@findex The CCT tables source files -@findex CCS source files -@* -The iconv library stores files with CCS tables in the the @emph{ccs/} -subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form -(@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in form -of compilable .c source files. The .cct files are only used when the -@option{--enable-newlib-iconv-external-ccs} configure script option is enabled. -The .c files are linked to the Newlib library if the corresponding -encoding is enabled. - -@* -As stated earlier, the Newlib iconv library performs all -conversions through the 32-bit UCS, but the codes which are used -in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set. -Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is -used instead of the 32-bit UCS-4. - -@* -CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to -16-bit UCS-2 and vice versa while 16-bit CCS tables map -16-bit CCS to 16-bit UCS-2 and vice versa. -8-bit tables are small (in size) while 16-bit tables may be big enough. -Because of this, 16-bit CCS tables may be -either speed- or size-optimized. Size-optimized CCS tables are -smaller then speed-optimized ones, but the conversion process is -slower if the size-optimized CCS tables are used. 8-bit CCS tables have only -size-optimized variant. - -Each CCS table (both speed- and size-optimized) consists of -@dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable maps -UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to -UCS-2 codes. - -@* -Almost all 16-bit CCS tables contain less then 0xFFFF codes and -a lot of gaps exist. - -@subsection Speed-optimized tables format -@* -In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is -trivial - it is just the array of 256 16-bit UCS codes. Therefore, an -UCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculates -as @emph{Y = to_ucs[X]}. - -@* -Obviously, the simplest way to create the "from_ucs" table or the -16-bit "to_ucs" table is to use the huge 16-bit array like in case -of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain -less then 0xFFFF code maps and this fact may be exploited to reduce -the size of the CCS tables. - -@* -In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The -16-bit "CCS -> UCS-2" CCS table format is the same, except the mapping -direction and the CCS bits number. - -@* -In case of the 8-bit speed-optimized table the "from_ucs" subtable -corresponds the "from_ucs" array and has the following layout: - -@* -from_ucs array: -@* -------------------------------------- -@* -0xFF mapping (2 bytes) (only for -8-bit table). -@* -------------------------------------- -@* -Heading block -@* -------------------------------------- -@* -Block 1 -@* -------------------------------------- -@* -Block 2 -@* -------------------------------------- -@* - ... -@* -------------------------------------- -@* -Block N +@node Generating CCS tables +@section Generating CCS tables @* -------------------------------------- +CCS tables are placed in the ccs subdirectory of the iconv directory. +This subdirectory contains .cct and .c files. The .cct files are for +dynamic loading whereas the .c files are for static linking with Newlib. +Both .c and .cct files are generated by the 'iconv_mktbl' perl script +from special source files (call them +.txt files). The 'iconv_mktbl' script can be found in the iconv/ccs +subdirectory. Input .txt files can be found at the Unicode.org site or +other locations found on the web. @* -The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each -subrange is represented by an 256-element @dfn{block} (256 1-byte -elements or 256 2-byte element in case of 16-bit CCS table) with -elements which are equivalent to the CCS codes of this subrange. -If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be -absent and there will be less then 256 blocks. +The .c files are linked with Newlib if the correspondent 'configure' script +option was given. This is needed to use iconv on targets without file system +support. If a CCS table isn't configured to be linked, the iconv library +tries to load it dynamically from a corresponding .cct file. @* -Any element number @emph{m} of @dfn{the heading block} (which contains -256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange. -If the subrange contains some codes, the value of the @emph{m}-th element of -the heading block contains the offset of the corresponding block in the -"from_ucs" array. If there is no codes in the subrange, the heading -block element contains 0xFFFF. +The following are commands to build .c and .cct CCS table files from .txt +files for several supported encodings. @* -If there are some gaps in a block, the corresponding block elements have -the 0xFF value. If there is an 0xFF code present in the CCS, it's mapping -is defined in the first 2-byte element of the "from_ucs" array. - -@* -Having such a table format, the algorithm of searching the CCS code -@emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows. - -@* -@enumerate -@item If @emph{Y} is equivalent to the value of the first 2-byte element -of the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search. - -@item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}. - -@item If the heading block element with number @emph{BlkN} is 0xFFFF, there -is no corresponding CCS code (error, wrong input data). Else, fetch the -"flom_ucs" array index of the @emph{BlkN}-th block. - -@item Calculate the offset of the @emph{X} code in its block: -@emph{Xindex = Y & 0xFF} - -@item If the @emph{Xintex}-th element of the block (which is equivalent to -@emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no corresponding -CCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}. -@end enumerate - -@subsection Size-optimized tables format -@* -As it is stated above, size-optimized tables exist only for 16-bit CCS-es. -This is because there is too small difference between the speed-optimized -and the size-optimized table sizes in case of 8-bit CCS-es. - -@* -Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of -size-optimized tables. - -This sections describes the format of the "UCS-2 -> CCS" size-optimized -CCS table. The format of "CCS -> UCS-2" table is the same. - -The idea of the size-optimized tables is to split the UCS-2 codes -("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes). -Then CCS codes ("to" codes) are stored only for the codes from these -ranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are stored -together with the corresponding "to" codes. - -@* -The following is the layout of the size-optimized table array: - -@* -size_arr array: -@* -------------------------------------- -@* -Ranges number (2 bytes) -@* -------------------------------------- -@* -Unranged codes number (2 bytes) -@* -------------------------------------- -@* -Unranged codes array index (2 bytes) -@* -------------------------------------- -@* -Ranges indexes (triads) -@* -------------------------------------- -@* -Ranges -@* -------------------------------------- -@* -Unranged codes array -@* -------------------------------------- - -@* -The @dfn{Unranged codes array index} @emph{size_arr} section helps to find -the offset of the needed range in the @emph{size_arr} and has -the following format (triads): -@* -the first code in range, the last code in range, range offset. - -@* -The array of these triads is sorted by the firs element, therefore it is -possible to quickly find the needed range index. - -@* -Each range has the corresponding sub-array containing the "to" codes. These -sub-arrays are stored in the place marked as "Ranges" in the layout -diagram. - -@* -The "Unranged codes array" contains pairs ("from" code, "to" code") for -each unranged code. The array of these pairs is sorted by "from" code -values, therefore it is possible to find the needed pair quickly. - -@* -Note, that each range requires 6 bytes to form its index. If, for -example, there are two ranges (1 - 5 and 9 - 10), and one unranged code -(7), 12 bytes are needed for two range indexes and 4 bytes for the unranged -code (total 16). But it is better to join both ranges as 1 - 10 and -mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the -range index and 4 bytes to mark codes 6 and 8 as absent are needed -(total 10 bytes). This optimization is done in the size-optimized tables. -Thus, ranges may contain small gaps. The absent codes in ranges are marked -as 0xFFFF. - -@* -Note, a pair of "from" codes is stored by means of unranged codes since -the number of bytes which are needed to form the range is greater than -the number of bytes to store two unranged codes (5 against 4). - -@* -The algorithm of searching of the CCS code -@emph{X} which corresponds to the UCS-2 code @emph{Y} (input) in the "UCS-2 -> -CCS" size-optimized table is as follows. - -@* -@enumerate -@item Try to find the corresponding triad in the "Unranged codes array -index". Since we are searching in the sorted array, we can do it quickly -(divide by 2, compare, etc). - -@item If the triad is found, fetch the @emph{X} code from the corresponding -range array. If it is 0xFFFF, return an error. - -@item If there is no corresponding triad, search the @emph{X} code among the -sorted unranged codes. Return error, if noting was found. -@end enumerate - -@subsection .cct ant .c CCS Table files -@* -The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs" -speed-optimized tables. The .c source files for 16-bit CCS tables have -"to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size" -tables. - -@* -When .c files are compiled and used, all the 16-bit and 32-bit values -have the native endian format (Big Endian for the BE systems and Little -Endian for the LE systems) since they are compile for the system before -they are used. - -@* -In case of .cct files, which are intended for dynamic CCS tables -loading, the CCS tables are stored either in LE or BE format. Since the -.cct files are generated by the 'mktbl.pl' Perl script, it is possible -to choose the endianess of the tables. It is also possible to store two -copies (both LE and BE) of the CCS tables in one .cct file. The default -.cct files (which come with the Newlib sources) have both LE and BE CCS -tables. The Newlib iconv library automatically chooses the needed CCS tables -(with appropriate endianess). - -@* -Note, the .cct files are only used when the -@option{--enable-newlib-iconv-external-ccs} is used. - -@subsection The 'mktbl.pl' Perl script -@* -The 'mktbl.pl' script is intended to generate .cct and .c CCS table -files from the @dfn{CCS source files}. - -@* -The CCS source files are just text files which has one or more colons -with CCS <-> UCS-2 codes mapping. To see an example of the CCS table -source files see one of them using URL-s which will be given bellow. - -@* -The following table describes where the source files for CCS table files -provided by the Newlib distribution are located. - -@multitable @columnfractions .25 .75 -@item -Name -@tab -URL - -@item -@tab - -@item -big5 -@tab -http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT - -@item -cns11643_plane1 -cns11643_plane14 -cns11643_plane2 -@tab -http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT - -@item -cp775 -cp850 -cp852 -cp855 -cp866 -@tab -http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/ - +@itemize @item -iso_8859_1 -iso_8859_2 -iso_8859_3 -iso_8859_4 -iso_8859_5 -iso_8859_6 -iso_8859_7 -iso_8859_8 -iso_8859_9 -iso_8859_10 -iso_8859_11 -iso_8859_13 -iso_8859_14 -iso_8859_15 -@tab -http://www.unicode.org/Public/MAPPINGS/ISO8859/ +cp775:@* +iconv_mktbl -Co cp775.c cp775.txt@* +iconv_mktbl -o cp775.cct cp775.txt +@end itemize +@itemize @item -iso_ir_111 -@tab -http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT +cp850:@* +iconv_mktbl -Co cp850.c cp850.txt@* +iconv_mktbl -o cp850.cct cp850.txt +@end itemize +@itemize @item -jis_x0201_1976 -jis_x0208_1990 -jis_x0212_1990 -@tab -http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT +cp852:@* +iconv_mktbl -Co cp852.c cp852.txt@* +iconv_mktbl -o cp852.cct cp852.txt +@end itemize +@itemize @item -koi8_r -@tab -http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT +cp855:@* +iconv_mktbl -Co cp855.c cp855.txt@* +iconv_mktbl -o cp855.cct cp855.txt +@end itemize +@itemize @item -koi8_ru -@tab -http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT +cp866@* +iconv_mktbl -Co cp866.c cp866.txt@* +iconv_mktbl -o cp866.cct cp866.txt +@end itemize +@itemize @item -koi8_u -@tab -http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT +iso-8859-1@* +iconv_mktbl -Co iso-8859-1.c iso-8859-1.txt@* +iconv_mktbl -o iso-8859-1.cct iso-8859-1.txt +@end itemize +@itemize @item -koi8_uni -@tab -http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT +iso-8859-4@* +iconv_mktbl -Co iso-8859-4.c iso-8859-4.txt@* +iconv_mktbl -o iso-8859-4.cct iso-8859-4.txt +@end itemize +@itemize @item -ksx1001 -@tab -http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT +iso-8859-5@* +iconv_mktbl -Co iso-8859-5.c iso-8859-5.txt@* +iconv_mktbl -o iso-8859-5.cct iso-8859-5.txt +@end itemize +@itemize @item -win_1250 -win_1251 -win_1252 -win_1253 -win_1254 -win_1255 -win_1256 -win_1257 -win_1258 -@tab -http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/ -@end multitable - -The CCS source files aren't distributed with Newlib because of License -restrictions in most Unicode.org's files. - -The following are 'mktbl.pl' options which were used to generate .cct -files. Note, to generate CCS tables source files @option{-s} option -should be added. - -@enumerate -@item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct, -iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct, -iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct, -iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct -win_1256.cct, win_1258.cct, win_1251.cct, -win_1253.cct, win_1255.cct, win_1257.cct, -koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct, -big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct -files, only the @option{-i <SRC_FILE_NAME>} option were used. - -@item To generate the jis_x0208_1990.cct file, the -@option{-i jis_x0208_1990.txt -x 2 -y 3} options were used. - -@item To generate the cns11643_plane1.cct file, the -@option{-i cns11643.txt -p1 -N cns11643_plane1 -o cns11643_plane1.cct} -options were used. - -@item To generate the cns11643_plane2.cct file, the -@option{-i cns11643.txt -p2 -N cns11643_plane2 -o cns11643_plane2.cct} -options were used. - -@item To generate the cns11643_plane14.cct file, the -@option{-i cns11643.txt -p0xE -N cns11643_plane14 -o cns11643_plane14.cct} -options were used. -@end enumerate - -@* -For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output. - -@* -It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes -in the CCS source file, the bits which are higher then 16 defines plane (see the -cns11643.txt CCS source file). - -@* -Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example, -several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to -the pair of UCS-2 codes. In these cases, such CCS codes (@dfn{lost -codes}) aren't just rejected but instead, they are mapped to the default -UCS-2 code (which is currently the @kbd{?} character's code). - - - - - -@page -@node CES converters -@section CES converters -@findex PCS -@* -Similar to the CCS tables, CES converters are also split into "from UCS" -and "to UCS" parts. Depending on the iconv library configuration, these -parts are enabled or disabled. - -@* -The following it the list of CES converters which are currently present -in the Newlib iconv library. +iso-8859-2@* +iconv_mktbl -Co iso-8859-2.c iso-8859-2.txt@* +iconv_mktbl -o iso-8859-2.cct iso-8859-2.txt +@end itemize -@itemize @bullet +@itemize @item -@emph{euc} - supports the @emph{euc_jp}, @emph{euc_kr} and @emph{euc_tw} -encodings. The @emph{euc} CES converter uses the @emph{table} and the -@emph{us_ascii} CES converters. +iso-8859-15@* +iconv_mktbl -Co iso-8859-15.c iso-8859-15.txt@* +iconv_mktbl -o iso-8859-15.cct iso-8859-15.txt +@end itemize +@itemize @item -@emph{table} - this CES converter corresponds to "null" and just performs -tables-based conversion using 8- and 16-bit CCS tables. This converter -is also used by any other CES converter which needs the CCS table-based -conversions. The @emph{table} converter is also responsible for .cct files -loading. +big5@* +iconv_mktbl -Co big5.c big5.txt@* +iconv_mktbl -o big5.cct big5.txt +@end itemize +@itemize @item -@emph{table_pcs} - this is the wrapper over the @emph{table} converter -which is intended for 16-bit encodings which also use the @dfn{Portable -Character Set} (@dfn{PCS}) which is the same as the @emph{US-ASCII}. -This means, that if the first byte the CCS code is in range of [0x00-0x7f], -this is the 7-bit PCS code. Else, this is the 16-bit CCS code. Of course, -the 16-bit codes must not contain bytes in the range of [0x00-0x7f]. -The @emph{big5} encoding uses the @emph{table_pcs} CES converter and the -@emph{table_pcs} CES converter depends on the @emph{table} CES converter. +ksx1001@* +iconv_mktbl -Co ksx1001.c ksx1001.txt@* +iconv_mktbl -o ksx1001.cct ksx1001.txt +@end itemize +@itemize @item -@emph{ucs_2} - intended for the @emph{ucs_2}, @emph{ucs_2be} and -@emph{ucs_2le} encodings support. +gb_2312@* +iconv_mktbl -Co gb_2312-80.c gb_2312-80.txt@* +iconv_mktbl -o gb_2312-80.cct gb_2312-80.txt +@end itemize +@itemize @item -@emph{ucs_4} - intended for the @emph{ucs_4}, @emph{ucs_4be} and -@emph{ucs_4le} encodings support. +jis_x0201@* +iconv_mktbl -Co jis_x0201.c jis_x0201.txt@* +iconv_mktbl -o jis_x0201.cct jis_x0201.txt +@end itemize +@itemize @item -@emph{ucs_2_internal} - intended for the @emph{ucs_2_internal} encoding support. +iconv_mktbl -Co shift_jis.c shift_jis.txt@* +iconv_mktbl -o shift_jis.cct shift_jis.txt +@end itemize +@itemize @item -@emph{ucs_4_internal} - intended for the @emph{ucs_4_internal} encoding support. +jis_x0208@* +iconv_mktbl -C -c 1 -u 2 -o jis_x0208-1983.c jis_x0208-1983.txt@* +iconv_mktbl -c 1 -u 2 -o jis_x0208-1983.cct jis_x0208-1983.txt +@end itemize +@itemize @item -@emph{us_ascii} - intended for the @emph{us_ascii} encoding support. In -principle, the most natural way to support the @emph{us_ascii} encoding -is to define the @emph{us_ascii} CCS and use the @emph{table} CES -converter. But for the optimization purposes, the specialized -@emph{us_ascii} CES converter was created. +jis_x0212@* +iconv_mktbl -Co jis_x0212-1990.c jis_x0212-1990.txt@* +iconv_mktbl -o jis_x0212-1990.cct jis_x0212-1990.txt +@end itemize +@itemize @item -@emph{utf_16} - intended for the @emph{utf_16}, @emph{utf_16be} and -@emph{utf_16le} encodings support. +cns11643-plane1@* +iconv_mktbl -C -p 0x1 -o cns11643-plane1.c cns11643.txt@* +iconv_mktbl -p 0x1 -o cns11643-plane1.cct cns11643.txt +@end itemize +@itemize @item -@emph{utf_8} - intended for the @emph{utf_8} encoding support. +cns11643-plane2@* +iconv_mktbl -C -p 0x2 -o cns11643-plane2.c cns11643.txt@* +iconv_mktbl -p 0x2 -o cns11643-plane2.cct cns11643.txt @end itemize - - - - -@page -@node The encodings description file -@section The encodings description file -@findex encoding.deps description file -@findex mkdeps.pl Perl script -@* -To simplify the process of adding new encodings support allowing to -automatically generate a lot of "glue" files. - -@* -There is the 'encoding.deps' file in the @emph{lib/} subdirectory which -is used to describe encoding's properties. The 'mkdeps.pl' Perl script -uses 'encoding.deps' to generates the "glue" files. - -@* -The 'encoding.deps' file is composed of sections, each section consists -of entries, each entry contains some encoding/CES/CCS description. - -@* -The 'encoding.deps' file's syntax is very simple. Currently only two -sections are defined: @emph{ENCODINGS} and @emph{CES_DEPENDENCIES}. - -@* -Each @emph{ENCODINGS} section's entry describes one encoding and -contains the following information. - -@itemize @bullet +@itemize @item -Encoding name (the @emph{ENCODING} field). The name should -be unique and only one name is possible. +cns11643-plane14@* +iconv_mktbl -C -p 0xE -o cns11643-plane14.c cns11643.txt@* +iconv_mktbl -p 0xE -o cns11643-plane14.cct cns11643.txt +@end itemize +@itemize @item -The encoding's CES converter name (the @emph{CES} field). Only one CES -converter is allowed. +koi8-r@* +iconv_mktbl -Co koi8-r.c koi8-r.txt@* +iconv_mktbl -o koi8-r.cct koi8-r.txt +@end itemize +@itemize @item -The whitespace-separated list of CCS table names which are used by the -encoding (the @emph{CCS} field). +koi8-u@* +iconv_mktbl -Co koi8-u.c koi8-u.txt@* +iconv_mktbl -o koi8-u.cct koi8-u.txt +@end itemize +@itemize @item -The whitespace-separated list of aliases names (the @emph{ENCODING} -field). +us-ascii@* +iconv_mktbl -Cao us-ascii.c iso-8859-1.txt@* +iconv_mktbl -ao us-ascii.cct iso-8859-1.txt @end itemize @* -Note all names in the 'encoding.deps' file have to have the normalized -form. +Source files for CCS tables can be taken from at least two places: @* -Each @emph{CES_DEPENDENCIES} section's entry describes dependencies of -one CES converted. For example, the @emph{euc} CES converter depends on -the @emph{table} and the @emph{us_ascii} CES converter since the -@emph{euc} CES converter uses them. This means, that both @emph{table} -and @emph{us_ascii} CES converters should be linked if the @emph{euc} -CES converter is enabled. - -@* -The @emph{CES_DEPENDENCIES} section defines the following: - -@itemize @bullet +@enumerate @item -the CES converter name for which the dependencies are defined in this -entry (the @emph{CES} field); - +http://www.unicode.org/Public/MAPPINGS/ contains a lot of encoding +map files. @item -the whitespace-separated list of CES converters which are needed for -this CES converter (the @emph{USED_CES} field). -@end itemize +http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains original +iconv sources and encoding map files. +@end enumerate @* -The 'mktbl.pl' Perl script automatically solves the following tasks. - -@itemize @bullet -@item -User works with the iconv library in terms of encodings and doesn't know -anything about CES converters and CCS tables. The script automatically -generates code which enables all needed CES converters and CCS tables -for all encodings, which were enabled by the user. - -@item -The CES converters may have dependencies and the script automatically -generates the code which handles these dependencies. - -@item -The list of encoding's aliases is also automatically generated. +The following are URLs where source files for some of the CCS tables +are found: +@itemize @item -The script uses a lot of macros in order to enable only the minimum set -of code/data which is needed to support the requested encodings in the -requested directions. +big5:@* +http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT @end itemize -@* -The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps' -file and generates the following files. - -@itemize @bullet +@itemize @item -@emph{lib/encnames.h} - this header files contains macro definitions for all -encoding names +cns11643_plane14, cns11643_plane1 and cns11643_plane2:@* +http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT +@end itemize +@itemize @item -@emph{lib/aliasesbi.c} - the array of encoding names and aliases. The array -is used to find the name of requested encoding by it's alias. +cp775, cp850, cp852, cp855, cp866:@* +http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/ +@end itemize +@itemize @item -@emph{ces/cesbi.c} - this file defines two arrays -(@code{_iconv_from_ucs_ces} and @code{_iconv_to_ucs_ces}) which contain -description of enabled "to UCS" and "from UCS" CES converters and the -names of encodings which are supported by these CES converters. +gb_2312_80:@* +http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT +@end itemize +@itemize @item -@emph{ces/cesbi.h} - this file contains the set of macros which defines -the set of CES converters which should be enabled if only the set of -enabled encodings is given (through macros defined in the -@emph{newlib.h} file). Note, that one CES converter may handle several -encodings. +iso_8859_15, iso_8859_1, iso_8859_2, iso_8859_4, iso_8859_5:@* +http://www.unicode.org/Public/MAPPINGS/ISO8859/ +@end itemize +@itemize @item -@emph{ces/cesdeps.h} - the CES converters dependencies are handled in -this file. +jis_x0201, jis_x0208_1983, jis_x0212_1990, shift_jis@* +http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT +@end itemize +@itemize @item -@emph{ccs/ccsdeps.h} - the array of linked-in CCS tables is defined -here. +koi8_r@* +http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT +@end itemize +@itemize @item -@emph{ccs/ccsnames.h} - this header files contains macro definitions for all -CCS names. +ksx1001@* +http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT +@end itemize +@itemize @item -@emph{encoding.aliases} - the list of supported encodings and their -aliases which is intended for the Newlib configure scripts in order to -handle the iconv-related configure script options. +koi8-u can be given from original FreeBSD iconv library distribution +http://www.dante.net/staff/konstantin/FreeBSD/iconv/ @end itemize - - - +@* +Moreover, http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains a +lot of additional CCS tables that you can use with Newlib (iso-2022 and +RFC1345 encodings). @page -@node How to add new encoding -@section How to add new encoding +@node Adding new converter +@section Adding a new iconv converter +@* +The following steps should be taken to add a new iconv converter: + @* -At first, the new encoding should be broken down to CCS and CES. Then, -the process of adding new encoding is split to the following activities. - @enumerate -@item Generate the .cct CCS file and the .c source file for the new -encoding's CCS (if it isn't already present). To do this, the CCS source -file should be had and the 'mktbl.pl' script should be used. - -@item Write the corresponding CES converter (if it isn't already -present). Use the existing CES converters as an example. - @item -Add the corresponding entries to the 'encoding.deps' file and regenerate -the autogenerated "glue" files using the 'mkdeps.pl' script. - -@item -Don't forget to add entries to the newlib/newlib.hin file. - -@item -Of course, the 'Makefile.am'-s should also be updated (if new files were -added) and the 'Makefile.in'-s should be regenerated using the correct -version of 'automake'. - -@item -Don't forget to update the documentation (the list of -supported encodings and CES converters). +Converter's name and aliases list should be added to +the iconv/charset.aliases file +@item +All iconv converters are protected by a _ICONV_CONVERTER_XXX +macro, where XXX is converter name. This protection macro should be added to +newlib/newlib.hin file. +@item +Converter's name and aliases should be also registered in _iconv_builtin_aliases +table in iconv/lib/bialiasesi.c. The list should be protected by +the corresponding macro mentioned above. +@item +If a new converter is just a CCS table, the corresponding .cct and .c files +should be added to the iconv/ccs/ subdirectory. The name of the files +should be equivalent to the normalized encoding name. The 'iconv_mktbl' +Perl script (found in iconv/ccs) may +be used to generate such files. The file's name should be added to +iconv/ccs/Makefile.am and iconv/ccs/binary/Makefile.am files and then +automake should be used to regenerate the Makefile.in files. +@item +If a new converter has a CES algorithm, the appropriate file should be +added to the +iconv/ces/ subdirectory. The name of the file again should be equivalent +to the normalized +encoding name. +@item +If a converter is EUC or ISO-2022-family CES, then the converter +is just an array with a list of used CCS (See ccs/euc-jp.c for example). This +is because iconv already has EUC and ISO-2022 support. Used CCS tables should +be provided in iconv/ccs/. +@item +If a converter isn't EUC or ISO-2022-based CCS, the following two functions +should be provided (see utf-8.c for example): +@enumerate - +@item A function to convert from new CES to UCS-32; +@item A function to convert from UCS-32 to new CES; +@item An 'init' function; +@item A 'close' function; +@item A 'reset' function to reset shift state for stateful CES. @end enumerate -In case a new encoding doesn't fit to the CES/CCS decomposition model or -it is desired to add the specialized (non UCS-based) conversion support, -the Newlib iconv library code should be upgraded. - - - - - -@page -@node The locale support interfaces -@section The locale support interfaces -@* -The newlib iconv library also has some interface functions (besides the -@code{iconv}, @code{iconv_open} and @code{iconv_close} interfaces) which -are intended for the Locale subsystem. All the locale-related code is -placed in the @emph{lib/iconvnls.c} file. - @* -The following is the description of the locale-related interfaces: - -@itemize @bullet -@item -@code{_iconv_nls_open} - opens two iconv descriptors for "CCS -> -wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is -passed in the function parameters. The @emph{wchar_t} characters encoding is -either ucs_2_internal or ucs_4_internal depending on size of -@emph{wchar_t}. - -@item -@code{_iconv_nls_conv} - the function is similar to the @code{iconv} -functions, but if there is no character in the output encoding which -corresponds to the character in the input encoding, the default -conversion isn't performed (the @code{iconv} function sets such output -characters to the @kbd{?} symbol and this is the behavior, which is -specified in SUSv3). - -@item -@code{_iconv_nls_get_state} - returns the current encoding's shift state -(the @code{mbstate_t} object). - +All these functions are registered into a 'struct iconv_ces_desc' object. +The name of the object should be _iconv_ces_module_XXX, where XXX is the +name of the converter. @item -@code{_iconv_nls_set_state} sets the current encoding's shift state (the -@code{mbstate_t} object). - -@item -@code{_iconv_nls_is_stateful} - checks whether the encoding is stateful -or stateless. - -@item -@code{_iconv_nls_get_mb_cur_max} - returns the maximum length (the -maximum bytes number) of the encoding's characters. -@end itemize - - - - -@page -@node Contact -@section Contact -@* -The author of the original BSD iconv library (Alexander Chuguev) no longer -supports that code. +For CES converters the correspondent 'struct iconv_ces_desc' reference should +be added into iconv/lib/bices.c file. @* -Any questions regarding the iconv library may be forwarded to -Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as -well as to the public Newlib mailing list. +For CCS converters, the corresponding table reference should be added into +the iconv/lib/biccs.c file. +@end enumerate |