diff options
Diffstat (limited to 'newlib/libc/iconv/iconv.tex')
-rw-r--r-- | newlib/libc/iconv/iconv.tex | 1710 |
1 files changed, 0 insertions, 1710 deletions
diff --git a/newlib/libc/iconv/iconv.tex b/newlib/libc/iconv/iconv.tex deleted file mode 100644 index 305c3ce16..000000000 --- a/newlib/libc/iconv/iconv.tex +++ /dev/null @@ -1,1710 +0,0 @@ -@node Iconv -@chapter Encoding conversions (@file{iconv.h}) - -This chapter describes the Newlib iconv library. -The iconv functions declarations are in -@file{iconv.h}. - -@menu -* iconv:: Encoding conversion routines -* Introduction:: Introduction to iconv and encodings -* Supported encodings:: The list of currently supported encodings -* iconv design decisions:: General iconv library design issues -* iconv configuration:: iconv-related configure script options -* Encoding names:: How encodings are named. -* CCS tables:: CCS tables format and 'mktbl.pl' Perl script -* CES converters:: CES converters description -* The encodings description file:: The 'encoding.deps' file and 'mkdeps.pl' -* How to add new encoding:: The steps to add new encoding support -* The locale support interfaces:: Locale-related iconv interfaces -* Contact:: The author contact -@end menu - -@page -@include iconv/iconv.def - -@page -@node Introduction -@section Introduction -@findex encoding -@findex character set -@findex charset -@findex CES -@findex CCS -@* -The iconv library is intended to convert characters from one encoding to -another. It implements iconv(), iconv_open() and iconv_close() -calls, which are defined by the Single Unix Specification. - -@* -In addition to these user-level interfaces, the iconv library also has -several useful interfaces which are needed to support coding -capabilities of the Newlib Locale infrastructure. Since Locale -support also needs to -convert various character sets to and from the @emph{wide characters -set}, the iconv library shares it's capabilities with the Newlib Locale -subsystem. Moreover, the iconv library supports several features which are -only needed for the Locale infrastructure (for example, the MB_CUR_MAX value). - -@* -The Newlib iconv library was created using concepts from another iconv -library implemented by Konstantin Chuguev (ver 2.0). The Newlib iconv library -was rewritten from scratch and contains a lot of improvements with respect to -the original iconv library. - -@* -Terms like @dfn{encoding} or @dfn{character set} aren't well defined and -are often used with various meanings. The following are the definitions of terms -which are used in this documentation as well as in the iconv library -implementation: - -@itemize @bullet -@item -@dfn{encoding} - a machine representation of characters by means of bits; - -@item -@dfn{Character Set} or @dfn{Charset} - just a collection of -characters, i.e. the encoding is the machine representation of the character set; - -@item -@dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a -set of integers @dfn{character codes}; - -@item -@dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character -codes to a sequence of bytes; -@end itemize - -@* -Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8, -ASCII, etc. Encodings are formed by the following chain of steps: - -@enumerate -@item -User has a set of characters which are specific to his or her language (character set). - -@item -Each character from this set is uniquely numbered, resulting in an CCS. - -@item -Each number from the CCS is converted to a sequence of bits or bytes by means -of a CES and form some encoding. Thus, CES may be considered as a -function of CCS which produces some encoding. Note, that CES may be -applied to more than one CCS. -@end enumerate - -@* -Thus, an encoding may be considered as one or more CCS + CES. - -@* -Sometimes, there is no CES and in such cases encoding is equivalent -to CCS, e.g. KOI8-R or ASCII. - -@* -An example of a more complicated encoding is UTF-8 which is the UCS -(or Unicode) CCS plus the UTF-8 CES. - -@* -The following is a brief list of iconv library features: -@itemize -@item -Generic architecture; -@item -Locale infrastructure support; -@item -Automatic generation of the program code which handles -CES/CCS/Encoding/Names/Aliases dependencies; -@item -The ability to choose size- or speed-optimazed -configuration; -@item -The ability to exclude a lot of unneeded code and data from the linking step. -@end itemize - - - - -@page -@node Supported encodings -@section Supported encodings -@findex big5 -@findex cp775 -@findex cp850 -@findex cp852 -@findex cp855 -@findex cp866 -@findex euc_jp -@findex euc_kr -@findex euc_tw -@findex iso_8859_1 -@findex iso_8859_10 -@findex iso_8859_11 -@findex iso_8859_13 -@findex iso_8859_14 -@findex iso_8859_15 -@findex iso_8859_2 -@findex iso_8859_3 -@findex iso_8859_4 -@findex iso_8859_5 -@findex iso_8859_6 -@findex iso_8859_7 -@findex iso_8859_8 -@findex iso_8859_9 -@findex iso_ir_111 -@findex koi8_r -@findex koi8_ru -@findex koi8_u -@findex koi8_uni -@findex ucs_2 -@findex ucs_2_internal -@findex ucs_2be -@findex ucs_2le -@findex ucs_4 -@findex ucs_4_internal -@findex ucs_4be -@findex ucs_4le -@findex us_ascii -@findex utf_16 -@findex utf_16be -@findex utf_16le -@findex utf_8 -@findex win_1250 -@findex win_1251 -@findex win_1252 -@findex win_1253 -@findex win_1254 -@findex win_1255 -@findex win_1256 -@findex win_1257 -@findex win_1258 -@* -The following is the list of currently supported encodings. The first column -corresponds to the encoding name, the second column is the list of aliases, -the third column is its CES and CCS components names, and the fourth column -is a short description. - -@multitable @columnfractions .20 .26 .24 .30 -@item -Name -@tab -Aliases -@tab -CES/CCS -@tab -Short description -@item -@tab -@tab -@tab - - -@item -big5 -@tab -csbig5, big_five, bigfive, cn_big5, cp950 -@tab -table_pcs / big5, us_ascii -@tab -The encoding for the Traditional Chinese. - - -@item -cp775 -@tab -ibm775, cspc775baltic -@tab -table / cp775 -@tab -The updated version of CP 437 that supports the balitic languages. - - -@item -cp850 -@tab -ibm850, 850, cspc850multilingual -@tab -table / cp850 -@tab -IBM 850 - the updated version of CP 437 where several Latin 1 characters have been -added instead of some less-often used characters like the line-drawing -and the greek ones. - - -@item -cp852 -@tab -ibm852, 852, cspcp852 -@tab -@tab -IBM 852 - the updated version of CP 437 where several Latin 2 characters have been added -instead of some less-often used characters like the line-drawing and the greek ones. - - -@item -cp855 -@tab -ibm855, 855, csibm855 -@tab -table / cp855 -@tab -IBM 855 - the updated version of CP 437 that supports Cyrillic. - - -@item -cp866 -@tab -866, IBM866, CSIBM866 -@tab -table / cp866 -@tab -IBM 866 - the updated version of CP 855 which follows more the logical Russian alphabet -ordering of the alternative variant that is preferred by many Russian users. - - -@item -euc_jp -@tab -eucjp -@tab -euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990 -@tab -EUC-JP - The EUC for Japanese. - - -@item -euc_kr -@tab -euckr -@tab -euc / ksx1001 -@tab -EUC-KR - The EUC for Korean. - - -@item -euc_tw -@tab -euctw -@tab -euc / cns11643_plane1, cns11643_plane2, cns11643_plane14 -@tab -EUC-TW - The EUC for Traditional Chinese. - - -@item -iso_8859_1 -@tab -iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1 -@tab -table / iso_8859_1 -@tab -ISO 8859-1:1987 - Latin 1, West European. - - -@item -iso_8859_10 -@tab -iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10 -@tab -table / iso_8859_10 -@tab -ISO 8859-10:1992 - Latin 6, Nordic. - - -@item -iso_8859_11 -@tab -iso8859_11, iso885911 -@tab -table / iso_8859_11 -@tab -ISO 8859-11 - Thai. - - -@item -iso_8859_13 -@tab -iso_8859_13:1998, iso8859_13, iso885913 -@tab -table / iso_8859_13 -@tab -ISO 8859-13:1998 - Latin 7, Baltic Rim. - - -@item -iso_8859_14 -@tab -iso_8859_14:1998, iso885914, iso8859_14 -@tab -table / iso_8859_14 -@tab -ISO 8859-14:1998 - Latin 8, Celtic. - - -@item -iso_8859_15 -@tab -iso885915, iso_8859_15:1998, iso8859_15, -@tab -table / iso_8859_15 -@tab -ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1. - - -@item -iso_8859_2 -@tab -iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2 -@tab -table / iso_8859_2 -@tab -ISO 8859-2:1987 - Latin 2, East European. - - -@item -iso_8859_3 -@tab -iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593 -@tab -table / iso_8859_3 -@tab -ISO 8859-3:1988 - Latin 3, South European. - - -@item -iso_8859_4 -@tab -iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4 -@tab -table / iso_8859_4 -@tab -ISO 8859-4:1988 - Latin 4, North European. - - -@item -iso_8859_5 -@tab -iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic -@tab -table / iso_8859_5 -@tab -ISO 8859-5:1988 - Cyrillic. - - -@item -iso_8859_6 -@tab -iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596 -@tab -table / iso_8859_6 -@tab -ISO i8859-6:1987 - Arabic. - - -@item -iso_8859_7 -@tab -iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597 -@tab -table / iso_8859_7 -@tab -ISO 8859-7:1987 - Greek. - - -@item -iso_8859_8 -@tab -iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598 -@tab -table / iso_8859_8 -@tab -ISO 8859-8:1988 - Hebrew. - - -@item -iso_8859_9 -@tab -iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599 -@tab -table / iso_8859_9 -@tab -ISO 8859-9:1989 - Latin 5, Turkish. - - -@item -iso_ir_111 -@tab -ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic -@tab -table / iso_ir_111 -@tab -ISO IR 111/ECMA Cyrillic. - - -@item -koi8_r -@tab -cskoi8r, koi8r, koi8 -@tab -table / koi8_r -@tab -RFC 1489 Cyrillic. - - -@item -koi8_ru -@tab -koi8ru -@tab -table / koi8_ru -@tab -The obsolete Ukrainian. - - -@item -koi8_u -@tab -koi8u -@tab -table / koi8_u -@tab -RFC 2319 Ukrainian. - - -@item -koi8_uni -@tab -koi8uni -@tab -table / koi8_uni -@tab -KOI8 Unified. - - -@item -ucs_2 -@tab -ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode -@tab -ucs_2 / (UCS) -@tab -ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_2_internal -@tab -ucs2_internal, ucs_2internal, ucs2internal -@tab -ucs_2_internal / (UCS) -@tab -ISO-10646-UCS-2 in system byte order. -NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_2be -@tab -ucs2be -@tab -ucs_2 / (UCS) -@tab -Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2). -Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_2le -@tab -ucs2le -@tab -ucs_2 / (UCS) -@tab -Little Endian version of ISO-10646-UCS-2. -Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_4 -@tab -ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4 -@tab -ucs_4 / (UCS) -@tab -ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_4_internal -@tab -ucs4_internal, ucs_4internal, ucs4internal -@tab -ucs_4_internal / (UCS) -@tab -ISO-10646-UCS-4 in system byte order. -NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_4be -@tab -ucs4be -@tab -ucs_4 / (UCS) -@tab -Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4). -Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -ucs_4le -@tab -ucs4le -@tab -ucs_4 / (UCS) -@tab -Little Endian version of ISO-10646-UCS-4. -Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -us_ascii -@tab -ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii -@tab -us_ascii / (ASCII) -@tab -7-bit ASCII. - - -@item -utf_16 -@tab -utf16 -@tab -utf_16 / (UCS) -@tab -RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM. - - -@item -utf_16be -@tab -utf16be -@tab -utf_16 / (UCS) -@tab -Big Endian version of RFC 2781 UTF-16. -NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -utf_16le -@tab -utf16le -@tab -utf_16 / (UCS) -@tab -Little Endian version of RFC 2781 UTF-16. -NBSP is always interpreted as NBSP (BOM isn't supported). - - -@item -utf_8 -@tab -utf8 -@tab -utf_8 / (UCS) -@tab -RFC 3629 UTF-8. - - -@item -win_1250 -@tab -cp1250 -@tab -@tab -Win-1250 Croatian. - - -@item -win_1251 -@tab -cp1251 -@tab -table / win_1251 -@tab -Win-1251 - Cyrillic. - - -@item -win_1252 -@tab -cp1252 -@tab -table / win_1252 -@tab -Win-1252 - Latin 1. - - -@item -win_1253 -@tab -cp1253 -@tab -table / win_1253 -@tab -Win-1253 - Greek. - - -@item -win_1254 -@tab -cp1254 -@tab -table / win_1254 -@tab -Win-1254 - Turkish. - - -@item -win_1255 -@tab -cp1255 -@tab -table / win_1255 -@tab -Win-1255 - Hebrew. - - -@item -win_1256 -@tab -cp1256 -@tab -table / win_1256 -@tab -Win-1256 - Arabic. - - -@item -win_1257 -@tab -cp1257 -@tab -table / win_1257 -@tab -Win-1257 - Baltic. - - -@item -win_1258 -@tab -cp1258 -@tab -table / win_1258 -@tab -Win-1258 - Vietnamese7 that supports Cyrillic. -@end multitable - - - - - -@page -@node iconv design decisions -@section iconv design decisions -@findex CCS table -@findex CES converter -@findex Speed-optimized tables -@findex Size-optimized tables -@* -The first iconv library design issue arises when considering the -following two design approaches: - -@enumerate -@item -Have modules which implement conversion from the encoding A to the encoding B -and vice versa i.e., one conversion module relates to any two encodings. -@item -Have modules which implement conversion from the encoding A to the fixed -encoding C and vice versa i.e., one conversion module relates to any -one encoding A and one fixed encoding C. In this case, to convert from -the encoding A to the encoding B, two modules are needed (in order to convert -from A to C and then from C to B). -@end enumerate - -@* -It's obvious, that we have tradeoff between commonality/flexibility and -efficiency: the first method is more efficient since it converts -directly; however, it isn't so flexible since for each -encoding pair a distinct module is needed. - -@* -The Newlib iconv model uses the second method and always converts through the 32-bit -UCS but its design also allows one to write specialized conversion -modules if the conversion speed is critical. - -@* -The second design issue is how to break down (decompose) encodings. -The Newlib iconv library uses the fact that any encoding may be -considered as one or more CCS plus a CES. It also decomposes its -conversion modules on @dfn{CES converter} plus one or more @dfn{CCS -tables}. CCS tables map CCS to UCS and vice versa; the CES converters -map CCS to the encoding and vice versa. - -@* -As the example, let's consider the conversion from the big5 encoding to -the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5 -CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2, -and CNS11643_PLANE14 CCS-es plus the EUC CES. - -@* -The euc_jp -> big5 conversion is performed as follows: - -@enumerate -@item -The EUC converter performs the EUC-TW encoding to the corresponding CCS-es -transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14 -CCS-es); -@item -The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1, -CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables; -@item -The resulting UCS codes are transformed to the ASCII and BIG5 codes using -the corresponding CCS tables; -@item -The obtained CCS codes are transformed to the big5 encoding using the corresponding -CES converter. -@end enumerate - -@* -Analogously, the backward conversion is performed as follows: - -@enumerate -@item -The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation -(the ASCII and BIG5 CCS-es); -@item -The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables; -@item -The resulting UCS codes are transformed to the ASCII and BIG5 codes using -the corresponding CCS tables; -@item -The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding -CES converter. -@end enumerate - -@* -Note, the above is just an example and real names (which are implemented -in the Newlib iconv) of the CES converters and the CCS tables are slightly different. - -@* -The third design issue also relates to flexibility. Obviously, it isn't -desirable to always link all the CES converters and the CCS tables to the library -but instead, we want to be able to load the needed converters and tables -dynamically on demand. This isn't a problem on "big" machines such as -a PC, but it may be very problematical within "small" embedded systems. - -@* -Since the CCS tables are just data, it is possible to load them -dynamically from external files. The CES converters, on the other hand -are algorithms with some code so a dynamic library loading -capability is required. - -@* -Apart from possible restrictions applied by embedded systems (small -RAM for example), Newlib itself has no dynamic library support and -therefore, all the CES converters which will ever be used must be linked into -the library. However, loading of the dynamic CCS tables is possible and is -implemented in the Newlib iconv library. It may be enabled via the Newlib -configure script options. - -@* -The next design issue is fine-tuning the iconv library -configuration. One important ability is for iconv to not link all it's -converters and tables (if dynamic loading is not enabled) but instead, -enable only those encodings which are specified at configuration -time (see the section about the configure script options). - -@* -In addition, the Newlib iconv library configure options distinguish between -conversion directions. This means that not only are supported encodings -selectable, the conversion direction is as well. For example, if user wants -the configuration which allows conversions from UTF-8 to UTF-16 and -doesn't plan using the "UTF-16 to UTF-8" conversions, he or she can -enable only -this conversion direction (i.e., no "UTF-16 -> UTF-8"-related code will -be included) thus, saving some memory (note, that such technique allows to -exclude one half of a CCS table from linking which may be big enough). - -@* -One more design aspect are the speed- and size- optimized tables. Users can -select between them using configure script options. The -speed-optimized CCS tables are the same as the size-optimized ones in -case of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimized -CCS tables may be 1.5 to 2 times less then the speed-optimized ones. On the -other hand, conversion with speed tables is several times faster. - -@* -Its worth to stress that the new encoding support can't be -dynamically added into an already compiled Newlib library, even if it -needs only an additional CCS table and iconv is configured to use -the external files with CCS tables (this isn't the fundamental restriction -and the possibility to add new Table-based encoding support dynamically, by -means of just adding new .cct file, may be easily added). - -@* -Theoretically, the compiled-in CCS tables should be more appropriate for -embedded systems than dynamically loaded CCS tables. This is because the compiled-in tables are read-only and can be placed in ROM -whereas dynamic loading requires RAM. Moreover, in the current iconv -implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding. -This means, for example, that if two iconv descriptors for -"KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies of -koi8-r .cct file will be loaded (actually, iconv loads only the needed part -of these files). On the other hand, in the case of compiled-in CCS tables, there will always be only one copy. - -@page -@node iconv configuration -@section iconv configuration -@findex iconv configuration -@findex --enable-newlib-iconv-encodings -@findex --enable-newlib-iconv-from-encodings -@findex --enable-newlib-iconv-to-encodings -@findex --enable-newlib-iconv-external-ccs -@findex NLSPATH -@* -To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configure -script option should be used. This option accepts a comma-separated list -of @emph{encodings} that should be enabled. The option enables each encoding in both -("to" and "from") directions. - -@* -The @option{--enable-newlib-iconv-from-encodings} configure script option enables -"from" support for each encoding that was passed to it. - -@* -The @option{--enable-newlib-iconv-to-encodings} configure script option enables -"to" support for each encoding that was passed to it. - -@* -Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and -"KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconv -code and data will be linked) is to configure Newlib with the following -options: -@* -@code{--enable-newlib-iconv-encodings=UTF-8 ---enable-newlib-iconv-from-encodings=KOI8-R ---enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5} -@* -which is the same as -@* -@code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8 ---enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8} -@* -User may also just use the -@* -@code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2} -@* -configure script option, but it isn't so optimal since there will be -some unneeded data and code. - -@* -The @option{--enable-newlib-iconv-external-ccs} option enables iconv's -capabilities to work with the external CCS files. - -@* -The @option{--enable-target-optspace} Newlib configure script option also affects -the iconv library. If this option is present, the library uses the size -optimized CCS tables. This means, that only the size-optimized CCS -tables will be linked or, if the -@option{--enable-newlib-iconv-external-ccs} configure script option was used, -the iconv library will load the size-optimized tables. If the -@option{--enable-target-optspace}configure script option is disabled, -the speed-optimized CCS tables are used. - -@* -Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory. -Thus, the NLSPATH environment variable should be set. - - - - - -@page -@node Encoding names -@section Encoding names -@findex encoding name -@findex encoding alias -@findex normalized name -@* -Each encoding has one @dfn{name} and a number of @dfn{aliases}. When -user works with the iconv library (i.e., when the @code{iconv_open} call -is used) both name or aliases may be used. The same is when encoding -names are used in configure script options. - -@* -Names and aliases may be specified in any case (small or capital -letters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol. -Also, when working with the iconv library, - -@* -Internally the Newlib iconv library always converts aliases to names. It -also converts names and aliases in the @dfn{normalized} form which means -that all capital letters are converted to small letters and the @kbd{-} -symbols are converted to @kbd{_} symbols. - - - - -@page -@node CCS tables -@section CCS tables -@findex Size-optimized CCS table -@findex Speed-optimized CCS table -@findex mktbl.pl Perl script -@findex .cct files -@findex The CCT tables source files -@findex CCS source files -@* -The iconv library stores files with CCS tables in the the @emph{ccs/} -subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form -(@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in form -of compilable .c source files. The .cct files are only used when the -@option{--enable-newlib-iconv-external-ccs} configure script option is enabled. -The .c files are linked to the Newlib library if the corresponding -encoding is enabled. - -@* -As stated earlier, the Newlib iconv library performs all -conversions through the 32-bit UCS, but the codes which are used -in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set. -Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is -used instead of the 32-bit UCS-4. - -@* -CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to -16-bit UCS-2 and vice versa while 16-bit CCS tables map -16-bit CCS to 16-bit UCS-2 and vice versa. -8-bit tables are small (in size) while 16-bit tables may be big enough. -Because of this, 16-bit CCS tables may be -either speed- or size-optimized. Size-optimized CCS tables are -smaller then speed-optimized ones, but the conversion process is -slower if the size-optimized CCS tables are used. 8-bit CCS tables have only -size-optimized variant. - -Each CCS table (both speed- and size-optimized) consists of -@dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable maps -UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to -UCS-2 codes. - -@* -Almost all 16-bit CCS tables contain less then 0xFFFF codes and -a lot of gaps exist. - -@subsection Speed-optimized tables format -@* -In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is -trivial - it is just the array of 256 16-bit UCS codes. Therefore, an -UCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculates -as @emph{Y = to_ucs[X]}. - -@* -Obviously, the simplest way to create the "from_ucs" table or the -16-bit "to_ucs" table is to use the huge 16-bit array like in case -of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain -less then 0xFFFF code maps and this fact may be exploited to reduce -the size of the CCS tables. - -@* -In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The -16-bit "CCS -> UCS-2" CCS table format is the same, except the mapping -direction and the CCS bits number. - -@* -In case of the 8-bit speed-optimized table the "from_ucs" subtable -corresponds the "from_ucs" array and has the following layout: - -@* -from_ucs array: -@* -------------------------------------- -@* -0xFF mapping (2 bytes) (only for -8-bit table). -@* -------------------------------------- -@* -Heading block -@* -------------------------------------- -@* -Block 1 -@* -------------------------------------- -@* -Block 2 -@* -------------------------------------- -@* - ... -@* -------------------------------------- -@* -Block N -@* -------------------------------------- - -@* -The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each -subrange is represented by an 256-element @dfn{block} (256 1-byte -elements or 256 2-byte element in case of 16-bit CCS table) with -elements which are equivalent to the CCS codes of this subrange. -If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be -absent and there will be less then 256 blocks. - -@* -Any element number @emph{m} of @dfn{the heading block} (which contains -256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange. -If the subrange contains some codes, the value of the @emph{m}-th element of -the heading block contains the offset of the corresponding block in the -"from_ucs" array. If there is no codes in the subrange, the heading -block element contains 0xFFFF. - -@* -If there are some gaps in a block, the corresponding block elements have -the 0xFF value. If there is an 0xFF code present in the CCS, it's mapping -is defined in the first 2-byte element of the "from_ucs" array. - -@* -Having such a table format, the algorithm of searching the CCS code -@emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows. - -@* -@enumerate -@item If @emph{Y} is equivalent to the value of the first 2-byte element -of the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search. - -@item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}. - -@item If the heading block element with number @emph{BlkN} is 0xFFFF, there -is no corresponding CCS code (error, wrong input data). Else, fetch the -"flom_ucs" array index of the @emph{BlkN}-th block. - -@item Calculate the offset of the @emph{X} code in its block: -@emph{Xindex = Y & 0xFF} - -@item If the @emph{Xintex}-th element of the block (which is equivalent to -@emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no corresponding -CCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}. -@end enumerate - -@subsection Size-optimized tables format -@* -As it is stated above, size-optimized tables exist only for 16-bit CCS-es. -This is because there is too small difference between the speed-optimized -and the size-optimized table sizes in case of 8-bit CCS-es. - -@* -Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of -size-optimized tables. - -This sections describes the format of the "UCS-2 -> CCS" size-optimized -CCS table. The format of "CCS -> UCS-2" table is the same. - -The idea of the size-optimized tables is to split the UCS-2 codes -("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes). -Then CCS codes ("to" codes) are stored only for the codes from these -ranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are stored -together with the corresponding "to" codes. - -@* -The following is the layout of the size-optimized table array: - -@* -size_arr array: -@* -------------------------------------- -@* -Ranges number (2 bytes) -@* -------------------------------------- -@* -Unranged codes number (2 bytes) -@* -------------------------------------- -@* -Unranged codes array index (2 bytes) -@* -------------------------------------- -@* -Ranges indexes (triads) -@* -------------------------------------- -@* -Ranges -@* -------------------------------------- -@* -Unranged codes array -@* -------------------------------------- - -@* -The @dfn{Unranged codes array index} @emph{size_arr} section helps to find -the offset of the needed range in the @emph{size_arr} and has -the following format (triads): -@* -the first code in range, the last code in range, range offset. - -@* -The array of these triads is sorted by the firs element, therefore it is -possible to quickly find the needed range index. - -@* -Each range has the corresponding sub-array containing the "to" codes. These -sub-arrays are stored in the place marked as "Ranges" in the layout -diagram. - -@* -The "Unranged codes array" contains pairs ("from" code, "to" code") for -each unranged code. The array of these pairs is sorted by "from" code -values, therefore it is possible to find the needed pair quickly. - -@* -Note, that each range requires 6 bytes to form its index. If, for -example, there are two ranges (1 - 5 and 9 - 10), and one unranged code -(7), 12 bytes are needed for two range indexes and 4 bytes for the unranged -code (total 16). But it is better to join both ranges as 1 - 10 and -mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the -range index and 4 bytes to mark codes 6 and 8 as absent are needed -(total 10 bytes). This optimization is done in the size-optimized tables. -Thus, ranges may contain small gaps. The absent codes in ranges are marked -as 0xFFFF. - -@* -Note, a pair of "from" codes is stored by means of unranged codes since -the number of bytes which are needed to form the range is greater than -the number of bytes to store two unranged codes (5 against 4). - -@* -The algorithm of searching of the CCS code -@emph{X} which corresponds to the UCS-2 code @emph{Y} (input) in the "UCS-2 -> -CCS" size-optimized table is as follows. - -@* -@enumerate -@item Try to find the corresponding triad in the "Unranged codes array -index". Since we are searching in the sorted array, we can do it quickly -(divide by 2, compare, etc). - -@item If the triad is found, fetch the @emph{X} code from the corresponding -range array. If it is 0xFFFF, return an error. - -@item If there is no corresponding triad, search the @emph{X} code among the -sorted unranged codes. Return error, if noting was found. -@end enumerate - -@subsection .cct ant .c CCS Table files -@* -The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs" -speed-optimized tables. The .c source files for 16-bit CCS tables have -"to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size" -tables. - -@* -When .c files are compiled and used, all the 16-bit and 32-bit values -have the native endian format (Big Endian for the BE systems and Little -Endian for the LE systems) since they are compile for the system before -they are used. - -@* -In case of .cct files, which are intended for dynamic CCS tables -loading, the CCS tables are stored either in LE or BE format. Since the -.cct files are generated by the 'mktbl.pl' Perl script, it is possible -to choose the endianess of the tables. It is also possible to store two -copies (both LE and BE) of the CCS tables in one .cct file. The default -.cct files (which come with the Newlib sources) have both LE and BE CCS -tables. The Newlib iconv library automatically chooses the needed CCS tables -(with appropriate endianess). - -@* -Note, the .cct files are only used when the -@option{--enable-newlib-iconv-external-ccs} is used. - -@subsection The 'mktbl.pl' Perl script -@* -The 'mktbl.pl' script is intended to generate .cct and .c CCS table -files from the @dfn{CCS source files}. - -@* -The CCS source files are just text files which has one or more colons -with CCS <-> UCS-2 codes mapping. To see an example of the CCS table -source files see one of them using URL-s which will be given bellow. - -@* -The following table describes where the source files for CCS table files -provided by the Newlib distribution are located. - -@multitable @columnfractions .25 .75 -@item -Name -@tab -URL - -@item -@tab - -@item -big5 -@tab -http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT - -@item -cns11643_plane1 -cns11643_plane14 -cns11643_plane2 -@tab -http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT - -@item -cp775 -cp850 -cp852 -cp855 -cp866 -@tab -http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/ - -@item -iso_8859_1 -iso_8859_2 -iso_8859_3 -iso_8859_4 -iso_8859_5 -iso_8859_6 -iso_8859_7 -iso_8859_8 -iso_8859_9 -iso_8859_10 -iso_8859_11 -iso_8859_13 -iso_8859_14 -iso_8859_15 -@tab -http://www.unicode.org/Public/MAPPINGS/ISO8859/ - -@item -iso_ir_111 -@tab -http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT - -@item -jis_x0201_1976 -jis_x0208_1990 -jis_x0212_1990 -@tab -http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT - -@item -koi8_r -@tab -http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT - -@item -koi8_ru -@tab -http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT - -@item -koi8_u -@tab -http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT - -@item -koi8_uni -@tab -http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT - -@item -ksx1001 -@tab -http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT - -@item -win_1250 -win_1251 -win_1252 -win_1253 -win_1254 -win_1255 -win_1256 -win_1257 -win_1258 -@tab -http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/ -@end multitable - -The CCS source files aren't distributed with Newlib because of License -restrictions in most Unicode.org's files. - -The following are 'mktbl.pl' options which were used to generate .cct -files. Note, to generate CCS tables source files @option{-s} option -should be added. - -@enumerate -@item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct, -iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct, -iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct, -iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct -win_1256.cct, win_1258.cct, win_1251.cct, -win_1253.cct, win_1255.cct, win_1257.cct, -koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct, -big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct -files, only the @option{-i <SRC_FILE_NAME>} option were used. - -@item To generate the jis_x0208_1990.cct file, the -@option{-i jis_x0208_1990.txt -x 2 -y 3} options were used. - -@item To generate the cns11643_plane1.cct file, the -@option{-i cns11643.txt -p1 -N cns11643_plane1 -o cns11643_plane1.cct} -options were used. - -@item To generate the cns11643_plane2.cct file, the -@option{-i cns11643.txt -p2 -N cns11643_plane2 -o cns11643_plane2.cct} -options were used. - -@item To generate the cns11643_plane14.cct file, the -@option{-i cns11643.txt -p0xE -N cns11643_plane14 -o cns11643_plane14.cct} -options were used. -@end enumerate - -@* -For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output. - -@* -It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes -in the CCS source file, the bits which are higher then 16 defines plane (see the -cns11643.txt CCS source file). - -@* -Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example, -several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to -the pair of UCS-2 codes. In these cases, such CCS codes (@dfn{lost -codes}) aren't just rejected but instead, they are mapped to the default -UCS-2 code (which is currently the @kbd{?} character's code). - - - - - -@page -@node CES converters -@section CES converters -@findex PCS -@* -Similar to the CCS tables, CES converters are also split into "from UCS" -and "to UCS" parts. Depending on the iconv library configuration, these -parts are enabled or disabled. - -@* -The following it the list of CES converters which are currently present -in the Newlib iconv library. - -@itemize @bullet -@item -@emph{euc} - supports the @emph{euc_jp}, @emph{euc_kr} and @emph{euc_tw} -encodings. The @emph{euc} CES converter uses the @emph{table} and the -@emph{us_ascii} CES converters. - -@item -@emph{table} - this CES converter corresponds to "null" and just performs -tables-based conversion using 8- and 16-bit CCS tables. This converter -is also used by any other CES converter which needs the CCS table-based -conversions. The @emph{table} converter is also responsible for .cct files -loading. - -@item -@emph{table_pcs} - this is the wrapper over the @emph{table} converter -which is intended for 16-bit encodings which also use the @dfn{Portable -Character Set} (@dfn{PCS}) which is the same as the @emph{US-ASCII}. -This means, that if the first byte the CCS code is in range of [0x00-0x7f], -this is the 7-bit PCS code. Else, this is the 16-bit CCS code. Of course, -the 16-bit codes must not contain bytes in the range of [0x00-0x7f]. -The @emph{big5} encoding uses the @emph{table_pcs} CES converter and the -@emph{table_pcs} CES converter depends on the @emph{table} CES converter. - -@item -@emph{ucs_2} - intended for the @emph{ucs_2}, @emph{ucs_2be} and -@emph{ucs_2le} encodings support. - -@item -@emph{ucs_4} - intended for the @emph{ucs_4}, @emph{ucs_4be} and -@emph{ucs_4le} encodings support. - -@item -@emph{ucs_2_internal} - intended for the @emph{ucs_2_internal} encoding support. - -@item -@emph{ucs_4_internal} - intended for the @emph{ucs_4_internal} encoding support. - -@item -@emph{us_ascii} - intended for the @emph{us_ascii} encoding support. In -principle, the most natural way to support the @emph{us_ascii} encoding -is to define the @emph{us_ascii} CCS and use the @emph{table} CES -converter. But for the optimization purposes, the specialized -@emph{us_ascii} CES converter was created. - -@item -@emph{utf_16} - intended for the @emph{utf_16}, @emph{utf_16be} and -@emph{utf_16le} encodings support. - -@item -@emph{utf_8} - intended for the @emph{utf_8} encoding support. -@end itemize - - - - - -@page -@node The encodings description file -@section The encodings description file -@findex encoding.deps description file -@findex mkdeps.pl Perl script -@* -To simplify the process of adding new encodings support allowing to -automatically generate a lot of "glue" files. - -@* -There is the 'encoding.deps' file in the @emph{lib/} subdirectory which -is used to describe encoding's properties. The 'mkdeps.pl' Perl script -uses 'encoding.deps' to generates the "glue" files. - -@* -The 'encoding.deps' file is composed of sections, each section consists -of entries, each entry contains some encoding/CES/CCS description. - -@* -The 'encoding.deps' file's syntax is very simple. Currently only two -sections are defined: @emph{ENCODINGS} and @emph{CES_DEPENDENCIES}. - -@* -Each @emph{ENCODINGS} section's entry describes one encoding and -contains the following information. - -@itemize @bullet -@item -Encoding name (the @emph{ENCODING} field). The name should -be unique and only one name is possible. - -@item -The encoding's CES converter name (the @emph{CES} field). Only one CES -converter is allowed. - -@item -The whitespace-separated list of CCS table names which are used by the -encoding (the @emph{CCS} field). - -@item -The whitespace-separated list of aliases names (the @emph{ENCODING} -field). -@end itemize - -@* -Note all names in the 'encoding.deps' file have to have the normalized -form. - -@* -Each @emph{CES_DEPENDENCIES} section's entry describes dependencies of -one CES converted. For example, the @emph{euc} CES converter depends on -the @emph{table} and the @emph{us_ascii} CES converter since the -@emph{euc} CES converter uses them. This means, that both @emph{table} -and @emph{us_ascii} CES converters should be linked if the @emph{euc} -CES converter is enabled. - -@* -The @emph{CES_DEPENDENCIES} section defines the following: - -@itemize @bullet -@item -the CES converter name for which the dependencies are defined in this -entry (the @emph{CES} field); - -@item -the whitespace-separated list of CES converters which are needed for -this CES converter (the @emph{USED_CES} field). -@end itemize - -@* -The 'mktbl.pl' Perl script automatically solves the following tasks. - -@itemize @bullet -@item -User works with the iconv library in terms of encodings and doesn't know -anything about CES converters and CCS tables. The script automatically -generates code which enables all needed CES converters and CCS tables -for all encodings, which were enabled by the user. - -@item -The CES converters may have dependencies and the script automatically -generates the code which handles these dependencies. - -@item -The list of encoding's aliases is also automatically generated. - -@item -The script uses a lot of macros in order to enable only the minimum set -of code/data which is needed to support the requested encodings in the -requested directions. -@end itemize - -@* -The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps' -file and generates the following files. - -@itemize @bullet -@item -@emph{lib/encnames.h} - this header files contains macro definitions for all -encoding names - -@item -@emph{lib/aliasesbi.c} - the array of encoding names and aliases. The array -is used to find the name of requested encoding by it's alias. - -@item -@emph{ces/cesbi.c} - this file defines two arrays -(@code{_iconv_from_ucs_ces} and @code{_iconv_to_ucs_ces}) which contain -description of enabled "to UCS" and "from UCS" CES converters and the -names of encodings which are supported by these CES converters. - -@item -@emph{ces/cesbi.h} - this file contains the set of macros which defines -the set of CES converters which should be enabled if only the set of -enabled encodings is given (through macros defined in the -@emph{newlib.h} file). Note, that one CES converter may handle several -encodings. - -@item -@emph{ces/cesdeps.h} - the CES converters dependencies are handled in -this file. - -@item -@emph{ccs/ccsdeps.h} - the array of linked-in CCS tables is defined -here. - -@item -@emph{ccs/ccsnames.h} - this header files contains macro definitions for all -CCS names. - -@item -@emph{encoding.aliases} - the list of supported encodings and their -aliases which is intended for the Newlib configure scripts in order to -handle the iconv-related configure script options. -@end itemize - - - - - -@page -@node How to add new encoding -@section How to add new encoding -@* -At first, the new encoding should be broken down to CCS and CES. Then, -the process of adding new encoding is split to the following activities. - -@enumerate -@item Generate the .cct CCS file and the .c source file for the new -encoding's CCS (if it isn't already present). To do this, the CCS source -file should be had and the 'mktbl.pl' script should be used. - -@item Write the corresponding CES converter (if it isn't already -present). Use the existing CES converters as an example. - -@item -Add the corresponding entries to the 'encoding.deps' file and regenerate -the autogenerated "glue" files using the 'mkdeps.pl' script. - -@item -Don't forget to add entries to the newlib/newlib.hin file. - -@item -Of course, the 'Makefile.am'-s should also be updated (if new files were -added) and the 'Makefile.in'-s should be regenerated using the correct -version of 'automake'. - -@item -Don't forget to update the documentation (the list of -supported encodings and CES converters). -@end enumerate - -In case a new encoding doesn't fit to the CES/CCS decomposition model or -it is desired to add the specialized (non UCS-based) conversion support, -the Newlib iconv library code should be upgraded. - - - - - -@page -@node The locale support interfaces -@section The locale support interfaces -@* -The newlib iconv library also has some interface functions (besides the -@code{iconv}, @code{iconv_open} and @code{iconv_close} interfaces) which -are intended for the Locale subsystem. All the locale-related code is -placed in the @emph{lib/iconvnls.c} file. - -@* -The following is the description of the locale-related interfaces: - -@itemize @bullet -@item -@code{_iconv_nls_open} - opens two iconv descriptors for "CCS -> -wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is -passed in the function parameters. The @emph{wchar_t} characters encoding is -either ucs_2_internal or ucs_4_internal depending on size of -@emph{wchar_t}. - -@item -@code{_iconv_nls_conv} - the function is similar to the @code{iconv} -functions, but if there is no character in the output encoding which -corresponds to the character in the input encoding, the default -conversion isn't performed (the @code{iconv} function sets such output -characters to the @kbd{?} symbol and this is the behavior, which is -specified in SUSv3). - -@item -@code{_iconv_nls_get_state} - returns the current encoding's shift state -(the @code{mbstate_t} object). - -@item -@code{_iconv_nls_set_state} sets the current encoding's shift state (the -@code{mbstate_t} object). - -@item -@code{_iconv_nls_is_stateful} - checks whether the encoding is stateful -or stateless. - -@item -@code{_iconv_nls_get_mb_cur_max} - returns the maximum length (the -maximum bytes number) of the encoding's characters. -@end itemize - - - - -@page -@node Contact -@section Contact -@* -The author of the original BSD iconv library (Alexander Chuguev) no longer -supports that code. - -@* -Any questions regarding the iconv library may be forwarded to -Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as -well as to the public Newlib mailing list. - |