newlib/libc/iconv/iconv.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710

@node Iconv
@chapter Encoding conversions (@file{iconv.h})

This chapter describes the Newlib iconv library.
The iconv functions declarations are in
@file{iconv.h}.

@menu
* iconv::                           Encoding conversion routines
* Introduction::                    Introduction to iconv and encodings
* Supported encodings::             The list of currently supported encodings
* iconv design decisions::          General iconv library design issues
* iconv configuration::             iconv-related configure script options
* Encoding names::                  How encodings are named.
* CCS tables::                      CCS tables format and 'mktbl.pl' Perl script
* CES converters::                  CES converters description
* The encodings description file::  The 'encoding.deps' file and 'mkdeps.pl'
* How to add new encoding::         The steps to add new encoding support
* The locale support interfaces::   Locale-related iconv interfaces
* Contact::                         The author contact
@end menu

@page
@include iconv/iconv.def

@page
@node Introduction
@section Introduction
@findex encoding
@findex character set
@findex charset
@findex CES
@findex CCS
@*
The iconv library is intended to convert characters from one encoding to
another. It implements iconv(), iconv_open() and iconv_close()
calls, which are defined by the Single Unix Specification.

@*
In addition to these user-level interfaces, the iconv library also has
several useful interfaces which are needed to support coding
capabilities of the Newlib Locale infrastructure.  Since Locale 
support also needs to
convert various character sets to and from the @emph{wide characters
set}, the iconv library shares it's capabilities with the Newlib Locale
subsystem. Moreover, the iconv library supports several features which are
only needed for the Locale infrastructure (for example, the MB_CUR_MAX value).

@*
The Newlib iconv library was created using concepts from another iconv
library implemented by Konstantin Chuguev (ver 2.0). The Newlib iconv library
was rewritten from scratch and contains a lot of improvements with respect to
the original iconv library. 

@*
Terms like @dfn{encoding} or @dfn{character set} aren't well defined and
are often used with various meanings. The following are the definitions of terms
which are used in this documentation as well as in the iconv library
implementation:

@itemize @bullet
@item
@dfn{encoding} - a machine representation of characters by means of bits;

@item
@dfn{Character Set} or @dfn{Charset} - just a collection of
characters, i.e. the encoding is the machine representation of the character set; 

@item
@dfn{CCS} (@dfn{Coded Character Set}) - a mapping from an character set to a
set of integers @dfn{character codes};

@item
@dfn{CES} (@dfn{Character Encoding Scheme}) - a mapping from a set of character
codes to a sequence of bytes;
@end itemize

@*
Users usually deal with encodings, for example, KOI8-R, Unicode, UTF-8,
ASCII, etc. Encodings are formed by the following chain of steps:

@enumerate
@item
User has a set of characters which are specific to his or her language (character set).

@item
Each character from this set is uniquely numbered, resulting in an CCS.

@item
Each number from the CCS is converted to a sequence of bits or bytes by means
of a CES and form some encoding. Thus, CES may be considered as a
function of CCS which produces some encoding. Note, that CES may be
applied to more than one CCS.
@end enumerate

@*
Thus, an encoding may be considered as one or more CCS + CES.

@*
Sometimes, there is no CES and in such cases encoding is equivalent
to CCS, e.g. KOI8-R or ASCII.

@*
An example of a more complicated encoding is UTF-8 which is the UCS
(or Unicode) CCS plus the UTF-8 CES.

@*
The following is a brief list of iconv library features:
@itemize
@item
Generic architecture;
@item
Locale infrastructure support;
@item
Automatic generation of the program code which handles
CES/CCS/Encoding/Names/Aliases dependencies;
@item
The ability to choose size- or speed-optimazed
configuration;
@item
The ability to exclude a lot of unneeded code and data from the linking step.
@end itemize


@page
@node Supported encodings
@section Supported encodings
@findex big5
@findex cp775
@findex cp850
@findex cp852
@findex cp855
@findex cp866
@findex euc_jp
@findex euc_kr
@findex euc_tw
@findex iso_8859_1
@findex iso_8859_10
@findex iso_8859_11
@findex iso_8859_13
@findex iso_8859_14
@findex iso_8859_15
@findex iso_8859_2
@findex iso_8859_3
@findex iso_8859_4
@findex iso_8859_5
@findex iso_8859_6
@findex iso_8859_7
@findex iso_8859_8
@findex iso_8859_9
@findex iso_ir_111
@findex koi8_r
@findex koi8_ru
@findex koi8_u
@findex koi8_uni
@findex ucs_2
@findex ucs_2_internal
@findex ucs_2be
@findex ucs_2le
@findex ucs_4
@findex ucs_4_internal
@findex ucs_4be
@findex ucs_4le
@findex us_ascii
@findex utf_16
@findex utf_16be
@findex utf_16le
@findex utf_8
@findex win_1250
@findex win_1251
@findex win_1252
@findex win_1253
@findex win_1254
@findex win_1255
@findex win_1256
@findex win_1257
@findex win_1258
@*
The following is the list of currently supported encodings. The first column
corresponds to the encoding name, the second column is the list of aliases,
the third column is its CES and CCS components names, and the fourth column
is a short description.

@multitable @columnfractions .20 .26 .24 .30
@item
Name
@tab
Aliases
@tab
CES/CCS
@tab
Short description
@item
@tab
@tab
@tab


@item
big5
@tab
csbig5, big_five, bigfive, cn_big5, cp950
@tab
table_pcs / big5, us_ascii 
@tab
The encoding for the Traditional Chinese.


@item
cp775
@tab
ibm775, cspc775baltic
@tab
table / cp775
@tab
The updated version of CP 437 that supports the balitic languages.


@item
cp850
@tab
ibm850, 850, cspc850multilingual
@tab
table / cp850
@tab
IBM 850 - the updated version of CP 437 where several Latin 1 characters have been
added instead of some less-often used characters like the line-drawing
and the greek ones.


@item
cp852
@tab
ibm852, 852, cspcp852
@tab
@tab
IBM 852 - the updated version of CP 437 where several Latin 2 characters have been added
instead of some less-often used characters like the line-drawing and the greek ones.


@item
cp855
@tab
ibm855, 855, csibm855
@tab
table / cp855
@tab
IBM 855 - the updated version of CP 437 that supports Cyrillic.


@item
cp866
@tab
866, IBM866, CSIBM866
@tab
table / cp866
@tab
IBM 866 - the updated version of CP 855 which follows more the logical Russian alphabet 
ordering of the alternative variant that is preferred by many Russian users.


@item
euc_jp
@tab
eucjp
@tab
euc / jis_x0208_1990, jis_x0201_1976, jis_x0212_1990
@tab
EUC-JP - The EUC for Japanese.


@item
euc_kr
@tab
euckr
@tab
euc / ksx1001
@tab
EUC-KR - The EUC for Korean.


@item
euc_tw
@tab
euctw
@tab
euc / cns11643_plane1, cns11643_plane2, cns11643_plane14
@tab
EUC-TW - The EUC for Traditional Chinese.


@item
iso_8859_1
@tab
iso8859_1, iso88591, iso_8859_1:1987, iso_ir_100, latin1, l1, ibm819, cp819, csisolatin1
@tab
table / iso_8859_1
@tab
ISO 8859-1:1987 - Latin 1, West European.


@item
iso_8859_10
@tab
iso_8859_10:1992, iso_ir_157, iso885910, latin6, l6, csisolatin6, iso8859_10
@tab
table / iso_8859_10
@tab
ISO 8859-10:1992 - Latin 6, Nordic.


@item
iso_8859_11
@tab
iso8859_11, iso885911
@tab
table / iso_8859_11
@tab
ISO 8859-11 - Thai.


@item
iso_8859_13
@tab
iso_8859_13:1998, iso8859_13, iso885913
@tab
table / iso_8859_13
@tab
ISO 8859-13:1998 - Latin 7, Baltic Rim.


@item
iso_8859_14
@tab
iso_8859_14:1998, iso885914, iso8859_14
@tab
table / iso_8859_14
@tab
ISO 8859-14:1998 - Latin 8, Celtic.


@item
iso_8859_15
@tab
iso885915, iso_8859_15:1998, iso8859_15, 
@tab
table / iso_8859_15
@tab
ISO 8859-15:1998 - Latin 9, West Europe, successor of Latin 1.


@item
iso_8859_2
@tab
iso8859_2, iso88592, iso_8859_2:1987, iso_ir_101, latin2, l2, csisolatin2
@tab
table / iso_8859_2
@tab
ISO 8859-2:1987 - Latin 2, East European.


@item
iso_8859_3
@tab
iso_8859_3:1988, iso_ir_109, iso8859_3, latin3, l3, csisolatin3, iso88593
@tab
table / iso_8859_3
@tab
ISO 8859-3:1988 - Latin 3, South European.


@item
iso_8859_4
@tab
iso8859_4, iso88594, iso_8859_4:1988, iso_ir_110, latin4, l4, csisolatin4
@tab
table / iso_8859_4
@tab
ISO 8859-4:1988 - Latin 4, North European.


@item
iso_8859_5
@tab
iso8859_5, iso88595, iso_8859_5:1988, iso_ir_144, cyrillic, csisolatincyrillic
@tab
table / iso_8859_5
@tab
ISO 8859-5:1988 - Cyrillic.


@item
iso_8859_6
@tab
iso_8859_6:1987, iso_ir_127, iso8859_6, ecma_114, asmo_708, arabic, csisolatinarabic, iso88596
@tab
table / iso_8859_6
@tab
ISO i8859-6:1987 - Arabic.


@item
iso_8859_7
@tab
iso_8859_7:1987, iso_ir_126, iso8859_7, elot_928, ecma_118, greek, greek8, csisolatingreek, iso88597
@tab
table / iso_8859_7
@tab
ISO 8859-7:1987 - Greek.


@item
iso_8859_8
@tab
iso_8859_8:1988, iso_ir_138, iso8859_8, hebrew, csisolatinhebrew, iso88598
@tab
table / iso_8859_8
@tab
ISO 8859-8:1988 - Hebrew.


@item
iso_8859_9
@tab
iso_8859_9:1989, iso_ir_148, iso8859_9, latin5, l5, csisolatin5, iso88599
@tab
table / iso_8859_9
@tab
ISO 8859-9:1989 - Latin 5, Turkish.


@item
iso_ir_111
@tab
ecma_cyrillic, koi8_e, koi8e, csiso111ecmacyrillic
@tab
table / iso_ir_111
@tab
ISO IR 111/ECMA Cyrillic.


@item
koi8_r
@tab
cskoi8r, koi8r, koi8
@tab
table / koi8_r
@tab
RFC 1489 Cyrillic.


@item
koi8_ru
@tab
koi8ru
@tab
table / koi8_ru
@tab
The obsolete Ukrainian.


@item
koi8_u
@tab
koi8u
@tab
table / koi8_u
@tab
RFC 2319 Ukrainian.


@item
koi8_uni
@tab
koi8uni
@tab
table / koi8_uni
@tab
KOI8 Unified.


@item
ucs_2
@tab
ucs2, iso_10646_ucs_2, iso10646_ucs_2, iso_10646_ucs2, iso10646_ucs2, iso10646ucs2, csUnicode
@tab
ucs_2 / (UCS)
@tab
ISO-10646-UCS-2. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).


@item
ucs_2_internal
@tab
ucs2_internal, ucs_2internal, ucs2internal
@tab
ucs_2_internal / (UCS)
@tab
ISO-10646-UCS-2 in system byte order.
NBSP is always interpreted as NBSP (BOM isn't supported).


@item
ucs_2be
@tab
ucs2be
@tab
ucs_2 / (UCS)
@tab
Big Endian version of ISO-10646-UCS-2 (in fact, equivalent to ucs_2).
Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).


@item
ucs_2le
@tab
ucs2le
@tab
ucs_2 / (UCS)
@tab
Little Endian version of ISO-10646-UCS-2.
Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).


@item
ucs_4
@tab
ucs4, iso_10646_ucs_4, iso10646_ucs_4, iso_10646_ucs4, iso10646_ucs4, iso10646ucs4
@tab
ucs_4 / (UCS)
@tab
ISO-10646-UCS-4. Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).


@item
ucs_4_internal
@tab
ucs4_internal, ucs_4internal, ucs4internal
@tab
ucs_4_internal / (UCS)
@tab
ISO-10646-UCS-4 in system byte order.
NBSP is always interpreted as NBSP (BOM isn't supported).


@item
ucs_4be
@tab
ucs4be
@tab
ucs_4 / (UCS)
@tab
Big Endian version of ISO-10646-UCS-4 (in fact, equivalent to ucs_4).
Big Endian, NBSP is always interpreted as NBSP (BOM isn't supported).


@item
ucs_4le
@tab
ucs4le
@tab
ucs_4 / (UCS)
@tab
Little Endian version of ISO-10646-UCS-4.
Little Endian, NBSP is always interpreted as NBSP (BOM isn't supported).


@item
us_ascii
@tab
ansi_x3.4_1968, ansi_x3.4_1986, iso_646.irv:1991, ascii, iso646_us, us, ibm367, cp367, csascii
@tab
us_ascii / (ASCII)
@tab
7-bit ASCII.


@item
utf_16
@tab
utf16
@tab
utf_16 / (UCS)
@tab
RFC 2781 UTF-16. The very first NBSP code in stream is interpreted as BOM.


@item
utf_16be
@tab
utf16be
@tab
utf_16 / (UCS)
@tab
Big Endian version of RFC 2781 UTF-16.
NBSP is always interpreted as NBSP (BOM isn't supported).


@item
utf_16le
@tab
utf16le
@tab
utf_16 / (UCS)
@tab
Little Endian version of RFC 2781 UTF-16.
NBSP is always interpreted as NBSP (BOM isn't supported).


@item
utf_8
@tab
utf8
@tab
utf_8 / (UCS)
@tab
RFC 3629 UTF-8.


@item
win_1250
@tab
cp1250
@tab
@tab
Win-1250 Croatian.


@item
win_1251
@tab
cp1251
@tab
table / win_1251
@tab
Win-1251 - Cyrillic.


@item
win_1252
@tab
cp1252
@tab
table / win_1252
@tab
Win-1252 - Latin 1.


@item
win_1253
@tab
cp1253
@tab
table / win_1253
@tab
Win-1253 - Greek.


@item
win_1254
@tab
cp1254
@tab
table / win_1254
@tab
Win-1254 - Turkish.


@item
win_1255
@tab
cp1255
@tab
table / win_1255
@tab
Win-1255 - Hebrew.


@item
win_1256
@tab
cp1256
@tab
table / win_1256
@tab
Win-1256 - Arabic.


@item
win_1257
@tab
cp1257
@tab
table / win_1257
@tab
Win-1257 - Baltic.


@item
win_1258
@tab
cp1258
@tab
table / win_1258
@tab
Win-1258 - Vietnamese7 that supports Cyrillic.
@end multitable


@page
@node iconv design decisions
@section iconv design decisions
@findex CCS table
@findex CES converter
@findex Speed-optimized tables
@findex Size-optimized tables
@*
The first iconv library design issue arises when considering the
following two design approaches:

@enumerate
@item
Have modules which implement conversion from the encoding A to the encoding B
and vice versa i.e., one conversion module relates to any two encodings.
@item
Have modules which implement conversion from the encoding A to the fixed
encoding C and vice versa i.e., one conversion module relates to any
one encoding A and one fixed encoding C. In this case, to convert from
the encoding A to the encoding B, two modules are needed (in order to convert
from A to C and then from C to B).
@end enumerate

@*
It's obvious, that we have tradeoff between commonality/flexibility and
efficiency: the first method is more efficient since it converts
directly; however, it isn't so flexible since for each
encoding pair a distinct module is needed.

@*
The Newlib iconv model uses the second method and always converts through the 32-bit
UCS but its design also allows one to write specialized conversion
modules if the conversion speed is critical.

@*
The second design issue is how to break down (decompose) encodings.
The Newlib iconv library uses the fact that any encoding may be
considered as one or more CCS plus a CES. It also decomposes its
conversion modules on @dfn{CES converter} plus one or more @dfn{CCS
tables}. CCS tables map CCS to UCS and vice versa; the CES converters
map CCS to the encoding and vice versa.

@*
As the example, let's consider the conversion from the big5 encoding to
the EUC-TW encoding. The big5 encoding may be decomposed to the ASCII and BIG5
CCS-es plus the BIG5 CES. EUC-TW may be decomposed on the CNS11643_PLANE1, CNS11643_PLANE2,
and CNS11643_PLANE14 CCS-es plus the EUC CES.

@*
The euc_jp -> big5 conversion is performed as follows:

@enumerate
@item
The EUC converter performs the EUC-TW encoding to the corresponding CCS-es
transformation (CNS11643_PLANE1, CNS11643_PLANE2 and CNS11643_PLANE14
CCS-es);
@item
The obtained CCS codes are transformed to the UCS codes using the CNS11643_PLANE1,
CNS11643_PLANE2 and CNS11643_PLANE14 CCS tables;
@item
The resulting UCS codes are transformed to the ASCII and BIG5 codes using
the corresponding CCS tables;
@item
The obtained CCS codes are transformed to the big5 encoding using the corresponding
CES converter.
@end enumerate

@*
Analogously, the backward conversion is performed as follows:

@enumerate
@item
The BIG5 converter performs the big5 encoding to the corresponding CCS-es transformation
(the ASCII and BIG5 CCS-es);
@item
The obtained CCS codes are transformed to the UCS codes using the ASCII and BIG5 CCS tables;
@item
The resulting UCS codes are transformed to the ASCII and BIG5 codes using
the corresponding CCS tables;
@item
The obtained CCS codes are transformed to the EUC-TW encoding using the corresponding
CES converter.
@end enumerate

@*
Note, the above is just an example and real names (which are implemented
in the Newlib iconv) of the CES converters and the CCS tables are slightly different.

@*
The third design issue also relates to flexibility. Obviously, it isn't
desirable to always link all the CES converters and the CCS tables to the library
but instead, we want to be able to load the needed converters and tables
dynamically on demand. This isn't a problem on "big" machines such as
a PC, but it may be very problematical within "small" embedded systems.

@*
Since the CCS tables are just data, it is possible to load them
dynamically from external files.  The CES converters, on the other hand
are algorithms with some code so a dynamic library loading 
capability is required.

@*
Apart from possible restrictions applied by embedded systems (small
RAM for example), Newlib itself has no dynamic library support and
therefore, all the CES converters which will ever be used must be linked into
the library.   However, loading of the dynamic CCS tables is possible and is
implemented in the Newlib iconv library.  It may be enabled via the Newlib
configure script options.

@*
The next design issue is fine-tuning the iconv library
configuration.  One important ability is for iconv to not link all it's
converters and tables (if dynamic loading is not enabled) but instead,
enable only those encodings which are specified at configuration
time (see the section about the configure script options).

@*
In addition, the Newlib iconv library configure options distinguish between
conversion directions. This means that not only are supported encodings
selectable, the conversion direction is as well. For example, if user wants
the configuration which allows conversions from UTF-8 to UTF-16 and
doesn't plan using the "UTF-16 to UTF-8" conversions, he or she can 
enable only
this conversion direction (i.e., no "UTF-16 -> UTF-8"-related code will
be included) thus, saving some memory (note, that such technique allows to
exclude one half of a CCS table from linking which may be big enough).

@*
One more design aspect are the speed- and size- optimized tables. Users can
select between them using configure script options. The
speed-optimized CCS tables are the same as the size-optimized ones in
case of 8-bit CCS (e.g.m KOI8-R), but for 16-bit CCS-es the size-optimized
CCS tables may be 1.5 to 2 times less then the speed-optimized ones. On the
other hand, conversion with speed tables is several times faster.

@*
Its worth to stress that the new encoding support can't be
dynamically added into an already compiled Newlib library, even if it
needs only an additional CCS table and iconv is configured to use
the external files with CCS tables (this isn't the fundamental restriction
and the possibility to add new Table-based encoding support dynamically, by
means of just adding new .cct file, may be easily added).

@*
Theoretically, the compiled-in CCS tables should be more appropriate for
embedded systems than dynamically loaded CCS tables.  This is because the compiled-in tables are read-only and can be placed in ROM
whereas dynamic loading requires RAM.  Moreover, in the current iconv
implementation, a distinct copy of the dynamic CCS file is loaded for each opened iconv descriptor even in case of the same encoding.
This means, for example, that if two iconv descriptors for
"KOI8-R -> UCS-4BE" and "KOI8-R -> UTF-16BE" are opened, two copies of
koi8-r .cct file will be loaded (actually, iconv loads only the needed part
of these files).  On the other hand, in the case of compiled-in CCS tables, there will always be only one copy.

@page
@node iconv configuration
@section iconv configuration
@findex iconv configuration
@findex --enable-newlib-iconv-encodings
@findex --enable-newlib-iconv-from-encodings
@findex --enable-newlib-iconv-to-encodings
@findex --enable-newlib-iconv-external-ccs
@findex NLSPATH
@*
To enable an encoding, the @emph{--enable-newlib-iconv-encodings} configure
script option should be used. This option accepts a comma-separated list
of @emph{encodings} that should be enabled. The option enables each encoding in both
("to" and "from") directions.

@*
The @option{--enable-newlib-iconv-from-encodings} configure script option enables
"from" support for each encoding that was passed to it.

@*
The @option{--enable-newlib-iconv-to-encodings} configure script option enables
"to" support for each encoding that was passed to it.

@*
Example: if user plans only the "KOI8-R -> UTF-8", "UTF-8 -> ISO-8859-5" and
"KOI8-R -> UCS-2" conversions, the most optimal way (minimal iconv
code and data will be linked) is to configure Newlib with the following
options:
@*
@code{--enable-newlib-iconv-encodings=UTF-8
--enable-newlib-iconv-from-encodings=KOI8-R
--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5}
@*
which is the same as
@*
@code{--enable-newlib-iconv-from-encodings=KOI8-R,UTF-8
--enable-newlib-iconv-to-encodings=UCS-2,ISO-8859-5,UTF-8}
@*
User may also just use the
@*
@code{--enable-newlib-iconv-encodings=KOI8-R,ISO-8859-5,UTF-8,UCS-2}
@*
configure script option, but it isn't so optimal since there will be
some unneeded data and code.

@*
The @option{--enable-newlib-iconv-external-ccs} option enables iconv's
capabilities to work with the external CCS files.

@*
The @option{--enable-target-optspace} Newlib configure script option also affects
the iconv library. If this option is present, the library uses the size
optimized CCS tables. This means, that only the size-optimized CCS
tables will be linked or, if the
@option{--enable-newlib-iconv-external-ccs} configure script option was used,
the iconv library will load the size-optimized tables. If the
@option{--enable-target-optspace}configure script option is disabled,
the speed-optimized CCS tables are used.

@*
Note: .cct files are searched by iconv_open in the $NLSPATH/iconv_data/ directory.
Thus, the NLSPATH environment variable should be set.


@page
@node Encoding names
@section Encoding names
@findex encoding name
@findex encoding alias
@findex normalized name
@*
Each encoding has one @dfn{name} and a number of @dfn{aliases}. When
user works with the iconv library (i.e., when the @code{iconv_open} call
is used) both name or aliases may be used. The same is when encoding
names are used in configure script options.

@*
Names and aliases may be specified in any case (small or capital
letters) and the @kbd{-} symbol is equivalent to the @kbd{_} symbol.
Also, when working with the iconv library,

@*
Internally the Newlib iconv library always converts aliases to names. It
also converts names and aliases in the @dfn{normalized} form which means
that all capital letters are converted to small letters and the @kbd{-}
symbols are converted to @kbd{_} symbols.


@page
@node CCS tables
@section CCS tables
@findex Size-optimized CCS table
@findex Speed-optimized CCS table
@findex mktbl.pl Perl script
@findex .cct files
@findex The CCT tables source files
@findex CCS source files
@*
The iconv library stores files with CCS tables in the the @emph{ccs/}
subdirectory. The CCS tables for any CCS may be kept in two forms - in the binary form
(@dfn{.cct files}, see the @emph{ccs/binary/} subdirectory) and in form
of compilable .c source files. The .cct files are only used when the
@option{--enable-newlib-iconv-external-ccs} configure script option is enabled.
The .c files are linked to the Newlib library if the corresponding
encoding is enabled.

@*
As stated earlier, the Newlib iconv library performs all
conversions through the 32-bit UCS, but the codes which are used
in most CCS-es, fit into the first 16-bit subset of the 32-bit UCS set.
Thus, in order to make the CCS tables more compact, the 16-bit UCS-2 is
used instead of the 32-bit UCS-4.

@*
CCS tables may be 8- or 16-bit wide. 8-bit CCS tables map 8-bit CCS to
16-bit UCS-2 and vice versa while 16-bit CCS tables map
16-bit CCS to 16-bit UCS-2 and vice versa.
8-bit tables are small (in size) while 16-bit tables may be big enough.
Because of this, 16-bit CCS tables may be
either speed- or size-optimized. Size-optimized CCS tables are
smaller then speed-optimized ones, but the conversion process is
slower if the size-optimized CCS tables are used. 8-bit CCS tables have only
size-optimized variant.

Each CCS table (both speed- and size-optimized) consists of
@dfn{from_ucs} and @dfn{to_ucs} subtables. "from_ucs" subtable maps
UCS-2 codes to CCS codes, while "to_ucs" subtable maps CCS codes to
UCS-2 codes.

@*
Almost all 16-bit CCS tables contain less then 0xFFFF codes and
a lot of gaps exist.

@subsection Speed-optimized tables format
@*
In case of 8-bit speed-optimized CCS tables the "to_ucs" subtables format is
trivial - it is just the array of 256 16-bit UCS codes. Therefore, an
UCS-2 code @emph{Y} corresponding to a @emph{X} CCS code is calculates
as @emph{Y = to_ucs[X]}.

@*
Obviously, the simplest way to create the "from_ucs" table or the
16-bit "to_ucs" table is to use the huge 16-bit array like in case
of the 8-bit "to_ucs" table. But almost all the 16-bit CCS tables contain
less then 0xFFFF code maps and this fact may be exploited to reduce
the size of the CCS tables.

@*
In this chapter the "UCS-2 -> CCS" 8-bit CCS table format is described. The
16-bit "CCS -> UCS-2" CCS table format is the same, except the mapping
direction and the CCS bits number.

@*
In case of the 8-bit speed-optimized table the "from_ucs" subtable
corresponds the "from_ucs" array and has the following layout:

@*
from_ucs array:
@*
-------------------------------------
@*
0xFF mapping (2 bytes) (only for
8-bit table).
@*
-------------------------------------
@*
Heading block
@*
-------------------------------------
@*
Block 1
@*
-------------------------------------
@*
Block 2
@*
-------------------------------------
@*
  ...
@*
-------------------------------------
@*
Block N
@*
-------------------------------------

@*
The 0x0000-0xFFFF 16-bit code range is divided to 256 code subranges. Each
subrange is represented by an 256-element @dfn{block} (256 1-byte
elements or 256 2-byte element in case of 16-bit CCS table) with
elements which are equivalent to the CCS codes of this subrange.
If the "UCS-2 -> CCS" mapping has big enough gaps, some blocks will be
absent and there will be less then 256 blocks.

@*
Any element number @emph{m} of @dfn{the heading block} (which contains
256 2-byte elements) corresponds to the @emph{m}-th 256-element subrange.
If the subrange contains some codes, the value of the @emph{m}-th element of
the heading block contains the offset of the corresponding block in the
"from_ucs" array. If there is no codes in the subrange, the heading
block element contains 0xFFFF.

@*
If there are some gaps in a block, the corresponding block elements have
the 0xFF value. If there is an 0xFF code present in the CCS, it's mapping
is defined in the first 2-byte element of the "from_ucs" array.

@*
Having such a table format, the algorithm of searching the CCS code
@emph{X} which corresponds to the UCS-2 code @emph{Y} is as follows.

@*
@enumerate
@item If @emph{Y} is equivalent to the value of the first 2-byte element
of the "from_ucs" array, @emph{X} is 0xFF. Else, continue to search.

@item Calculate the block number: @emph{BlkN = (Y & 0xFF00) >> 8}.

@item If the heading block element with number @emph{BlkN} is 0xFFFF, there
is no corresponding CCS code (error, wrong input data). Else, fetch the
"flom_ucs" array index of the @emph{BlkN}-th block.

@item Calculate the offset of the @emph{X} code in its block: 
@emph{Xindex = Y & 0xFF}

@item If the @emph{Xintex}-th element of the block (which is equivalent to
@emph{from_ucs[BlkN+Xindex]}) value is 0xFF, there is no corresponding
CCS code (error, wrong input data). Else, @emph{X = from_ucs[BlkN+Xindex]}.
@end enumerate

@subsection Size-optimized tables format
@*
As it is stated above, size-optimized tables exist only for 16-bit CCS-es.
This is because there is too small difference between the speed-optimized
and the size-optimized table sizes in case of 8-bit CCS-es.

@*
Formats of the "to_ucs" and "from_ucs" subtables are equivalent in case of
size-optimized tables.

This sections describes the format of the "UCS-2 -> CCS" size-optimized
CCS table. The format of "CCS -> UCS-2" table is the same.

The idea of the size-optimized tables is to split the UCS-2 codes
("from" codes) on @dfn{ranges} (@dfn{range} is a number of consecutive UCS-2 codes).
Then CCS codes ("to" codes) are stored only for the codes from these
ranges. Distinct "from" codes, which have no range (@dfn{unranged codes}, are stored
together with the corresponding "to" codes.

@*
The following is the layout of the size-optimized table array:

@*
size_arr array:
@*
-------------------------------------
@*
Ranges number (2 bytes)
@*
-------------------------------------
@*
Unranged codes number (2 bytes)
@*
-------------------------------------
@*
Unranged codes array index (2 bytes)
@*
-------------------------------------
@*
Ranges indexes (triads)
@*
-------------------------------------
@*
Ranges
@*
-------------------------------------
@*
Unranged codes array
@*
-------------------------------------

@*
The @dfn{Unranged codes array index} @emph{size_arr} section helps to find
the offset of the needed range in the @emph{size_arr} and has
the following format (triads):
@*
the first code in range, the last code in range, range offset.

@*
The array of these triads is sorted by the firs element, therefore it is
possible to quickly find the needed range index.

@*
Each range has the corresponding sub-array containing the "to" codes. These
sub-arrays are stored in the place marked as "Ranges" in the layout
diagram. 

@*
The "Unranged codes array" contains pairs ("from" code, "to" code") for
each unranged code. The array of these pairs is sorted by "from" code
values, therefore it is possible to find the needed pair quickly.

@*
Note, that each range requires 6 bytes to form its index. If, for
example, there are two ranges (1 - 5 and 9 - 10), and one unranged code
(7), 12 bytes are needed for two range indexes and 4 bytes for the unranged
code (total 16). But it is better to join both ranges as 1 - 10 and
mark codes 6 and 8 as absent. In this case, only 6 additional bytes for the
range index and 4 bytes to mark codes 6 and 8 as absent are needed
(total 10 bytes). This optimization is done in the size-optimized tables.
Thus, ranges may contain small gaps. The absent codes in ranges are marked
as 0xFFFF.

@*
Note, a pair of "from" codes is stored by means of unranged codes since
the number of bytes which are needed to form the range is greater than
the number of bytes to store two unranged codes (5 against 4).

@*
The algorithm of searching of the CCS code
@emph{X} which corresponds to the UCS-2 code @emph{Y} (input) in the "UCS-2 ->
CCS" size-optimized table is as follows.

@*
@enumerate
@item Try to find the corresponding triad in the "Unranged codes array
index". Since we are searching in the sorted array, we can do it quickly
(divide by 2, compare, etc).

@item If the triad is found, fetch the @emph{X} code from the corresponding
range array. If it is 0xFFFF, return an error.

@item If there is no corresponding triad, search the @emph{X} code among the
sorted unranged codes. Return error, if noting was found.
@end enumerate

@subsection .cct ant .c CCS Table files
@*
The .c source files for 8-bit CCS tables have "to_ucs" and "from_ucs"
speed-optimized tables. The .c source files for 16-bit CCS tables have
"to_ucs_speed", "to_ucs_size", "from_ucs_speed" and "from_ucs_size"
tables.

@*
When .c files are compiled and used, all the 16-bit and 32-bit values
have the native endian format (Big Endian for the BE systems and Little
Endian for the LE systems) since they are compile for the system before
they are used.

@*
In case of .cct files, which are intended for dynamic CCS tables
loading, the CCS tables are stored either in LE or BE format. Since the
.cct files are generated by the 'mktbl.pl' Perl script, it is possible
to choose the endianess of the tables. It is also possible to store two
copies (both LE and BE) of the CCS tables in one .cct file. The default
.cct files (which come with the Newlib sources) have both LE and BE CCS
tables. The Newlib iconv library automatically chooses the needed CCS tables
(with appropriate endianess).

@*
Note, the .cct files are only used when the
@option{--enable-newlib-iconv-external-ccs} is used.

@subsection The 'mktbl.pl' Perl script
@*
The 'mktbl.pl' script is intended to generate .cct and .c CCS table
files from the @dfn{CCS source files}.

@*
The CCS source files are just text files which has one or more colons
with CCS <-> UCS-2 codes mapping. To see an example of the CCS table
source files see one of them using URL-s which will be given bellow.

@*
The following table describes where the source files for CCS table files
provided by the Newlib distribution are located.

@multitable @columnfractions .25 .75
@item
Name
@tab
URL

@item
@tab

@item
big5
@tab
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT

@item
cns11643_plane1
cns11643_plane14
cns11643_plane2
@tab
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT

@item
cp775
cp850
cp852
cp855
cp866
@tab
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/

@item
iso_8859_1
iso_8859_2
iso_8859_3
iso_8859_4
iso_8859_5
iso_8859_6
iso_8859_7
iso_8859_8
iso_8859_9
iso_8859_10
iso_8859_11
iso_8859_13
iso_8859_14
iso_8859_15
@tab
http://www.unicode.org/Public/MAPPINGS/ISO8859/

@item
iso_ir_111
@tab
http://crl.nmsu.edu/~mleisher/csets/ISOIR111.TXT

@item
jis_x0201_1976
jis_x0208_1990
jis_x0212_1990
@tab
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT

@item
koi8_r
@tab
http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT

@item
koi8_ru
@tab
http://crl.nmsu.edu/~mleisher/csets/KOI8RU.TXT

@item
koi8_u
@tab
http://crl.nmsu.edu/~mleisher/csets/KOI8U.TXT

@item
koi8_uni
@tab
http://crl.nmsu.edu/~mleisher/csets/KOI8UNI.TXT

@item
ksx1001
@tab
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT

@item
win_1250
win_1251
win_1252
win_1253
win_1254
win_1255
win_1256
win_1257
win_1258
@tab
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/
@end multitable

The CCS source files aren't distributed with Newlib because of License
restrictions in most Unicode.org's files.

The following are 'mktbl.pl' options which were used to generate .cct
files. Note, to generate CCS tables source files @option{-s} option
should be added.

@enumerate
@item For the iso_8859_10.cct, iso_8859_13.cct, iso_8859_14.cct, iso_8859_15.cct,
iso_8859_1.cct, iso_8859_2.cct, iso_8859_3.cct, iso_8859_4.cct,
iso_8859_5.cct, iso_8859_6.cct, iso_8859_7.cct, iso_8859_8.cct,
iso_8859_9.cct, iso_8859_11.cct, win_1250.cct, win_1252.cct, win_1254.cct
win_1256.cct, win_1258.cct, win_1251.cct,
win_1253.cct, win_1255.cct, win_1257.cct,
koi8_r.cct, koi8_ru.cct, koi8_u.cct, koi8_uni.cct, iso_ir_111.cct,
big5.cct, cp775.cct, cp850.cct, cp852.cct, cp855.cct, cp866.cct, cns11643.cct
files, only the @option{-i <SRC_FILE_NAME>} option were used.

@item To generate the jis_x0208_1990.cct file, the
@option{-i jis_x0208_1990.txt -x 2 -y 3} options were used.

@item To generate the cns11643_plane1.cct file, the
@option{-i cns11643.txt -p1 -N cns11643_plane1  -o cns11643_plane1.cct}
options were used.

@item To generate the cns11643_plane2.cct file, the
@option{-i cns11643.txt -p2 -N cns11643_plane2  -o cns11643_plane2.cct}
options were used.

@item To generate the cns11643_plane14.cct file, the
@option{-i cns11643.txt -p0xE -N cns11643_plane14  -o cns11643_plane14.cct}
options were used.
@end enumerate

@*
For more info about the 'mktbl.pl' options, see the 'mktbl.pl -h' output.

@*
It is assumed that CCS codes are 16 or less bits wide. If there are wider CCS codes
in the CCS source file, the bits which are higher then 16 defines plane (see the
cns11643.txt CCS source file).

@*
Sometimes, it is impossible to map some CCS codes to the 16-bit UCS if, for example,
several different CCS codes are mapped to one UCS-2 code or one CCS code is mapped to
the pair of UCS-2 codes. In these cases, such CCS codes (@dfn{lost
codes}) aren't just rejected but instead, they are mapped to the default
UCS-2 code (which is currently the @kbd{?} character's code).


@page
@node CES converters
@section CES converters
@findex PCS
@*
Similar to the CCS tables, CES converters are also split into "from UCS"
and "to UCS" parts. Depending on the iconv library configuration, these
parts are enabled or disabled. 

@*
The following it the list of CES converters which are currently present
in the Newlib iconv library.

@itemize @bullet
@item
@emph{euc} - supports the @emph{euc_jp}, @emph{euc_kr} and @emph{euc_tw}
encodings. The @emph{euc} CES converter uses the @emph{table} and the
@emph{us_ascii} CES converters.

@item
@emph{table} - this CES converter corresponds to "null" and just performs 
tables-based conversion using 8- and 16-bit CCS tables. This converter
is also used by any other CES converter which needs the CCS table-based
conversions. The @emph{table} converter is also responsible for .cct files
loading.

@item
@emph{table_pcs} - this is the wrapper over the @emph{table} converter
which is intended for 16-bit encodings which also use the @dfn{Portable
Character Set} (@dfn{PCS}) which is the same as the @emph{US-ASCII}.
This means, that if the first byte the CCS code is in range of [0x00-0x7f],
this is the 7-bit PCS code. Else, this is the 16-bit CCS code. Of course,
the 16-bit codes must not contain bytes in the range of [0x00-0x7f].
The @emph{big5} encoding uses the @emph{table_pcs} CES converter and the
@emph{table_pcs} CES converter depends on the @emph{table} CES converter.

@item
@emph{ucs_2} - intended for the @emph{ucs_2}, @emph{ucs_2be} and
@emph{ucs_2le} encodings support.

@item
@emph{ucs_4} - intended for the @emph{ucs_4}, @emph{ucs_4be} and
@emph{ucs_4le} encodings support.

@item
@emph{ucs_2_internal} - intended for the @emph{ucs_2_internal} encoding support.

@item
@emph{ucs_4_internal} - intended for the @emph{ucs_4_internal} encoding support.

@item
@emph{us_ascii} - intended for the @emph{us_ascii} encoding support. In
principle, the most natural way to support the @emph{us_ascii} encoding
is to define the @emph{us_ascii} CCS and use the @emph{table} CES
converter. But for the optimization purposes, the specialized
@emph{us_ascii} CES converter was created.

@item
@emph{utf_16} - intended for the @emph{utf_16}, @emph{utf_16be} and
@emph{utf_16le} encodings support.

@item
@emph{utf_8} - intended for the @emph{utf_8} encoding support.
@end itemize


@page
@node The encodings description file
@section The encodings description file
@findex encoding.deps description file
@findex mkdeps.pl Perl script
@*
To simplify the process of adding new encodings support allowing to
automatically generate a lot of "glue" files.

@*
There is the 'encoding.deps' file in the @emph{lib/} subdirectory which
is used to describe encoding's properties. The 'mkdeps.pl' Perl script
uses 'encoding.deps' to generates the "glue" files.

@*
The 'encoding.deps' file is composed of sections, each section consists
of entries, each entry contains some encoding/CES/CCS description. 

@*
The 'encoding.deps' file's syntax is very simple. Currently only two
sections are defined: @emph{ENCODINGS} and @emph{CES_DEPENDENCIES}.

@*
Each @emph{ENCODINGS} section's entry describes one encoding and
contains the following information.

@itemize @bullet
@item
Encoding name (the @emph{ENCODING} field). The name should
be unique and only one name is possible.

@item
The encoding's CES converter name (the @emph{CES} field). Only one CES
converter is allowed.

@item
The whitespace-separated list of CCS table names which are used by the
encoding (the @emph{CCS} field).

@item
The whitespace-separated list of aliases names (the @emph{ENCODING}
field).
@end itemize

@*
Note all names in the 'encoding.deps' file have to have the normalized
form.

@*
Each @emph{CES_DEPENDENCIES} section's entry describes dependencies of
one CES converted. For example, the @emph{euc} CES converter depends on
the @emph{table} and the @emph{us_ascii} CES converter since the
@emph{euc} CES converter uses them. This means, that both @emph{table}
and @emph{us_ascii} CES converters should be linked if the @emph{euc}
CES converter is enabled.

@*
The @emph{CES_DEPENDENCIES} section defines the following:

@itemize @bullet
@item
the CES converter name for which the dependencies are defined in this
entry (the @emph{CES} field);

@item
the whitespace-separated list of CES converters which are needed for
this CES converter (the @emph{USED_CES} field).
@end itemize

@*
The 'mktbl.pl' Perl script automatically solves the following tasks.

@itemize @bullet
@item
User works with the iconv library in terms of encodings and doesn't know
anything about CES converters and CCS tables. The script automatically
generates code which enables all needed CES converters and CCS tables
for all encodings, which were enabled by the user.

@item
The CES converters may have dependencies and the script automatically
generates the code which handles these dependencies.

@item
The list of encoding's aliases is also automatically generated.

@item
The script uses a lot of macros in order to enable only the minimum set
of code/data which is needed to support the requested encodings in the
requested directions.
@end itemize

@*
The 'mktbl.pl' Perl script is intended to interpret the 'encoding.deps'
file and generates the following files.

@itemize @bullet
@item
@emph{lib/encnames.h} - this header files contains macro definitions for all
encoding names

@item
@emph{lib/aliasesbi.c} - the array of encoding names and aliases. The array
is used to find the name of requested encoding by it's alias.

@item
@emph{ces/cesbi.c} - this file defines two arrays
(@code{_iconv_from_ucs_ces} and @code{_iconv_to_ucs_ces}) which contain
description of enabled "to UCS" and "from UCS" CES converters and the
names of encodings which are supported by these CES converters.

@item
@emph{ces/cesbi.h} - this file contains the set of macros which defines
the set of CES converters which should be enabled if only the set of
enabled encodings is given (through macros defined in the
@emph{newlib.h} file). Note, that one CES converter may handle several
encodings.

@item
@emph{ces/cesdeps.h} - the CES converters dependencies are handled in
this file.

@item
@emph{ccs/ccsdeps.h} - the array of linked-in CCS tables is defined
here.

@item
@emph{ccs/ccsnames.h} - this header files contains macro definitions for all
CCS names.

@item
@emph{encoding.aliases} - the list of supported encodings and their
aliases which is intended for the Newlib configure scripts in order to
handle the iconv-related configure script options.
@end itemize


@page
@node How to add new encoding
@section How to add new encoding
@*
At first, the new encoding should be broken down to CCS and CES. Then,
the process of adding new encoding is split to the following activities.

@enumerate
@item Generate the .cct CCS file and the .c source file for the new
encoding's CCS (if it isn't already present). To do this, the CCS source
file should be had and the 'mktbl.pl' script should be used.

@item Write the corresponding CES converter (if it isn't already
present). Use the existing CES converters as an example.

@item
Add the corresponding entries to the 'encoding.deps' file and regenerate
the autogenerated "glue" files using the 'mkdeps.pl' script.

@item
Don't forget to add entries to the newlib/newlib.hin file.

@item
Of course, the 'Makefile.am'-s should also be updated (if new files were
added) and the 'Makefile.in'-s should be regenerated using the correct
version of 'automake'.

@item
Don't forget to update the documentation (the list of
supported encodings and CES converters).
@end enumerate

In case a new encoding doesn't fit to the CES/CCS decomposition model or
it is desired to add the specialized (non UCS-based) conversion support,
the Newlib iconv library code should be upgraded.


@page
@node The locale support interfaces
@section The locale support interfaces
@*
The newlib iconv library also has some interface functions (besides the
@code{iconv}, @code{iconv_open} and @code{iconv_close} interfaces) which
are intended for the Locale subsystem. All the locale-related code is
placed in the @emph{lib/iconvnls.c} file.

@*
The following is the description of the locale-related interfaces:

@itemize @bullet
@item
@code{_iconv_nls_open} - opens two iconv descriptors for "CCS ->
wchar_t" and "wchar_t -> CCS" conversions. The normalized CCS name is
passed in the function parameters. The @emph{wchar_t} characters encoding is
either ucs_2_internal or ucs_4_internal depending on size of
@emph{wchar_t}.

@item
@code{_iconv_nls_conv} - the function is similar to the @code{iconv}
functions, but if there is no character in the output encoding which
corresponds to the character in the input encoding, the default
conversion isn't performed (the @code{iconv} function sets such output
characters to the @kbd{?} symbol and this is the behavior, which is
specified in SUSv3).

@item
@code{_iconv_nls_get_state} - returns the current encoding's shift state
(the @code{mbstate_t} object).

@item
@code{_iconv_nls_set_state} sets the current encoding's shift state (the
@code{mbstate_t} object).

@item
@code{_iconv_nls_is_stateful} - checks whether the encoding is stateful
or stateless.

@item
@code{_iconv_nls_get_mb_cur_max} - returns the maximum length (the
maximum bytes number) of the encoding's characters.
@end itemize


@page
@node Contact
@section Contact
@*
The author of the original BSD iconv library (Alexander Chuguev) no longer
supports that code.

@*
Any questions regarding the iconv library may be forwarded to
Artem B. Bityuckiy (dedekind@@oktetlabs.ru or dedekind@@mail.ru) as
well as to the public Newlib mailing list.