skf

Autres langues

Langue: en

Version: 370609 (fedora - 01/12/10)

Section: 1 (Commandes utilisateur)

NAME

skf - simple Kanji Filter (v1.97)

SYNOPSIS

skf [-AEIJKNQRSXZabehjknqrsuvxz] [ long_format_options ] [infiles..]

DESCRIPTION

skf is a yet another i18n capable kanji-filter, designed for reading various CJK-coded files on the Net. skf converts input kanji texts or streams into a character stream using designated codeset and output them to standard output. Specifically, skf is designed to be a versatile filter to read documents in various code sets, and does not provide features not related to code conversion.

Like nkf, skf automatically recognizes an input file code when it is a kind of ISO-2022 compliant code, and also detects EUC-variant codes if input file is Japanese text without X 0201 kanas. skf 1.9x can read various iso-2022 compliant character sets, including JIS Kanji codes (X 0208, X 0212 and X 0213), EUC encoding (euc-jp (with X 0213 support), euc-cn, euc-kr and euc-tw), ISO Europian latins (ISO-8859-1 to 11, 13/14/15/16) and many regional character sets. skf can also read some non-iso2022 compliant sets, including Microsoft Shift-JIS code, KOI-8-R/U, GB2312 (HZ), big5, VISCII(rfc1456, include VIQR), Unicode standard (UCS2/UTF-16, UTF7 and UTF8), some of MS codesets (cp1250 etc.) and some other vendor specific codes (KEIS83, JEF etc).

Supported output character sets of skf are more limited, but still include X 0208/X 0212/X 0213 JIS, X 0201 JIS, ASCII, Microsoft Shift-JIS, EUC-jp/-kr/-cn, HZ, iso-2022-jp/kr, big5, VISCII and Unicode.

skf also provides some basic decoding features for some common encodings including MIME, Punycode and URI codepoint. Unicode decomposition feature is also supported since 1.96.

As noted above, skf is designed to convert input text into some kind of human-readable forms under a local environment (i.e. codeset), and has several extra conversion features like GNU recode type folding. Such conversions include Windows/Macintosh specific code swaps and old-new jis glyph changes, html-format/TeX format conversion and variant unifications.

skf also can be compiled as an extension of some lightweight languages. See README.txt for details.

If one or more file names are given, skf read the files and output converted stream to stdout. If no file names are given, input is taken from stdin and output is also stdout. OPTIONS are taken from environment variables SKFENV, skfenv and command line, respectively in this order. Environment variables are not used when skf is running as a priviledged user. skf does not use LOCALE-related environment variables for conversions, but output error messages are controlled by given LOCALES.

OPTIONS

skf-1.9 is written from scratch, and inherits no code from nkf. However, skf is intended to be a drop-in replacement for nkf(v1.4), and has a similar commonly-used nkf option set.
skf 1.96 recognizes following options. Defaults are all off if not explicitly specified.

buffering control

-b
use buffered output. This is default.
-u
use unbuffered output. Code detection feature is disabled when this option is on.

Input/Output codeset options

--ic=
input_code_set
specify input codeset is input_code_set. Possible candidates are shown below.
--oc=
output_code_set
specify output codeset is output_code_set. Possible candidates are shown below. Default codeset in distribution package is euc-jp, but depends on compile option. Default codeset is shown by

Supported codeset

skf recognizes following codesets as an input/output codeset. These codeset names are case insensitive, and minus ('-') and underscore ('_') is ignored. Note that iso-2022 escape-based input codeset (registered to IANA) is recoginized automatically, even when non-iso2022 codeset (except Unicode and B-Right/V) is specified. o in in-column means named codeset can be specified as input and x means named codeset is not for input. output-column is same except it is for output.

in out name description
o o iso8859-1 ascii + iso-8859-1 (latin-1)
o o iso8859-2 ascii + iso-8859-2 (latin-2)
o o iso8859-3 ascii + iso-8859-3 (latin-3)
o o iso8859-4 ascii + iso-8859-4 (latin-4)
o o iso8859-5 ascii + iso-8859-5 (Cyrillic)
o o iso8859-6 ascii + iso-8859-6 (Arabic)
o o iso8859-7 ascii + iso-8859-7 (Greek)
o o iso8859-8 ascii + iso-8859-8 (Hebrew)
o o iso8859-9 ascii + iso-8859-9 (latin-5)
o o iso8859-10 ascii + iso-8859-10 (latin-6)
o o iso8859-11 ascii + iso-8859-11 (Thai)
o o iso8859-13 ascii + iso-8859-13 (Baltic Rim)
o o iso8859-14 ascii + iso-8859-14 (Celtic)
o o iso8859-15 ascii + iso-8859-15 (Latin-9)
o o iso8859-16 ascii + iso-8859-16
o o koi-8r koi-8r (Russian)
o o cp1251 Cyrillic latin MS cp1251
o o jis iso-2022-jp (rfc1496 7bit JIS)
o o iso-2022-jp-x0213 iso-2022-jp-3 (JIS X 0213:2000)

                        a.k.a. jis-x0213
o o jis-x0213-strict iso-2022-jp-3-strict
o o iso-2022-jp-2004 iso-2022-jp-2004(JIS X 0213:2004)

                        a.k.a. jis-x0213-2004
o o oldjis iso-2022-jp-1978(JIS X 0208:1978)
o o cp50220 Microsoft codepage 50220
o o cp50221 Microsoft codepage 50221
o o cp50222 Microsoft codepage 50222
o o euc-jp EUC-encoded JIS X 0208:1997
o o euc-x0213 EUC-encoded JIS X 0213:2000
o o euc-jis-2004 EUC-encoded JIS X 0213:2004
o o cp51932 EUC-encoded Microsoft codepage 932
o o euc-kr EUC-encoded KS X 1001 Korian
o o euc7-kr 7bit EUC-encoded KS X 1001 Korian
o o uhc Unified hangle (Windows cp949)
o o johab KS X 1001-johab Korian
o o euc-cn EUC-encoded GB2312 Chinese
o o euc7-cn 7bit EUC-encoded GB2312 Chinese
o o hz HZ-encoded GB2312 Chinese
o o euc-tw EUC-encoded CNS 11643 Chinese
o o gb12345 EUC-encoded GB12345 Chinese
o o gbk GB2312 Extension(cp936) Chinese
o o gb18030 GB18030 chinese
o o big5 BIG5 (with Eten extension + EURO)
o o cp950 BIG5 (Microsoft cp950 + EURO)
o o big5-hkscs BIG5 with HKSCS
o o big5-2003 BIG5-2003
o o big5-uao BIG5-Unicode at On
o o sjis Shift-jis (Microsoft cp943)
o o shiftjis-x0213 Shiftjis-encoded JIS X 0213:2000
o o shiftjis-2004 Shiftjis-encoded JIS X 0213:2004
o x sjis-cellular Shiftjis-encoded JIS X 0208:1997

                 with NTT Docomo, Vodafone(SoftBank) phone glyph
o o oldsjis Shift-jis (JIS X 0208:1978)
o o cp932 Shift-jis-encoded MS cp932
o o cp932w Shift-jis-encoded MS cp932 with
                        MS compatibility
o o viscii VISCII (rfc1456) Vietnamise
o o viqr VISCII (rfc1456-VIQR) Vietnamise
o o keis Hitachi KEIS83/90
o x jef Fujitsu JEF (basic support only)
o x ibm930 IBM EBCDIC DBCS Japanese
o x ibm931 IBM EBCDIC DBCS Japanese w.latin
o x ibm933 IBM EBCDIC DBCS Korian
o x ibm935 IBM EBCDIC DBCS Simpl. Chinese
o x ibm937 IBM EBCDIC DBCS Trad. Chinese
o o unicode Unicode(TM) UCS-2/UTF-16LE
o o unicodefffe Unicode(TM) UTF-16BE
o o utf7 Unicode(TM) UTF-7
o o utf8 Unicode(TM) UTF-8
o x transparent Transparent mode (see below)

Codeset explanations

iso-8859-*

When specified as output, G0 = GL is ascii and G1 = GR is iso-8859-*. 8bit encoding is used.
iso-2022-jp, jis

Encoding is iso-2022-jp-2 (RFC1496). G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is JIS X 0212:1990 Supplementary Kanji.
jis-x0213, iso-2022-jp-3

Encoding is iso-2022-jp-3 (JIS X 0213:2000 based). G0 = GL is JIS X 0201 roman, For output, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is JIS X 0213 plane2 Kanji.
jis-x0213-strict

Encoding is subset of iso-2022-jp-3-strict (uses Plane 1 only). For output, G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is not set. Output code using JIS X 0208 whenever possible. JIS X 0213 input is automatically recognized.
jis-x0213-2004, iso-2022-jp-2004

Encoding is iso-2022-jp-2003:2004. For output, G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is JIS X 0213 plane2 Kanji.
oldjis

Encoding is iso-2022-jp using old JIS X 0208:1978). G0 = GL is JIS X 0201 roman, G1 = GR is JIS X 0201 kana, G2 is iso-8859-1 and G3 is JIS X 0212 Supplementary Kanji.
euc-jp, euc

Encoding is 8-bit EUC using JIS X 0208:1997 character set. G0 = GL is ascii, G1 = GR is JIS X 0208, G2 is JIS X 0201 kana and G3 is JIS X 0212 Supplementary Kanji.
euc-x0213, euc-jis-2003

Encoding is 8-bit EUC-based JIS X 0213:2000. G0 = GL is ascii, G1 = GR is X 0213:2000 plane 1, G2 is iso-8859-1 and G3 is JIS X 0213:2000 plane2 Kanji.
euc-jis-2004

Encoding is 8-bit EUC-based JIS X0213:2004. G0 = GL is ascii, G1 = GR is X0213:2004 plane 1, G2 is iso-8859-1 and G3 is JIS x0213:2004 plane2 Kanji.
euc-kr

Encoding is 8-bit EUC using KS X 1001 Wansung character set. G0 = GR is KS X1003, G1 = GR is KS X1001, G2 and G3 is not set.
euc7-kr iso-2022-kr
Encoding is iso-2022-kr (rfc1557): 7-bit EUC using KS X 1001 Wansung character set. G0 = GR is KS X1003, G1 is KS X1001, G2 and G3 is not set.
euc-cn

Encoding is 8-bit EUC using GB 2312 simplified chinese character set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3 is not set.
euc7-cn

Encoding is 7-bit EUC using GB 2312 simplified chinese character set. G0 = GR is ASCII, G1 is GB2312, G2 and G3 is not set.
hz

Encoding is HZ encoded (rfc1842) GB 2312 simplified chinese character set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3 is not set.
euc-tw

Encoding is EUC encoded CNS11643 Plane1/2 traditional chinese character set. Subset of iso-2022-cn. G0 = GR is ASCII, G1 = GR is CNS11643 plane 1, G2 is CNS11643 plane 2 and G3 is not set.
gb12345

Encoding is 8-bit EUC using GB 12345 (GBF) traditional chinese character set. G0 = GR is ASCII, G1 = GR is GB12345, G2 and G3 is not set.
gbk, cp936

Encoding is GBK simplified chinese character set. G0 = GR is ASCII and G1 = GR is GBK. G2 and G3 is not set.
gb18030 (experimental)

Encoding is GB18030 (ibm-1392, Windows cp54936) chinese character set. Uses ASCII as latin part.
big5

Encoding is Big5 traditional chinese character set with ETen extension. Include Euro mapping. Uses ASCII as latin part.
cp950

Encoding is Microsoft cp950-Big5 traditional chinese character set. Uses ASCII as latin part.
big5-hkscs (experimental)

Encoding is cp950-Big5 traditional chinese character set with HKSCS extension. Uses ASCII as latin part.
big5-2003 (experimental)

Encoding is Big5-2003 Taiwanese standard traditional chinese character set. Uses ASCII as latin part.
big5-uao (experimental)

Encoding is Big5-UAO (http://uao.cpatch.org) traditional chinese character set. Uses ASCII as latin part.
VISCII (experimental)

Vietnamise VISCII (rfc1456) character set. Not TCVN-5712.
VIQR (experimental)

Vietnamise VISCII character set with VIQR encoding(rfc1456).
sjis

Encoding is Shift-encoded JIS X 0208:1997 character set. Note that this is not cp932. Uses JIS X 0201 latin as latin(GL) part.
sjis-x0213, shift_jis-2000

Encoding is Shift-encoded JIS using JIS X 0213:2000 character set.
sjis-x0213-2004, shift_jis-2004

Encoding is Shift-encoded JIS using JIS X 0213:2004 character set. 10 newly defined character added, but Unicode mapping is same as JIS X 0213:2000. Uses JIS X 0201 latin as latin(GL) part.
sjis-cellular (experimental)

Encoding is Shift-encoded JIS X 0208:1997 character set with NTT Docomo/Vodafone(SoftBank) cellular phone glyph mapping.
cp932 cp932w

Encoding is Microsoft SJIS cp932 with NEC/IBM gaiji area, based on Windows XP mapping. Uses ASCII as latin(GL) part. --use-compat and --use-ms-compat is automatically enabled. cp932w provides further WideCharToMultiByte compatibility.
cp51932

Encoding is Microsoft EUC-based cp51932 with NEC/IBM gaiji area, based on Windows XP mapping. Uses ASCII as G0 and JIS X 0201 kana as EUC G2 part. G3 is not used for output, and JIS X 0212:2000 as input. --use-compat and --use-ms-compat is automatically enabled.
cp50220, cp50221, cp50222

Encoding is Microsoft JIS-based cp50220, cp50221, cp50222 with NEC/IBM gaiji area, based on Windows XP mapping. For input, skf accepts cp50220, 50221 and 50222. Note that this codeset is NOT compatible with iso-2022. Uses ASCII as default character set. --use-compat and --use-ms-compat is automatically enabled.
oldsjis

Encoding is Microsoft SJIS (JIS X 0208:1978 a.k.a. old JIS). Uses JIS X 0201 latin as latin(GL) part.
johab

Encoding is KS X1001(Johab) character set. Uses KS X1003 latin as latin(GL) part.
uhc

Encoding is UHC (cp949) character set. Uses ASCII as latin(GL) part.
unicode, unicodefffe

Encoding is Unicode UTF-16 (v5.0). Input/Output default byte-endian is little for unicode and big for unicodefffe, and input byte order mark is recognized. Output includes endian mark by default unless --disable-endian-mark is specified. Output range is within UTF-32 with surrogate pair unless --limit-to-ucs2 is specified.
Note that ucs2 is not supported within perl/ruby extension in both in and output, because of data structure limitation. Specify to ucs2 will generate error.
utf8

Encoding is UTF-8 encoded Unicode (v5.0). Output doesn't include byte order mark unless --enable-endian-mark is specified. Output range is within UTF-32 unless --limit-to-ucs2 is specified. By default, CESU-8 is not accepted as input. Option --enable-cesu8 enables CESU-8 input for utf-8 converter. CESU-8 output is not supported. For UTF-8, endian mark (BOM) is always ignored.
utf7

Encoding is UTF-7 encoded Unicode (v5.0). Input/output range is limited to UTF-16, and value above U+10000 is regarded as undefined. BOM is always ignored for input, and never used for output.
keis (experimental)

Encoding is Hitachi KEIS83/90. Output range is limited to EBCDIK and JIS X 0208 area.
jef (experimental)

Encoding is Fujitsu JEF. Input only. Only basic part is supported.
ibm930 (experimental)

Encoding is IBM DBCS Japanese with EBCDIC Kana
ibm931 (experimental)

Encoding is IBM DBCS Japanese with EBCDIC latin (ibm037)
ibm933 (experimental)

Encoding is IBM DBCS Korian with EBCDIC Wansung character set
ibm935 (experimental)

Encoding is IBM DBCS Simplified Chinese with EBCDIC Chinese
ibm937 (experimental)

Encoding is IBM DBCS Traditional Chinese with EBCDIC Chinese
koi8r

Russian KOI-8R code.
cp1250

Central Europian latin Microsoft cp1250 code
cp1251

Eastern Europian cyrillic Microsoft cp1251 code
transparent

Transparent mode. Various code control features, include folding and line end code conversion, is also ignored.

Shortcuts

-n -j
same as --oc=jis
-s -x
same as --oc=sjis
-a -e
same as --oc=euc-jp
-q
same as --oc=ucs2
-z
same as --oc=sjis
-y
same as --oc=utf7
-k
same as --oc=keis
-A, -E
same as --ic=euc-jp. Assume input codeset is EUC-JP.
-N
same as --ic=jis. Assume input codeset is iso-2022-jp.
-S, -X
same as --ic=sjis. Assume input codeset is shift JIS
-Q
same as --ic=ucs2.
-Y
same as --ic=utf7.
-Z
same as --ic=utf8.
-K
same as --ic=keis.

ISO-2022 Specific controls

Replaces G0-3 after setting up according to specified input codeset by assigned character set with this option. Note that this doesn't change any codeset properties of the original codeset, like language and encoding.
--set-g0=`charset name'
Predefines specified code set to plane 0 (G0). Also set to GL at initial state.
--set-g1=`charset name'
Predefines specified code set to right plane (G1). Also set to GR at initial state.
--set-g2=`charset name'
Predefines specified code set to right plane (G2).
--set-g3=`charset name'
Predefines specified code set to right plane (G3).

Supported `char_set' is as follows. 'o' means the codeset can be specified to set to the plane. 'x' means you can't. For unicode family codesets, this option is ignored. For other non-iso2022 categories, this option is not supported, and result is unpredictable.

g0 g1 g2 g3    codeset name    description

o o o o     ascii ANSI X3.4 ASCII

o o o o     x0201 JIS X 0201 (latin part)

x o o o     iso8859-1 ISO 8859-1 latin

x o o o     iso8859-2 ISO 8859-2 latin

x o o o     iso8859-3 ISO 8859-3 latin

x o o o     iso8859-4 ISO 8859-4 latin

x o o o     iso8859-5 ISO 8859-5 Cyrillic

x o o o     iso8859-6 ISO 8859-6 Arabic

x o o o     iso8859-7 ISO 8859-7 Greek-latin

x o o o     iso8859-8 ISO 8859-8 Hebrew

x o o o     iso8859-9 ISO 8859-9 latin

x o o o     iso8859-10 ISO 8859-10 latin

x o o o     iso8859-11 ISO 8859-11 Thai

x o o o     iso8859-13 ISO 8859-13 latin

x o o o     iso8859-14 ISO 8859-14 latin

x o o o     iso8859-15 ISO 8859-15 latin

x o o o     iso8859-16 ISO 8859-16 latin

x o o o     tcvn5712 TCVN 5712 (Vietnamese)

x o o o     ecma94 ECMA 94 Cyrillic (KOI-8e)

o o o o     x0212 JIS X 0212:1990

o o o o     x0208 JIS X 0208:1997

o o o o     x0213 JIS X 0213 Plane 1:2000

o o o o     x0213-2 JIS X 0213 Plane 2:2000

o o o o     x0213n JIS X 0213 Plane 1:2004

o o o o     gb2312 Simplified Chinese GB2312

o o o o     gb1988 Chinese GB1988(latin)

o o o o     gb12345 Traditional Chinese GB12345

o o o o     ksx1003 Korian KS X 1003(latin)

o o o o     ksx1001 Korian KS X 1001

x o o o     koi8-r Cyrillic KOI-8R

x o o o     koi8-u Ukrainean Cyrillic KOI-8U

o o o o     cns11643-1 Traditional Chinese CNS11643-1

x o o o     viscii-r RFC1496 VISCII (right plane)

o o o o     viscii-l RFC1496 VISCII (left plane)

x o o o     cp437 Microsoft cp437 (US latin)

x o o o     cp737 Microsoft cp737

x o o o     cp775 Microsoft cp775

x o o o     cp850 Microsoft cp850

x o o o     cp852 Microsoft cp852

x o o o     cp855 Microsoft cp855

x o o o     cp857 Microsoft cp857

x o o o     cp860 Microsoft cp860

x o o o     cp861 Microsoft cp861

x o o o     cp862 Microsoft cp862

x o o o     cp863 Microsoft cp863

x o o o     cp864 Microsoft cp864

x o o o     cp865 Microsoft cp865

x o o o     cp866 Microsoft cp866

x o o o     cp869 Microsoft cp869

x o o o     cp874 Microsoft cp874

x o o o     cp932 Microsoft cp932 (Japanese)

x o o o     cp1250 Microsoft cp1250(Central Europe)

x o o o     cp1251 Microsoft cp1251 (Cyrillic)

x o o o     cp1252 Microsoft cp1252 (Latin-1)

x o o o     cp1253 Microsoft cp1253 (Greek)

x o o o     cp1254 Microsoft cp1254 (Turkish)

x o o o     cp1255 Microsoft cp1255

x o o o     cp1256 Microsoft cp1256

x o o o     cp1257 Microsoft cp1257

x o o o     cp1258 Microsoft cp1258

--euc-protect-g1
In EUC input mode, suppress sequences to set a charset to G1. Such sequences are discarded.
--add-annon
Add announcer for JIS X 0208:1997 to X 0208 designate sequence. This option works only with iso-2022-based output.
--input-detect-jis78
Distinguish JIS X 0208:1978 codeset and JIS X 0208:1997 codeset. By default, these two charset is regarded as X 0208:1997. This option is valid only when input encoding is JIS (iso-2022-jp).

Unicode coding specific control options

--use-compat --suppress-compat
skf substitutes characters in unicode compatibility planes (U+F900 - U+FFFD) to appropriate characters in non-compatibility planes. If enabled, these characters is converted to variants or undefined. --use-compat disables this substitution, and --suppress-compat enables this behavior. Default is enabled, but several codesets disable this as codeset feature (i.e. Use compatibility planes). See codeset section.
--use-ms-compat
When output is Unicode, make Unicode map to be Microsoft windows compatible). This only changes conversion for some symbols in JIS-Kanji, and adding --use-compat option is recommended for roundtrip conversion. If you need more strict compatibility, try cp932w for input codeset.
--use-cde-compat
When output is Unicode, make translation CDE standard codeset compatible.
--little-endian
When output is UTF-16, use little endian byte-order. This is default.
--big-endian
When output is UTF-16, use big endian byte-order.
--disable-endian-mark --enable-endian-mark
When output is UTF-16 or UTF-8, do not use/use byte order marking. To make UTF-16N, use this option with --little-endian. By default, BOM is enabled for UTF-16 and disabled for UTF-8.
--input-little-endian
When input is UTF-16, assume input is little endian byte-ordered. This is default, but skf respects byte-order mark.
--input-big-endian
When input is UTF-16, assume input is big endian byte-ordered. Note that skf respects byte-order mark.
--endian-protect
Do not use endian mark in input stream. Endian mark is just discarded. This is off by default.
--limit-to-ucs2
Do not use > 0x10000 area code in Unicode (i.e. limits code to BMP area). This option doesn't limit internal code range in skf. This is off by default.
--disable-cjk-extension
Treat CJK extension A/B areas as undefined. This is off (i.e. these areas are enabled) by default.
--enable-cesu8
Enable CESU-8 input in utf-8 codeset. Ignored for any other codesets.
--non-strict-utf8
Enable broken (decodable but not obeying specs.) utf-8 input. If you need this option, proceeds with extra care.
--enable-nfd-decomposition --disable-nfd-decomposition
Enable/Disable Unicode Normalized decomposition. Default is disabled.
--enable-nfda-decomposition --disable-nfda-decomposition
Enable/Disable Apple-compatible Unicode Normalized decomposition. Default is disabled.

Codeset/Vendor Specific codeset handling flags

skf by default assumes machine specific parts of kanji code are Microsoft Windows compatible. Here are some options that control this behavior. Option in this category is valid when output codeset is Japanese codeset, except --disable-charts.
--use-apple-gaiji
Assume machine specific part in input file is Macintosh Classic OS (System 7,8,9) compatible.
--disable-ibm-gaiji --disable-nec-gaiji
Disable IBM/NEC defined machine specific part in input file.
--disable-chart
Do not use Moji-keisen characters. This is for old Macintosh system (System 6.x or older) compatibility.
--old-nec-compat
Enable old NEC kanji sequence (ESC-K,H). Needs compile option --enable-oldnec at configuration.
--no-utf7
Assume input codeset is *NOT* UTF-7 encoded Unicode. This option disables input utf7 testing.
--no-kana
Assume input codeset does *NOT* include JIS X 0201 kana.

OUTPUT Conversions options

skf is intended to output stream to stdout, buf nkf-compatible file-encoding change option is also provided.

--overwrite --in-place
converts encoding of file(s) specified as input. --overwrite preserves file change date.

skf has various features to fix output files appropriate in local environment. Most of these are controlled by extended control switches described in this section.

--use-g0-ascii
set G0(=GL) for output encoding to ASCII, ignoring codeset designation.

X-0201 Kana/latin conversions

skf by default converts X-0201 kanas to X-0208 kanas. To output X-0201 kana as it is, use one of following options. When output is designated to EUC or SJIS, these three options enable X-0201 kana output by ways provided by each encoding. When Unicode output is specified, (equiv.) kana part output is controlled by --use-compat, not following switches. Valid only when output codeset is NOT Unicode family.
--kana-jis7
use SI/SO locking shift sequence to designate X-0201 kana. This switch is valid for jis, jis-x0213 and cp50220 (i.e. cp50221) encoding. For other codesets, this option is ignored.
--kana-jis8
output X-0201 kana using 8-bit code right plane. This switch is valid for jis and jis-x0213 encoding. For other codeset, this option is ignored.
--kana-esci --kana-call
use ESC-(-I to designate X-0201 kana. This switch is valid for jis, jis-x0213 and cp50220 (i.e. cp50222) encoding. For other codeset, this option is ignored.
--kana-enable
If output is EUC-JP or cp51932, use X-0201 kana with G2. If SJIS output, it is same as --kana-jis8. When JIS output, it is same as --kana-call.
--use-iso8859-1
Enable iso-8859-1 output. Iso-8859-1 is invoked to G1 and set to GR plane.

JIS X 0212(Supplement Kanji code) Support

--x0212-enable
skf by default does not output JIS X 0212 code. This option enables use of JIS X 0212 part. Output code set may be neither Microsoft code nor KEIS. For Unicode variant encodings, this option is ignored. Note that this option is supported for backward compatibility. May not be supported in future versions.

URI/TeX format conversion feature options

With Unicode(tm) family output codings, skf output non-ascii latin character part as it is, but with other output codings, skf converts these characters using following rules:

(1) If a code is defined in a specified output codeset, specified code point is used for output.
(2) If one of following html convert modes are enabled (i.e. --convert-html --convert-sgml) and the code is defined in html/sgml codeset, it is converted to entity-reference or codepoint reference.
(3) If tex convert mode enabled and the code is defined in tex expression, it is converted to tex format.
(4) If the code is a kind of combined ligatures, it is shown by a set of characters.
(5) A kind of replacement character is shown, with warning.

--convert-html --convert-sgml
Enable html convert mode. This mode is cleared by --reset. These two options are synonyms, and are treated as same option.
--convert-html-decimal
Enable html code-point decimal convert mode. This mode is cleared by --reset.
--convert-html-hexadecimal
Enable html code-point hexadecimal convert mode. This mode is cleared by --reset.
--convert-tex
Enable TeX convert mode. This mode is cleared by --reset.
--use-replace-char
In Unicode, use unicode replacement chatacter (U+fffc) for undefined chatacter.

Encoding/Decoding control options

--decode=`encoding scheme'
--encode=`encoding scheme' Specify an decoding/encoding scheme for input stream. Supported encoding schemes for decoding are `hex', 'mime', 'mime_q', 'mime_b', 'uri', 'ace', 'hex_perc_encode', CAP hex-code, mime, mime Q-encoding, mime B-encoding, uri character reference, ACE punycode, uri percent notation, base64, Q-encoding, rfc2231 and rot13/47 respectively.
For encoding, 'hex', 'mime_b', 'mime_q', 'uri', 'ace', 'cap', and some already ascii-encoded codeset (e.g. UTF-7) output with encoding is not supported.
Only one decode/encode option is valid, and if more than one option is specified, the last one is used. When one of mime decodings is specified, base text is assumed to be EUC encoding unless specified otherwise. Except rot, which assumes input stream is Shift_JIS, EUC or iso-2022-jp, these encodings assumes input stream is ascii (as defined in RFC2045). Some encodings may co-exist with encoding, but this is not guaranteed. Especially, if input is UTF-16/UCS2 code, these encoding is ignored in skf.
--mime-ms-compat
treat japanese generic codesets as Microsoft cp932 compatible. More specifically, with this option skf treats iso-2022-jp as cp50220, euc-jp as cp51932 and Shift_JIS as cp932w.

End of line control options

--lineend-thru
Output end-of-line code as it is. Also output ^Z code as it is. This is default.
--lineend-cr --lineend-mac
Use CR as end-of-line code. Also delete ^Z code from input stream.
--lineend-lf --lineend-unix
Use LF as end-of-line code. Also delete ^Z code from input stream.
--lineend-crlf --lineend-windows
Use CR+LF as end-of-line code. Also delete ^Z code from input stream. This option doesn't preserve original order of cr and lf.
--input-cr
Assume input stream uses CR as end-of-line code.
--input-lf
Assume input stream uses LF as end-of-line code.
--input-crlf
Assume input stream uses CR+LF as end-of-line code.
-F[line_length[-kinsoku]]
-f[line_length[-kinsoku]] -f[line_length[+kinsoku]]
Wrap input lines by line_length columns. f option deletes CR/LF's in input, and F option doesn't delete them. For Japanese convension, both gyoutou-kinsoku(by burasage-gumi) and gyoumatsu-kinsoku(by oidasi-gumi) is supported. The burasage-length is controlled by kinsoku option. Default value for line_length is 66, and must be < 1000. Default value for kinsoku is 5, and must be <= 10. In 'f' option, skf autodetects paragraph and retains some CR/LF. 2nd 'f' option format (with '+') disables this behaviour. In nkf compatible mode, some fold behaviors change as follows.
(1) Default line_length is set to 60, and kinsoku value is 10.
(2) alpha numeric characters become gyoutou-kinsoku characters.

File control options

--filewise-detect --force-reset
Reset and re-detect input code set at the start of each file.
--linewise-detect
Reset and re-detect input code set at the start of each line. This option needs -DKUNIMOTO at compile time.

Compatibility options

--nkf-compat
interpret following options as nkf compatible manners. -l, -d, -c, -x, -m, -w and -W works as nkf2.0. -f and -F behavior is changed as shown above, and --disable-space-convert is also enabled. Note that mime decoding is NOT enabled by this option.
--skf-compat
interpret following options as skf-native manners.

Misc. Control options

--disable-space-convert --enable-space-convert
skf converts an ideographic space into two ascii spaces. Disable option disables, and enable option enables this behavior. Default is enabled.
--html-sanitize
Convert several characters in HTML document to entity reference expression. Specifically, "!#$&%()/<>:;?' are escaped by entity-references.
--filewise-detect --force-reset
If multiple input files are given, detect input codeset for each file.
--linewise-detect
Detect input code line-wise. Note this option weakens code detect correctness.
--reset
Reset all flags specified by extended controls and given input code.
--inquiry --guess
skf detects code and output detect result to stdout. No filtering output is performed. If multiple input file is given, --show-filename is automatically enabled.
--hard-inquiry 
Similar as inquiry, but reports both code and end-of-line character.
--suppress-filename
When inquiry(--inquiry) is on, this option disables file name output. This option overrides --show-filename.
--show-filename
When inquiry(--inquiry) is on, this option adds each file name to output.
--invis-strip
Delete all escape sequences not belonging to ISO-2022 code extension. This is intended to replace invisstrip command bundled in inews package.
-I
Warn if input has unassigned code points.
-v
print version information and exit.
-h --help
print brief help and exit.
--show-supported-codeset
Display supported codesets (input) and exit. Both canonical names (left side) and detailed names are shown. This canonical name can be used as MIME charset and also as ic-option code specification.
--show-supported-charset
Display supported character sets (output) and exit. Both canonical names and detailed names are shown. Some charsets with special treatments (i.e. meaningless as set-g* parameters) intensionally lacks addressable cnames.
-%[debug_level]
Enable skf debugging. Debug level is one digit. 0 is the least verbose, and with -%9 you'll get whole traces within skf. This option needs configure option --enable-debug.

FILES

/usr/(local/)share/skf/lib/    (Unices)
/Program Files/skf/share/lib (MS Windows)
These directories are where external codeset conversion tables go.
The location that current skf assumes are shown by -h option.


AUTHOR

skf is written by Seiji Kaneko (efialtes@sourceforge.jp) based on idea from nkf written by Itaru Ichikawa (ichikawa@flab.fujitsu.co.jp) X 0213 code table is derived from work of earthian@tama.or.jp. Some codeset mapping is derived from various sources. Detailed origin is shown in copyright document included in this distribution.

ACKNOWLEDGEMENT

skf is inspired by works or requests by shinoda@cs.titech, kato@cs.titech, uematsu@cs.titech, void@global ohta@ricoh, Hinata(HKE), Ashizawa(CRL), Kunimoto(SDL), Oohara(Univ of Kyoto), Jokagi(elf2000) and Naruse (at sourceforge.jp). Thanks.

BUGS AND LIMITATIONS

1. skf can handle mixed coding with some limitations. However, code detection tends to fail for mixed code, and giving explicit input code set is strongly encouraged, if codeset is known beforehand.
In case of need, --linewise-detect option may help, but code detecting will be more likely to fail.

2. When using UCS2, UTF-16, UTF-8 and UTF-7, skf tries to detect input code, but giving explicit code set is encouraged. skf doesn't support UCS4, but does support UTF-32 area by UTF-16 (i.e. surrogate pairs) and UTF-8. skf just passes composite characters to output. No further normalization process are performed.

3. skf implements ISO-2022 with following exceptions.

 i) GL 0x20 is always space. Even when 96-character codeset is invoked to GL.

 ii) Sequences for setting codes to C1 and C2 are always ignored.

 iii) If unknown sequence is given to G0, G0 is set to ascii, and locking/single shift is cleared. Unknown sequece call to set to G1-G3 is just ignored. 
 Private charset is also not supported and is ignored.

 iv) Sequences for 96 character multibyte coding is ignored (Currently, no codeset is registered).

 v) Calling UTF-8, UTF-16 coding system from iso-2022 is supported, and returns to previous coding system by standard return.
 Callings and returns to/from other coding schemes are ignored.

 vi) For supporting some of cellular phone glyphs, several private (not registered) codesets are defined in skf, and can be called by appropriate sequences.

4. Since skf by default tests input stream to detect utf7 coding, skf sometimes misdetects pure ascii text as utf7. If this occurs, use --no-utf7 option.

5. Error output coding is controlled by LOCALE environment variables in UN*X system. skf don't take care of a situation like stdout and stderr is redirecting into same stream. Such case should be handled by user side.

6. skf-1.9x converts KEIS/JIS X 0213 code using CJK-extension B and CJK compatibility area. For this reason, X 0213 and KEIS convert result varies depending on --use-compat and --limit-to-ucs2 switches.

7. JIS X 0207:1979 is not supported. JIS X 0211:1987 is designed to be supported (i.e. common terminal control sequence will be transparently passed to output).

8. Even if unbuffer option(-u) is specified, some code-translation related bufferings are still performed (in MIME, kana, VIQR etc.).

9. skf-1.9x recognizes and handles languages in iso639-1(alpha 2). iso639-2 is not supported as a valid language set.

10. UCS-2(UTF-16) is not supported within perl/ruby extension either in and output, because of data structure limitation. Specify to ucs2 will generate error. This is a limitation of SWIG and language itself, rather than a limitation of skf. Use UTF-8 for these LWL.

11. skf-1.9x does not retain Macintosh RLO-ordered character property. Codesets with this kind of codes are not supported.

Notes

1. Extended options are changed extensively since skf-1.9. Some archaic options (eg. -B, -@ and -r) have been deleted from this version.

2. skf is originally forked project from nkf, but doesn't contain nkf codes. Copyright notice is retained by honor.

3. From version 1.9, default Japanese character set assumed by skf has changed to JIS X 0208:1990 with Microsoft Japanese Windows gaiji (i.e. CP932).

4. Code autodetection is not perfect by design. If it has failed to detect input code properly, please give input code information explicitly.

5. Some ligatures in Unicode, cp932 gaiji and KEIS83 are converted using JIS X 0124 and other convention. During this conversion, its byte length is not preserved.

6. skf is intended to pass ANSI compatible terminal control codes transparently, but this is not guaranteed.

7. nkf's -i and -o options works only in nkf-compat mode. It is obsolete option in 1.97, and valid only when iso-2022-jp and without considering output codeset specifications.

8. For unconverted character, skf uses geta and undefined character as --use-replace-char option. If output codeset doesn't contain geta code, skf prefers 'black square character', then uses '.' respectively.

9. There are some undocumented options. These options should be considered as highly experimental.

10. In lineend_thru mode and using folding, skf remembers order of cr and lf appears in stream, and use that order. For this design, if skf needs to output line-end character before any line-end character appears in input stream, input order may not be preserved.

11. NKF-compatibility
1) -B*, and --prefix, some --fb's and --no-cp932ext/best-fit-chars are not supported.
2) rot encoding is not supported. rot decode can't use with other decoding.
3) MSDOS (and -T) are not supported.
4) MIME decoding/encoding error handling behavior differs in various ways.
5) LF/CR behaves differently. Results may not be same for some messy text.

Notice

Unicode(TM) is a trademark of Unicode, Inc. Microsoft and Windows are registered trademarks of Microsoft corporation. Macintosh is a registered trademark of Apple Computer Inc. Vodafone is a trademark of Vodafone K.K. Other names and terms may be trademarks or registered trademarks of their respective owner. Trademark symbol (TM) may be omitted in this manual page.