几篇比较好的博客
古腾龙的博客:编码规则(UTF-8 GBK)
GBK 千千秀字
shell set
man ascii可以查看ascii码表,man utf-8看以查看utf-8的帮助
Unicode is a design,it includes all the characters on earth.It just defined the character set,just defined what characters should be included.It didn't define how to express these characters in computer.
UTF-8 is a implementation of Unicode.Its is designed in 1992 by Ken*Tompson(He and Riege created UNIX and C language together).
Unicode in java is 'char',2 bytes.From 0 to 0xffff.
But in UTF-8,different char has different bytes.
Unicode | UTF-8 | explanation |
0000-007F | 0xxx xxxx | |
0080-07FF | 110xx xxx 10xx xxxx | |
0800-FFFF | 1110 xxxx 10xx xxxx 10xx xxxx |
| Unicode符号范围 | UTF-8编码方式 n | (十六进制) | (二进制) ---+-----------------------+------------------------------------------------------ 1 | 0000 0000 - 0000 007F | 0xxxxxxx 2 | 0000 0080 - 0000 07FF | 110xxxxx 10xxxxxx 3 | 0000 0800 - 0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 4 | 0001 0000 - 0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 5 | 0020 0000 - 03FF FFFF | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 6 | 0400 0000 - 7FFF FFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
In java,'xxxxReader' is always text input and 'xxxxStream' is always a binary input.Firstly,We use text output 'PrintWriter' to write file.Then we use 'FileInputStream' to read file.Our task is to convert the binary information into Unicode.If what we write is the same with what we read,we can assure we comprehend the UTF-8 format.
we can use java's library to convert a gbk file to unicode.
class uni {public static void main(String[] args) throws Exception {String name=args[0].substring(0,args[0].indexOf("."));PrintWriter cout = new PrintWriter(new File(name + "-unicode.txt"));InputStreamReader cin = new InputStreamReader(new FileInputStream(new File(args[0])), "GBK");char buf[] = new char[100];int n = cin.read(buf);while (n != -1) {cout.print(buf);n = cin.read(buf);}cin.close();cout.close();} }
转载于:https://www.cnblogs.com/weiyinfu/p/4959048.html