几篇比较好的博客

古腾龙的博客:编码规则(UTF-8 GBK)

GBK 千千秀字

shell set

man ascii可以查看ascii码表,man utf-8看以查看utf-8的帮助

Unicode is a design,it includes all the characters on earth.It just defined the character set,just defined what characters should be included.It didn't define how to express these characters in computer.

UTF-8 is a implementation of Unicode.Its is designed in 1992 by Ken*Tompson(He and Riege created UNIX and C language together).

Unicode in java is 'char',2 bytes.From 0 to 0xffff.

But in UTF-8,different char has different bytes.

Unicode UTF-8 explanation
0000-007F 0xxx xxxx
0080-07FF 110xx xxx    10xx xxxx
0800-FFFF 1110 xxxx   10xx xxxx    10xx xxxx
   |  Unicode符号范围      |  UTF-8编码方式  n |  (十六进制)           | (二进制)  
---+-----------------------+------------------------------------------------------  1 | 0000 0000 - 0000 007F |                                              0xxxxxxx  2 | 0000 0080 - 0000 07FF |                                     110xxxxx 10xxxxxx  3 | 0000 0800 - 0000 FFFF |                            1110xxxx 10xxxxxx 10xxxxxx  4 | 0001 0000 - 0010 FFFF |                   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx  5 | 0020 0000 - 03FF FFFF |          111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx  6 | 0400 0000 - 7FFF FFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 

 

In java,'xxxxReader' is always text input and 'xxxxStream' is always a binary input.Firstly,We use text output 'PrintWriter' to write file.Then we use 'FileInputStream' to read file.Our task is to convert the binary information into Unicode.If what we write is the same with what we read,we can assure we comprehend the UTF-8 format.


UTF-8 ‘s format-编程知识网 

 

 

we can use java's library to convert a gbk file to unicode.

class uni {public static void main(String[] args) throws Exception {String name=args[0].substring(0,args[0].indexOf("."));PrintWriter cout = new PrintWriter(new File(name + "-unicode.txt"));InputStreamReader cin = new InputStreamReader(new FileInputStream(new File(args[0])), "GBK");char buf[] = new char[100];int n = cin.read(buf);while (n != -1) {cout.print(buf);n = cin.read(buf);}cin.close();cout.close();}
}

 

转载于:https://www.cnblogs.com/weiyinfu/p/4959048.html

UTF-8 ‘s format-编程知识网阅读世界,共赴山海423全民读书节,邀你共读