遇君阁

前不见古人,后不见来者
念天地之悠悠,独怆然而涕下

  VC知识库BLOG :: 首页 :: 新随笔 :: 联系 :: 聚合  :: 登录 ::
  26 随笔 :: 8 文章 :: 53 评论 :: 0 Trackbacks
<2008年12月>
30123456
78910111213
14151617181920
21222324252627
28293031123
45678910

留言簿(0)

随笔分类

随笔档案

文章分类

文章档案

相册

相关链接

搜索

最新评论

阅读排行榜

评论排行榜

I think all software developers know unicode. But as I know, some of them misunderstand them. Why did I say these
words. Because maybe you would think Unicode is a char set, or someone think it is a encoding methodology etc. Yes,
you are right, but partly. In wikipedia website the defination of Unicode standard is:

Unicode consists of a repertoire of about 100,000 characters, a set of code charts for visual reference, an encoding
methodology and set of standard character encodings, an enumeration of character properties such as upper and lower
case, a set of reference data computer files, and a number of related items, such as character properties, rules for
text normalization, decomposition, collation, rendering and bidirectional display order (for the correct display of
text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).

OK, After seen this, you must say it's very complex, but I will tell you we only know two part of Unicode standard
is enough:

1. Unicode is a Char Set.
2. Unicode defines two mapping methods: the Unicode Transformation Format (UTF) encodings, and the Universal Character
Set (UCS) encodings.

In my experience, someone also  misunderstand the UTF and Unicode. UTF is a encoding method for unicode, and until
now it  has three kind: UTF-8, UTF-16, UTF-32. The number 8,16,32 means the Transformation Format is an octet (8-bit)
 lossless encoding of Unicode characters. For example UTF-8 encodes each Unicode character as a variable number
of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is
an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character
in the range U+0000 through U+007F as a single octet.

And finally, In C/C++, it supports unicode using wchar_t. But you must note some tips of it. So what's the size of a
wchar_t then? 2 or 4 byte? The answer is yes - that is, the standards don't specify an exact length. The Unicode 4.0
standard says that "ANSI/ISO C leaves the semantics of the wide character set to the specific implementation but
requires that the characters from the portable C execution set correspond to their wide character equivalents by
zero extension." Furthermore, the standard specifies: "The width of wchar_t is compiler-specific and can be as small
as 8 bits. Consequently, programs that need to be portable across any C or C++ compiler should not use wchar_t for
storing Unicode text. The wchar_t type is intended for storing compiler-defined wide characters, which may be
Unicode characters in some compilers."

This means that a UNIX-like operating system will usually use 4 bytes (it's best to verify this by using sizeof()).
If you use the Microsoft Windwws API, you end up with 2 bytes per wchar_t.

Note, you must know the ecoding method when you meet a string.


 

posted on 2008-03-21 11:18 遇君阁 阅读(1275) 评论(3)  编辑 收藏

评论

# re: All must know unicode 2008-03-26 16:58 h
UCS-2 == UTF16
UCS-4 == UTF32

UTF16LE,UTF16BE, ...
LE == Little Eden,CPU表示一个数时低位在前
BE == Big Eden,高位在前
ARM,MIPS,X86等架构CPU是LE。不过有的MIPS的可以设置LE还是BE方式,一般是LE。
起源于吃蛋,是从小头吃起还是大头。

# re: All must know unicode 2008-03-27 10:10 遇君阁
Thanks, but I think the relation between UCS-2 and UTF16 is:
UTF16 ⊇ UCS-2

# re: All must know unicode 2008-04-07 15:01 h
UCS-2仅仅是一种编码方式,UTF16就是具体的应用。
UCS-2和UCS-4只规定了代码点和文字之间的对应关系,并没有规定代码点在计算机中如何存储(注意这里,也就是说UCS仅仅是规定了一张表),传输文字就需要UTF。
UTF16是完全对应于UCS-2的,即把UCS-2规定的代码点通过BE或LE方式直接保存下来。

标题  
姓名  
主页
验证码 *
内容   
  登录  使用高级评论  Top
[使用Ctrl+Enter键可以直接提交]