C & WIN32 WINDOWS SYSTEM (OS) PROGRAMMING: README FIRST

發表： 2012 年 02 月 07 日 | 作者： vinsonhsieh | Filed under: 計算機相關 | 發表留言

http://www.tenouk.com/cnwin32tutorials.html

Win32 programming tutorial is implementation specific to Windows operating system, that is for 32 bit Windows operating system family. Now we already have the 64 bit Windows Operating System such as, Windows XP Professional x64 Edition. However the fundamentals in system programming not much change, the general principle and concept still retained. Based on the MSDN documentation but re-arranged in a readable and understandable sequence, avoiding a lot of cross references, this tutorial tries to investigate the Windows 2000 (NT5) family system through Win32 C programming.

请告诉我代码页（Codepage）和 Unicode 的区别和联系是什么？

發表： 2012 年 02 月 01 日 | 作者： vinsonhsieh | Filed under: 計算機相關 | 發表留言

http://zhidao.baidu.com/question/261569922.html

我们常说汉字的"内码"与"外码"。

内码是汉字在计算机内部存储，处理和传输用的信息编码。它必须与ASCII码兼容但又不能冲突。

所以把国标码两个字节的最高位置'1'，以区别于西文，这就是内码。汉字的输入码称为"外码"。输入码即指我们输入汉字时使用的编码。常见的外码分为数字编码(如区位码)，拼音编码和字形编码(如五笔)。

    再说区位码，"啊"的区位码是1601，写成16进制是0x10,0x01。这和计算机广泛使用的ASCII编码冲突。为了兼容00-7f的 ASCII编码，我们在区位码的高、低字节上分别加上A0。这样"啊"的编码就成为B0A1。我们将加过两个A0的编码也称为GB2312编码，虽然 GB2312的原文根本没提到这一点。
  内码是指操作系统内部的字符编码。早期操作系统的内码是与语言相关的.现在的Windows在内部统一使用Unicode，然后用代码页适应各种语言,"内码"的概念就比较模糊了。我们一般将缺省代码页指定的编码说成是内码。内码这个词汇，并没有什么官方的定义。代码页也只是微软的一种习惯叫法。作为程序员，我们只要知道它们是什么东西，没有必要过多地考证这些名词。
  所谓代码页(code page)就是针对一种语言文字的字符编码。例如GBK的code page是CP936，BIG5的code page是CP950，GB2312的code page是CP20936。
  Windows中有缺省代码页的概念，即缺省用什么编码来解释字符。例如Windows的记事本打开了一个文本文件，里面的内容是字节流：BA、BA、 D7、D6。Windows应该去怎么解释它呢？是按照Unicode编码解释、还是按照GBK解释、还是按照BIG5解释，还是按照ISO8859-1 去解释？如果按GBK去解释，就会得到"汉字"两个字。按照其它编码解释，可能找不到对应的字符，也可能找到错误的字符。所谓"错误"是指与文本作者的本意不符，这时就产生了乱码。
  答案是Windows按照当前的缺省代码页去解释文本文件里的字节流。缺省代码页可以通过控制面板的区域选项设置。记事本的另存为中有一项ANSI，其实就是按照缺省代码页的编码方法保存。
  Windows的内码是Unicode，它在技术上可以同时支持多个代码页。只要文件能说明自己使用什么编码，用户又安装了对应的代码页，Windows就能正确显示，例如在HTML文件中就可以指定charset。
  有的HTML文件作者，特别是英文作者，认为世界上所有人都使用英文，在文件中不指定charset。如果他使用了0x80-0xff之间的字符，中文Windows又按照缺省的GBK去解释，就会出现乱码。这时只要在这个html文件中加上指定charset的语句，例如：
  <meta http-equiv="Content-Type" content="text/html; charset=ISO8859-1">
如果原作者使用的代码页和ISO8859-1兼容，就不会出现乱码了。????????????????????????
进一步的参考资料
"Short overview of ISO-IEC 10646 and Unicode" ()

All About Python and Unicode

發表： 2012 年 02 月 01 日 | 作者： vinsonhsieh | Filed under: 計算機相關 | 發表留言

http://boodebr.org/main/python/all-about-python-and-unicode

March 4, 2007 – 3:39pm — frank

… and even more about Unicode

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

發表： 2012 年 01 月 31 日 | 作者： vinsonhsieh | Filed under: 計算機相關 | 發表留言

http://www.joelonsoftware.com/articles/Unicode.html

by Joel Spolsky

Wednesday, October 08, 2003

Ever wonder about that mysterious Content-Type tag? You know, the one you’re supposed to put in HTML and you never quite know what it should be?

Did you ever get an email from your friends in Bulgaria with the subject line “???? ?????? ??? ????"?

I’ve been dismayed to discover just how many software developers aren’t really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they “couldn’t do anything about it." Like many programmers, he just wished it would all blow over somehow.

But it won’t. When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.

So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

And one more thing:

IT’S NOT THAT HARD.

In this article I’ll fill you in on exactly what every working programmer should know. All that stuff about “plain text = ascii = characters are 8 bits" is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.

Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. I’m really just trying to set a minimum bar here so that everyone can understand what’s going on and can write code that has a hope of working with text in any language other than the subset of English that doesn’t include words with accents. And I should warn you that character handling is only a tiny portion of what it takes to create software that works internationally, but I can only write about one thing at a time so today it’s character sets.

A Historical Perspective

The easiest way to understand this stuff is to go chronologically.

You probably think I’m going to talk about very old character sets like EBCDIC here. Well, I won’t. EBCDIC is not relevant to your life. We don’t have to go that far back in time.

ASCII table Back in the semi-olden days, when Unix was being invented and K&R were writing The C Programming Language, everything was very simple. EBCDIC was on its way out. The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A" was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes: the dim bulbs at WordStar actually turned on the high bit to indicate the last letter in a word, condemning WordStar to English text only. Codes below 32 were called unprintable and were used for cussing. Just kidding. They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.

And all was good, assuming you were an English speaker.

Because bytes have room for up to eight bits, lots of people got to thinking, “gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters… horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry cleaners’. In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (), so when Americans would send their résumés to Israel they would arrive as rsums. In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn’t even reliably interchange Russian documents.

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few “multilingual" code pages that could do Esperanto and Galician on the same computer! Wow! But getting, say, Hebrew and Greek on the same computer was a complete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the “double byte character set" in which some letters were stored in one byte and others took two. It was easy to move forward in a string, but dang near impossible to move backwards. Programmers were encouraged not to use s++ and s– to move backwards and forwards, but instead to call functions such as Windows’ AnsiNext and AnsiPrev which knew how to deal with the whole mess.

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.

Unicode

Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too. Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

Until now, we’ve assumed that a letter maps to some bits which you can store on disk or in memory:

A -> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.

In Unicode, the letter A is a platonic ideal. It’s just floating in heaven:

This platonic A is different than B, and different from a, but the same as A and A and A. The idea that A in a Times New Roman font is the same character as the A in a Helvetica font, but different from “a" in lower case, does not seem very controversial, but in some languages just figuring out what a letter is can cause controversy. Is the German letter ß a real letter or just a fancy way of writing ss? If a letter’s shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a great deal of highly political debate, and you don’t have to worry about it. They’ve figured it all out already.

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means “Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. You can find them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site.

There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.

OK, so say we have a string:

Hello

which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

Just a bunch of code points. Numbers, really. We haven’t yet said anything about how to store this in memory or represent it in an email message.

Encodings

That’s where encodings come in.

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let’s just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn’t it also be:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.

For a while it seemed like that might be good enough, but programmers were complaining. “Look at all those zeros!" they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings, and anyway, there were already all these doggone documents out there using various ANSI and DBCS character sets and who’s going to convert them all? Moi? For this reason alone most people decided to ignore Unicode for several years and in the meantime things got worse.

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

How UTF-8 works

This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you’ll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).

So far I’ve told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it’s high-endian UCS-2 or low-endian UCS-2. And there’s the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.

There are actually a bunch of other ways of encoding Unicode. There’s something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There’s UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn’t be so bold as to waste that much memory.

And in fact now that you’re thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there’s no equivalent for the Unicode code point you’re trying to represent in the encoding you’re trying to represent it in, you usually get a little question mark: ? or, if you’re really good, a box. Which did you get? -> �

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain" text is ASCII.

There Ain’t No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

Almost every stupid “my website looks like gibberish" or “she can’t read my emails when I use accents" problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.

How do we preserve this information about what encoding a string uses? Well, there are standard ways to do this. For an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset="UTF-8″

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself — not in the HTML itself, but as one of the response headers that are sent before the HTML page.

This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn’t really know what encoding each file was written in, so it couldn’t send the Content-Type header.

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8″>

But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.

What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept" is quite frankly not a good engineering principle. Anyway, what does the poor reader of this website, which was written in Bulgarian but appears to be Korean (and not even cohesive Korean), do? He uses the View | Encoding menu and tries a bunch of different encodings (there are at least a dozen for Eastern European languages) until the picture comes in clearer. If he knew to do that, which most people don’t.

For the latest version of CityDesk, the web site management software published by my company, we decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type. In C++ code we just declare strings as wchar_t (“wide char") instead of char and use the wcs functions instead of the str functions (for example wcscat and wcslen instead of strcat and strlen). To create a literal UCS-2 string in C code you just put an L before it as so: L"Hello".

When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That’s the way all 29 language versions of Joel on Software are encoded and I have not yet heard a single person who has had any trouble viewing them.

This article is getting rather long, and I can’t possibly cover everything there is to know about character encodings and Unicode, but I hope that if you’ve read this far, you know enough to go back to programming, using antibiotics instead of leeches and spells, a task to which I will leave you now.

<!–

College students: my company has paid
summer internships in
New York City,
including free housing, free lunch, and the chance to develop software people
will really use, with great mentors
on interesting projects. Don’t miss this chance of a lifetime. We only have
a few spaces and they always go fast, so apply today.

–> Have you been wondering about Distributed Version Control? It has been a huge productivity boon for us, so I wrote Hg Init, a Mercurial tutorial—check it out!

Craftsmanship

Want to know more?

You’re reading Joel on Software, stuffed with years and years of completely raving mad articles about software development, managing software teams, designing user interfaces, running successful software companies, and rubber duckies.

About the author.

I’m Joel Spolsky, co-founder of Fog Creek Software, a New York company that proves that you can treat programmers well and still be highly profitable. Programmers get private offices, free lunch, and work 40 hours a week. Customers only pay for software if they’re delighted. We make FogBugz, an enlightened bug tracking and software development tool, Kiln, a distributed source control system that will blow your socks off if you’re stuck on Subversion, and Fog Creek Copilot, which makes remote desktop access easy. I’m also the co-founder of Stack Overflow.

Java compare to C++

發表： 2011 年 08 月 29 日 | 作者： vinsonhsieh | Filed under: 計算機相關 | 發表留言

http://www.csie.nctu.edu.tw/~tsaiwn/oop/java/02_handouts/p001.txt
History
                                                     tsaiwn@csie.nctu.edu.tw
 1967  BCPL, by Martin Richards
 1970  B, modified from BCPL, by Ken Thompson, at Bell Lab.
*1972  C, modified from B, by Dennis Ritchie, at Bell Lab.
 1978  "The C Programming Language, 1st ed.," by Kernighan and Ritchie (K&R)
*1980  C with classes (class 源自 SIMULA), by Bjarne Stroustrup, at Bell Lab.
 1983  ANSI X3J11 comittee working for C Standard
*1984  C with classes 改名為 C++
 1985  AT&T 開始對外發行 C++ 1.0 (Translator)
 1988  K&R second edition published
*1989/December ANSI C via ANSI X3J11
 1990  ANSI X3J16 comittee working for C++ Standard

*1991  OAK, based on C++, by James Gosling, at Sun Microsystems
       (目標是a Language for 家電用品, the Green project, but failed)
 1993  OAK 死而復活 (因為 WWW 的流行), 後來改名 Java
*1995/May Java formally announced by Sun Microsystems, with JDK 1.0
 1996  IDL(Interface Definition Language), RMI (Remote Method Invocation),
       Java Beans, JDBC (Java DataBase Connectivity)
 1997  JDK 1.1, Servlet

*1998  ANSI C++ via ANSI X3J16

*1998/November  Java 2 (JDK 1.2)
*1999/June JDK 1.2.2    *1999/Oct  JDK 1.3 

 1999  ANSI C99   (the latest version of C Standard, no compiler yet)

*2001  JDK 1.3.1   *2002 JDK 1.4
*2004  JDK 1.5 (JDK 5.0)    *2006  JDK 6.0  (JDK1.6)  *2009 JDK 7.0
  ===  JAVA JAVA JAVA JAVA JAVA === === JAVA JAVA JAVA JAVA JAVA JAVA ===

  * Removed Features from C++
     - pointers (太危險! 從此不能再耍花招)(其實Java的Reference是C++的pointer)
     - unions, struct (不需要, 且其實 class 就是 struct)
     - multiple inheritance, operator overloading  (容易搞錯, 不易懂)
     - prepocessing (因會使程式缺乏可讀性)
     - template (不過因一直有人要求該加入, JDK1.5 已加入 generic type)
     - coercions (尊重程式作者? 請用 cast)  (不過expression中promotion仍可以)
     - destructors (要死就死, 何必囉唆:-)其實是不需要, 因有 Garbage Collection)
     - goto statements (破斧沉舟! 讓你死了想用 goto 的念頭:-)

 ** Added Features:
     - data types: 八大原始類別: boolean (1 bit), char (16 bit, Unicode)
                  byte(8 bit), short(16 bit). int (32 bit), long (64 bit)
                  float 和 double 都與 C/C++ 同 (32 bit, 64 bit, IEEE754)
            * array as objects (注意宣告 array 時只生出 reference)
            * String class (類似 C++ class library 中的 string class)
     - multi-level break  (break 可帶有 Label, 這才使goto真的變沒必要)
     - Packages   (把相關 class 集合一起, 方便管理)
     - Interfaces (彌補去除multiple inheritance的不足)
     - automatic Garbage collection   (從此不用擔心 memory leaking)
     - multithread & Synchronization 
     - Runtime type checking

 *** How Java programs work?
     - Java 寫的程式可以分成 Application, Applet, 以及 Servlet, 檔名以.java結尾
     - Java 程式若夾在網頁內則為 JSP (類似 PHP 以及 微軟的 ASP)
     - .java 的程式要用 javac 命令 compile 成 ByteCode, 檔名為 .calss 結尾
     - Bytecode可給真的Java CPU 執行, 或任何CPU上的Java虛擬機器執行
     - Java虛擬機器是Java的 Interpreter(解譯程式), 通常是用 C 語言寫的
     - javac 是 Java程式的 Compiler; 
       java 是 Java 的虛擬機器(解譯器Interpreter)
     - Runtime Environment
        -- Java Interpreter (java) with lots of Java class Library
           PC 使用者請到 java.sun.com 抓, 建議先熟悉 Standard Edition
           Sun 使用者請在 Solaris 上執行, 不要在 Sun OS 4.x
           建議在 /usr/local/jdk/.. 之下找 javac 和 java 或問系統管理者
        -- Load, Verify, Execute, JIT (Just In Time compiler)
        -- native code Executable image compiler/linker, like gcj
  * Can Java program be compiled/linked to native exec prog? 
    A: YES. Please use google.com to search GCJ project.

二進位的小數表示法

發表： 2011 年 08 月 07 日 | 作者： vinsonhsieh | Filed under: 計算機相關 | 發表留言

二進位的小數表示法

小數

二進位的小數和十進位的小數在觀念上完全一樣，必須有一個小數點，小數點的左邊是整數，小數點的右邊是小數，每一位數有它的權重，例如下圖：

在電腦裡表達小數的方式有兩種，一種是定點表示法，一種是浮點表示法。

定點小數表示法：(fixed point)

在電腦裡資料以位元來存放，而小數點並不需要真的存放在位元內，我們只要知道哪兩個位元之間是小數點就好了，或是說我們只要知道每一個位元的權重，例如上圖即可。

一般來講使用定點表示法來表達小數時我們在整個系統中會將小數點都固定在一個地方，而不會調來調去，否則就需要多幾個位元來記錄小數點的位置了。當然這也是為什麼要叫做定點小數的原因了。
在定點小數表示法裡加減法就是二進位的標準加減法，而乘法必須要把小數點移回來 (你可以用十進位小數乘法去思考)。
在高階程式語言內一般都不用定點小數表示法，因為定點小數表示法所能表示數字的範圍非常有限，例如上面雙位元組小數點置於第 8 位和第 9 位元之間時，只能表示 0 到 255 之間每隔 1/256 的那些數字而已，這樣子範圍的數字在自然界中應用的機會較小，那為什麼要用這種方式表示小數呢?? 簡單地說就是運算速度較快，定點小數的運算就是標準二進位的運算，因此很簡單，硬體製作起來也很便宜，浮點小數就比較複雜，軟體模擬比較慢，硬體也比較貴。
自然界中隨便一個數字，例如十進位 0.1 這個數字當使用二進位定點表示法時，可能會有誤差，原因是是十進位 0.1 這個數字用二進位表示時會是 0.000110011001100110011… 的一個循環小數，以上圖的表示法來說小數以下只能表示八位數，也就是 0.00011001，這個數字和原來希望表示的數字會有一點點的誤差，讓我們把這個數字換回十進位來看看：也就是 1/256 + 1/32 + 1/16 = 25/256 = 0.09765625，是蠻接近 0.1 的，不是嗎，誤差一定小於 1/256，而且有一個特性就是實際表達出來的數字的絕對值永遠小於或是等於原來要表達的數字的絕對值。

浮點小數表示法：

所謂二進位的浮點小數表示法 (或是浮點數， floating point number) 簡單地講和我們十進位中常用的科學記號表示法是類似的，十進位中我們用十的冪次 (power) 及 0 至 1 之間的小數來表示一個任意的實數，例如：12345.6789 表示為 0.123456789 * 10^5。

在二進位中我們一樣可以將一個二進位數字 1101110.11011 表示為 0.110111011011 * 2^7，這樣子的表示法和定點表示法之間好像沒有什麼不同嘛!! 對!! 這兩個數值的大小當然是完全一樣的，那為什麼要用浮點表示法呢?? 請注意在定點小數表示法之中我們看到它的缺點是絕對值太大的數字會被截斷 (正確的名稱是溢位，overflow) 位數不夠多無法表達大於範圍的數字，絕對值太小的數 (例如：0.000000001) 也會被截斷 (正確的名稱是無條件捨去，truncation) 只能表達近似的值。

如果是這樣子的話，不知道你有沒有想過…

對於一個很大的數字 (例如 123456789012) 定點表示法保証小數點以後一定有固定的幾個位數來表示， (例如 123456789012.00000011) 可是這樣子的精確度對於大部份的應用來說是沒什麼意義的，試想太陽到地球的距離多一公里少一公里真的有關係嗎?? 光速每秒鐘快一公尺又何妨??
對於一個很小的數字 (例如 0.0000000012) 來說定點表示法可能因為小數點後沒有足夠的位數來記錄而將其省略，上面這個數字就變成 0 了。如果你說水中含有 0.0000000012 莫耳的氰化鈉，因為小數點後位數不足而把它當為 0，這不太好吧!!

這時浮點數表示法就有它的妙用了，以第一例中一個很大的數字而言：浮點數由最重要的位數開始只保留一定的位數，例如 123456789012.00000011 可用 .1234567890 * 10^12 來表示就夠了，這個表達方法所記錄的數字和實際的數字會有誤差，但是百分比誤差不大。以第二例而言： 0.0000000012 可用 0.12 * 10^(-8) 來表示，不需要浪費許多位元記錄 “0″，只需記 12 以及 -8 即可精確地表達這個很小的數字。

舉例來說，一種簡單的二進位浮點表示法可以一個位元記錄正負號，七個位元二的補數記錄 2 的冪次， 24 個位元記錄小數，共 32 位元，如下圖：

若有一個以此種表示方法的二進位數值：

0 0001110 110000000000000000000000

代表十進位的 0.75 * 2^14 這個數字。

注意：

和前面的定點小數表示法一樣，浮點小數表示法在表達任意一個小數的時候，也常常會有一些誤差，而且實際所表示的數字的絕對值會小於或是等於原來希望表達的那個數字的絕對值。

求一个能在32位WIN7下使用超过2G以上的程序

發表： 2011 年 08 月 06 日 | 作者： vinsonhsieh | Filed under: 計算機相關 | 發表留言

http://zhidao.baidu.com/question/125628293.html

“提问者对于答案的评价：
他们都没看懂什么意思，你的方法不错，不过没破解的只能用1G，64位系统可以用到2.2G就再也上不去了"

Vinson也同意根本一堆人沒看懂問題…這個回答有建設，雖然不夠清楚，PAE/AWE其實就是作業系統特別的管道，行程內32位元定址虛擬4G是沒錯，但系統當然可以提供類似查表的對應，額外送你實體記憶體位址，只要系統的32位元指標可以從剩餘的實際記憶體中找出或預留可用的部分…
DOS 16位元也有一樣的問題，著名的XMS就是同樣道理…

定址一般會跟CPU位元數相同沒錯，但定址線可以超過32位元，比如36位元定址，可以存取64G實體記憶體，只要修改分頁方式(不就是對應問題嗎)，提供額外handle給行程，不影響原本的指標4G對應，不就可以擴充了…

乘法與除法運算

發表： 2011 年 08 月 05 日 | 作者： vinsonhsieh | Filed under: 計算機相關 | 發表留言

比如 6 x 5 ，二進位為0110 x 0101

所以0110各別做四次左移位再相加

第一個，不動 = 0110

第二個，為 0 = 0000

第三個，左移兩位 = 011000

第三個，為 0 = 000000

相加後為 011110 = 30.

除法電路經左移、比較與減法電路組成，需要執行的次數為被除數的位元數，在執行上較加、減、乘電路的執行時間要長些。

有號數如何運用補數

發表： 2011 年 08 月 05 日 | 作者： vinsonhsieh | Filed under: 程式面面觀 - 作者: VinsonHsieh, 計算機相關 | 發表留言

首先要先了解int宣告若為-5二進位的表式法

-5 = 5的一次補數加1，不就是二的補數，原來在宣告int為-5的時候已經是二的補數形式，如此一來就可以直接跟任意int值相加減。

0000 0000 0000 0101 => 一次補數 => 1111 1111 1111 1010 =>+1=> 1111 1111 1111 1011 (-5)

假設我們要做7-5也就是7+(-5)，可以用二次補數幫忙，發現了嗎，我們根本沒做減運算

所以

0000 0000 0000 0111

1111 1111 1111 1011

———————————–

1 0000 0000 0000 0010 (2的補數溢位不管，所以答案等於2)

如果是7x(-5)會對嗎? 可以乘乘看，結果是對的

1111 1111 1101 1101 (-35) 所以你可以拿二的補數去表達正負數，做四則運算

假設(-5)x2 乘2是往左移一位變成

1111 1111 1111 0110 = (-10)正確，所以也可以移位運算…

補數- 用二進位的加法取代減法，讓邏輯線路設計簡單

發表： 2011 年 08 月 05 日 | 作者： vinsonhsieh | Filed under: 程式面面觀 - 作者: VinsonHsieh, 計算機相關 | 發表留言

http://programming.im.ncnu.edu.tw/C_index.html

如何表示整數資料

有號整數

　　有一種方法稱為 “符號與大小表示法"：第一個bit作為正負號，其他的bits為大小。缺點是不好運算，因為加法的運算方法和運算元的正負號有關。當兩個運算子同號時，只要將數字大小的部份相加，正負號與運算元相同即可，但當兩個運算子的符號不同時，則要以減法來處理大小的部份，正負號則視相減的結果而定。這種表示法也會造成電腦邏輯線路設計上的困擾。

　　減法比加法難實作的原因是減法需要一直借位, 而加法只要記得前一對數字加起來有沒有進位即可。有人就想出一種數字系統，可以用二進位的加法取代減法，讓邏輯線路設計簡單，跑起來又快的。就是補數系統。

補數(Complement)

我們先以最多兩個數字的10進位數字系統為例。

60-30 = 60 + 70 – 100。

而-100可以藉由 “忽略溢位" 來達成。所謂 “溢位" 是指數字計算的結果超過硬體線路所能表達的範圍，以兩個數字的10進位數字系統為例，只能表達0-99，所以算出來的結果若大於99就發生 “溢位" 。而 “忽略溢位" 在此例中就是指 “丟棄" “忽視" “省略" 第3位數的意思。因此-30就可由+70與 “忽略溢位" 來達成。接下來的問題就是如何由30得到70？ 70 = 99 – 30 + 1。讀者此時的疑問是，減法又跑出來了，這樣會比較快嗎？請注意99減任意數都不需要借位，因此線路比較好設計。補數的觀念用到二進位時，效用就更大了。所有都是一的二進位數111111，減去二進位數X，得到Y，此時X和Y每一個對應的bit都是相反的，如1111 – 0101 = 1010。換言之，求二進位的補數可以用bit inverse來取代減法，而bit inverse則是最容易實作的電路。99 – 30稱為30的9補數，99 – 30 + 1稱為30的10補數。

令Y為10進位所能表達的最大數字, 則在9補數中X的負數是以(Y-X)來表達. 例如最多3位數的10進位系統中，300的9補數以(999-300)=699來表達. 因此最多3位數的10進位系統中, 其9補數的對照表如下

500 …999 | 0…499

-499…-000 | 0…499

上表是說999代表-0，998代表-1，997代表-2，…，500代表-499。9補數有兩個0，+0和-0，可表達的範圍在-499 ~ +499。

10補數中X的負數則定義為(Y-X+1), 因此最多3位數的10進位系統中, 其10補數對照表如下:

500 …999 | 0…499

-500…-001 | 0…499

上表是說999代表-1，998代表-2，997代表-3，…，500代表-500。所以10補數能表達的範圍在-500 ~ +499。

9補數減法：將-X的運算, 改為 + (-X), 若有溢位, 則將結果再加1

420-170

=>420+829

=1249 => 249 + 1

(Vinson補充: 420-170 = 420+829-999 = 1249 – 999 = 1000-999+249 = 1 + 249….大家知道為什麼溢位進一了吧)

100 – 350

=> 100 + 649

= 749

-100-200

=>899+799

=1698 => 698 + 1

10補數減法則不必考慮溢位

2進位補數(2′ complement)的特殊之處

由於2進位只有0和1兩個數字, 你會發現:
1 – 0 = 1
1 – 1 = 0
二進位所能表達的最大數字Y是一個全部都是1的數字. 在1補數中, -X是以(Y-X)來表達, 因此你會發覺X和(Y-X)的每一個bit都是不一樣的. 例如:

Y:   11111111
X:   01110011
Y-X: 10001100

因此在二進位的1補數系統中, 不需要用到減法, 只要將X的每一個bit都反過來, 就可得到-X了.

Like 669699

Management@SoftwareLife

C & WIN32 WINDOWS SYSTEM (OS) PROGRAMMING: README FIRST

请告诉我代码页（Codepage）和 Unicode 的区别和联系是什么？

All About Python and Unicode

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Next:

Want to know more?

About the author.

Java compare to C++

二進位的小數表示法

二進位的小數表示法

小數

定點小數表示法：(fixed point)

浮點小數表示法：

求一个能在32位WIN7下使用超过2G以上的程序

乘法與除法運算

有號數如何運用補數

補數- 用二進位的加法取代減法，讓邏輯線路設計簡單

2進位補數(2′ complement)的特殊之處

訂閱文章

近期文章

分類

熱門文章與頁面︰

最多人點選

文章存檔