The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

發表： 2012 年 01 月 31 日 | 作者： vinsonhsieh | Filed under: 計算機相關 | 發表留言

http://www.joelonsoftware.com/articles/Unicode.html

by Joel Spolsky

Wednesday, October 08, 2003

Ever wonder about that mysterious Content-Type tag? You know, the one you’re supposed to put in HTML and you never quite know what it should be?

Did you ever get an email from your friends in Bulgaria with the subject line “???? ?????? ??? ????"?

I’ve been dismayed to discover just how many software developers aren’t really completely up to speed on the mysterious world of character sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they “couldn’t do anything about it." Like many programmers, he just wished it would all blow over somehow.

But it won’t. When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough.

So I have an announcement to make: if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine. I swear I will.

And one more thing:

IT’S NOT THAT HARD.

In this article I’ll fill you in on exactly what every working programmer should know. All that stuff about “plain text = ascii = characters are 8 bits" is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs. Please do not write another line of code until you finish reading this article.

Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a little bit oversimplified. I’m really just trying to set a minimum bar here so that everyone can understand what’s going on and can write code that has a hope of working with text in any language other than the subset of English that doesn’t include words with accents. And I should warn you that character handling is only a tiny portion of what it takes to create software that works internationally, but I can only write about one thing at a time so today it’s character sets.

A Historical Perspective

The easiest way to understand this stuff is to go chronologically.

You probably think I’m going to talk about very old character sets like EBCDIC here. Well, I won’t. EBCDIC is not relevant to your life. We don’t have to go that far back in time.

ASCII table Back in the semi-olden days, when Unix was being invented and K&R were writing The C Programming Language, everything was very simple. EBCDIC was on its way out. The only characters that mattered were good old unaccented English letters, and we had a code for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter “A" was 65, etc. This could conveniently be stored in 7 bits. Most computers in those days were using 8-bit bytes, so not only could you store every possible ASCII character, but you had a whole bit to spare, which, if you were wicked, you could use for your own devious purposes: the dim bulbs at WordStar actually turned on the high bit to indicate the last letter in a word, condemning WordStar to English text only. Codes below 32 were called unprintable and were used for cussing. Just kidding. They were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.

And all was good, assuming you were an English speaker.

Because bytes have room for up to eight bits, lots of people got to thinking, “gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255. The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a bunch of line drawing characters… horizontal bars, vertical bars, horizontal bars with little dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 computer at your dry cleaners’. In fact as soon as people started buying PCs outside of America all kinds of different OEM character sets were dreamed up, which all used the top 128 characters for their own purposes. For example on some PCs the character code 130 would display as é, but on computers sold in Israel it was the Hebrew letter Gimel (), so when Americans would send their résumés to Israel they would arrive as rsums. In many cases, such as Russian, there were lots of different ideas of what to do with the upper-128 characters, so you couldn’t even reliably interchange Russian documents.

Eventually this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do below 128, which was pretty much the same as ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived. These different systems were called code pages. So for example in Israel DOS used a code page called 862, while Greek users used 737. They were the same below 128 but different from 128 up, where all the funny letters resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few “multilingual" code pages that could do Esperanto and Galician on the same computer! Wow! But getting, say, Hebrew and Greek on the same computer was a complete impossibility unless you wrote your own custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.

Meanwhile, in Asia, even more crazy things were going on to take into account the fact that Asian alphabets have thousands of letters, which were never going to fit into 8 bits. This was usually solved by the messy system called DBCS, the “double byte character set" in which some letters were stored in one byte and others took two. It was easy to move forward in a string, but dang near impossible to move backwards. Programmers were encouraged not to use s++ and s– to move backwards and forwards, but instead to call functions such as Windows’ AnsiNext and AnsiPrev which knew how to deal with the whole mess.

But still, most people just pretended that a byte was a character and a character was 8 bits and as long as you never moved a string from one computer to another, or spoke more than one language, it would sort of always work. But of course, as soon as the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling down. Luckily, Unicode had been invented.

Unicode

Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klingon, too. Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don’t feel bad.

In fact, Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

Until now, we’ve assumed that a letter maps to some bits which you can store on disk or in memory:

A -> 0100 0001

In Unicode, a letter maps to something called a code point which is still just a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.

In Unicode, the letter A is a platonic ideal. It’s just floating in heaven:

This platonic A is different than B, and different from a, but the same as A and A and A. The idea that A in a Times New Roman font is the same character as the A in a Helvetica font, but different from “a" in lower case, does not seem very controversial, but in some languages just figuring out what a letter is can cause controversy. Is the German letter ß a real letter or just a fancy way of writing ss? If a letter’s shape changes at the end of the word, is that a different letter? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied by a great deal of highly political debate, and you don’t have to worry about it. They’ve figured it all out already.

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means “Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. You can find them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site.

There is no real limit on the number of letters that Unicode can define and in fact they have gone beyond 65,536 so not every unicode letter can really be squeezed into two bytes, but that was a myth anyway.

OK, so say we have a string:

Hello

which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

Just a bunch of code points. Numbers, really. We haven’t yet said anything about how to store this in memory or represent it in an email message.

Encodings

That’s where encodings come in.

The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let’s just store those numbers in two bytes each. So Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Right? Not so fast! Couldn’t it also be:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.

For a while it seemed like that might be good enough, but programmers were complaining. “Look at all those zeros!" they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn’t have minded guzzling twice the number of bytes. But those Californian wimps couldn’t bear the idea of doubling the amount of storage it took for strings, and anyway, there were already all these doggone documents out there using various ANSI and DBCS character sets and who’s going to convert them all? Moi? For this reason alone most people decided to ignore Unicode for several years and in the meantime things got worse.

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

How UTF-8 works

This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you’ll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).

So far I’ve told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it’s high-endian UCS-2 or low-endian UCS-2. And there’s the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.

There are actually a bunch of other ways of encoding Unicode. There’s something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There’s UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn’t be so bold as to waste that much memory.

And in fact now that you’re thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there’s no equivalent for the Unicode code point you’re trying to represent in the encoding you’re trying to represent it in, you usually get a little question mark: ? or, if you’re really good, a box. Which did you get? -> �

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain" text is ASCII.

There Ain’t No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

Almost every stupid “my website looks like gibberish" or “she can’t read my emails when I use accents" problem comes down to one naive programmer who didn’t understand the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.

How do we preserve this information about what encoding a string uses? Well, there are standard ways to do this. For an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset="UTF-8″

For a web page, the original idea was that the web server would return a similar Content-Type http header along with the web page itself — not in the HTML itself, but as one of the response headers that are sent before the HTML page.

This causes problems. Suppose you have a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their copy of Microsoft FrontPage saw fit to generate. The web server itself wouldn’t really know what encoding each file was written in, so it couldn’t send the Content-Type header.

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8″>

But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.

What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept" is quite frankly not a good engineering principle. Anyway, what does the poor reader of this website, which was written in Bulgarian but appears to be Korean (and not even cohesive Korean), do? He uses the View | Encoding menu and tries a bunch of different encodings (there are at least a dozen for Eastern European languages) until the picture comes in clearer. If he knew to do that, which most people don’t.

For the latest version of CityDesk, the web site management software published by my company, we decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type. In C++ code we just declare strings as wchar_t (“wide char") instead of char and use the wcs functions instead of the str functions (for example wcscat and wcslen instead of strcat and strlen). To create a literal UCS-2 string in C code you just put an L before it as so: L"Hello".

When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That’s the way all 29 language versions of Joel on Software are encoded and I have not yet heard a single person who has had any trouble viewing them.

This article is getting rather long, and I can’t possibly cover everything there is to know about character encodings and Unicode, but I hope that if you’ve read this far, you know enough to go back to programming, using antibiotics instead of leeches and spells, a task to which I will leave you now.

<!–

College students: my company has paid
summer internships in
New York City,
including free housing, free lunch, and the chance to develop software people
will really use, with great mentors
on interesting projects. Don’t miss this chance of a lifetime. We only have
a few spaces and they always go fast, so apply today.

–> Have you been wondering about Distributed Version Control? It has been a huge productivity boon for us, so I wrote Hg Init, a Mercurial tutorial—check it out!

Craftsmanship

Want to know more?

You’re reading Joel on Software, stuffed with years and years of completely raving mad articles about software development, managing software teams, designing user interfaces, running successful software companies, and rubber duckies.

About the author.

I’m Joel Spolsky, co-founder of Fog Creek Software, a New York company that proves that you can treat programmers well and still be highly profitable. Programmers get private offices, free lunch, and work 40 hours a week. Customers only pay for software if they’re delighted. We make FogBugz, an enlightened bug tracking and software development tool, Kiln, a distributed source control system that will blow your socks off if you’re stuck on Subversion, and Fog Creek Copilot, which makes remote desktop access easy. I’m also the co-founder of Stack Overflow.

Python Patterns – An Optimization Anecdote

發表： 2012 年 01 月 30 日 | 作者： vinsonhsieh | Filed under: 演算法 | 發表留言

http://www.python.org/doc/essays/list2str.html

Python Patterns – An Optimization Anecdote

The other day, a friend asked me a seemingly simple question: what’s the best way to convert a list of integers into a string, presuming that the integers are ASCII values. For instance, the list [97, 98, 99] should be converted to the string ‘abc’. Let’s assume we want to write a function to do this.

The first version I came up with was totally straightforward:

    def f1(list):
        string = ""
        for item in list:
            string = string + chr(item)
        return string

That can’t be the fastest way to do it, said my friend. How about this one:

    def f2(list):
        return reduce(lambda string, item: string + chr(item), list, "")

This version performs exactly the same set of string operations as the first one, but gets rid of the for loop overhead in favor of the faster, implied loop of the reduce() function.

Sure, I replied, but it does so at the cost of a function call (the lambda function) per list item. I betcha it’s slower, since function call overhead in Python is bigger than for loop overhead.

(OK, so I had already done the comparisons. f2() took 60% more time than f1(). So there 🙂

Hmm, said my friend. I need this to be faster. OK, I said, how about this version:

    def f3(list):
        string = ""
        for character in map(chr, list):
            string = string + character
        return string

To both our surprise, f3() clocked twiceas fast as f1()! The reason that this surprised us was twofold: first, it uses more storage (the result of map(chr, list) is another list of the same length); second, it contains two loops instead of one (the one implied by the map() function, and the for loop).

Of course, space versus time is a well-known trade-off, so the first one shouldn’t have surprised us. However, how come two loops are faster than one? Two reasons.

First, in f1(), the built-in function chr() is looked up on every iteration, while in f3() it is only looked up once (as the argument to map()). This look-up is relatively expensive, I told my friend, since Python’s dynamic scope rules mean that it is first looked up (unsuccessfully) in the current module’s global dictionary, and then in the dictionary of built-in function (where it is found). Worse, unsuccessful dictionary lookups are (on average) a bit slower than successful ones, because of the way the hash chaining works.

The second reason why f3() is faster than f1() is that the call to chr(item), as executed by the bytecode interpreter, is probably a bit slower than when executed by the map() function – the bytecode interpreter must execute three bytecode instructions for each call (load ‘chr’, load ‘item’, call), while the map() function does it all in C.

This led us to consider a compromise, which wouldn’t waste extra space, but which would speed up the lookup for the chr() function:

    def f4(list):
        string = ""
        lchr = chr
        for item in list:
            string = string + lchr(item)
        return string

As expected, f4() was slower than f3(), but only by 25%; it was about 40% faster than f1() still. This is because local variable lookups are muchfaster than global or built-in variable lookups: the Python “compiler" optimizes most function bodies so that for local variables, no dictionary lookup is necessary, but a simple array indexing operation is sufficient. The relative speed of f4() compared to f1() and f3() suggests that both reasons why f3() is faster contribute, but that the first reason (fewer lookups) is a bit more important. (To get more precise data on this, we would have to instrument the interpreter.)

Still, our best version, f3(), was only twice as fast as the most straightforward version, f1(). Could we do better?

I was worried that the quadratic behavior of the algorithm was killing us. So far, we had been using a list of 256 integers as test data, since that was what my friend needed the function for. But what if it were applied to a list of two thousand characters? We’d be concatenating longer and longer strings, one character at a time. It is easy to see that, apart from overhead, to create a list of length N in this way, there are 1 + 2 + 3 + … + (N-1) characters to be copied in total, or N*(N-1)/2, or 0.5*N**2 – 0.5*N. In addition to this, there are N string allocation operations, but for sufficiently large N, the term containing N**2 will take over. Indeed, for a list that’s 8 times as long (2048 items), these functions all take much more than 8 times as long; close to 16 times as long, in fact. I didn’t dare try a list of 64 times as long.

There’s a general technique to avoid quadratic behavior in algorithms like this. I coded it as follows for strings of exactly 256 items:

    def f5(list):
        string = ""
        for i in range(0, 256, 16): # 0, 16, 32, 48, 64, ...
            s = ""
            for character in map(chr, list[i:i+16]):
                s = s + character
            string = string + s
        return string

Unfortunately, for a list of 256 items, this version ran a bit slower (though within 20%) of f3(). Since writing a general version would only slow it down more, we didn’t bother to pursue this path any further (except that we also compared it with a variant that didn’t use map(), which of course was slower again).

Finally, I tried a radically different approach: use only implied loops. Notice that the whole operation can be described as follows: apply chr() to each list item; then concatenate the resulting characters. We were already using an implied loop for the first part: map(). Fortunately, there are some string concatenation functions in the string module that are implemented in C. In particular, string.joinfields(list_of_strings, delimiter) concatenates a list of strings, placing a delimiter of choice between each two strings. Nothing stops us from concatenating a list of characters (which are just strings of length one in Python), using the empty string as delimiter. Lo and behold:

    import string
    def f6(list):
        return string.joinfields(map(chr, list), "")

This function ran four to five times as fast as our fastest contender, f3(). Moreover, it doesn’t have the quadratic behavior of the other versions.

And The Winner Is…

The next day, I remembered an odd corner of Python: the array module. This happens to have an operation to create an array of 1-byte wide integers from a list of Python integers, and every array can be written to a file or converted to a string as a binary data structure. Here’s our function implemented using these operations:

    import array
    def f7(list):
        return array.array('B', list).tostring()

This is about three times as fast as f6(), or 12 to 15 times as fast as f3()! it also uses less intermediate storage – it only allocates 2 objects of N bytes (plus fixed overhead), while f6() begins by allocating a list of N items, which usually costs 4N bytes (8N bytes on a 64-bit machine) – assuming the character objects are shared with similar objects elsewhere in the program (like small integers, Python caches strings of length one in most cases).

Stop, said my friend, before you get into negative times – this is fast enough for my program. I agreed, though I had wanted to try one more approach: write the whole function in C. This could have minimal storage requirements (it would allocate a string of length N right away) and save a few instructions in the C code that I knew were there in the array module, because of its genericity (it supports integer widths of 1, 2, and 4 bytes). However, it wouldn’t be able to avoid having to extract the items from the list one at a time, and to extract the C integer from them, both of which are fairly costly operations in the Python-C API, so I expected at most modest speed up compared to f7(). Given the effort of writing and testing an extension (compared to whipping up those Python one-liners), as well as the dependency on a non-standard Python extension, I decided not to pursue this option…

Conclusion

If you feel the need for speed, go for built-in functions – you can’t beat a loop written in C. Check the library manual for a built-in function that does what you want. If there isn’t one, here are some guidelines for loop optimization:

Rule number one: only optimize when there is a proven speed bottleneck. Only optimize the innermost loop. (This rule is independent of Python, but it doesn’t hurt repeating it, since it can save a lot of work. 🙂
Small is beautiful. Given Python’s hefty charges for bytecode instructions and variable look-up, it rarely pays off to add extra tests to save a little bit of work.
Use intrinsic operations. An implied loop in map() is faster than an explicit for loop; a while loop with an explicit loop counter is even slower.
Avoid calling functions written in Python in your inner loop. This includes lambdas. In-lining the inner loop can save a lot of time.
Local variables are faster than globals; if you use a global constant in a loop, copy it to a local variable before the loop. And in Python, function names (global or built-in) are also global constants!
Try to use map(), filter() or reduce() to replace an explicit for loop, but only if you can use a built-in function: map with a built-in function beats for loop, but a for loop with in-line code beats map with a lambda function!
Check your algorithms for quadratic behavior. But notice that a more complex algorithm only pays off for large N – for small N, the complexity doesn’t pay off. In our case, 256 turned out to be small enough that the simpler version was still a tad faster. Your mileage may vary – this is worth investigating.
And last but not least: collect data. Python’s excellent profile module can quickly show the bottleneck in your code. if you’re considering different versions of an algorithm, test it in a tight loop using the time.clock() function.

By the way, here’s the timing function that I used. it calls a function f n*10 times with argument a, and prints the function name followed by the time it took, rounded to milliseconds. The 10 repeated calls are done to minimize the loop overhead of the timing function itself. You could go even further and make 100 calls… Also note that the expression range(n) is calculated outside the timing brackets – another trick to minimize the overhead caused by the timing function. If you are worried about this overhead, you can calibrate it by calling the timing function with a do-nothing function.

    import time
    def timing(f, n, a):
        print f.__name__,
        r = range(n)
        t1 = time.clock()
        for i in r:
            f(a); f(a); f(a); f(a); f(a); f(a); f(a); f(a); f(a); f(a)
        t2 = time.clock()
        print round(t2-t1, 3)

Epilogue

A few days later, my friend was back with the question: how do you do the reverse operation? I.e. create a list of integer ASCII values from a string. Oh no, here we go again, it flashed through my mind…

But this time, it was relatively painless. There are two candidates, the obvious:

    def g1(string):
        return map(ord, string)

and the somewhat less obvious:

    import array
    def g2(string):
        return array.array('b', string).tolist()

Timing these reveals that g2() is about five times as fast as g1(). There’s a catch though: g2() returns integers in the range -128..127, while g1() returns integers in the range 0..255. If you need the positive integers, g1() is going to be faster than anything postprocessing you could do on the result from g2(). (Note: since this essay was written, the ‘B’ typecode was added to the array module, which stores unsigned bytes, so there’s no reason to prefer g1() any more.)

Sample Code

3:2 pulldown

發表： 2012 年 01 月 30 日 | 作者： vinsonhsieh | Filed under: 演算法 | 發表留言

http://www.doom9.org/index.html?/synch.htm

1. 24 FRAMES becomes 48 FIELDS

Atop

Abottom

Btop

Bbottom

Ctop

Cbottom

Dtop

Dbottom

Frame A becomes 2 fields: Atopfield +Abottomfield. Thus, 4 Frames becomes 8 Fields, and 24 Frames becomes 48 Fields. This “field-based" material is then TELECINED into an NTSC Video signal. As TELECINE is a STANDARDIZED conversion, we have to follow the rules of engagement ;). The rule is to do a REPEAT_FIRST_FIELD in a 2:3 sequence.

2. 4 FRAMES (8 FIELDS) becomes 5 FRAMES (10 FIELDS)

Atop

Abottom

Atop

Bbottom

Btop

Cbottom

Ctop

Cbottom

Dtop

Dbottom

If we look closely, we can see a sequence of At Al At followed by Bl Bt then Cl Ct Cl then Dt Dl. But, since 1 FRAME consists of 2 FIELDS, then the sequence becomes AA AB BC CC DD. What we have now is a conversion from 4 SOLID frame into 5 FRAMES consisting of 3 SOLID FRAMES and 2 INTERLACED FRAMES. By INTERLACED I am referring to a FRAME that’s made-up from 2 FIELDs of DIFFERENT FRAME source. The AB frame is the example.

So, 4 FRAMES becomes 5 FRAMES, thus 24 becomes….. 30, DONE! Done? Nope, not by a longshot. The NTSC Video is 29.97fps, so PLAYBACK of 30fps must be slow-down into 29.97fps, which brings us to the term DROP_FRAME.

Don’t get a wrong concept of DROP_FRAME as “FRAMES being REMOVED or DROPPED". In a 30fps Video sequence, a DROP_FRAME time code counts video frames accurately in relationship to real time. DROP_FRAME time code counts each video frame, but, when that .03 finally adds up to a video frame, it skips (or drops) a number. It does not drop a film or video frame, it merely skips a number and continues counting. This allows it to keep accurate time. So if you’re cutting a scene using drop frame time code, and the duration reads as, say, 30 minutes and 0 frames, then you can be assured the duration is really 30 minutes. Confusing? Well, to put it in simple term, DROP_FRAME here is in essence EQUAL a SLOWED_DOWN playback from a pure 30fps into the correct NTSC 29.97fps SPEED. In an MPEG-2 domain, this means that the 00 and 01 frames are dropped or SKIPPED from time code, at the start of each minute except minutes which are even multiples of 10.

NOW, it is DONE.

內積的一種意義

發表： 2012 年 01 月 29 日 | 作者： vinsonhsieh | Filed under: 演算法 | 發表留言

int speed = 5;
int Theta = 30;
BulletX[0] = BulletX[1] = BulletX[2] =100;
BulletY[0] = BulletY[1] = BulletY[2] =500;
BulletVx[0] = speed * cos(M_PI/180*Theta );
BulletVy[0] = speed * sin(M_PI/180*Theta );

[Bullet[0]順時鐘旋轉30度]

BulletVx[1] = BulletVx[0]*cos(M_PI/180*30) – BulletVy[0]*sin(M_PI/180*30);
BulletVy[1] = BulletVx[0]*sin(M_PI/180*30) + BulletVy[0]*cos(M_PI/180*30);

[Bullet[0]逆時鐘旋轉30度]

BulletVx[2] = BulletVx[0]*cos(M_PI/180*30) + BulletVy[0]*sin(M_PI/180*30);
BulletVy[2] = -BulletVx[0]*sin(M_PI/180*30) + BulletVy[0]*cos(M_PI/180*30);

[敵機跟原始子彈出發的向量]

int PX, PY;

PX = e.left – BulletX[0];
PY = e.top – BulletY[0];

int V1P = BulletVx[1]*PX + BulletVy[1]*PY; //很大的負值
int V2P = BulletVx[2]*PX + BulletVy[2]*PY; //很小的正值

假設我們要發出一顆子彈打敵機，而且只能是Bullet[0]原方向左右30度選一個，那麼到底要是左彎還是右彎打出新子彈?

V1P是一個負值表示Bullet[1]跟敵機夾角介於90到180度之間

V2P是一個正值表示Bullet[2]跟敵機夾角介於0到90度之間

所以這個答案很明顯是選Bullet[2]來描繪子彈

但萬一兩個都是正值那改選誰呢?

其實也很簡單，就看誰的值大，大的那個表示投影量大也表示比較接近敵機位置.

產業重塑 – 倖存下來的不是最強的物種，而是最聰明的物種也就是最能因應變化而做出調適的物種。

發表： 2012 年 01 月 16 日 | 作者： vinsonhsieh | Filed under: 管理評論 - 作者: VinsonHsieh | 發表留言

儘管經理人總不禁只想發布好消息給媒體、持股人、顧客、政治人物以及員工，但他卻必須完全坦白。誠如偉大的波哥神諭所曾宣示的：「確定悲慘，強過不確定 ─ 因為，不確定本身也是一種悲慘。」

沒有人一大早出現在辦公室時，心裡就想：「我今天一定會把事情搞得亂七八糟。」但是，如果公司裡有個不夠開明的管理階層，上午九點鐘就可以把員工套進這樣的思想框架裡。

經營管理階層的態度可不能像英格蘭農村地區的公車，候車乘客大排長龍向司機微笑揮手，司機卻照樣過站不停。公司人員提出解釋：假如司機必須停下來載客，他們就無法遵守時間表。」這套邏輯當然無懈可擊，但好像有什麼東西被忽略掉了。

因為產品是太空發射的載具，不能容許任何最微小的疏漏，所以我們對遷廠事宜非常詳細的加以規劃，對細節的講究更是達到令人痛苦的地步。即使如此，我們還是遇到偶發的「異常狀況」。例如，當某些從舊廠來的工人抵達新廠時，他們發現過去一直常用的、設定機器的數字不見了，因此也無法工作。原因是：工人用鉛筆將那些數字寫在附近支撐屋頂的柱子上，而我們竟忘了把柱子一起搬過來！

產業重塑

Multi-thread 的時機

發表： 2012 年 01 月 16 日 | 作者： vinsonhsieh | Filed under: 程式面面觀 - 作者: VinsonHsieh | 發表留言

不是任何事拆成多條thread就可以比較快，如果使用的都是CPU bound 那就一點都沒用，但如果仔細分析而可以把GPU/IO相關的code搬出到另一條thread那就會快非常多…問題就在很多人都不仔細看哪些API其實背後是GPU 或 IO 而老是說沒有用….

主題: #define 和 inline 的差異

發表： 2012 年 01 月 15 日 | 作者： vinsonhsieh | Filed under: 程式面面觀 - 作者: VinsonHsieh | 發表留言

http://phorum.study-area.org/index.php?topic=56407.0

回覆： #define 和 inline 的差異

« 回覆 #3 於: 2009-06-03 15:11 »

引述: syadou 於 2009-04-25 02:03

小弟最近在trace一些linux code, 看到了一些令我不解的程式碼…
有些程式碼之中使用
#define function(x) do{…}while(0)
有時又看到使用inline的function,
我一開始以為 #define do{…}while(0) 的寫法等於 inline
後來發現有人的程式裡#define 和 inline穿插使用,
這兩個除了在compile時的差異之外,
想請問各位學長, 在performance上是否有差異和影響, 為什麼要這樣寫?

這樣寫是有特殊用途的，等等解釋。

首先 inline 只有 c++ 才有，linux kernel source 基本上不大可能使用 inline 方式語法，只能夠使用 define 這類 c 與 c++ 都可以支援方式。

define 的許多問題我想大家應該知道，常見像是:

代碼: [選擇]

#define sq(a) (a*a)
這表示自己次方，一般使用上沒問題，像是:

代碼: [選擇]

int answer = sq(5); // 答案是 25
但是這樣用就死了…

代碼: [選擇]

int answer = sq(5+1) // 答案不是 36 啊.....
因為實際上是 5+1 * 5+1 ==> 5+5+1 = 11

所以要改成:

代碼: [選擇]

#define sq(a) ((a)*(a))
在 inline 內這類問題都不會發生就是…

拉回來，至於你說的問題….

代碼: [選擇]

#define function(x) do{...}while(0)
這寫法也是有特殊的意義，記得應該是 linux kernel FAQ 有寫.

代碼: [選擇]

#define swap(x,y) { int tmp_n ; tmp_n = x ; x = y ; y = tmp_n; }
但是這樣使用遇到這種情況會有問題:

代碼: [選擇]

if ( x > y ) swap(x,y); else otherthing();
但是程式碼展開就會….

代碼: [選擇]

if ( x > y ) { int tmp_n ; tmp_n = x ; x = y ; y = tmp_n; }; else otherthing();
有無看到問題點?

代碼: [選擇]

if ( x > y ) { int tmp_n ; tmp_n = x x = y y = tmp_n; } ; <---- 這邊多分號，慘了 ~~~~ else otherthing();
這導致會編譯錯誤，因為 else 算是多出來的，已經不是與 else 配對的…

基於這個原因，所以才引入 function(x) do{…}while(0) 來解決

#define swap(a,b) do { int tmp_n ; tmp_n = x ; x = y ; y = tmp_n; } while(0)

那展開使用就是

代碼: [選擇]

if ( x > y ) do { int tmp_n ; tmp_n = x; x = y; y = tmp_n; } while(0); else otherthing();
這樣就不會出包了….

« 上次編輯: 2009-06-03 15:21 由 kenduest »

記錄

I am kenduest – 小州my website: http://kenduest.sayya.org/

Static function

發表： 2012 年 01 月 14 日 | 作者： vinsonhsieh | Filed under: 程式面面觀 - 作者: VinsonHsieh | 發表留言

static PyObject *MyFunctionWithNoArgs( PyObject *self );

當你看到static func時，你會想到什麼? 是的他只能被自己的source file看到，
換句話說他的名字可能就不是那麼重要，因為外面的人看不到...
(這裏其實是因為有另一個mapping table開放出可讀性更高的名字出去)
反過來如果這個func不是要開放給外面的人看，就把它變成static吧...

the names of your C functions can be whatever you like 
as they will never be seen outside of the extension module. 
So they would be defined as static function.

Like 669699

Management@SoftwareLife

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Next:

Want to know more?

About the author.

Python Patterns – An Optimization Anecdote

Python Patterns – An Optimization Anecdote

And The Winner Is…

Conclusion

Epilogue

Sample Code

3:2 pulldown

內積的一種意義

產業重塑 – 倖存下來的不是最強的物種，而是最聰明的物種也就是最能因應變化而做出調適的物種。

Multi-thread 的時機

主題: #define 和 inline 的差異

http://phorum.study-area.org/index.php?topic=56407.0

回覆： #define 和 inline 的差異

Static function

訂閱文章

近期文章

分類

熱門文章與頁面︰

最多人點選

文章存檔