XML

WikklyText 1.5.0 released

WikklyText 1.5.0 is released. This is somewhat of a cleanup release with a reorganization of the caching architecture, parsing cleanups and XHTML compliance fixes.

Major changes
  • Caching is now done at the wikitext->XML layer, instead of wikitext->HTML. This has several benefits including:
    • Macros are now evaluated every time, instead of just the first time the wikitext is parsed (no more need for the -cache tag; you can safely remove it).
    • Auto-links to unknown CamelWords or links like [[My new word]] now work with caching turned on.
    • Bottom line: You can safely turn caching back on for all cases now.
  • API Change: Keyword parameter plugin_dir changed to plugin_dirs in several places to reflect the lower level semantics. Old keyword still accepted for now.
  • Can specify extra paths to look for plugins on wiki admin page. This is convenient when you want to have several wikis yet keep your plugins in a single location.
  • Fix macro parsing for nested calls like <<aaa<<bbb
  • Cleanups for XHTML conformance — now passes tests @ http://validator.w3.org/ in strict mode
  • Links (both [[..]] and [img[..]]) can no longer have inner markup. Allowing inner markup was occasionally useful but caused too many problems with TiddlyWiki compatibility. Although this is a markup change, it is compatible with TiddlyWiki so hopefully won't cause too many problems. Please double-check your links after updating.
  • Fixed CamelWord parsing so that abcDefGhi is not seen as abc[[DefGhi]]
  • rendercache entries now expire to keep cache from growing indefinitely (parameters set through wiki admin page)
  • Turn off browser caching of command responses
  • [[/my/path]] will now make a file:/// link if /my/path is a valid path
  • Rendering time limit now configurable on admin page. Fixed bug in time limit handling.
  • Speedups in rendering for all store types; text stores benefit the most from these changes.

Downloads and more information at the WikklyText home page
Written in WikklyText.

boodebr library 1.3.0 released

This is a major feature release:
  • Added multithread/multiprocess locking semantics, boodebr.util.locking
  • Updated boodebr.config to use file locking.
  • makeGUID() can be called with no args
  • New module boodebr.util.modules
  • Fixes in boodebr.sql to enhance stability under high load
  • New module boodebr.sql2 (considered beta quality for now; will eventually replace boodebr.sql)
  • New module boodebr.util.threadQ provides serialization of method calls across threads.

More information and downloads at the boodebr library wiki

Written in WikklyText.

WikklyText 1.3.0 released

The major new feature in this release is the addition of a GUI "control center" for managing local wikis. Not only does this make WikklyText easier for new users, but it removes the tedium of starting/stopping wikis by hand. Especially useful when you run multiple local wikis. This should make WikklyText easier to integrate into environments such as PortableApps and U3.

This release also adds some new options for more "USB-friendly" usage. These options (on the admin page) allow you to turn off most disk writes, to help preserve USB life.

Changes:
  • New GUI "control center" for managing wikis. It is particularly useful if you are running several local wikis and don't want to keep starting/stopping them by hand. This is now the default command if you simply run wik.exe or wikgui.exe.
  • SiteTitle and SiteSubtitle are now rendered correctly.
  • Fixed handling of embedded Python code (<?py) so that globals are visible.
  • Added CherryPy to auto dependencies.
  • Allow turning on/off metadb usage in wiki admin page (will enable/disable logging.)
  • Allow turning on/off caching in wiki admin page.
  • Added --no-respawn command-line option to prevent wiki from doing an auto-restart on exit. (Primarily for use by wik and wikgui.)
  • Create rss.xml when running "wik render"
  • RSS feed now validates with no warnings at http://feedvalidator.org/

More information and downloads at wikklytext.com
Written in WikklyText.

boodebr library 1.2.0 released

This is a minor feature release with a few bugfixes as well.
  • boodebr.gui
    • Redid stock images as SVG.
    • Embed stock images via img2py for easier use with py2exe.
    • Synced fixed_colsorter.py with wxPython 2.8.7.1
  • boodebr.ion: Added support for pickling of complex numbers.
  • boodebr.config: Added file_exits() to fileconfig objects.
  • Moved test suite down one level in tree.

More information and downloads at the boodebr library wiki

WikklyText 0.99.50

A new version of WikklyText is available for download. This is primarily a developers release, with many internal changes. You can view the list of major changes here.

Note this release is not suitable for use with Drupal. If you are using the Drupal plugin, please continue to use 0.99.22 until the next stable release of the Drupal plugin.

Downloads & instructions can be found at the WikklyText Home Page.

WikklyText - Recent Changes

File has moved!


This is now hosted at wikklytext.com

WikklyText 0.99.22 Released

This is a major feature enhancement release for WikklyText, adding these features:

  • New script twextract converts a TiddlyWiki into a set of XML and HTML files. This is useful for ...
    • Serving your TiddlyWiki content as a lightweight set of pages instead of each user having to download the entire wiki.
    • Serving your TiddlyWiki content in a Javascript-restricted environment. The generated files are plain HTML.

You can view a demonstration here: TiddlyWiki Home Page, converted to HTML.

Other enhancements/fixes:
  • Wikitexts can use <<set $LINKS_NEW_WINDOW 0|1>> to determine if links open in a new window.
  • Improved CSS styling & document structure for standalone documents.
  • Bugfixes in tables, allowing the PeriodicTable sample to work again.
  • Lots of internal fixes and reorganization.

Downloads & instructions can be found at the WikklyText Home Page.

Written in WikklyText.

Updated: All About Python and Unicode

I've completed another revision to All About Python and Unicode. Please check it out and let me know if you think I've made it worse. :-)

Changes in this version:
  • Rewrote introduction, to walk through the evolution from ASCII to Unicode
  • Simplified section "A Wrinkle in \U". I hope this is clearer now, it seems to have caused some confusion before.
  • Added a section "Python as universal recoder"
  • Incorporated user comments into the Mac OS/X section.

Check it out: All About Python and Unicode
Written in WikklyText.

Updated: "All About Python and Unicode"

I've updated my tutorial All About Python and Unicode again. It is now fully integrated into Drupal instead of being a standalone HTML document (in a different style and without the ability to add comments ... how 1995!). With this revision I've tried to clarify the section about splitting up Unicode strings, explaining a little more about why it is needed. Enjoy!

Update (2007-03-06): Oops, forgot the table of contents ... added.

All About Python and Unicode

... and even more about Unicode

Contents

A Starting Point


Two weeks before I started writing this document, my knowledge of using Python and Unicode was about like this:

All there is to using Unicode in Python is just passing your strings to unicode()


Now where would I get such a strange idea? Oh, that's right, from the Python tutorial on Unicode, which states:

"Creating Unicode strings in Python is just as simple as creating normal strings":

>>> u'Hello World !' u'Hello World !'


While this example is technically correct, it can be misleading to the Unicode newbie, since it glosses over several details needed for real-life usage. This overly-simplified explanation gave me a completely wrong understanding of how Unicode works in Python.

If you have been led down the overly-simplistic path as well, then this tutorial will hopefully help you out. This tutorial contains a set of examples, tests, and demos that docment my "relearning" of the correct way to work with Unicode in Python. It includes cross-platform issues, as well as issues that arise when dealing with HTML, XML, and filesystems.

By the way, Unicode is fairly simple, I just wish I had learned it correctly the first time.

Where to begin?


At a top level, computers use three types of text representations:
  1. ASCII
  2. Multibyte character sets
  3. Unicode

I think Unicode is easier to understand if you understand how it evolved from ASCII. The following is a brief synopsis of this evolution.

From ASCII to Multibyte


In the beginning, there was ASCII. (OK, there was also EBCDIC, but that never caught on outside of mainframes, so I'm omitting it here.) The ASCII character set contains 256 characters, as you can see on this ASCII Chart. Even though 256 characters are available, the lower 128 (codes 0-127) are the most often used codes. Early email systems in fact would only allow you to transmit characters 0-127 (i.e. "7-bit text") and in fact this is still true of many systems today. As you can see from the chart, ASCII is sufficient for English language documents.

Problems arose as computer use grew in countries where ASCII was not sufficient. ASCII lacks the ability to handle Greek, Cyrillic, or Japanese texts, to name a few. Furthermore, Japanese texts alone need thousands of characters, so there is no way to fit them into an 8-bit scheme. To overcome this, Multibyte Character Sets were invented. Most (if not all?) Multibyte Character Sets take advantage of the fact that only the first 128 characters of the ASCII set are commonly used (codes 0-127 in decimal, or 0x00-0x7f in hex). The upper codes (128..255 in decimal, or 0x80-0xff in hex) are used to define the non-English extended sets.

Lets look at an example: Shift-JIS is one encoding for Japanese text. You can see its character table here. Notice that the first byte of each character begins with a hex value from 0x80 - 0xfc. This is an interesting property, because it means that English and Japanese text can be freely mixed! The string "Hello World!" is a perfectly valid Shift-JIS encoding of English text. When parsing Shift-JIS, if you get a byte in the range 0x80-0xff, you know it is the first character of a two code sequence. Else, it is a single byte of regular ASCII.

This works just fine as long as you are working only in Japanese, but what happens if you switch to a Greek character set? As you can see from the table, ISO-8859-7 has redefined the codes from 0x80-0xff in a completely different way than Shift-JIS defines them. So, although you can mix English and Japanese, you cannot mix Greek and Japanese since they would step on each other. This is a common problem with mixing any multibyte character sets.

From Multibyte to Unicode


To overcome the problem of mixing different languages, Unicode proposes to combine all of the worlds character sets into a single huge table. Take a look at the Unicode character set.

At first glance, there appears to be separate tables for each language, so you may not see the improvement over ASCII. In reality though these are all in the same table, and are just indexed here for easy (human) reference. The key thing to notice is that since these are all part of the same table, they don't overlap like in the ASCII/multibyte world. This allows Unicode documents to freely mix languages with no coding conflicts.

Unicode terminology


Lets look at the Greek chart and grab a few characters:
Sample Unicode Symbols
03A0ΠGreek Capital Letter Pi
03A3ΣGreek Capital Letter Sigma
03A9ΩGreek Capital Letter Omega


It is common to refer to these symbols using the notation U+NNNN, for example U+03A0. So we could define a string that contains these characters, using the following notation (I added brackets for clarity):

uni = {U+03A0} + {U+03A3} + {U+03A9} 


Now, even though we know exactly what 'uni' represents (ΠΣΩ) note that there is no way to:
  • Print uni to the screen.
  • Save uni to a file.
  • Add uni to another piece of text.
  • Tell me how many bytes it takes to store uni.

Why? Because uni is an idealized Unicode string - nothing more than a concept at this point. Shortly we'll see how to print it, save it, and manipulate it, but for now, take note of the last statement: There is no way to tell me how many bytes it takes to store uni. In fact, you should forget all about bytes and think of Unicode strings as sets of symbols.

Why should you forget about bytes in the Unicode world? Take the Greek symbol Omega: Ω. There are at least 4 ways to encode this as binary:

Encoding nameBinary representation
ISO-8859-7\xD9
"Native" Greek encoding
UTF-8\xCE\xA9
UTF-16\xFF\xFE\xA9\x03
UTF-32\xFF\xFE\x00\x00\xA9\x03\x00\x00
Each of these is a perfectly valid coding of Ω, but trying to work with bytes like this is no better than dealing with the ASCII/Multibyte world. This is why I say you should think of Unicode as symbols (Ω), not as bytes.

Unicode Text in Python


To convert our idealized Unicode string uni (ΠΣΩ) to a useful form, we need to look a few things:
  1. Representing Unicode literals
  2. Converting Unicode to binary
  3. Converting binary to Unicode
  4. Using string operations

Converting Unicode symbols to Python literals


Creating a Unicode string from symbols is very easy. Recall our Greek symbols from above:

Sample Unicode Symbols
03A0ΠGreek Capital Letter Pi
03A3ΣGreek Capital Letter Sigma
03A9ΩGreek Capital Letter Omega


Lets say we want to make a Unicode string with those characters, plus some good old-fashioned ASCII characters.

Pseudocode:
uni = 'abc_' + {U+03A0} + {U+03A3} + {U+03A9} + '.txt'

Here is how you make that string in Python:
uni = u"abc_\u03a0\u03a3\u03a9.txt"


A few things to notice:
  • Plain-ASCII characters can be written as themselves. You can just say "a", and not have to use the Unicode symbol "\u0061". (But remember, "a" really is {U+0061}; there is no such thing as a Unicode symbol "a".)
  • The \u escape sequence is used to denote Unicode codes.
    • This is somewhat like the traditional C-style \xNN to insert binary values. However, a glance at the Unicode table shows values with up to 6 digits. These cannot be represented conveniently by \xNN, so \u was invented.
    • For Unicode values up to (and including) 4 digits, use the 4-digit version:
      \uNNNN
      Note that you must include all 4 digits, using leading 0's as needed.
    • For Unicode values longer than 4 digits, use the 8-digit version:
      \UNNNNNNNN
      Note that you must include all 8 digits, using leading 0's as needed.

Here is another example:

Pseudocode:
uni = {U+1A} + {U+B3C} + {U+1451} + {U+1D10C} 

Python:
uni = u'\u001a\u0bc3\u1451\U0001d10c'

Note how I padded each of the values to 4/8 digits as appropriate. Python will give you an error if you don't do this. Also note that you can use either capital or lowecase letters in the codes. The following would give you exactly the same thing:

Python:
uni = u'\u001A\u0BC3\u1451\U0001D10C'


Why doesn't "print" work?


Remember how I said earlier that uni has no fixed computer representation. So what happens if we try to print uni?
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni


You would see:
Traceback (most recent call last):
  File "t6.py", line 2, in ?
    print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4:
ordinal not in range(128)


What happened? Well, you told Python to print uni, but since uni has no fixed computer representation, Python first had to convert uni to some printable form. Since you didn't tell Python how to do the conversion, it assumed you wanted ASCII. Unfortunately, ASCII can only handle values from 0 to 127, and uni contains values out of that range, hence you see an error.

A quick method to print uni is to use Python's repr() method:
uni = u"\u001A\u0BC3\u1451\U0001D10C"
print repr(uni)


This prints:
u'\x1a\u0bc3\u1451\U0001d10c'


This of course makes sense, since that's exactly how we just defined uni. But repr(uni) is just as useless in the real world as uni itself. What we really need to do is learn about codecs.

Codecs

Codecs
In general, Python's codecs allow arbitrary object-to-object transformations. However, in the context of this article, it is enough to think of codecs as functions that transform Unicode objects into binary Python strings, and vice versa.
Why do we need them?
Unicode objects have no fixed computer representation. Before a Unicode object can be printed, stored to disk, or sent across a network, it must be encoded into a fixed computer representation. This is done using a codec. Some popular codecs you may have heard about in your day to day experiences: ascii, iso-8859-7, UTF-8, UTF-16.

From Unicode to binary


To turn a Unicode value into a binary representation, you call its .encode method with the name of the codec. For example, to convert a Unicode value to UTF-8:
binary = uni.encode("utf-8")


How about we make uni more interesting and add some plain text characters:
uni = u"Hello\u001A\u0BC3\u1451\U0001D10CUnicode"


Now lets have a look at how different codecs represent uni. Here is a little test program:

test_codec01.py
if __name__ == '__main__':

    # define our Unicode string
    uni = u"Hello\u001A\u0BC3\u1451\U0001D10CUnicode"

    # UTF-8 and UTF-16 can fully encode *any* Unicode string
    print "UTF-8", repr(uni.encode('utf-8'))
    print "UTF-16", repr(uni.encode('utf-16'))

    # ASCII can only code values 0-127. Below, we tell Python
    # to replace non-codable characters with '?'
    print "ASCII",uni.encode('ascii','replace')

    # ISO-8859-1 is similar to ASCII
    print "ISO-8859-1",uni.encode('iso-8859-1','replace')


This results in the output:
UTF-8 'Hello\x1a\xe0\xaf\x83\xe1\x91\x91\xf0\x9d\x84\x8cUnicode'
UTF-16 '\xff\xfeH\x00e\x00l\x00l\x00o\x00\x1a\x00\xc3\x0bQ\x144
        \xd8\x0c\xddU\x00n\x00i\x00c\x00o\x00d\x00e\x00'
ASCII Hello????Unicode
ISO-8859-1 Hello????Unicode


Note that I still used repr() to print the UTF-8 and UTF-16 strings. Why? Well, otherwise, it would have printed raw binary values to the screen which would have been hard to capture in this document.

From binary to Unicode


Say someone gives you a UTF-8 encoded version of a Unicode object. How do you convert it back into Unicode? You might naively try this:

The Naive (and Wrong) Way
uni = unicode( utf8_string )


Why is this wrong? Here is a sample program doing exactly that:
uni = u"Hello\u001A\u0BC3\u1451\U0001D10CUnicode"
utf8_string = uni.encode('utf-8')

# naively convert back to Unicode
uni = unicode(utf8_string)


Here is what happens:
Traceback (most recent call last):
    File "t6.py", line 5, in ?
    uni = unicode(utf8_string)

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0
    in position 6: ordinal not in range(128)


You see, the function unicode() really takes two parameters:
def unicode(string, encoding):
     ....


In the above example, we omitted the encoding so Python, in faithful style, assumed once again that we wanted ASCII (footnote 1), and gave us the wrong thing.

Here is the correct way to do it:
uni = u"Hello\u001A\u0BC3\u1451\U0001D10CUnicode"
utf8_string = uni.encode('utf-8')

# have to decode with the same codec the encoder used!
uni = unicode(utf8_string,'utf-8')
print "Back from UTF-8: ",repr(uni)


Which gives the output:
    Back from UTF-8:  u'Hello\x1a\u0bc3\u1451\U0001d10cUnicode'


String Operations


The above examples hopefully give you a good idea of why you want to avoid dealing with Unicode values as binary strings as much as possible! The UTF-8 version was 23 bytes long, the UTF-16 version was 36 bytes, the ASCII version was only 16 bytes (but it completely discarded 4 Unicode values) and similarly with ISO-8859-1.

This is why, at the very start of this document I suggested that you forget all about bytes!

The good news is that once you have a Unicode object, it behaves exactly like a regular string object, so there is no new syntax to learn (other than the \u and \U escapes). Here is a short sample that shows Unicode objects behaving the way you would expect:

test_stringops01.py
if __name__ == '__main__':

    uni = u"Hello\u001A\u0BC3\u1451\U0001D10CUnicode"

    print "uni = ",repr(uni)

    print "len(uni) = ",len(uni)

    # print the "Hello" part
    print "uni[:5] = ",uni[:5]

    # print the Unicode characters one at a time
    print "uni[5] = ",repr(uni[5])
    print "uni[6] = ",repr(uni[6])
    print "uni[7] = ",repr(uni[7])

    # Depending on how Python was compiled, \U characters
    # may be stored as two Unicode characters -- see the
    # section "A wrinkle in \U" below for more details ...
    print "uni[8] = ",repr(uni[8])
    print "uni[9] = ",repr(uni[9])

    # print the "Unicode" text at the end
    print "uni[10:] = ",repr(uni[10:])


Running this sample gives the output:
uni =  u'Hello\x1a\u0bc3\u1451\U0001d10cUnicode'
len(uni) =  17
uni[:5] =  Hello
uni[5] =  u'\x1a'
uni[6] =  u'\u0bc3'
uni[7] =  u'\u1451'
uni[8] =  u'\ud834'
uni[9] =  u'\udd0c'
uni[10:] =  u'Unicode'


A wrinkle in \U


Depending on how your version of Python was compiled, it will store Unicode objects internally in either UTF-16 (2 bytes/character) or UTF-32 (4 bytes/character) format. Unfortunately this low-level detail is exposed through the normal string interface.

For 4-digit (16-bit) characters like \u03a0, there is no difference.
a = u'\u03a0'
print len(a)
Will show of length of 1, regardless of how your Python was built, and a[0] will always be \u03a0. However, for 8-digit (32-bit) characters, like \U0001FF00, you will see a difference. Obviously, 32-bit values cannot be directly represented in a 16-bit code, so a pair of two 16-bit values are used. (Codes 0xD800 - 0xDFFF, called "surrogate pairs", are reserved for these two-character sequences. These values are invalid when used by themselves, per the Unicode specification.)

A sample program that shows what happens:
What happens with \U ...
a = u'\U0001ff00'
print "Length:",len(a)

print "Chars:"
for c in a:
    print repr(c)
If you run this under a "UTF-16" Python, you will see:
Output, 'UTF-16' Python
Length: 2
Chars:
u'\ud83f'
u'\udf00'
Under a 'UTF-32' Python, you will see:
Output, 'UTF-16' Python
Length: 1
Chars:
u'\U0001ff00'
This is an annoying detail to have to worry about. I wrote a module that lets you step character-by-character through a Unicode string, regardless of whether you are running on a 'UTF-16' or 'UTF-32' flavor of Python. It is called xmlmap and is part of Gnosis Utils. Here are two examples, one using xmlmap, one not.
Without xmlmap
a = u'A\U0001ff00C\U0001fafbD'
print "Length:",len(a)

print "Chars:"
for c in a:
    print repr(c)
Results without xmlmap, on a UTF-16 Python
Length: 7
Chars:
u'A'
u'\ud83f'
u'\udf00'
u'C'
u'\ud83e'
u'\udefb'
u'D'
Now, using the usplit() function, to get the characters one-at-a-time, combining split values where needed:
With xmlmap
from gnosis.xml.xmlmap import usplit

a = u'A\U0001ff00C\U0001fafbD'
print "Length:",len(a)

print "Chars:"
for c in usplit(a):
    print repr(c)
Results with xmlmap, on a UTF-16 Python
Length: 7
Chars:
u'A'
u'\U0001ff00'
u'C'
u'\U0001fafb'
u'D'
Now you will get identical results regardless of how your Python was compiled. (Note that the length is still the same, but usplit() has combined the surrogate pairs so you don't see them.)

Bugs in Python 2.0 & 2.1


Yes, you may wonder "who cares" when it comes to Python 2.0 and 2.1, but when writing code that's supposed to be completely portable, it does matter!


Python 2.0.x and 2.1.x have a fatal bug when trying to handle single-character codes from in the range \uD800-\uDFFF.

The sample code below demonstrates the problem:
 u = unichr(0xd800)
 print "Orig: ",repr(u)

 # create utf-8 from '\ud800'
 ue = u.encode('utf-8')
 print "UTF-8: ",repr(ue)

 # decode back to unicode
 uu = unicode(ue,'utf-8')
 print "Back: ",repr(uu)


Running this under Python 2.2 and up gives the expected result:
 Orig:  u'\ud800'
 UTF-8:  '\xed\xa0\x80'
 Back:  u'\ud800'


Python 2.0.x gives:
 Orig:  u'\uD800'
 UTF-8:  '\240\200'
 Traceback (most recent call last):
   File "test_utf8_bug.py", line 9, in ?
     uu = unicode(ue,'utf-8')
 UnicodeError: UTF-8 decoding error: unexpected code byte


Python 2.1.x gives:
 Orig:  u'\ud800'
 UTF-8:  '\xa0\x80'
 Traceback (most recent call last):
   File "test_utf8_bug.py", line 9, in ?
     uu = unicode(ue,'utf-8')
 UnicodeError: UTF-8 decoding error: unexpected code byte


As you can see, both fail to encode u'\ud800' when used as a single character. While it is true that the characters from 0xD800 .. 0xDFF are not valid when used by themselves, the fact is that Python will let you use them alone.

But if they're invalid, why should Python bother?


I came up with a good example, completely by accident while working on the code for this tutorial. Create two Python files:

aaa.py
x = u'\ud800'


bbb.py
import sys
sys.path.insert(0,'.')
import aaa


Now, use Python 2.0.x/2.1.x to run bbb.py twice (it needs to run twice so it will load aaa.pyc the second time). On the second run, you'll get:
  Traceback (most recent call last):
    File "bbb.py", line 3, in ?
      import aaa
  UnicodeError: UTF-8 decoding error: unexpected code byte


That's right, Python 2.0.x/2.1.x are unable to reload their own bytecode from a .pyc file if the source contains a string like u'\ud800'. A portable workaround in that case would be to use unichr(0xd800) instead of u'\ud800' (this is what gnosis.xml.pickle does).

Python as a "universal recoder"


Up to this point, I've been translating Unicode to/from UTF for purposes of demonstration. However, Python lets you do much more than that. It allows you to translate nearly any multibyte character string into Unicode (and vice versa). Implementing all of these translations is a lot of work. Fortunately, it has been done for us, so all we have to do is know how to use it.

Lets revisit our Greek table, except this time I'm going to list the characters both in Unicode as well as ISO-8859-7 ("native Greek").

CharacterNameAs UnicodeAs ISO-8859-7
Π Greek Capital Letter Pi 03A0 0xD0
Σ Greek Capital Letter Sigma 03A3 0xD3
Ω Greek Capital Letter Omega 03A9 0xD9
With Python, using unicode() and .encode() makes it trivial to translate between these.
# {Pi}{Sigma}{Omega} as ISO-8859-7 encoded string 
b = '\xd0\xd3\xd9'

# Convert to Unicode ('univeral format')
u = unicode(b, 'iso-8859-7')
print repr(u)

# ... and back to ISO-8859-7
c = u.encode('iso-8859-7')
print repr(c)


Shows:
u'\u03a0\u03a3\u03a9'
\xd0\xd3\xd9


You can also use Python as a "universal recoder". Say you received a file in the Japanese encoding ShiftJIS and wanted to convert to the EUC-JP encoding:
txt = ... the ShiftJIS-encoded text ...

# convert to Unicode ("universal format")
u = unicode(txt, 'shiftjis')

# convert to EUC-JP
out = u.encode('eucjp')


Of course, this only works when translating between compatible character sets. Trying to translate between Japanese and Greek character sets this way would not work.

Now the Fun Begins ... Unicode and the Real World


Now you know about everything you need to know to work with Unicode objects within Python. Isn't that nice? However, the rest of the world isn't quite as nice and neat as Python, so you need to understand how the non-Python portion of the world handles Unicode. It isn't terribly hard, but there are a lot of special cases to consider.

From here on out, we'll be looking at Unicode issues that arise when dealing with:
  1. Filenames (Operating System specific issues)
  2. XML
  3. HTML
  4. Network files (Samba)

Unicode Filenames


Sounds simple enough, right? If I want to name a file with my Greek letters, I just say:
   open(unicode_name, 'w')


In theory, yes, that's supposed to be all there is to it. However, there are many ways for this to not work, and they depend on the platform your program is running on.

Microsoft Windows


There are at least two ways of running Python under Windows. The first is to use the Win32 binaries from www.python.org. I will refer to this method as "Windows-native Python".

The other method is by using the version of Python that comes with Cygwin This version of Python looks (to user code) more like POSIX, instead of like a Windows-native environment.

For many things, the two versions are interchangeable. As long as you write portable Python code, you shouldn't have to care which interpreter you are running under. However, one important exception is when handling Unicode. That is why I'll be specific here about which version I am running.

Using Windows-native Python


Lets keep using our familiar Greek symbols:

Sample Unicode Symbols
03A0ΠGreek Capital Letter Pi
03A3ΣGreek Capital Letter Sigma
03A9ΩGreek Capital Letter Omega


Our sample Unicode filename will be:
# this is: abc_{PI}{Sigma}{Omega}.txt
uname = u"abc_\u03A0\u03A3\u03A9.txt"


Lets create a file with that name, containing a single line of text:
open(uname,'w').write('Hello world!\n')


Opening up an Explorer window shows the results (click for a larger version):

win32_01.jpg


There the filename is in all its unicode glory.

Now, lets see how os.listdir() works with this name. The first thing to know is that os.listdir() has two modes of operation:
  • Non-unicode, achieved by passing a non-Unicode string to os.listdir(), i.e. os.listdir('.')
  • Unicode, achieved by passing a Unicode string to os.listdir(), i.e. os.listdir(u'.')

First, lets try as Unicode:
os.chdir('ttt')
# there is only one file in directory 'ttt'
name = os.listdir(u'.')[0]
print "Got name: ",repr(name)
print "Line: ",open(name,'r').read()


Running this program gives the following output:
Got name:  u'abc_\u03a0\u03a3\u03a9.txt'
Line:  Hello world!


Comparing with above, that looks correct. Note that print repr(name) was required, since an error would have occurred if I had tried to print name directly to the screen. Why? Yep, once again Python would have assumed you wanted an ASCII coding, and would have failed with an error.

Now let's try the above sample again, but using the non-Unicode version of os.listdir():
os.chdir('ttt')
# there is only one file in directory 'ttt'
name = os.listdir('.')[0]
print "Got name: ",repr(name)
print "Line: ",open(name,'r').read()


Gives this output:
Got name:  'abc_?SO.txt'
Line: 
Traceback (most recent call last):
  File "c:\frank\src\unicode\t2.py", line 8, in ?
    print "Line: ",open(name,'r').read()
IOError: [Errno 2] No such file or directory: 'abc_?SO.txt'


Yikes! What happened? Welcome to the wonderful work of the win32 "dual-API".

A little background:
Windows NT/2000/XP always write filenames to the the underlying filesystem as Unicode (footnote 2). So in theory, Unicode filenames should work flawlessly with Python.

Unfortunately, win32 actually provides two sets of APIs for interfacing with the filesystem. And in true Microsoft style, they are incompatible. The two APIs are:
  1. A set of APIs for Unicode-aware applications, that return the true Unicode names.
  2. A set of APIs for non-Unicode aware applications that return a locale-dependent coding of the true Unicode filenames.

Python (for better or worse) follows this convention on win32 platforms, so you end up with two incompatible ways of calling os.listdir() and open():
  1. When you call os.listdir(), open(), etc. with a Unicode string, Python calls the Unicode version of the APIs, and you get the true Unicode filenames. (This corresponds to the first set of APIs above).
  2. When you call os.listdir(), open(), etc. with a non-Unicode string, Python calls the non-Unicode version of the APIs, and here is where the trouble creeps in. The non-Unicode API's handle Unicode with a particular codec called MBCS. MBCS is a lossy codec: Every MBCS name can be represented as Unicode, but not vice versa. MBCS coding also changes depending on the current locale. In other words, if I write a CD with a multibyte-character filename as MBCS on my English locale machine, then send the CD to Japan, the filename there may appear to contain completely different characters.


Now that we know the background facts, we can see what happened above. By using os.listdir('.'), you are getting the MBCS-version of the true Unicode name that is stored on the filesystem. And, on my English-locale computer, there is no accurate mapping for the Greek characters, so you end up with "?", "S", and "O". This leads to the weird result that there is no way to open our Greek-lettered file using the MBCS APIs in an English locale (!!).

Bottom line
I recommend always using Unicode strings in os.listdir(), open(), etc. Remember that Windows NT/2000/XP always stores filenames as Unicode, and so this is the native behavior. And, as shown above, can sometimes be the only way to open a Unicode filename.
Danger! Cygwin
Cygwin has a huge problem here. It (currently, at least) has no support for Unicode. That is, it will never call the Unicode versions of the win32 APIs. Hence, it is impossible to open certain files (like our Greek-lettered filename) from Cygwin. It doesn't matter if you use os.listdir(u'.') or os.listdir(''); you always get the MBCS-coded versions.

Please note that this isn't a Python-specific problem; it is a systemic problem with Cygwin. All Cygwin utilities, such as zsh, ls, zip, unzip, mkisofs, will be unable to recognize our Greek-lettered name, and will report various errors.


Unix/POSIX/Linux


Unlike Windows NT/2000/XP, which always store filenames in Unicode format, POSIX systems (including Linux) always store filenames as binary strings. This is somewhat more flexible, since the operating system itself doesn't have to know (or care) what encoding is used for filenames. The downside is that the user is responsible for setting up their environment ("locale") for the proper coding.

Setting a locale


The specifics of setting up your POSIX box to handle Unicode filenames are beyond the scope of this document, but it generally comes down to setting a few environment variables. In my case, I wanted to use the UTF-8 codec in a U.S. English locale, so my setup involved adding a few lines to these startup files (I've tried this under Gentoo Linux and Ubuntu, though all Linux systems should be similar):

Additions to .bashrc:
LANG="en_US.utf8"
LANGUAGE="en_US.utf8"
LC_ALL="en_US.utf8"

export LANG
export LANGUAGE
export LC_ALL


For good measure, I added the same lines to my .zshrc file.

Additionally, I added the first three lines to /etc/env.d/02locale.
CAUTION
Please do not blindy make changes like the above to your system if you aren't sure what you're doing. You could make your files unreadable by switching locales. The above is meant only as an example of a simple case of switching from an ASCII locale to a UTF-8 locale.


Python under POSIX


A big advantage under POSIX, as far as Python is concerned, is that you can use either:
      os.listdir('.')


Or:
      os.listdir(u'.')


Both methods will give you strings that you can pass to open() to open the files. This is much better than Windows, which will return mangled versions of the Unicode names if you use os.listdir('.'), which as seen above can sometimes fail to give you a valid name to open the file. You will always get a valid name under POSIX/Linux.

Here is a sample function to demonstrate that:

test_posix01.test()
def test():
    # Demonstrate that listdir(u'.') and listdir('.') both
    # work fine under POSIX (unlike win32)

    import os

    uname = u'abc_\u03a0\u03a3\u03a9.txt'

    # make a tempdir so I'll only have a single file in it
    os.mkdir('ttt')
    os.chdir('ttt')

    open(uname,'w').write("Hello unicode!\n")

    # use listdir() to get name as Unicode
    name = os.listdir(u'.')[0]
    print "As unicode: ",repr(name)
    print "   Read line: ",open(name,'r').read()

    # now get name as a bytestring
    name = os.listdir('.')[0]
    print "As bytestring: ",repr(name)
    print "   Read line: ",open(name,'r').read()


If you run this you'll get:
As unicode:  u'abc_\u03a0\u03a3\u03a9.txt'
   Read line:  Hello unicode!

As bytestring:  'abc_\xce\xa0\xce\xa3\xce\xa9.txt'
   Read line:  Hello unicode!


As you can see, we were able to successfully read the file, no matter if we used the Unicode or bytestring version of the filename.

Application Demos


Unlike the Microsoft Windows world where you basically have a "DOS box" and Windows Explorer, under Linux you have many choices about what terminal and file manager you want to run. This is both a blessing and a curse: A blessing in that you can pick an application that suits your preferences, but also curse in that not all applications support Unicode to the same extent.

The following is a survey of several popular applications to see what they support.

Applications that support Unicode filenames


My personal current favorite is mlterm, a multi-lingual terminal (click for a larger version):

mlterm_01.jpg


The GNOME terminal (gnome-terminal):

gnome_terminal_01.jpg


The KDE terminal (konsole):

konsole_01.jpg


A modified version of rxvt (rxvt-unicode) handles Unicode, although it has some issues with underscore characters in the font I've chosen ...

urxvt_01.jpg


Here is our Greek-lettered file in the a KDE file manager window (konqueror):

konq_01.jpg


And here it is in the GNOME file manager (Nautilus):

naut_01.jpg


The XFCE 4 file manager:

xfce_01.jpg


The standard KDE file selector supports Unicode filenames:

kfilesel_01.jpg


As does the GNOME file selector:

gfilesel_01.jpg


Applications that do not support Unicode filenames


The standard rxvt does not handle Unicode correctly:

rxvt_01.jpg


The Xfm file manager does not handle Unicode filenames:

xfm_01.jpg


Mac OS/X


I don't have an OSX machine to test this on, but helpful readers have contributed some information on Unicode support in OSX.

One reader pointed out that os.listdir('.') and os.listdir(u'.') both return objects that can be passed directly to open(), as you can do under POSIX.

Reader Hraban noted:
You should mention that MacOS X uses a special kind of decomposed UTF-8 to store filenames. If you need to e.g. read in filenames and write them to a "normal" UTF-8 file, you must normalize them (at least if your editor, or my TeX system, doesn't understand decomposed UTF-8):
filename = unicodedata.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')


For others reading this who aren't familiar with this issue (like I wasn't) here are a few references: My understanding of this is that when you pass a name with an accented character like é, it will decompose this into e plus ' before saving it to the filesystem (this behavior is defined by the Unicode standard).

If you can add anything else to this section, please leave a comment below!

Unicode and HTML


You may find yourself generating HTML with Python (i.e. when using mod_python, CherryPy, or such). So how do you use Unicode characters in an HTML document?

The answer involves these easy steps:
  1. Use a <meta> tag to let the user's browser know the encoding you used. (footnote 3)
  2. Generate your HTML as a Unicode object.
  3. Write your HTML bytestream using a whichever codec you prefer.

Here is an example, writing the same Greek-lettered string I've been using all along:
    code = 'utf-8' # make it easy to switch the codec later
 
    html = u'<html>'

    # use a <meta> tag to specify the document encoding used
    html += u'<meta http-equiv="content-type" content="text/html; charset=%s">' % code
    html += u'<head></head><body>'

    # my actual Unicode content ...
    html += u'abc_\u03A0\u03A3\u03A9.txt'

    html += u'</body></html>'

    # Now, you cannot write Unicode directly to a file. 
    # First have to either convert it to a bytestring using a codec, or
    # open the file with the 'codecs' module.

    # Method #1, doing the conversion yourself:
    open('t.html','w').write( html.encode( code ) )

    # Or, by using the codecs module:
    import codecs
    codecs.open('t.html','w',code).write( html )

    # .. the method you use depends on personal preference and/or
    # convenience in the code you are writing.


Now let's open the page (t.html) in Firefox:

win32_02.jpg


Just as expected!

Now, if you go back into the sample code and replace the line:
    code = 'utf-8'


With ...
    code = 'utf-16'


... the HTML file will now be written in UTF-16 format, but the result displayed in the browser window will be exactly the same.

Unicode and XML


The XML 1.0 standard requires all parsers to support UTF-8 and UTF-16 encoding. So, it would seem obvious that an XML parser would allow any legal UTF-8 or UTF-16 encoded document as input, right?

Nope!

Have a look at this sample program:
   xml = u'<?xml version="1.0" encoding="utf-8" ?>'
   xml += u'<H> \u0019 </H>'

   # encode as UTF-8
   utf8_string = xml.encode( 'utf-8' )


At this point, utf8_string is a perfectly valid UTF-8 string representing the XML. So we should be able to parse it, right?:
from xml.dom.minidom import parseString
parseString( utf8_string )


Here is what happens when we run the above code:
Traceback (most recent call last):
   File "t9.py", line 9, in ?
     parseString( utf8_string )
   File "c:\py23\lib\xml\dom\minidom.py", line 1929, in parseString
     return expatbuilder.parseString(string)
   File "c:\py23\lib\xml\dom\expatbuilder.py", line 940, in parseString
     return builder.parseString(string)
   File "c:\py23\lib\xml\dom\expatbuilder.py", line 223, in parseString
     parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 43


Whoa - what happened there? It gave us an error at column 43. Lets see what column 43 is:
>> print repr(utf8_string[43])

'\x19'


You can see that it doesn't like the Unicode character U+0019. Why is this? Section 2.2 of the XML 1.0 standard defines the set of legal characters that may appear in a document. From the standard:
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
         [#xE000-#xFFFD] | [#x10000-#x10FFFF]


Clearly, there are some major gaps in the characters that are legal to include in an XML document. Lets turn the above into a Python function that can be used to test whether a given Unicode value is legal to write to an XML stream:

gnosis.xml.xmlmap.raw_illegal_xml_regex()
def raw_illegal_xml_regex():
    """    
    I want to define a regexp to match *illegal* characters.
    That way, I can do "re.search()" to find a single character,
    instead of "re.match()" to match the entire string. [Based on
    my assumption that .search() would be faster in this case.]

    Here is a verbose map of the XML character space (as defined
    in section 2.2 of the XML specification):
    
         u0000 - u0008           = Illegal
         u0009 - u000A           = Legal
         u000B - u000C           = Illegal
         u000D                   = Legal
         u000E - u001F           = Illegal
         u0020 - uD7FF           = Legal
         uD800 - uDFFF           = Illegal (See note!)
         uE000 - uFFFD           = Legal
         uFFFE - uFFFF           = Illegal
         U00010000 - U0010FFFF = Legal (See note!)
    
    Note:
    
       The range U00010000 - U0010FFFF is coded as 2-character sequences
       using the codes (D800-DBFF),(DC00-DFFF), which are both illegal
       when used as single chars, from above.
    
       Python won't let you define \U character ranges, so you can't
       just say '\U00010000-\U0010FFFF'. However, you can take advantage
       of the fact that (D800-DBFF) and (DC00-DFFF) are illegal, unless
       part of a 2-character sequence, to match for the \U characters.
    """

    # First, add a group for all the basic illegal areas above
    re_xml_illegal = u'([\u0000-\u0008\u000b-\u000c\u000e-\u001f\ufffe-\uffff])'

    re_xml_illegal += u"|"

    # Next, we know that (uD800-uDBFF) must ALWAYS be followed by (uDC00-uDFFF),
    # and (uDC00-uDFFF) must ALWAYS be preceded by (uD800-uDBFF), so this
    # is how we check for the U00010000-U0010FFFF range. There are also special
    # case checks for start & end of string cases.

    # I've defined this oddly due to the bug mentioned at the top of this file
    re_xml_illegal += u'([%s-%s][^%s-%s])|([^%s-%s][%s-%s])|([%s-%s]$)|(^[%s-%s])' % \
                      (unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
                       unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff),
                       unichr(0xd800),unichr(0xdbff),unichr(0xdc00),unichr(0xdfff))

    return re_xml_illegal
Using the code ...
def make_illegal_xml_regex():
    return re.compile( raw_illegal_xml_regex() )

c_re_xml_illegal = make_illegal_xml_regex()


Finally:
gnosis.xml.xmlmap.is_legal_xml()
def is_legal_xml( uval ):
    """
    Given a Unicode object, figure out if it is legal
    to place it in an XML file.
    """
    return (c_re_xml_illegal.search( uval ) == None)


The above function is good for when you have a Unicode string, but could be a little slow when searching a character at a time. So here is an alternate function for doing that (note this makes use of the usplit() function defined earlier):

gnosis.xml.xmlmap.is_legal_xml_char()
def is_legal_xml_char( uchar ):
    """
    Check if a single unicode char is XML-legal.
    (This is faster that running the full 'is_legal_xml()' regexp
    when you need to go character-at-a-time. For string-at-a-time
    of course you want to use is_legal_xml().)

    USAGE NOTE:
       If you want to use this in a 'for' loop,
       make sure use usplit(), e.g.:
          
       for c in usplit( uval ):
          if is_legal_xml_char(c):
                 ... 

       Otherwise, the first char of a legal 2-character
       sequence will be incorrectly tagged as illegal, on
       Pythons where \U is stored as 2-chars.
    """

    # due to inconsistencies in how \U is handled (based on
    # how Python was compiled) it is shorter to test for
    # illegal chars than legal ones, and invert the result.
    #
    # (as one example: (u'\ud900' > u'\U00100000') can be True,
    # depending on how Python was compiled. Testing for illegal chars
    # lets us stick with the single char sequences (all 2-char
    # sequences are legal for XML).

    if len(uchar) == 1:
        return not \
               (
               (uchar >= u'\u0000' and uchar <= u'\u0008') or \
               (uchar >= u'\u000b' and uchar <= u'\u000c') or \
               (uchar >= u'\u000e' and uchar <= u'\u001f') or \
               # always illegal as single chars
               (uchar >= unichr(0xd800) and uchar <= unichr(0xdfff)) or \
               (uchar >= u'\ufffe' and uchar <= u'\uffff')
               )
    elif len(uchar) == 2:
        # all 2-char codings are legal in XML
        # (this looks weird, but remember that even after calling
        # usplit(), \U00010000 is STILL len() of 2, usplit() just
        # made it a single listitem
        return True
    
    else:
        raise Exception("Must pass a single character to is_legal_xml_char")


Here is a fairly extensive test case to demonstrate the above functions:

test_xml_legality
from xml.dom.minidom import parseString
import re
import sys

# define True/False if this Python doesn't have them
try:
    a = True
except:
    True = 1
    False = 0

from gnosis.xml.xmlmap import *

# sanity check for testing purposes
def try_in_xml( uval ):
    "Try putting the Unicode string uval into an XML doc & parsing."
    
    xml = u'<?xml version="1.0" encoding="utf-8" ?>'
    xml += u'<H>' + uval + '</H>'

    #print [u for u in usplit(xml) if u >= u'\U00010000']

    try:
        parseString(xml.encode('utf-8'))
        return True # succeeded
    except:
        return False # failed

# --- test cases ---

bad_unicode = [
    # 0000-0008 is illegal
    u'abc\u0001def',
    # 000B-000C is illegal
    u'abc\u000cdef',
    # 000E-0019 is illegal
    u'abc\u0015def',
    # D800-DBFF is illegal, unless it starts a 2-char sequence
    u'abc\ud900def',
    # DC00-DFFF is illegal, unless it ends a 2-char sequence
    u'abc\uDDDDdef',
    # FFFE-FFFF is illegal
    u'abc\ufffedef',
    # case of D800-DBFF at end of string (next to last segment of regex)
    u'abc\ud800',
    # case of DC00-DFFF at start of string (last segment of regex)
    u'\udc00'
    ]

good_unicode = [
    # 0009-000A is legal
    u'abc\u0009def\u000aghi',
    # 000D is legal
    u'abc\u000ddef',
    # 0020-D7FF is legal
    u'abc\u0020def\u8112ghi\ud7ffjkl',
    # E000-FFFD is legal
    u'abc\ue000def\uF123ghi\ufffdjkl',
    # U00010000 - U0010FFFF is legal
    u'abc\U00010000def\U00023456ghi\U00101234jkl'
    ]

if __name__ == '__main__':

    print "** BAD VALUES **"
    for u in bad_unicode:
        # print the unicode value, test legality, and sanity check by
        # putting it in an XML document & parsing it
        print "%-50s %8s %1d" % (repr(u), is_legal_xml(u), try_in_xml(u))

    print "\n** GOOD VALUES **"
    for u in good_unicode:
        # print the unicode value, test legality, and sanity check by
        # putting it in an XML document & parsing it
        print "%-50s %8s %1d" % (repr(u), is_legal_xml(u), try_in_xml(u))

    # an all-illegal string
    u = u'\u0000\u0005\u0008\u000b\u000c\u000e\u0010\u0019' + \
        u'\ud800\ud900\u0000\udc00\udd00\udfff\ufffe\uffff'

    print "\nTesting one char at a time ..."
    print repr(u)
    for c in usplit(u):
        # test as a char
        if is_legal_xml_char(c):
            raise "ERROR(1)"

        # test as a string
        if is_legal_xml(c):
            raise "ERROR(2)"

        # stick in an XML document to double-check the above
        if try_in_xml(c) != False:
            raise "ERROR(3)"

    print "OK\n"

    # an all-legal string
    u = u'\u0009\u000a\u000d\u0020\u2345\ud7ff' + \
        u'\ue000\ue876\ufffd' + \
        u'\U00010000\U00012345\U00100000\U0010ffff'
    # subtle -- make sure it allows a handcoded 2-char sequence (this
    # is the case that forces usplit() to do a full pass even if \U is
    # stored as single chars)
    u += u'\ud800\udc00' 

    print repr(u)
    for c in usplit(u):
        # test as a char
        if not is_legal_xml_char(c):
            raise "ERROR(1)"

        # test as a string
        if not is_legal_xml(c):
            raise "ERROR(2)"

        # stick in an XML document to double-check the above
        if try_in_xml(c) != True:
            raise "ERROR(3)"

    print "OK"


I'm going to run this under two different versions of Python to show the differences you can see in \U coding.

First, under Python 2.0 (which uses 2-char \U encoding, on my machine):
 ** BAD VALUES **
 u'abc\001def'                                             0 0
 u'abc\014def'                                             0 0
 u'abc\025def'                                             0 0
 u'abc\uD900def'                                           0 0
 u'abc\uDDDDdef'                                           0 0
 u'abc\uFFFEdef'                                           0 0
 u'abc\uD800'                                              0 0
 u'\uDC00'                                                 0 0

 ** GOOD VALUES **
 u'abc\011def\012ghi'                                      1 1
 u'abc\015def'                                             1 1
 u'abc def\u8112ghi\uD7FFjkl'                              1 1
 u'abc\uE000def\uF123ghi\uFFFDjkl'                         1 1
 u'abc\uD800\uDC00def\uD84D\uDC56ghi\uDBC4\uDE34jkl'        1 1

 Testing one char at a time ...
 u'\000\005\010\013\014\016\020\031\uD800\uD900\000\uDC00\uDD00
 \uDFFF\uFFFE\uFFFF'
 OK

 u'\011\012\015 \u2345\uD7FF\uE000\uE876\uFFFD\uD800\uDC00\uD808
 \uDF45\uDBC0\uDC00\uDBFF\uDFFF\uD800\uDC00'
 OK


And now under Python 2.3, which on my machine stores \U as a single character::
 ** BAD VALUES **
 u'abc\x01def'                                         False 0
 u'abc\x0cdef'                                         False 0
 u'abc\x15def'                                         False 0
 u'abc\ud900def'                                       False 0
 u'abc\udddddef'                                       False 0
 u'abc\ufffedef'                                       False 0
 u'abc\ud800'                                          False 0
 u'\udc00'                                             False 0

 ** GOOD VALUES **
 u'abc\tdef\nghi'                                       True 1
 u'abc\rdef'                                            True 1
 u'abc def\u8112ghi\ud7ffjkl'                           True 1
 u'abc\ue000def\uf123ghi\ufffdjkl'                      True 1
 u'abc\U00010000def\U00023456ghi\U00101234jkl'          True 1

 Testing one char at a time ...
 u'\x00\x05\x08\x0b\x0c\x0e\x10\x19\ud800\ud900\x00\udc00\udd00
   \udfff\ufffe\uffff'
 OK

 u'\t\n\r \u2345\ud7ff\ue000\ue876\ufffd\U00010000\U00012345
   \U00100000\U0010ffff\U00010000'
 OK


You can see that both version of Python give the same answers (except Python 2.0 uses 1/0 instead of True/False). But you can see in the repr()