memonic

httpie - Python-powered HTTP CLI for humans - The Changelog - Open Source moves fast. Keep up.

Save
httpie - Python-powered HTTP CLI for humans
Wynn Netherland posted this 2 days ago

Although cURL is great, we’re always looking for great console tools for working with HTTP. Jakub Roztocil has released HTTPie. Built on Requests, HTTPie provides a clean command line interface for HTTP requests:

http PATCH api.example.com/person/1 X-API-Token:123 name=John email=john@example.orgPATCH /person/1 HTTP/1.1User-Agent: HTTPie/0.1X-API-Token: 123Content-Type: application/json; charset=utf-8{"name": "John", "email": "john@example.org"}

I appreciate the colored terminal output:

HTTPie output

Source on GitHub.

Beautiful Soup 4 Now in Beta

Save
Computing Thoughts
Beautiful Soup 4 Now in Beta
by Bruce Eckel
February 23, 2012

Many improvements have been made in this new version -- for one thing, it's compatible with both Python 2 and Python 3. One of the biggest changes is that Beautiful Soup no longer uses its own parser; instead it chooses from what's available on your system, preferring the blazingly-fast lxml but falling back to other parsers and using Python's batteries-included parser if nothing else is available.

A useful tip if you're on Windows: you can find a pre-compiled Windows version of lxml here. That site has lots of pre-compiled Python extensions which is extremely helpful, as some of these packages (like lxml) otherwise require some serious gyrations in order to install them on your Windows box. (I work regularly on Mac, Windows 7 and Ubuntu Linux, in order to ensure that whatever I'm working on is cross-platform.)

Beautiful Soup has been refactored in many places; sometimes these changes constitute a significant improvement to the programming model, other times the changes are just to conform to the Python naming syntax or to ensure Python 2/3 compatibility.

The author Leonard Richardson is open to suggestions for improvements, so if you've had a feature request sitting on your back burner, now's the time.

Here's the introductory link to the Beautiful Soup 4 Beta.

I've been using Beautiful Soup to process a book that I'm coauthoring via Google Docs. We can work on the book remotely at the same time, which is something I've tried to do with other technologies via screen sharing. It works best with Google Docs because there's no setup necessary if we want to have a phone conversation about the document while working on it. Then I download the book in HTML format and apply the Beautiful Soup tools I've written to process the HTML. Although I've spent a fair amount of time on these, the investment is worth it because HTML isn't going away anytime soon so my Beautiful Soup skills should come in handy again and again.

Welcome to Scrapy

Save
Scrapy

Welcome to Scrapy

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

News

2011-09-16 Scrapy (source code, wiki & issues) has moved to Github!
2011-01-02 Scrapy 0.12 released!
2010-09-29 Scrapy 0.10.3 released!
2010-09-15 Scrapy 0.10.1 and 0.10.2 released!
2010-09-10 Scrapy 0.10 released!
2010-09-10 Scrapy Blog launched!
2010-08-24 Scrapy Snippets site launched!
2010-06-28 Scrapy 0.9 released!

See older news

Features

Simple
Scrapy was designed with simplicity in mind, by providing the features you need without getting in your way
Productive
Just write the rules to extract the data from web pages and let Scrapy crawl the entire web site for you
Fast
Scrapy is used in production crawlers to completely scrape more than 500 retailer sites daily, all in one server
Extensible
Scrapy was designed with extensibility in mind and so it provides several mechanisms to plug new code without having to touch the framework core
Portable
Scrapy runs on Linux, Windows, Mac and BSD
Open Source and 100% Python
Scrapy is completely written in Python, which makes it very easy to hack
Well-tested
Scrapy has an extensive test suite with very good code coverage

Still not sure if Scrapy is what you're looking for?. Check out Scrapy at a glance.

Companies using Scrapy

Scrapy is being used in large production environments, to crawl thousands of sites daily. Here is a list of Companies using Scrapy.

Where to start?

Start by reading Scrapy at a glance, then download Scrapy and follow the Tutorial.

Python’s super() considered super! « Deep Thoughts by Raymond Hettinger

Save

If you aren’t wowed by Python’s super() builtin, chances are you don’t really know what it is capable of doing or how to use it effectively.

Much has been written about super() and much of that writing has been a failure. This article seeks to improve on the situation by:

  • providing practical use cases
  • giving a clear mental model of how it works
  • showing the tradecraft for getting it to work every time
  • concrete advice for building classes that use super()
  • favoring real examples over abstract ABCD diamond diagrams.

The examples for this post are available in both Python 2 syntax and Python 3 syntax.

Using Python 3 syntax, let’s start with a basic use case, a subclass for extending a method from one of the builtin classes:

class LoggingDict(dict):
    def __setitem__(self, key, value):
        logging.info('Setting %r to %r' % (key, value))
        super().__setitem__(key, value)

This class has all the same capabilities as its parent, dict, but it extends the __setitem__ method to make log entries whenever a key is updated. After making a log entry, the method uses super() to delegate the work for actually updating the dictionary with the key/value pair.

Before super() was introduced, we would have hardwired the call with dict.__setitem__(self, key, value). However, super() is better because it is a computed indirect reference.

One benefit of indirection is that we don’t have to specify the delegate class by name. If you edit the source code to switch the base class to some other mapping, the super() reference will automatically follow. You have a single source of truth:

class LoggingDict(SomeOtherMapping):            # new base class
    def __setitem__(self, key, value):
        logging.info('Setting %r to %r' % (key, value))
        super().__setitem__(key, value)         # no change needed

In addition to isolating changes, there is another major benefit to computed indirection, one that may not be familiar to people coming from static languages. Since the indirection is computed at runtime, we have the freedom to influence the calculation so that the indirection will point to some other class.

The calculation depends on both the class where super is called and on the instance’s tree of ancestors. The first component, the class where super is called, is determined by the source code for that class. In our example, super() is called in the LoggingDict.__setitem__ method. That component is fixed. The second and more interesting component is variable (we can create new subclasses with a rich tree of ancestors).

Let’s use this to our advantage to construct a logging ordered dictionary without modifying our existing classes:

class LoggingOD(LoggingDict, collections.OrderedDict):
    pass

The ancestor tree for our new class is: LoggingOD, LoggingDict, OrderedDict, dict, object. For our purposes, the important result is that OrderedDict was inserted after LoggingDict and before dict! This means that the super() call in LoggingDict.__setitem__ now dispatches the key/value update to OrderedDict instead of dict.

Think about that for a moment. We did not alter the source code for LoggingDict. Instead we built a subclass whose only logic is to compose two existing classes and control their search order.

__________________________________________________________________________________________________________________

Search Order

What I’ve been calling the search order or ancestor tree is officially known as the Method Resolution Order or MRO. It’s easy to view the MRO by printing the __mro__ attribute:

 pprint(LoggingOD.__mro__)
(class '__main__.LoggingOD',
 class '__main__.LoggingDict',
 class 'collections.OrderedDict',
 class 'dict',
 class 'object')

If our goal is to create a subclass with an MRO to our liking, we need to know how it is calculated. The basics are simple. The sequence includes the class, its base classes, and the base classes of those bases and so on until reaching object which is the root class of all classes. The sequence is ordered so that a class always appears before its parents, and if there are multiple parents, they keep the same order as the tuple of base classes.

The MRO shown above is the one order that follows from those constraints:

  • LoggingOD precedes its parents, LoggingDict and OrderedDict
  • LoggingDict precedes OrderedDict because LoggingOD.__bases__ is (LoggingDict, OrderedDict)
  • LoggingDict precedes its parent which is dict
  • OrderedDict precedes its parent which is dict
  • dict precedes its parent which is object

The process of solving those constraints is known as linearization. There are a number of good papers on the subject, but to create subclasses with an MRO to our liking, we only need to know the two constraints: children precede their parents and the order of appearance in __bases__ is respected.

__________________________________________________________________________________________________________________

Practical Advice

super() is in the business of delegating method calls to some class in the instance’s ancestor tree. For reorderable method calls to work, the classes need to be designed cooperatively. This presents three easily solved practical issues:

  • the method being called by super() needs to exist
  • the caller and callee need to have a matching argument signature
  • and every occurrence of the method needs to use super()

1) Let’s first look at strategies for getting the caller’s arguments to match the signature of the called method. This is a little more challenging than traditional method calls where the callee is known in advance. With super(), the callee is not known at the time a class is written (because a subclass written later may introduce new classes into the MRO).

One approach is to stick with a fixed signature using positional arguments. This works well with methods like __setitem__ which have a fixed signature of two arguments, a key and a value. This technique is shown in the LoggingDict example where __setitem__ has the same signature in both LoggingDict and dict.

A more flexible approach is to have every method in the ancestor tree cooperatively designed to accept keyword arguments and a keyword-arguments dictionary, to remove any arguments that it needs, and to forward the remaining arguments using **kwds, eventually leaving the dictionary empty for the final call in the chain.

Each level strips-off the keyword arguments that it needs so that the final empty dict can be sent to a method that expects no arguments at all (for example, object.__init__ expects zero arguments):

class Shape:
    def __init__(self, shapename, **kwds):
        self.shapename = shapename
        super().__init__(**kwds)        

class ColoredShape(Shape):
    def __init__(self, color, **kwds):
        self.color = color
        super().__init__(**kwds)

cs = ColoredShape(color='red', shapename='circle')

2) Having looked at strategies for getting the caller/callee argument patterns to match, let’s now look at how to make sure the target method exists.

The above example shows the simplest case. We know that object has an __init__ method and that object is always the last class in the MRO chain, so any sequence of calls to super().__init__ is guaranteed to end with a call to object.__init__ method. In other words, we’re guaranteed that the target of the super() call is guaranteed to exist and won’t fail with an AttributeError.

For cases where object doesn’t have the method of interest (a draw() method for example), we need to write a root class that is guaranteed to be called before object. The responsibility of the root class is simply to eat the method call without making a forwarding call using super().

Root.draw can also employ defensive programming using an assertion to ensure it isn’t masking some other draw() method later in the chain.  This could happen if a subclass erroneously incorporates a class that has a draw() method but doesn’t inherit from Root.:

class Root:
    def draw(self):
        # the delegation chain stops here
        assert not hasattr(super(), 'draw')

class Shape(Root):
    def __init__(self, shapename, **kwds):
        self.shapename = shapename
        super().__init__(**kwds)
    def draw(self):
        print('Drawing.  Setting shape to:', self.shapename)
        super().draw()

class ColoredShape(Shape):
    def __init__(self, color, **kwds):
        self.color = color
        super().__init__(**kwds)
    def draw(self):
        print('Drawing.  Setting color to:', self.color)
        super().draw()

cs = ColoredShape(color='blue', shapename='square')
cs.draw()

If subclasses want to inject other classes into the MRO, those other classes also need to inherit from Root so that no path for calling draw() can reach object without having been stopped by Root.draw. This should be clearly documented so that someone writing new cooperating classes will know to subclass from Root. This restriction is not much different than Python’s own requirement that all new exceptions must inherit from BaseException.

3) The techniques shown above assure that super() calls a method that is known to exist and that the signature will be correct; however, we’re still relying on super() being called at each step so that the chain of delegation continues unbroken. This is easy to achieve if we’re designing the classes cooperatively – just add a super() call to every method in the chain.

The three techniques listed above provide the means to design cooperative classes that can be composed or reordered by subclasses.

__________________________________________________________________________________________________________________

How to Incorporate a Non-cooperative Class

Occasionally, a subclass may want to use cooperative multiple inheritance techniques with a third-party class that wasn’t designed for it (perhaps its method of interest doesn’t use super() or perhaps the class doesn’t inherit from the root class). This situation is easily remedied by creating an adapter class that plays by the rules.

For example, the following Moveable class does not make super() calls, and it has an __init__() signature that is incompatible with object.__init__, and it does not inherit from Root:

class Moveable:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    def draw(self):
        print('Drawing at position:', self.x, self.y)

If we want to use this class with our cooperatively designed ColoredShape hierarchy, we need to make an adapter with the requisite super() calls:

class MoveableAdapter(Root):
    def __init__(self, x, y, **kwds):
        self.movable = Moveable(x, y)
        super().__init__(**kwds)
    def draw(self):
        self.movable.draw()
        super().draw()

class MovableColoredShape(ColoredShape, MoveableAdapter):
    pass

MovableColoredShape(color='red', shapename='triangle',
                    x=10, y=20).draw()

__________________________________________________________________________________________________________________

Complete Example – Just for Fun

In Python 2.7 and 3.2, the collections module has both a Counter class and an OrderedDict class. Those classes are easily composed to make an OrderedCounter:

from collections import Counter, OrderedDict

class OrderedCounter(Counter, OrderedDict):
     'Counter that remembers the order elements are first seen'
     def __repr__(self):
         return '%s(%r)' % (self.__class__.__name__,
                            OrderedDict(self))
     def __reduce__(self):
         return self.__class__, (OrderedDict(self),)

oc = OrderedCounter('abracadabra')

__________________________________________________________________________________________________________________

Notes and References

* When subclassing a builtin such as dict(), it is often necessary to override or extend multiple methods at a time. In the above examples, the __setitem__ extension isn’t used by other methods such as dict.update, so it may be necessary to extend those also. This requirement isn’t unique to super(); rather, it arises whenever builtins are subclassed.

* If a class relies on one parent class preceding another (for example, LoggingOD depends on LoggingDict coming before OrderedDict which comes before dict), it is easy to add assertions to validate and document the intended method resolution order:

position = LoggingOD.__mro__.index
assert position(LoggingDict)  position(OrderedDict)
assert position(OrderedDict)  position(dict)

* Good write-ups for linearization algorithms can be found at Python MRO documentation and at Wikipedia entry for C3 Linearization.

* The Dylan programming language has a next-method construct that works like Python’s super(). See Dylan’s class docs for a brief write-up of how it behaves.

* The Python 3 version of super() is used in this post. The full working source code can be found at:  Recipe 577720. The Python 2 syntax differs in that the type and object arguments to super() are explicit rather than implicit. Also, the Python 2 version of super() only works with new-style classes (those that explicitly inherit from object or other builtin type). The full working source code using Python 2 syntax is at Recipe 577721.
__________________________________________________________________________________________________________________

Acknowledgements

Serveral Pythonistas did a pre-publication review of this article.  Their comments helped improve it quite a bit.

They are:  Laura Creighton, Alex Gaynor, Philip Jenvey, Brian Curtin, David Beazley, Chris Angelico, Jim Baker, Ethan Furman, and Michael Foord.  Thanks one and all.

Advertisement

Explore posts in the same categories: Algorithms, Documentation, Inheritance, Open Source, Python

Python Unicode

Save
... and even more about Unicode


Two weeks before I started writing this document, my knowledge of using Python and Unicode was about like this:

All there is to using Unicode in Python is just passing your strings to unicode()



Now where would I get such a strange idea? Oh, that's right, from the Python tutorial on Unicode, which states:

"Creating Unicode strings in Python is just as simple as creating normal strings":

u'Hello World !' u'Hello World !'



While this example is technically correct, it can be misleading to the Unicode newbie, since it glosses over several details needed for real-life usage. This overly-simplified explanation gave me a completely wrong understanding of how Unicode works in Python.

If you have been led down the overly-simplistic path as well, then this tutorial will hopefully help you out. This tutorial contains a set of examples, tests, and demos that docment my "relearning" of the correct way to work with Unicode in Python. It includes cross-platform issues, as well as issues that arise when dealing with HTML, XML, and filesystems.

By the way, Unicode is fairly simple, I just wish I had learned it correctly the first time.


At a top level, computers use three types of text representations:
  1. ASCII
  2. Multibyte character sets
  3. Unicode
     
I think Unicode is easier to understand if you understand how it evolved from ASCII. The following is a brief synopsis of this evolution.


In the beginning, there was ASCII. (OK, there was also EBCDIC, but that never caught on outside of mainframes, so I'm omitting it here.) The ASCII character set contains 256 characters, as you can see on this ASCII Chart. Even though 256 characters are available, the lower 128 (codes 0-127) are the most often used codes. Early email systems in fact would only allow you to transmit characters 0-127 (i.e. "7-bit text") and in fact this is still true of many systems today. As you can see from the chart, ASCII is sufficient for English language documents.

Problems arose as computer use grew in countries where ASCII was not sufficient. ASCII lacks the ability to handle Greek, Cyrillic, or Japanese texts, to name a few. Furthermore, Japanese texts alone need thousands of characters, so there is no way to fit them into an 8-bit scheme. To overcome this, Multibyte Character Sets were invented. Most (if not all?) Multibyte Character Sets take advantage of the fact that only the first 128 characters of the ASCII set are commonly used (codes 0-127 in decimal, or 0x00-0x7f in hex). The upper codes (128..255 in decimal, or 0x80-0xff in hex) are used to define the non-English extended sets.

Lets look at an example: Shift-JIS is one encoding for Japanese text. You can see its character table here. Notice that the first byte of each character begins with a hex value from 0x80 - 0xfc. This is an interesting property, because it means that English and Japanese text can be freely mixed! The string "Hello World!" is a perfectly valid Shift-JIS encoding of English text. When parsing Shift-JIS, if you get a byte in the range 0x80-0xff, you know it is the first character of a two code sequence. Else, it is a single byte of regular ASCII.

This works just fine as long as you are working only in Japanese, but what happens if you switch to a Greek character set? As you can see from the table, ISO-8859-7 has redefined the codes from 0x80-0xff in a completely different way than Shift-JIS defines them. So, although you can mix English and Japanese, you cannot mix Greek and Japanese since they would step on each other. This is a common problem with mixing any multibyte character sets.


To overcome the problem of mixing different languages, Unicode proposes to combine all of the worlds character sets into a single huge table. Take a look at the Unicode character set.

At first glance, there appears to be separate tables for each language, so you may not see the improvement over ASCII. In reality though these are all in the same table, and are just indexed here for easy (human) reference. The key thing to notice is that since these are all part of the same table, they don't overlap like in the ASCII/multibyte world. This allows Unicode documents to freely mix languages with no coding conflicts.


Lets look at the Greek chart and grab a few characters:
Sample Unicode Symbols
03A0 Π Greek Capital Letter Pi
03A3 Σ Greek Capital Letter Sigma
03A9 Ω Greek Capital Letter Omega


It is common to refer to these symbols using the notation U+NNNN, for example U+03A0. So we could define a string that contains these characters, using the following notation (I added brackets for clarity):

uni = {U+03A0} + {U+03A3} + {U+03A9} 



Now, even though we know exactly what 'uni' represents (ΠΣΩ) note that there is no way to:
  • Print uni to the screen.
  • Save uni to a file.
  • Add uni to another piece of text.
  • Tell me how many bytes it takes to store uni.
     
Why? Because uni is an idealized Unicode string - nothing more than a concept at this point. Shortly we'll see how to print it, save it, and manipulate it, but for now, take note of the last statement: There is no way to tell me how many bytes it takes to store uni. In fact, you should forget all about bytes and think of Unicode strings as sets of symbols.

Why should you forget about bytes in the Unicode world? Take the Greek symbol Omega: Ω. There are at least 4 ways to encode this as binary:

Encoding name Binary representation
ISO-8859-7 \xD9
"Native" Greek encoding
UTF-8 \xCE\xA9
UTF-16 \xFF\xFE\xA9\x03
UTF-32 \xFF\xFE\x00\x00\xA9\x03\x00\x00
Each of these is a perfectly valid coding of Ω, but trying to work with bytes like this is no better than dealing with the ASCII/Multibyte world. This is why I say you should think of Unicode as symbols (Ω), not as bytes.


To convert our idealized Unicode string uni (ΠΣΩ) to a useful form, we need to look a few things:
  1. Representing Unicode literals
  2. Converting Unicode to binary
  3. Converting binary to Unicode
  4. Using string operations
     

Creating a Unicode string from symbols is very easy. Recall our Greek symbols from above:

Sample Unicode Symbols
03A0 Π Greek Capital Letter Pi
03A3 Σ Greek Capital Letter Sigma
03A9 Ω Greek Capital Letter Omega


Lets say we want to make a Unicode string with those characters, plus some good old-fashioned ASCII characters.

Pseudocode:
uni = 'abc_' + {U+03A0} + {U+03A3} + {U+03A9} + '.txt'
 
Here is how you make that string in Python:
uni = u"abc_\u03a0\u03a3\u03a9.txt"


A few things to notice:
  • Plain-ASCII characters can be written as themselves. You can just say "a", and not have to use the Unicode symbol "\u0061". (But remember, "a" really is {U+0061}; there is no such thing as a Unicode symbol "a".)
  • The \u escape sequence is used to denote Unicode codes.
    • This is somewhat like the traditional C-style \xNN to insert binary values. However, a glance at the Unicode table shows values with up to 6 digits. These cannot be represented conveniently by \xNN, so \u was invented.
    • For Unicode values up to (and including) 4 digits, use the 4-digit version:
      \uNNNN
      Note that you must include all 4 digits, using leading 0's as needed.
    • For Unicode values longer than 4 digits, use the 8-digit version:
      \UNNNNNNNN
      Note that you must include all 8 digits, using leading 0's as needed.
       
Here is another example:

Pseudocode:
uni = {U+1A} + {U+B3C} + {U+1451} + {U+1D10C} 
 
Python:
uni = u'\u001a\u0bc3\u1451\U0001d10c'
 
Note how I padded each of the values to 4/8 digits as appropriate. Python will give you an error if you don't do this. Also note that you can use either capital or lowecase letters in the codes. The following would give you exactly the same thing:

Python:
uni = u'\u001A\u0BC3\u1451\U0001D10C'



Remember how I said earlier that uni has no fixed computer representation. So what happens if we try to print uni?

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni



You would see:

Traceback (most recent call last):
  File "t6.py", line 2, in ?
    print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-4:
ordinal not in range(128)



What happened? Well, you told Python to print uni, but since uni has no fixed computer representation, Python first had to convert uni to some printable form. Since you didn't tell Python how to do the conversion, it assumed you wanted ASCII. Unfortunately, ASCII can only handle values from 0 to 127, and uni contains values out of that range, hence you see an error.

A quick method to print uni is to use Python's repr() method:

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print repr(uni)



This prints:

u'\x1a\u0bc3\u1451\U0001d10c'



This of course makes sense, since that's exactly how we just defined uni. But repr(uni) is just as useless in the real world as uni itself. What we really need to do is learn about codecs.

Codecs
In general, Python's codecs allow arbitrary object-to-object transformations. However, in the context of this article, it is enough to think of codecs as functions that transform Unicode objects into binary Python strings, and vice versa.
Why do we need them?
Unicode objects have no fixed computer representation. Before a Unicode object can be printed, stored to disk, or sent across a network, it must be encoded into a fixed computer representation. This is done using a codec. Some popular codecs you may have heard about in your day to day experiences: ascii, iso-8859-7, UTF-8, UTF-16.
 

To turn a Unicode value into a binary representation, you call its .encode method with the name of the codec. For example, to convert a Unicode value to UTF-8:

binary = uni.encode("utf-8")



How about we make uni more interesting and add some plain text characters:

uni = u"Hello\u001A\u0BC3\u1451\U0001D10CUnicode"



Now lets have a look at how different codecs represent uni. Here is a little test program:

ERROR - No such file or resource "test_codec01.py"

This results in the output:

UTF-8 'Hello\x1a\xe0\xaf\x83\xe1\x91\x91\xf0\x9d\x84\x8cUnicode'
UTF-16 '\xff\xfeH\x00e\x00l\x00l\x00o\x00\x1a\x00\xc3\x0bQ\x144
        \xd8\x0c\xddU\x00n\x00i\x00c\x00o\x00d\x00e\x00'
ASCII Hello????Unicode
ISO-8859-1 Hello????Unicode



Note that I still used repr() to print the UTF-8 and UTF-16 strings. Why? Well, otherwise, it would have printed raw binary values to the screen which would have been hard to capture in this document.


Say someone gives you a UTF-8 encoded version of a Unicode object. How do you convert it back into Unicode? You might naively try this:

The Naive (and Wrong) Way

uni = unicode( utf8_string )



Why is this wrong? Here is a sample program doing exactly that:

uni = u"Hello\u001A\u0BC3\u1451\U0001D10CUnicode"
utf8_string = uni.encode('utf-8')

# naively convert back to Unicode
uni = unicode(utf8_string)



Here is what happens:

Traceback (most recent call last):
    File "t6.py", line 5, in ?
    uni = unicode(utf8_string)

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0
    in position 6: ordinal not in range(128)



You see, the function unicode() really takes two parameters:

def unicode(string, encoding):
     ....



In the above example, we omitted the encoding so Python, in faithful style, assumed once again that we wanted ASCII (footnote 1), and gave us the wrong thing.

Here is the correct way to do it:

uni = u"Hello\u001A\u0BC3\u1451\U0001D10CUnicode"
utf8_string = uni.encode('utf-8')

# have to decode with the same codec the encoder used!
uni = unicode(utf8_string,'utf-8')
print "Back from UTF-8: ",repr(uni)



Which gives the output:

    Back from UTF-8:  u'Hello\x1a\u0bc3\u1451\U0001d10cUnicode'




The above examples hopefully give you a good idea of why you want to avoid dealing with Unicode values as binary strings as much as possible! The UTF-8 version was 23 bytes long, the UTF-16 version was 36 bytes, the ASCII version was only 16 bytes (but it completely discarded 4 Unicode values) and similarly with ISO-8859-1.

This is why, at the very start of this document I suggested that you forget all about bytes!

The good news is that once you have a Unicode object, it behaves exactly like a regular string object, so there is no new syntax to learn (other than the \u and \U escapes). Here is a short sample that shows Unicode objects behaving the way you would expect:

ERROR - No such file or resource "test_stringops01.py"

Running this sample gives the output:

uni =  u'Hello\x1a\u0bc3\u1451\U0001d10cUnicode'
len(uni) =  17
uni[:5] =  Hello
uni[5] =  u'\x1a'
uni[6] =  u'\u0bc3'
uni[7] =  u'\u1451'
uni[8] =  u'\ud834'
uni[9] =  u'\udd0c'
uni[10:] =  u'Unicode'




Depending on how your version of Python was compiled, it will store Unicode objects internally in either UTF-16 (2 bytes/character) or UTF-32 (4 bytes/character) format. Unfortunately this low-level detail is exposed through the normal string interface.

For 4-digit (16-bit) characters like \u03a0, there is no difference.

a = u'\u03a0'
print len(a)

Will show of length of 1, regardless of how your Python was built, and a[0] will always be \u03a0. However, for 8-digit (32-bit) characters, like \U0001FF00, you will see a difference. Obviously, 32-bit values cannot be directly represented in a 16-bit code, so a pair of two 16-bit values are used. (Codes 0xD800 - 0xDFFF, called "surrogate pairs", are reserved for these two-character sequences. These values are invalid when used by themselves, per the Unicode specification.)

A sample program that shows what happens:

What happens with \U ...

a = u'\U0001ff00'
print "Length:",len(a)

print "Chars:"
for c in a:
    print repr(c)

If you run this under a "UTF-16" Python, you will see:

Output, 'UTF-16' Python

Length: 2
Chars:
u'\ud83f'
u'\udf00'

Under a 'UTF-32' Python, you will see:

Output, 'UTF-16' Python

Length: 1
Chars:
u'\U0001ff00'

This is an annoying detail to have to worry about. I wrote a module that lets you step character-by-character through a Unicode string, regardless of whether you are running on a 'UTF-16' or 'UTF-32' flavor of Python. It is called xmlmap and is part of Gnosis Utils. Here are two examples, one using xmlmap, one not.

Without xmlmap

a = u'A\U0001ff00C\U0001fafbD'
print "Length:",len(a)

print "Chars:"
for c in a:
    print repr(c)

Results without xmlmap, on a UTF-16 Python

Length: 7
Chars:
u'A'
u'\ud83f'
u'\udf00'
u'C'
u'\ud83e'
u'\udefb'
u'D'

Now, using the usplit() function, to get the characters one-at-a-time, combining split values where needed:

With xmlmap

from gnosis.xml.xmlmap import usplit

a = u'A\U0001ff00C\U0001fafbD'
print "Length:",len(a)

print "Chars:"
for c in usplit(a):
    print repr(c)

Results with xmlmap, on a UTF-16 Python

Length: 7
Chars:
u'A'
u'\U0001ff00'
u'C'
u'\U0001fafb'
u'D'

Now you will get identical results regardless of how your Python was compiled. (Note that the length is still the same, but usplit() has combined the surrogate pairs so you don't see them.)


Yes, you may wonder "who cares" when it comes to Python 2.0 and 2.1, but when writing code that's supposed to be completely portable, it does matter!



Python 2.0.x and 2.1.x have a fatal bug when trying to handle single-character codes from in the range \uD800-\uDFFF.

The sample code below demonstrates the problem:

 u = unichr(0xd800)
 print "Orig: ",repr(u)

 # create utf-8 from '\ud800'
 ue = u.encode('utf-8')
 print "UTF-8: ",repr(ue)

 # decode back to unicode
 uu = unicode(ue,'utf-8')
 print "Back: ",repr(uu)



Running this under Python 2.2 and up gives the expected result:

 Orig:  u'\ud800'
 UTF-8:  '\xed\xa0\x80'
 Back:  u'\ud800'



Python 2.0.x gives:

 Orig:  u'\uD800'
 UTF-8:  '\240\200'
 Traceback (most recent call last):
   File "test_utf8_bug.py", line 9, in ?
     uu = unicode(ue,'utf-8')
 UnicodeError: UTF-8 decoding error: unexpected code byte



Python 2.1.x gives:

 Orig:  u'\ud800'
 UTF-8:  '\xa0\x80'
 Traceback (most recent call last):
   File "test_utf8_bug.py", line 9, in ?
     uu = unicode(ue,'utf-8')
 UnicodeError: UTF-8 decoding error: unexpected code byte



As you can see, both fail to encode u'\ud800' when used as a single character. While it is true that the characters from 0xD800 .. 0xDFF are not valid when used by themselves, the fact is that Python will let you use them alone.


I came up with a good example, completely by accident while working on the code for this tutorial. Create two Python files:

aaa.py

x = u'\ud800'



bbb.py

import sys
sys.path.insert(0,'.')
import aaa



Now, use Python 2.0.x/2.1.x to run bbb.py twice (it needs to run twice so it will load aaa.pyc the second time). On the second run, you'll get:

  Traceback (most recent call last):
    File "bbb.py", line 3, in ?
      import aaa
  UnicodeError: UTF-8 decoding error: unexpected code byte



That's right, Python 2.0.x/2.1.x are unable to reload their own bytecode from a .pyc file if the source contains a string like u'\ud800'. A portable workaround in that case would be to use unichr(0xd800) instead of u'\ud800' (this is what gnosis.xml.pickle does).


Up to this point, I've been translating Unicode to/from UTF for purposes of demonstration. However, Python lets you do much more than that. It allows you to translate nearly any multibyte character string into Unicode (and vice versa). Implementing all of these translations is a lot of work. Fortunately, it has been done for us, so all we have to do is know how to use it.

Lets revisit our Greek table, except this time I'm going to list the characters both in Unicode as well as ISO-8859-7 ("native Greek").

Character Name As Unicode As ISO-8859-7
Π Greek Capital Letter Pi 03A0 0xD0
Σ Greek Capital Letter Sigma 03A3 0xD3
Ω Greek Capital Letter Omega 03A9 0xD9
With Python, using unicode() and .encode() makes it trivial to translate between these.

# {Pi}{Sigma}{Omega} as ISO-8859-7 encoded string 
b = '\xd0\xd3\xd9'

# Convert to Unicode ('univeral format')
u = unicode(b, 'iso-8859-7')
print repr(u)

# ... and back to ISO-8859-7
c = u.encode('iso-8859-7')
print repr(c)



Shows:

u'\u03a0\u03a3\u03a9'
\xd0\xd3\xd9



You can also use Python as a "universal recoder". Say you received a file in the Japanese encoding ShiftJIS and wanted to convert to the EUC-JP encoding:

txt = ... the ShiftJIS-encoded text ...

# convert to Unicode ("universal format")
u = unicode(txt, 'shiftjis')

# convert to EUC-JP
out = u.encode('eucjp')



Of course, this only works when translating between compatible character sets. Trying to translate between Japanese and Greek character sets this way would not work.


Now you know about everything you need to know to work with Unicode objects within Python. Isn't that nice? However, the rest of the world isn't quite as nice and neat as Python, so you need to understand how the non-Python portion of the world handles Unicode. It isn't terribly hard, but there are a lot of special cases to consider.

From here on out, we'll be looking at Unicode issues that arise when dealing with:
  1. Filenames (Operating System specific issues)
  2. XML
  3. HTML
  4. Network files (Samba)
     

Sounds simple enough, right? If I want to name a file with my Greek letters, I just say:

   open(unicode_name, 'w')



In theory, yes, that's supposed to be all there is to it. However, there are many ways for this to not work, and they depend on the platform your program is running on.


There are at least two ways of running Python under Windows. The first is to use the Win32 binaries from www.python.org. I will refer to this method as "Windows-native Python".

The other method is by using the version of Python that comes with Cygwin This version of Python looks (to user code) more like POSIX, instead of like a Windows-native environment.

For many things, the two versions are interchangeable. As long as you write portable Python code, you shouldn't have to care which interpreter you are running under. However, one important exception is when handling Unicode. That is why I'll be specific here about which version I am running.


Lets keep using our familiar Greek symbols:

Sample Unicode Symbols
03A0 Π Greek Capital Letter Pi
03A3 Σ Greek Capital Letter Sigma
03A9 Ω Greek Capital Letter Omega


Our sample Unicode filename will be:

# this is: abc_{PI}{Sigma}{Omega}.txt
uname = u"abc_\u03A0\u03A3\u03A9.txt"



Lets create a file with that name, containing a single line of text:

open(uname,'w').write('Hello world!\n')



Opening up an Explorer window shows the results (click for a larger version):

win32_01.jpg


There the filename is in all its unicode glory.

Now, lets see how os.listdir() works with this name. The first thing to know is that os.listdir() has two modes of operation:
  • Non-unicode, achieved by passing a non-Unicode string to os.listdir(), i.e. os.listdir('.')
  • Unicode, achieved by passing a Unicode string to os.listdir(), i.e. os.listdir(u'.')
     
First, lets try as Unicode:

os.chdir('ttt')
# there is only one file in directory 'ttt'
name = os.listdir(u'.')[0]
print "Got name: ",repr(name)
print "Line: ",open(name,'r').read()



Running this program gives the following output:

Got name:  u'abc_\u03a0\u03a3\u03a9.txt'
Line:  Hello world!



Comparing with above, that looks correct. Note that print repr(name) was required, since an error would have occurred if I had tried to print name directly to the screen. Why? Yep, once again Python would have assumed you wanted an ASCII coding, and would have failed with an error.

Now let's try the above sample again, but using the non-Unicode version of os.listdir():

os.chdir('ttt')
# there is only one file in directory 'ttt'
name = os.listdir('.')[0]
print "Got name: ",repr(name)
print "Line: ",open(name,'r').read()



Gives this output:

Got name:  'abc_?SO.txt'
Line: 
Traceback (most recent call last):
  File "c:\frank\src\unicode\t2.py", line 8, in ?
    print "Line: ",open(name,'r').read()
IOError: [Errno 2] No such file or directory: 'abc_?SO.txt'



Yikes! What happened? Welcome to the wonderful work of the win32 "dual-API".

A little background:
Windows NT/2000/XP always write filenames to the the underlying filesystem as Unicode (footnote 2). So in theory, Unicode filenames should work flawlessly with Python.

Unfortunately, win32 actually provides two sets of APIs for interfacing with the filesystem. And in true Microsoft style, they are incompatible. The two APIs are:
  1. A set of APIs for Unicode-aware applications, that return the true Unicode names.
  2. A set of APIs for non-Unicode aware applications that return a locale-dependent coding of the true Unicode filenames.
     
Python (for better or worse) follows this convention on win32 platforms, so you end up with two incompatible ways of calling os.listdir() and open():
  1. When you call os.listdir(), open(), etc. with a Unicode string, Python calls the Unicode version of the APIs, and you get the true Unicode filenames. (This corresponds to the first set of APIs above).
  2. When you call os.listdir(), open(), etc. with a non-Unicode string, Python calls the non-Unicode version of the APIs, and here is where the trouble creeps in. The non-Unicode API's handle Unicode with a particular codec called MBCS. MBCS is a lossy codec: Every MBCS name can be represented as Unicode, but not vice versa. MBCS coding also changes depending on the current locale. In other words, if I write a CD with a multibyte-character filename as MBCS on my English locale machine, then send the CD to Japan, the filename there may appear to contain completely different characters.


Now that we know the background facts, we can see what happened above. By using os.listdir('.'), you are getting the MBCS-version of the true Unicode name that is stored on the filesystem. And, on my English-locale computer, there is no accurate mapping for the Greek characters, so you end up with "?", "S", and "O". This leads to the weird result that there is no way to open our Greek-lettered file using the MBCS APIs in an English locale (!!).

Bottom line
I recommend always using Unicode strings in os.listdir(), open(), etc. Remember that Windows NT/2000/XP always stores filenames as Unicode, and so this is the native behavior. And, as shown above, can sometimes be the only way to open a Unicode filename.

Danger! Cygwin

Cygwin has a huge problem here. It (currently, at least) has no support for Unicode. That is, it will never call the Unicode versions of the win32 APIs. Hence, it is impossible to open certain files (like our Greek-lettered filename) from Cygwin. It doesn't matter if you use os.listdir(u'.') or os.listdir(''); you always get the MBCS-coded versions.

Please note that this isn't a Python-specific problem; it is a systemic problem with Cygwin. All Cygwin utilities, such as zsh, ls, zip, unzip, mkisofs, will be unable to recognize our Greek-lettered name, and will report various errors.




Unlike Windows NT/2000/XP, which always store filenames in Unicode format, POSIX systems (including Linux) always store filenames as binary strings. This is somewhat more flexible, since the operating system itself doesn't have to know (or care) what encoding is used for filenames. The downside is that the user is responsible for setting up their environment ("locale") for the proper coding.


The specifics of setting up your POSIX box to handle Unicode filenames are beyond the scope of this document, but it generally comes down to setting a few environment variables. In my case, I wanted to use the UTF-8 codec in a U.S. English locale, so my setup involved adding a few lines to these startup files (I've tried this under Gentoo Linux and Ubuntu, though all Linux systems should be similar):

Additions to .bashrc:

LANG="en_US.utf8"
LANGUAGE="en_US.utf8"
LC_ALL="en_US.utf8"

export LANG
export LANGUAGE
export LC_ALL



For good measure, I added the same lines to my .zshrc file.

Additionally, I added the first three lines to /etc/env.d/02locale.

CAUTION

Please do not blindy make changes like the above to your system if you aren't sure what you're doing. You could make your files unreadable by switching locales. The above is meant only as an example of a simple case of switching from an ASCII locale to a UTF-8 locale.




A big advantage under POSIX, as far as Python is concerned, is that you can use either:

      os.listdir('.')



Or:

      os.listdir(u'.')



Both methods will give you strings that you can pass to open() to open the files. This is much better than Windows, which will return mangled versions of the Unicode names if you use os.listdir('.'), which as seen above can sometimes fail to give you a valid name to open the file. You will always get a valid name under POSIX/Linux.

Here is a sample function to demonstrate that:

ERROR

Unable to get source for "test_posix01.test"



If you run this you'll get:

As unicode:  u'abc_\u03a0\u03a3\u03a9.txt'
   Read line:  Hello unicode!

As bytestring:  'abc_\xce\xa0\xce\xa3\xce\xa9.txt'
   Read line:  Hello unicode!



As you can see, we were able to successfully read the file, no matter if we used the Unicode or bytestring version of the filename.


Unlike the Microsoft Windows world where you basically have a "DOS box" and Windows Explorer, under Linux you have many choices about what terminal and file manager you want to run. This is both a blessing and a curse: A blessing in that you can pick an application that suits your preferences, but also curse in that not all applications support Unicode to the same extent.

The following is a survey of several popular applications to see what they support.


My personal current favorite is mlterm, a multi-lingual terminal (click for a larger version):

mlterm_01.jpg


The GNOME terminal (gnome-terminal):

gnome_terminal_01.jpg


The KDE terminal (konsole):

konsole_01.jpg


A modified version of rxvt (rxvt-unicode) handles Unicode, although it has some issues with underscore characters in the font I've chosen ...

urxvt_01.jpg


Here is our Greek-lettered file in the a KDE file manager window (konqueror):

konq_01.jpg


And here it is in the GNOME file manager (Nautilus):

naut_01.jpg


The XFCE 4 file manager:

xfce_01.jpg


The standard KDE file selector supports Unicode filenames:

kfilesel_01.jpg


As does the GNOME file selector:

gfilesel_01.jpg



The standard rxvt does not handle Unicode correctly:

rxvt_01.jpg


The Xfm file manager does not handle Unicode filenames:

xfm_01.jpg


Mac OS/X


I don't have an OSX machine to test this on, but helpful readers have contributed some information on Unicode support in OSX.

One reader pointed out that os.listdir('.') and os.listdir(u'.') both return objects that can be passed directly to open(), as you can do under POSIX.

Reader Hraban noted:
You should mention that MacOS X uses a special kind of decomposed UTF-8 to store filenames. If you need to e.g. read in filenames and write them to a "normal" UTF-8 file, you must normalize them (at least if your editor, or my TeX system, doesn't understand decomposed UTF-8):

filename = unicodedata.normalize('NFC', unicode(filename, 'utf-8')).encode('utf-8')



For others reading this who aren't familiar with this issue (like I wasn't) here are a few references: My understanding of this is that when you pass a name with an accented character like é, it will decompose this into e plus ' before saving it to the filesystem (this behavior is defined by the Unicode standard).

If you can add anything else to this section, please leave a comment below!


You may find yourself generating HTML with Python (i.e. when using mod_python, CherryPy, or such). So how do you use Unicode characters in an HTML document?

The answer involves these easy steps:
  1. Use a meta tag to let the user's browser know the encoding you used. (footnote 3)
  2. Generate your HTML as a Unicode object.
  3. Write your HTML bytestream using a whichever codec you prefer.
     
Here is an example, writing the same Greek-lettered string I've been using all along:

    code = 'utf-8' # make it easy to switch the codec later
 
    html = u'html'

    # use a meta tag to specify the document encoding used
    html += u'meta http-equiv="content-type" content="text/html; charset=%s"' % code
    html += u'head/headbody'

    # my actual Unicode content ...
    html += u'abc_\u03A0\u03A3\u03A9.txt'

    html += u'/body/html'

    # Now, you cannot write Unicode directly to a file. 
    # First have to either convert it to a bytestring using a codec, or
    # open the file with the 'codecs' module.

    # Method #1, doing the conversion yourself:
    open('t.html','w').write( html.encode( code ) )

    # Or, by using the codecs module:
    import codecs
    codecs.open('t.html','w',code).write( html )

    # .. the method you use depends on personal preference and/or
    # convenience in the code you are writing.



Now let's open the page (t.html) in Firefox:

win32_02.jpg


Just as expected!

Now, if you go back into the sample code and replace the line:

    code = 'utf-8'



With ...

    code = 'utf-16'



... the HTML file will now be written in UTF-16 format, but the result displayed in the browser window will be exactly the same.


The XML 1.0 standard requires all parsers to support UTF-8 and UTF-16 encoding. So, it would seem obvious that an XML parser would allow any legal UTF-8 or UTF-16 encoded document as input, right?

Nope!

Have a look at this sample program:

   xml = u'?xml version="1.0" encoding="utf-8" ?'
   xml += u'H \u0019 /H'

   # encode as UTF-8
   utf8_string = xml.encode( 'utf-8' )



At this point, utf8_string is a perfectly valid UTF-8 string representing the XML. So we should be able to parse it, right?:

from xml.dom.minidom import parseString
parseString( utf8_string )



Here is what happens when we run the above code:

Traceback (most recent call last):
   File "t9.py", line 9, in ?
     parseString( utf8_string )
   File "c:\py23\lib\xml\dom\minidom.py", line 1929, in parseString
     return expatbuilder.parseString(string)
   File "c:\py23\lib\xml\dom\expatbuilder.py", line 940, in parseString
     return builder.parseString(string)
   File "c:\py23\lib\xml\dom\expatbuilder.py", line 223, in parseString
     parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 43



Whoa - what happened there? It gave us an error at column 43. Lets see what column 43 is:

 print repr(utf8_string[43])

'\x19'



You can see that it doesn't like the Unicode character U+0019. Why is this? Section 2.2 of the XML 1.0 standard defines the set of legal characters that may appear in a document. From the standard:

/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] |
         [#xE000-#xFFFD] | [#x10000-#x10FFFF]



Clearly, there are some major gaps in the characters that are legal to include in an XML document. Lets turn the above into a Python function that can be used to test whether a given Unicode value is legal to write to an XML stream:

ERROR

Unable to get source for "gnosis.xml.xmlmap.raw_illegal_xml_regex"

Using the code ...

def make_illegal_xml_regex():
    return re.compile( raw_illegal_xml_regex() )

c_re_xml_illegal = make_illegal_xml_regex()



Finally:

ERROR

Unable to get source for "gnosis.xml.xmlmap.is_legal_xml"



The above function is good for when you have a Unicode string, but could be a little slow when searching a character at a time. So here is an alternate function for doing that (note this makes use of the usplit() function defined earlier):

ERROR

Unable to get source for "gnosis.xml.xmlmap.is_legal_xml_char"



Here is a fairly extensive test case to demonstrate the above functions:

ERROR

Unable to import module 'test_xml_legality'



I'm going to run this under two different versions of Python to show the differences you can see in \U coding.

First, under Python 2.0 (which uses 2-char \U encoding, on my machine):

 ** BAD VALUES **
 u'abc\001def'                                             0 0
 u'abc\014def'                                             0 0
 u'abc\025def'                                             0 0
 u'abc\uD900def'                                           0 0
 u'abc\uDDDDdef'                                           0 0
 u'abc\uFFFEdef'                                           0 0
 u'abc\uD800'                                              0 0
 u'\uDC00'                                                 0 0

 ** GOOD VALUES **
 u'abc\011def\012ghi'                                      1 1
 u'abc\015def'                                             1 1
 u'abc def\u8112ghi\uD7FFjkl'                              1 1
 u'abc\uE000def\uF123ghi\uFFFDjkl'                         1 1
 u'abc\uD800\uDC00def\uD84D\uDC56ghi\uDBC4\uDE34jkl'        1 1

 Testing one char at a time ...
 u'\000\005\010\013\014\016\020\031\uD800\uD900\000\uDC00\uDD00
 \uDFFF\uFFFE\uFFFF'
 OK

 u'\011\012\015 \u2345\uD7FF\uE000\uE876\uFFFD\uD800\uDC00\uD808
 \uDF45\uDBC0\uDC00\uDBFF\uDFFF\uD800\uDC00'
 OK



And now under Python 2.3, which on my machine stores \U as a single character::

 ** BAD VALUES **
 u'abc\x01def'                                         False 0
 u'abc\x0cdef'                                         False 0
 u'abc\x15def'                                         False 0
 u'abc\ud900def'                                       False 0
 u'abc\udddddef'                                       False 0
 u'abc\ufffedef'                                       False 0
 u'abc\ud800'                                          False 0
 u'\udc00'                                             False 0

 ** GOOD VALUES **
 u'abc\tdef\nghi'                                       True 1
 u'abc\rdef'                                            True 1
 u'abc def\u8112ghi\ud7ffjkl'                           True 1
 u'abc\ue000def\uf123ghi\ufffdjkl'                      True 1
 u'abc\U00010000def\U00023456ghi\U00101234jkl'          True 1

 Testing one char at a time ...
 u'\x00\x05\x08\x0b\x0c\x0e\x10\x19\ud800\ud900\x00\udc00\udd00
   \udfff\ufffe\uffff'
 OK

 u'\t\n\r \u2345\ud7ff\ue000\ue876\ufffd\U00010000\U00012345
   \U00100000\U0010ffff\U00010000'
 OK



You can see that both version of Python give the same answers (except Python 2.0 uses 1/0 instead of True/False). But you can see in the repr() coding at the end that the two versions represent \U in different ways. As long as you use the usplit() function defined earlier, you will see no differences in your code.

OK, so now we've established that you cannot put certain characters in an XML file. How do we get around this? Maybe we can encode the illegal values as XML entities?

 xml = u'?xml version="1.0" encoding="utf-8" ?'

 # try to cheat and put \u0019 as an entity ...
 xml += u'H #x19; /H'

 # encode as UTF-8
 utf8_string = xml.encode( 'utf-8' )

 # parse it
 from xml.dom.minidom import parseString
 parseString( utf8_string )



Running this gives the output::

 Traceback (most recent call last):
  File "t10.py", line 11, in ?
    parseString( utf8_string )
  File "c:\py23\lib\xml\dom\minidom.py", line 1929, in parseString
    return expatbuilder.parseString(string)
  File "c:\py23\lib\xml\dom\expatbuilder.py", line 940, in parseString
    return builder.parseString(string)
  File "c:\py23\lib\xml\dom\expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
 xml.parsers.expat.ExpatError: reference to invalid character number: line 1, column 43



Nope! According to the XML 1.0 standard, the illegal characters are not allowed, no matter how we try to cheat and stuff them in there. In fact, if a parser allows any of the illegal characters, then by definition it is not an XML parser. A key idea of XML is that parsers are not allowed to be "forgiving", to avoid the mess of incompatibility that exists in the HTML world.

So how do we handle the illegal characters?
Due to the fact that the characters are illegal, there is no standard way to handle them. It is up to the XML author (or application) to find another way to represent the illegal characters. Perhaps a future version of XML standard will help address this situation.
 

Samba 3.0 and up has the ability to share files with Unicode filenames. In fact, the test was very uneventful: I simply opened a Samba share (from my Linux machine) on a Windows client, opened the folder with the Greek-lettered filename in it, and the result is:

samba_01.jpg


Perhaps there are more complicated setups out there where this wouldn't work so well, but it was completely painless for me. Samba defaults to UTF-8 coding, so I didn't even have to modify my smb.conf file.

Summary


There are a few topics I've omitted, but plan to add them later. Among them:
  1. Some examples of how to work-around the "illegal XML character" issues, by defining our own coding transforms.
  2. It is perfectly possible for os.listdir(u'.') to return non-Unicode strings (it means that the filename was not stored with a coding legal in the current locale). The problem is that if you have a mix of legal and illegal names, e.g. /a-legal/b-illegal/c-legal, you cannot use os.path.join() to concatenate the Unicode and non-Unicode parts, since that would not be the correct filename (due to b-illegal not having a valid Unicode coding, in the above example). The only solution I've found is to os.chdir() to each path component, one at a time, when opening files, traversing directories, etc. Need to write a section to expand on this issue.
     
Several of the functions defined in this document (usplit(), is_legal_xml(), is_legal_xml_string()) are available as part of Gnosis Utils (of which I'm a coauthor). Version 1.2.0 is the first release with the functions. They are available in the package gnosis.xml.xmlmap. In upcoming versions, I plan to incorporate the Unicode-XML transforms mentioned above.


  1. In my opinion, if the creators of Python's Unicode support had merely omitted the "default ASCII" logic, it would have been much clearer, as that would force newbies to understand what was going on, instead of blindly using unicode(value), without an explicit coding.

    Now, to be fair, using ASCII as a default encoding is reasonable. Since Python's ASCII codec only accepts codes from 0-127, if unicode() works, ASCII is almost certainly the correct codec.
  2. I'm not sure what earlier version do (95/98 era), but I'm guessing their Unicode support is not up to current standards.
  3. Actually, Firefox and Internet Explorer were able to correctly display the page without a correct meta tag, but in general you should always include it, since auto-guessing may not work on all platforms, or for all HTML documents.
     

Author: Frank McIngvale
Version: 1.3
Last Revised: Apr 22, 2007

I've been trying to get utf-8 output working properly in python (2.5.1) and I've encountered some strangely inconsistent behavious. It seems like a bug, but I'm probably just getting something wrong.

Basically print works, but sys.stdout.write of the same thing doesn't.

Any ideas?

Python 2.5.1 (r251:54863, Apr 15 2008, 22:57:26) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
 print unichr(0xe9)
é
 sys.stdout.write(unichr(0xe9))
Traceback (most recent call last):
  File "stdin", line 1, in module
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
 class Test:
    def write(self, x):
        sys.stdout.write(x)

 print  Test(), unichr(0xe9)
Traceback (most recent call last):
  File "stdin", line 1, in module
  File "stdin", line 3, in write
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
 print  sys.stdout, unichr(0xe9)
é
 sys.stdout.encoding
'UTF-8'
 

As is common to most Unicode-related tutorials I have seen on the net (and including documents such as GNU’s Libc manual), there are a number of very important gotchas that programmers are just going to have to know about; the best place to read up on these issues is by browsing the Unicode Technical Documents and Unicode Standard Annexes; the first that I would suggest is the Unicode Standard Annex regarding Normalisation as this has a major impact on correct code.

Briefly, there are four normal forms, but you only need to know about two of them: Normalisation Form C (NFC) and Normalisation Form KC (NFKC). Put very simply, Unicode text sent out (eg. to the network) SHOULD be in NFC. Any string comparisons — such as for things like filenames, usernames or any string you wish to sort on — need to be normalised into NFKC prior to comparison. NFKC remaps the string so any “compatibility characters” are canonicalised, meaning their memory order is consistent and so can be compared. Here is an example:

 "Richard IV" == "Richard \u2163"
False



The strings "Richard IV" and "Richard Ⅳ" are considered identical for the purposes of human consideration. The first string is composed of 'I' + 'V', but the second string is composed of a single code-point U+2163 ROMAN NUMERAL FOUR. This is a ‘compatibility character’. It generally shouldn’t be used but some systems may automatically have mapped ‘I’ + ‘V’ into U+2163 (or it may have been transcoded from another character set). The NFKC normalisation process essentially changes occurrences of U+2163 to ‘I’ + ‘V’. Other examples in European scripts typically come from ligatures such as U+0133 LATIN SMALL LIGATURE IJ (ij), which under NFKC would be re-composed to ‘i’ + ‘j’; there are plenty more examples in the Normalisation document.

Another major issue to do with normalisation is that normalisation is not closed under concatenation, which means the string formed by NFKC(string1) + NFKC(string2) is not guaranteed to be NFKC normalised, although an optimised function such as NFKC_concat(string1, string2) can be defined.

Another very important document to read is the Security Considerations document, which very broadly speaking covers two themes: visual security, as illustrated by paypacapital-i.com, and technical security issues. This document also gives user-agents a number of security related recommendations.

I think I shall eventually put together a reading list for people wanting to correctly get into Unicode — I’m just getting into it myself and been exploring the various normative documents — but that may be a little time in coming.

People with an interest in network protocols should also arm themselves with a knowledge of standards such as stringprep (RFC 3435) which allows us to pick and choose rules (creating what is known as a stringprep ‘profile’) to limit how particular strings, such as usernames, may be represented.

In conclusion, the programming world is in for a nasty shake-up; there is a definite requirement for CORRECT training resources to be available to the programming world at large.

 

run 0.2 : Python Package Index

Save

Python Process Run Control

Downloads ↓

This module provides convenient ways to guarantee a certain script state.

Make sure only one instance of the program is running, this is done by acquiring an fcntl file lock on the current script:

 import run.alone

There are a few things to keep in mind:

  1. import run.alone once, it will exit your program if it gets imported twice
  2. symlinked scripts will be resolved to the inode it has been linked to, this is a limitiation of using fcltn file locks

Run until a specified timespec, if the process runs longer, it will be killed:

 import run
 run.until('23m42s')
 ...

You can also choose to limit on the amount of CPU-time being consumed (default is wall clock time):

 run.until('42s', 'cpu')
 ...

This module is dirived from the Sys::RunAlone and Sys::RunUntil perl modules written by Elizabeth Mattijsen.

 
File Type Py Version Uploaded on Size # downloads
run-0.2.tar.gz (md5) Source 2010-04-22 3KB 347
The rating feature has been removed. See catalog-sig for the discussion of this removal.

Doing HTTP Caching Right: Introducing httplib2

Save

Doing HTTP Caching Right: Introducing httplib2
By Joe Gregorio
February 01, 2006

You need to understand HTTP caching. No, really, you do. I have mentioned repeatedly that you need to choose your HTTP methods carefully when building a web service, in part because you can get the performance benefits of caching with GET. Well, if you want to get the real advantages of GET then you need to understand caching and how you can use it effectively to improve the performance of your service.

This article will not explain how to set up caching for your particular web server, nor will it cover the different kinds of caches. If you want that kind of information I recommend Mark Nottingham's excellent tutorial on HTTP caching.

First you need to understand the goals of the HTTP caching model. One objective is to let both the client and server have a say over when to return a cached entry. As you can imagine, allowing both client and server to have input on when a cached entry is to be considered stale is obviously going to introduce some complexity.

The HTTP caching model is based on validators, which are bits of data that a client can use to validate that a cached response is still valid. They are fundamental to the operation of caches since they allow a client or intermediary to query the status of a resource without having to transfer the entire response again: the server returns an entity body only if the validator indicates that the cache has a stale response.

One of the validators for HTTP is the ETag. An ETag is like a fingerprint for the bytes in the representation; if a single byte changes the ETag also changes.

Using validators requires that you already have done a GET once on a resource. The cache stores the value of the ETag header if present and then uses the value of that header in later requests to that same URI.

For example, if I send a request to example.org and get back this response:

HTTP/1.1 200 OK
Date: Fri, 30 Dec 2005 17:30:56 GMT
Server: Apache
ETag: "11c415a-8206-243aea40"
Accept-Ranges: bytes
Content-Length: 33286
Vary: Accept-Encoding,User-Agent
Cache-Control: max-age=7200
Expires: Fri, 30 Dec 2005 19:30:56 GMT
Content-Type: image/png 

-- binary data --

Then the next time I do a GET I can add the validator in. Note that the value of ETag is placed in the If-None-Match: header.

GET / HTTP/1.1
Host: example.org
If-None-Match: "11c415a-8206-243aea40"

If there was no change in the representation then the server returns a 304 Not Modified.

HTTP/1.1 304 Not Modified 
Date: Fri, 30 Dec 2005 17:32:47 GMT

If there was a change, the new representation is returned with a status code of 200 and a new ETag.

HTTP/1.1 200 OK
Date: Fri, 30 Dec 2005 17:32:47 GMT
Server: Apache
ETag: "0192384-9023-1a929893"
Accept-Ranges: bytes
Content-Length: 33286
Vary: Accept-Encoding,User-Agent
Cache-Control: max-age=7200
Expires: Fri, 30 Dec 2005 19:30:56 GMT
Content-Type: image/png 

-- binary data --
 

While validators are used to test if a cached entry is still valid, the Cache-Control: header is used to signal how long a representation can be cached. The most fundamental of all the cache-control directives is max-age. This directive asserts that the cached response can be only max-age seconds old before being considered stale. Note that max-age can appear in both request headers and response headers, which gives both the client and server a chance to assert how old they like their responses cached. If a cached response is fresh then we can return the cached response immediately; if it's stale then we need to validate the cached response before returning it.

Let's take another look at our example response from above. Note that the Cache-Control: header is set and that a max-age of 7200 means that the entry can be cached for up to two hours.

HTTP/1.1 200 OK
Date: Fri, 30 Dec 2005 17:32:47 GMT
Server: Apache
ETag: "0192384-9023-1a929893"
Accept-Ranges: bytes
Content-Length: 33286
Vary: Accept-Encoding,User-Agent
Cache-Control: max-age=7200
Expires: Fri, 30 Dec 2005 19:30:56 GMT
Content-Type: text/xml

There are lots of directives that can be put in the Cache-Control: header, and the Cache-Control: header may appear in both requests and/or responses.

Directive Description
no-cache The cached response must not be used to satisfy this request.
no-store Do not store this response in a cache.
max-age=delta-seconds The client is willing to accept a cached reponse that is delta-seconds old without validating.
max-stale=delta-seconds The client is willing to accept a cached response that is no more than delta-seconds stale.
min-fresh=delta-seconds The client is willing to accept only a cached response that will still be fresh delta-seconds from now.
no-transform The entity body must not be transformed.
only-if-cached Return a response only if there is one in the cache. Do not validate or GET a response if no cache entry exists.

Directive Description
public This can be cached by any cache.
private This can be cached only by a private cache.
no-cache The cached response must not be used on subsequent requests without first validating it.
no-store Do not store this response in a cache.
no-transform The entity body must not be transformed.
must-revalidate If the cached response is stale it must be validated before it is returned in any response. Overrides max-stale.
max-age=delta-seconds The client is willing to accept a cached reponse that is delta-seconds old without validating.
s-maxage=delta-seconds Just like max-age but it applies only to shared caches.
proxy-revalidate Like must-revalidate, but only for proxies.

Let's look at some Cache-Control: header examples.

Cache-Control: private, max-age=3600

If sent by a server, this Cache-Control: header states that the response can only be cached in a private cache for one hour.

Cache-Control: public, must-revalidate, max-age=7200

The included response can be cached by a public cache and can be cached for two hours; after that the cache must revalidate the entry before returning it to a subsequent request.

Cache-Control: must-revalidate, max-age=0

This forces the client to revalidate every request, since a max-age=0 forces the cached entry to be instantly stale. See Mark Nottingham's Leveraging the Web: Caching for a nice example of how this can be applied.

Cache-Control: no-cache

This is pretty close to must-revalidate, max-age=0, except that a client could use a max-stale header on a request and get a stale response. The must-revalidate will override the max-stale property. I told you that giving both client and server some control would make things a bit complicated.

So far all of the Cache-Control: header examples we have looked at are on the response side, but they can also be added on the request too.

Cache-Control: no-cache

This forces an "end-to-end reload," where the client forces the cache to reload its cache from the origin server.

Cache-Control: min-fresh=200

Here the client asserts that it wants a response that will be fresh for at least 200 seconds.

Vary

You may be wondering about situations where a cache might get confused. For example, what if a server does content negotiation, where different representations can be returned from the same URI? For cases like this HTTP supplies the Vary: header. The Vary: header informs the cache of the names of the all headers that might cause a resources representation to change.

For example, if a server did do content negotiation then the Content-Type: header would be different for the different types of responses, depending on the type of content negotiated. In that case the server can add a Vary: accept header, which causes the cache to consider the Accept: header when caching responses from that URI.

Date: Mon, 23 Jan 2006 15:37:34 GMT
Server: Apache
Accept-Ranges: bytes
Vary: Accept-Encoding,User-Agent
Content-Encoding: gzip
Cache-Control: max-age=7200
Expires: Mon, 23 Jan 2006 17:37:34 GMT
Content-Length: 5073
Content-Type: text/html; charset=utf-8

In this example the server is stating that responses can be cached for two hours, but that responses may vary based on the Accept-Encoding and User-Agent headers.

When a server successfully validates a cached response, using for example the If-None-Match: header, then the server returns a status code of 304 Not Modified. So nothing much happens on a 304 Not Modified response, right? Well, not exactly. In fact, the server can send updated headers for the entity that have to be updated in the cache. The server can also send along a Connection: header that says which headers shouldn't be updated.

Some headers are by default excluded from list of headers to update. These are called hop-by-hop headers and they are: Connection, Keep-Alive, Proxy-Authenticate, Proxy-Authorization, TE, Trailers, Transfer-Encoding, and Upgrade. All other headers are considered end-to-end headers.

HTTP/1.1 304 Not Modified
Content-Length: 647
Server: Apache
Connection: close
Date: Mon, 23 Jan 2006 16:10:52 GMT
Content-Type: text/html; charset=iso-8859-1

...

In the above example Date: is not a hop-by-hop header nor is it listed in the Connection: header, so the cache has to update the value of Date: in the cache.

While a little complex, the above is at least conceptually nice. Of course, one of the problems is that we have to be able to work with HTTP 1.0 servers and caches which use a different set of headers, all time-based, to do caching and out of necessity those are brought forward into HTTP 1.1.

The older cache control model from HTTP 1.0 is based solely on time. The Last-Modified cache validator is just that, the last time that the resource was modified. The cache uses the Date:, Expires:, Last-Modified:, and If-Modified-Since: headers to detect changes in a resource.

If you are developing a client you should always use both validators if present; you never know when an HTTP 1.0 cache will pop up between you and a server. HTTP 1.1 was published seven years ago so you'd think that at this late date most things would be updated. This is the protocol equivalent of wearing a belt and suspenders.

Now that you understand caching you may be wondering if the client library in your favorite language even supports caching. I know the answer for Python, and sadly that answer is currently no. It pains me that my favorite language doesn't have one of the best HTTP client implementations around. That needs to change.

Introducing httplib2, a comprehensive Python HTTP client library that supports a local private cache that understands all the caching operations we just talked about. In addition it supports many features left out of other HTTP libraries.

HTTP and HTTPS
HTTPS support is available only if the socket module was compiled with SSL support.
Keep-Alive
Supports HTTP 1.1 Keep-Alive, keeping the socket open and performing multiple requests over the same connection if possible.
Authentication
The following three types of HTTP Authentication are supported. These can be used over both HTTP and HTTPS.
Caching
The module can optionally operate with a private cache that understands the Cache-Control: header and uses both the ETag and Last-Modified cache validators.
All Methods
The module can handle any HTTP request method, not just GET and POST.
Redirects
Automatically follows 3XX redirects on GETs.
Compression
Handles both compress and gzip types of compression.
Lost Update Support
Automatically adds back ETags into PUT requests to resources we have already cached. This implements Section 3.2 of Detecting the Lost Update Problem Using Unreserved Checkout.
Unit Tested
A large and growing set of unit tests.

See the httplib2 project page for more details.

Next time I will cover HTTP authentication, redirects, keep-alive, and compression in HTTP and how httplib2 handles them. You might also be wondering how the "big guys" handle caching. That will take a whole other article to cover.

XML.com Copyright © 1998-2006 O'Reilly Media, Inc.

 

lxml.html

Save

lxml.html

Since version 2.0, lxml comes with a dedicated Python package for dealing with HTML: lxml.html. It is based on lxml's HTML parser, but provides a special Element API for HTML elements, as well as a number of utilities for common HTML processing tasks.

The main API is based on the lxml.etree API, and thus, on the ElementTree API.

There are several functions available to parse HTML:

parse(filename_url_or_file):

Parses the named file or url, or if the object has a .read() method, parses from that.

If you give a URL, or if the object has a .geturl() method (as file-like objects from urllib.urlopen() have), then that URL is used as the base URL. You can also provide an explicit base_url keyword argument.

document_fromstring(string):
Parses a document from the given string. This always creates a correct HTML document, which means the parent node is html, and there is a body and possibly a head.
fragment_fromstring(string, create_parent=False):
Returns an HTML fragment from a string. The fragment must contain just a single element, unless create_parent is given; e.g,. fragment_fromstring(string, create_parent='div') will wrap the element in a div.
fragments_fromstring(string):
Returns a list of the elements found in the fragment.
fromstring(string):
Returns document_fromstring or fragment_fromstring, based on whether the string looks like a full document, or just a fragment.

The normal HTML parser is capable of handling broken HTML, but for pages that are far enough from HTML to call them 'tag soup', it may still fail to parse the page in a useful way. A way to deal with this is ElementSoup, which deploys the well-known BeautifulSoup parser to build an lxml HTML tree.

However, note that the most common problem with web pages is the lack of (or the existence of incorrect) encoding declarations. It is therefore often sufficient to only use the encoding detection of BeautifulSoup, called UnicodeDammit, and to leave the rest to lxml's own HTML parser, which is several times faster.

HTML elements have all the methods that come with ElementTree, but also include some extra methods:

.drop_tree():
Drops the element and all its children. Unlike el.getparent().remove(el) this does not remove the tail text; with drop_tree the tail text is merged with the previous element.
.drop_tag():
Drops the tag, but keeps its children and text.
.find_class(class_name):
Returns a list of all the elements with the given CSS class name. Note that class names are space separated in HTML, so doc.find_class_name('highlight') will find an element like div class="sidebar highlight". Class names are case sensitive.
.find_rel_links(rel):
Returns a list of all the a rel="{rel}" elements. E.g., doc.find_rel_links('tag') returns all the links marked as tags.
.get_element_by_id(id, default=None):
Return the element with the given id, or the default if none is found. If there are multiple elements with the same id (which there shouldn't be, but there often is), this returns only the first.
.text_content():
Returns the text content of the element, including the text content of its children, with no markup.
.cssselect(expr):
Select elements from this element and its children, using a CSS selector expression. (Note that .xpath(expr) is also available as on all lxml elements.)
.label:
Returns the corresponding label element for this element, if any exists (None if there is none). Label elements have a label.for_element attribute that points back to the element.
.base_url:
The base URL for this element, if one was saved from the parsing. This attribute is not settable. Is None when no base URL was saved.

One of the interesting modules in the lxml.html package deals with doctests. It can be hard to compare two HTML pages for equality, as whitespace differences aren't meaningful and the structural formatting can differ. This is even more a problem in doctests, where output is tested for equality and small differences in whitespace or the order of attributes can let a test fail. And given the verbosity of tag-based languages, it may take more than a quick look to find the actual differences in the doctest output.

Luckily, lxml provides the lxml.doctestcompare module that supports relaxed comparison of XML and HTML pages and provides a readable diff in the output when a test fails. The HTML comparison is most easily used by importing the usedoctest module in a doctest:

 import lxml.html.usedoctest

Now, if you have a HTML document and want to compare it to an expected result document in a doctest, you can do the following:

 import lxml.html
 html = lxml.html.fromstring('''\
...    htmlbody onload="" color="white"
...      pHi  !/p
...    /body/html
... ''')

 print lxml.html.tostring(html)
htmlbody onload="" color="white"pHi !/p/body/html

 print lxml.html.tostring(html)
html body color="white" onload="" pHi    !/p /body /html

 print lxml.html.tostring(html)
html
  body color="white" onload=""
    pHi !/p
  /body
/html

In documentation, you would likely prefer the pretty printed HTML output, as it is the most readable. However, the three documents are equivalent from the point of view of an HTML tool, so the doctest will silently accept any of the above. This allows you to concentrate on readability in your doctests, even if the real output is a straight ugly HTML one-liner.

Note that there is also an lxml.usedoctest module which you can import for XML comparisons. The HTML parser notably ignores namespaces and some other XMLisms.

lxml.html comes with a predefined HTML vocabulary for the E-factory, originally written by Fredrik Lundh. This allows you to quickly generate HTML pages and fragments:

 from lxml.html import builder as E
 from lxml.html import usedoctest
 html = E.HTML(
...   E.HEAD(
...     E.LINK(rel="stylesheet", href="great.css", type="text/css"),
...     E.TITLE("Best Page Ever")
...   ),
...   E.BODY(
...     E.H1(E.CLASS("heading"), "Top News"),
...     E.P("World News only on this page", style="font-size: 200%"),
...     "Ah, and here's some more text, by the way.",
...     lxml.html.fromstring("p... and this is a parsed fragment .../p")
...   )
... )

 print lxml.html.tostring(html)
html
  head
    link href="great.css" rel="stylesheet" type="text/css"
    titleBest Page Ever/title
  /head
  body
    h1 class="heading"Top News/h1
    p style="font-size: 200%"World News only on this page/p
    Ah, and here's some more text, by the way.
    p... and this is a parsed fragment .../p
  /body
/html

Note that you should use lxml.html.tostring and not lxml.tostring. lxml.tostring(doc) will return the XML representation of the document, which is not valid HTML. In particular, things like script src="..."/script will be serialized as script src="..." /, which completely confuses browsers.

A handy method for viewing your HTML: lxml.html.open_in_browser(lxml_doc) will write the document to disk and open it in a browser (with the webbrowser module).

Any form elements in a document are available through the list doc.forms (e.g., doc.forms[0]). Form, input, select, and textarea elements each have special methods.

Input elements (including select and textarea) have these attributes:

.name:
The name of the element.
.value:

The value of an input, the content of a textarea, the selected option(s) of a select. This attribute can be set.

In the case of a select that takes multiple options (select multiple) this will be a set of the selected options; you can add or remove items to select and unselect the options.

Select attributes:

.value_options:
For select elements, this is all the possible values (the values of all the options).
.multiple:
For select elements, true if this is a select multiple element.

Input attributes:

.type:
The type attribute in input elements.
.checkable:
True if this can be checked (i.e., true for type=radio and type=checkbox).
.checked:
If this element is checkable, the checked state. Raises AttributeError on non-checkable inputs.

The form itself has these attributes:

.inputs:
A dictionary-like object that can be used to access input elements by name. When there are multiple input elements with the same name, this returns list-like structures that can also be used to access the options and their values as a group.
.fields:

A dictionary-like object used to access values by their name. form.inputs returns elements, this only returns values. Setting values in this dictionary will effect the form inputs. Basically form.fields[x] is equivalent to form.inputs[x].value and form.fields[x] = y is equivalent to form.inputs[x].value = y. (Note that sometimes form.inputs[x] returns a compound object, but these objects also have .value attributes.)

If you set this attribute, it is equivalent to form.fields.clear(); form.fields.update(new_value)

.form_values():
Returns a list of [(name, value), ...], suitable to be passed to urllib.urlencode() for form submission.
.action:
The action attribute. This is resolved to an absolute URL if possible.
.method:
The method attribute, which defaults to GET.

Note that you can change any of these attributes (values, method, action, etc) and then serialize the form to see the updated values. You can, for instance, do:

 from lxml.html import fromstring, tostring
 form_page = fromstring('''htmlbodyform
...   Your name: input type="text" name="name" br
...   Your phone: input type="text" name="phone" br
...   Your favorite pets: br
...   Dogs: input type="checkbox" name="interest" value="dogs" br
...   Cats: input type="checkbox" name="interest" value="cats" br
...   Llamas: input type="checkbox" name="interest" value="llamas" br
...   input type="submit"/form/body/html''')
 form = form_page.forms[0]
 form.fields = dict(
...     name='John Smith',
...     phone='555-555-3949',
...     interest=set(['cats', 'llamas']))
 print tostring(form)
html
  body
    form
    Your name:
      input name="name" type="text" value="John Smith"
      brYour phone:
      input name="phone" type="text" value="555-555-3949"
      brYour favorite pets:
      brDogs:
      input name="interest" type="checkbox" value="dogs"
      brCats:
      input checked name="interest" type="checkbox" value="cats"
      brLlamas:
      input checked name="interest" type="checkbox" value="llamas"
      br
      input type="submit"
    /form
  /body
/html

You can submit a form with lxml.html.submit_form(form_element). This will return a file-like object (the result of urllib.urlopen()).

If you have extra input values you want to pass you can use the keyword argument extra_values, like extra_values={'submit': 'Yes!'}. This is the only way to get submit values into the form, as there is no state of "submitted" for these elements.

You can pass in an alternate opener with the open_http keyword argument, which is a function with the signature open_http(method, url, values).

Example:

 from lxml.html import parse, submit_form
 page = parse('http://tinyurl.com').getroot()
 page.forms[1].fields['url'] = 'http://codespeak.net/lxml/'
 result = parse(submit_form(page.forms[1])).getroot()
 [a.attrib['href'] for a in result.xpath("//a[@target='_blank']")]
['http://tinyurl.com/2xae8s', 'http://preview.tinyurl.com/2xae8s']

The module lxml.html.clean provides a Cleaner class for cleaning up HTML pages. It supports removing embedded or script content, special tags, CSS style annotations and much more.

Say, you have an evil web page from an untrusted source that contains lots of content that upsets browsers and tries to run evil code on the client side:

 html = '''\
... html
...  head
...    script type="text/javascript" src="evil-site"/script
...    link rel="alternate" type="text/rss" src="evil-rss"
...    style
...      body {background-image: url(javascript:do_evil)};
...      div {color: expression(evil)};
...    /style
...  /head
...  body onload="evil_function()"
...    !-- I am interpreted for EVIL! --
...    a href="javascript:evil_function()"a link/a
...    a href="#" onclick="evil_function()"another link/a
...    p onclick="evil_function()"a paragraph/p
...    div style="display: none"secret EVIL!/div
...    object of EVIL! /object
...    iframe src="evil-site"/iframe
...    form action="evil-site"
...      Password: input type="password" name="password"
...    /form
...    blinkannoying EVIL!/blink
...    a href="evil-site"spam spam SPAM!/a
...    image src="evil!"
...  /body
... /html'''

To remove the all suspicious content from this unparsed document, use the clean_html function:

 from lxml.html.clean import clean_html

 print clean_html(html)
html
  body
    div
      style/* deleted *//style
      a href=""a link/a
      a href="#"another link/a
      pa paragraph/p
      divsecret EVIL!/div
      of EVIL!
      Password:
      annoying EVIL!
      a href="evil-site"spam spam SPAM!/a
      img src="evil!"
    /div
  /body
/html

The Cleaner class supports several keyword arguments to control exactly which content is removed:

 from lxml.html.clean import Cleaner

 cleaner = Cleaner(page_structure=False, links=False)
 print cleaner.clean_html(html)
html
  head
    link rel="alternate" src="evil-rss" type="text/rss"
    style/* deleted *//style
  /head
  body
    a href=""a link/a
    a href="#"another link/a
    pa paragraph/p
    divsecret EVIL!/div
    of EVIL!
    Password:
    annoying EVIL!
    a href="evil-site"spam spam SPAM!/a
    img src="evil!"
  /body
/html

 cleaner = Cleaner(style=True, links=True, add_nofollow=True,
...                   page_structure=False, safe_attrs_only=False)

 print cleaner.clean_html(html)
html
  head
  /head
  body
    a href=""a link/a
    a href="#"another link/a
    pa paragraph/p
    divsecret EVIL!/div
    of EVIL!
    Password:
    annoying EVIL!
    a href="evil-site" rel="nofollow"spam spam SPAM!/a
    img src="evil!"
  /body
/html

You can also whitelist some otherwise dangerous content with Cleaner(host_whitelist=['www.youtube.com']), which would allow embedded media from YouTube, while still filtering out embedded media from other sites.

See the docstring of Cleaner for the details of what can be cleaned.

You can also wrap long words in your html:

word_break(doc, max_width=40, ...)

word_break_html(html, ...)

This finds any long words in the text of the document and inserts #8203; in the document (which is the Unicode zero-width space).

This avoids the elements pre, textarea, and code. You can control this with avoid_elements=['textarea', ...].

It also avoids elements with the CSS class nobreak. You can control this with avoid_classes=['code', ...].

Lastly you can control the character that is inserted with break_character=u'\u200b'. However, you cannot insert markup, only text.

word_break_html(html) parses the HTML document and returns a string.

HTML Diff

The module lxml.html.diff offers some ways to visualize differences in HTML documents. These differences are content oriented. That is, changes in markup are largely ignored; only changes in the content itself are highlighted.

There are two ways to view differences: htmldiff and html_annotate. One shows differences with ins and del, while the other annotates a set of changes similar to svn blame. Both these functions operate on text, and work best with content fragments (only what goes in body), not complete documents.

Example of htmldiff:

 from lxml.html.diff import htmldiff, html_annotate
 doc1 = '''pHere is some text./p'''
 doc2 = '''pHere is ba lot/b of itext/i./p'''
 doc3 = '''pHere is ba little/b itext/i./p'''
 print htmldiff(doc1, doc2)
pHere is insba lot/b of itext/i./ins delsome text./del /p
 print html_annotate([(doc1, 'author1'), (doc2, 'author2'),
...                      (doc3, 'author3')])
pspan title="author1"Here is/span
   bspan title="author2"a/span
   span title="author3"little/span/b
   ispan title="author2"text/span/i
   span title="author2"./span/p

As you can see, it is imperfect as such things tend to be. On larger tracts of text with larger edits it will generally do better.

The html_annotate function can also take an optional second argument, markup. This is a function like markup(text, version) that returns the given text marked up with the given version. The default version, the output of which you see in the example, looks like:

def default_markup(text, version):
    return 'span title="%s"%s/span' % (
        cgi.escape(unicode(version), 1), text)

This example parses the hCard microformat.

First we get the page:

 import urllib
 from lxml.html import fromstring
 url = 'http://microformats.org/'
 content = urllib.urlopen(url).read()
 doc = fromstring(content)
 doc.make_links_absolute(url)

Then we create some objects to put the information in:

 class Card(object):
...     def __init__(self, **kw):
...         for name, value in kw:
...             setattr(self, name, value)
 class Phone(object):
...     def __init__(self, phone, types=()):
...         self.phone, self.types = phone, types

And some generally handy functions for microformats:

 def get_text(el, class_name):
...     els = el.find_class(class_name)
...     if els:
...         return els[0].text_content()
...     else:
...         return ''
 def get_value(el):
...     return get_text(el, 'value') or el.text_content()
 def get_all_texts(el, class_name):
...     return [e.text_content() for e in els.find_class(class_name)]
 def parse_addresses(el):
...     # Ideally this would parse street, etc.
...     return el.find_class('adr')

Then the parsing:

 for el in doc.find_class('hcard'):
...     card = Card()
...     card.el = el
...     card.fn = get_text(el, 'fn')
...     card.tels = []
...     for tel_el in card.find_class('tel'):
...         card.tels.append(Phone(get_value(tel_el),
...                                get_all_texts(tel_el, 'type')))
...     card.addresses = parse_addresses(el)

Comments (1)

Flavio Pfaffhauser

Flavio Pfaffhauser Jun 27, 2011

Accidentally I was just looking for this ;)

Welcome to the tox automation project — tox v1.0 documentation

Save

Note

Bug reports, feedback, contributions welcome: see support and contact channels.

tox aims to automate state-of-the-art packaging, testing and deployment of Python software right from your console or CI server, invoking your tools of choice.

Tox as is a generic virtualenv management and test command line tool you can use for:

  • checking your package installs correctly with different Python versions and interpreters
  • running your tests in each of the environments, configuring your test tool of choice
  • acting as a frontend to Continous Integration servers, greatly reducing boilerplate and merging CI and shell-based testing.

First, install tox with pip install tox or easy_install tox. Then put basic information about your project and the test environments you want your project to run in into a tox.ini file residing right next to your setup.py file:

# content of: tox.ini , put in same dir as setup.py
[tox]
envlist = py26,py27
[testenv]
deps=pytest       # install pytest in the venvs
commands=py.test  # or 'nosetests' or ...

To sdist-package, install and test your project against Python2.6 and Python2.7, just type:

and watch things happening (you must have python2.6 and python2.7 installed in your environment otherwise you will see errors). When you run tox a second time you’ll note that it runs much faster because it keeps track of virtualenv details and will not recreate or re-install dependencies. You also might want to checkout tox configuration and usage examples to get some more ideas.

  • automation of tedious Python related test activities

  • test your Python package against many interpreter and dependency configs

    • automatic customizable (re)creation of virtualenv test environments
    • installs your setup.py based project into each virtual environment
    • test-tool agnostic: runs py.test, nose or unittests in a uniform manner
  • supports using different / multiple PyPI index servers

  • uses pip (for Python2 environments) and distribute (for all environments) by default

  • cross-Python compatible: Python2.4 up to Python2.7, Jython and Python3 support as well as for pypy

  • cross-platform: Windows and Unix style environments

  • integrates with continous integration servers like Jenkins (formerly known as Hudson) and helps you to avoid boilerplatish and platform-specific build-step hacks.

  • unified automatic artifact management between tox runs both in a local developer shell as well as in a CI/Jenkins context.

  • driven by a simple ini-style config file

  • documented examples and configuration

  • concise reporting about tool invocations and configuration errors

  • professionally supported

  • tox always operates in virtualenv environments, it cannot work with globally installed Python interpreters because there are no reliable means to install and recreate dependencies. Or does it still makes sense to allow using global Python installations?
  • tox is fresh on the Python testing scene (first release July 2010) and needs some battle testing and feedback. It is is likely to evolve in (possibly incompatible) increments as it provides more power to configure and customize the test process.
  • tox uses virtualenv and virtualenv5, the latter being a fork of virtualenv3 which roughly works with Python3 but has less features (no “pip” and other problems). This comes with limitations and you may run into them when trying to create python3 based virtual environments. IMO the proper solution is: virtualenv needs to merge and grow proper native Python3 support, preferably in a “single-source” way.
  • tox currently uses a setup.py sdist invocation to create an installable package and then invokes pip or easy_install to install into each test environment. There is no support for other installation methods.

Deform — deform v0.9 documentation

Save

deform is a Python HTML form generation library.

The design of deform is heavily influenced by the formish form generation library. Some might even say it’s a shameless rip-off; this would not be completely inaccurate. It differs from formish mostly in ways that make the implementation (arguably) simpler and smaller.

deform uses Colander as a schema library, Peppercorn as a form control deserialization library, and Chameleon to perform HTML templating.

deform depends only on Peppercorn, Colander, Chameleon and an internationalization library named translationstring, so it may be used in most web frameworks (or antiframeworks) as a result.

Alternate templating languages may be used, as long as all templates are translated from the native Chameleon templates to your templating system of choice and a suitable renderer is supplied to deform.

Visit deformdemo.repoze.org to view an application which demonstrates most of Deform’s features. The source code for this application is also available in the deform package on GitHub.

To report bugs, use the bug tracker.

If you’ve got questions that aren’t answered by this documentation, contact the Pylons-discuss maillist or join the #pylons IRC channel.

Browse and check out tagged and trunk versions of deform via the deform package on GitHub. To check out the trunk, use this command:

git clone git://github.com/Pylons/deform.git

To find out how to become a contributor to deform, please see the Pylons Project contributor documentation.

Without these people, this software would not exist:

Comments (1)

Aengus Walton

Aengus Walton Jun 27, 2011

Fear Factory and Midlake? Chap's got an interesting mix of musical inspiration...
(1 - 10 of 51)