Category Archives: Python

iPython with virtualenv

You’ll find a bunch of different approaches to this on Google; Using an iPython boot script that extends sys.path with paths from the current virtualenv. Or calling the iPython script with the virtualenv’s python binary (python `which ipython`).

They all tend to be problematic; for example, the latter doesn’t work if the virtualenv has been configured with –no-site-packages.

Why not simply install iPython inside the virtualenv proper? This is what I’ve been doing for a while, reluctantly, and I’m finally aware of what has bothered me about it: iPython clocks in at an insane 8.3 megabyte (the uncompressed size is 18 MB, about 15 of which is documentation). On my slow DSL connection the download takes a good minute.

Using virtualenvwrapper, I’ve now added this to my postmkvirtualenv script:

CACHE=$WORKON_HOME/.cache
mkdir -p "$CACHE"
$VIRTUAL_ENV/bin/pip install --download-cache="$CACHE" ipython

This gives me iPython in every new virtual environment, at the cost of 2 or 3 seconds installation time.

Python argparse: Combine nargs=* with subparsers

Say you set up argparse with a nargs=’*’ or nargs=’+’ argument first, followed by a subparser to handle a command:

parser = argparse.ArgumentParser()
parser.add_argument('--items', nargs='+', default=[])
subparsers = parser.add_subparsers()
subparser.add_parser('foo')
subparser.add_parser('bar')

The usage would look like this:

usage: script.py [--items ITEM [ITEM ...]] {foo,bar}

This is actually somewhat problematic. If you were to parse the arguments “–items one two foo”, argparse will assume that foo is an item, and complain about the lack of a command (error: too few arguments).

A workaround is letting the user break out of the nargs-based argument by given a single “-“ character. This can easily be done with:

parser.add_argument('-', dest='__dummy', 
    action="store_true", help=argparse.SUPPRESS)

Now, the following will work: “–items one two – foo”.

I think that two dashes (“–“) are more common for this purpose (with a single dash usually referring to stdin/stdout), but unfortunately, argparse doesn’t seem to support using two.

Edit: I’m a dummy. argparse already has support for “–“ to break nargs built in. It’s not 100% the same, as it will force everything that follows to be considered a positional argument (whereas argparse in theory would support multiple sets of positional arguments, I think), but for most cases, relying on the builtin “–“ is the right choice.

Twisted Twistd Autoreload

While working on the Twisted server for A World Of Photo, I quickly began missing the convenience of having it automatically restart during development when I had made changes to the code. It turns out that the autoreload module that Django uses is actually pretty generic [1]. One thing Twisted doesn’t like is that the code which checks for file changes is run inside the main thread, and the actual app in a separate thread. That’s easily reversed though. You can find a patched version on bitbucket.

Then, all you need is a simple twistd wrapper:

from twisted.scripts import twistd
from pyutils import autoreload

autoreload.main(twistd.run)

[1] http://twistedmatrix.com/trac/ticket/4072

Opening IRIs in Python

As I mentioned on Twitter a couple days ago (Yes, I’ve finally surrendered), I was surprised to find that Python’s urllib/urllib2 refused to open the unicode url I gave it. Then I realized I didn’t actually understand how precisily the non-ascii url stuff even worked, so I decided to change that.

Apparently, a URI is by definition restricted to (a subset of) ASCII characters (or maybe it consists of just bytes and does not have a concept of characters altogether, I couldn’t quite make out the official stance). To enable a wider set of characters, IRIs where introduced in RFC 3987. IRIs by definition can contain unicode characters, and the RFC describes how an IRI has to be converted to an equivalent ASCII-only URI.

Therefore, to open a IRI (i.e. a unicode address,) in urllib, we first have to go through this conversion process. Essentially, two things need to be done:

  • The domain name needs to be IDNA-encoded, also known as Punycode. Python, since 2.3, supports both an idna and a punycode codec. The latter is the base algorithm, the former knows about domain syntax and makes sure each label (i.e. subdomain) is handled separatly, as it should.
  • The path and querystring components need to be UTF8-quoted, i.e. need to use the percent-encoding, with each octet being considered UTF8-encoded. Firefox also encodes a username/password portion in the same way.

With that in mind, the whole thing already seemed a lot more straightforward. I had a couple additional requirements, though:

  • The function doing the  IRI => URI conversion should support being generically plugged into an urlopen() call, i.e. since urlopen() doesn’t actually require a url, but also handles for example filesystem paths, the converter needed to be able to deal with those non-urls as well, without corrupting them.
  • It needed to be able to handle URLs from “out in the wild”, some of which may already be quoted (and should therefore not be quoted again).

The result currently looks like this:

def asciify_url(url, force_quote=False):
    r"""Attempts to make a unicode url usuable with ``urllib/urllib2``.

    More specifically, it attempts to convert the unicode object ``url``,
    which is meant to represent a IRI, to an unicode object that,
    containing only ASCII characters, is a valid URI. This involves:

        * IDNA/Puny-encoding the domain name.
        * UTF8-quoting the path and querystring parts.

    See also RFC 3987.
    """
    assert type(url) == unicode

    parts = urlparse.urlsplit(url)
    if not parts.scheme or not parts.netloc:
        # apparently not an url
        return url

    # idna-encode domain
    hostname = parts.hostname.encode('idna')

    # UTF8-quote the other parts. We check each part individually if
    # if needs to be quoted - that should catch some additional user
    # errors, say for example an umlaut in the username even though
    # the path *is* already quoted.
    def quote(s, safe):
        s = s or ''
        # Triggers on non-ascii characters - another option would be:
        #     urllib.quote(s.replace('%', '')) != s.replace('%', '')
        # which would trigger on all %-characters, e.g. "&".
        if s.encode('ascii', 'replace') != s or force_quote:
            return urllib.quote(s.encode('utf8'), safe=safe)
        return s
    username = quote(parts.username, '')
    password = quote(parts.password, safe='')
    path = quote(parts.path, safe='/')
    query = quote(parts.query, safe='&=')

    # put everything back together
    netloc = hostname
    if username or password:
        netloc = '@' + netloc
        if password:
            netloc = ':' + password + netloc
        netloc = username + netloc
    if parts.port:
        netloc += ':' + str(parts.port)
    return urlparse.urlunsplit([
        parts.scheme, netloc, path, query, parts.fragment])

A version with more extensive comments and doctests is part of the FeedPlatform code.

I subsequently found out that there is at the very least one other existing implementation of this, in httplib2. While that one doesn’t avoid double-quoting and won’t leave non-urls alone (my own specific requirements), the latter enables it to support partial urls. It also uses a custom quote function written after the spec, rather than relying on urllib.quote, which is interesting. I wonder what the practical differences are there. Finally, it has a bug where an auth-portion in the domain will lead to invalid IDNA-encoding, but that should be rare anyway.

It might further be noteworthy that the SVN version of FeedParser also applies IDNA-encoding, but does so on the full string given, which corrupts the URL it if there are non-ascii characters in any non-domain part.

FeedPlatform

Looking at my sort-of-todo list, I see at least 4 projects that either need or would greatly benefit from a feed aggregator-like functionality, e.g. not just simply parsing, but updating a list of feeds and keeping track of the items.

So it’s seems clear that the right thing to do is to implement this only once, preferably as some sort of generic library, and then reuse it. Unfortunately, the requirements are quite different: One project needs to track enclosures (which by the way also changes how item guids can potentially be identified). Sometimes notifications need to be sent out. Content has to be analysed in different ways. Two of the apps potentially handle large lists of feeds that require prioritized parsing and sophisticated error handling – in the other cases the list is small, and it wouldn’t be worth to bother. Other issues evolve around how and if to handle cover images, how to handle redirects, whether to ignore entries under certain conditions, and even what data to collect.

You get the point: It’s not that easy to all of this under one hood, which is also the reason I’ve been putting it off for a while now. I think I finally came up with a solution that I find satisfying though.

The whole thing centers around a Django settings.py-like config file. Now per default, two tables would be used, for feeds and items, and save for primary and foreign keys, each table would only need one column: The feed table the feed url, the item table the item guid.

Then, in said config file, you would specify a list of addins that each provide a particular, isolated piece of functionality. Addins can depend on each other, and it would be easy to write your own.

A configuration file might look like this:

# my-feedbot-config.py

USER_AGENT = 'MyFeedBot/%s (+url)' % get_version()

ADDINS = [
    # builtin
    collect('title', 'description', 'author'),
    enclosures(require=True),
    save_bandwith(), # needs columns for storing of etag etc. in db
    custom_item_filter(handler_func),

    # custom
    check_for_claimcode(),
]

Now, given the addins specified above, there’d probably be a new enclosure table, new columns for storing the meta data and http header info in the item table, and the parsing process would call out to your code to handle filtering and claimcodes.

I’ve already checked a decent amount of code into a bazaar branch, but it’s far from finished (or usable). I’ll hopefully have enough time to work on this during the next few weeks and plan to post some updates as I go along (btw, this is not depending on Django, for once).

SmartInspect Python Logging Client

Back when I was still writing a lot of Win32 apps in Delphi, one of my favorite tools was Smart Inspect, a very nice logging tool that also ships with Java und .NET libraries. Unfortunately, I never got to use it that much, since shortly after I bought it I started to drift more and more into web development.

So, three days ago, I had the sudden impulse to take a break from my regular projects and write a Python client library. Just for fun, mostly, but I can see some opportunities in the future where it might come in handy. I wasn’t expecting that it would take three days either, but some of the final bugs in the end took some time to iron out.

I should also note that:

  • It’s a pretty direct port of the Delphi implementation, and thus not as pythonic as it could be, though I took some liberties where it seemed to make a lot of sense. Most significantly, the identifier names were converted to Python style guidelines (using underscores in method names).
  • It isn’t complete, either. Apart from some minor stuff, the big things missing are the file and text protocols – I plan to add those at some point later on. Only the TCP and memory protocols are supported right now.

Here are some random examples – more in the readme and test files:

>>> from smartinspect.auto import *
>>> si.enabled = True
>>> si.log_debug("hello world!")

Manual initialization, without using the smartinspect.auto module:

>>> from smartinspect import *
>>> si = SmartInspect("myapp")
>>> si.connections = "tcp(port=4444, timeout=10)"
>>> si.enabled = True
>>> logger = si.add_session("main")
>>> logger.log_debug("hello world!")

Manually logging process flow:

>>> def append(self, obj):
>>>     logger.enter_method("append", self)
>>>     try:
>>>         pass   # so something
>>>     finally:
>>>         logger.leave_method("append", self)

Logging process flow using the decorator:

>>> @logger.track
>>> def append(self, obj):
>>>     pass   # so something

Download: smartinspect.zip

Sending binary data through stdout on Windows

It looks like I’m not the first one stumbling over this, but I just had a lot of fun trying to figure out why a moderately complex binary protocol sometimes randomly failed to work when sent through a subprocess stdout pipe. Apparently, because Python opens Windows stdout in text mode, something like “x0A” ended up as “x0Dx0A” on the receiving end.

Even worse, when trying to debug it by dumping stuff to files, I forgot to open those in binary mode too, which made things even more confusing 😉

Here’s the solution:

if sys.platform == "win32":
    import os, msvcrt
    msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)

‘module’ object has no attribute ‘__path__’

If you’re seeing the above error, possibly during a reverse() call, and possibly involving the django.contrib.auth.views.template_detail view, make sure none of your application directories has a file named templatetags.py – Django actually requires this to be be a package/directory.

Oh, and on a more general note: Not forgetting the .pyc files when deleting modules will greatly reduce the time needed to  debug cases like this 😉

Debugging Storm queries

Storm does not have a query log like you might know from Django. Not to my knowledge at least, the docs are still lacking.

If you need to know what queries are executed, you can do:

from storm import database
database.DEBUG = True

This will print all statements to stdout. If you need more, I suppose there’s nothing stopping you from hooking into storm.database.Connection.raw_execute.