As I mentioned on Twitter a couple days ago (Yes, I’ve finally surrendered), I was surprised to find that Python’s urllib/urllib2 refused to open the unicode url I gave it. Then I realized I didn’t actually understand how precisily the non-ascii url stuff even worked, so I decided to change that.
Apparently, a URI is by definition restricted to (a subset of) ASCII characters (or maybe it consists of just bytes and does not have a concept of characters altogether, I couldn’t quite make out the official stance). To enable a wider set of characters, IRIs where introduced in RFC 3987. IRIs by definition can contain unicode characters, and the RFC describes how an IRI has to be converted to an equivalent ASCII-only URI.
Therefore, to open a IRI (i.e. a unicode address,) in urllib, we first have to go through this conversion process. Essentially, two things need to be done:
- The domain name needs to be IDNA-encoded, also known as Punycode. Python, since 2.3, supports both an idna and a punycode codec. The latter is the base algorithm, the former knows about domain syntax and makes sure each label (i.e. subdomain) is handled separatly, as it should.
- The path and querystring components need to be UTF8-quoted, i.e. need to use the percent-encoding, with each octet being considered UTF8-encoded. Firefox also encodes a username/password portion in the same way.
With that in mind, the whole thing already seemed a lot more straightforward. I had a couple additional requirements, though:
- The function doing the IRI => URI conversion should support being generically plugged into an urlopen() call, i.e. since urlopen() doesn’t actually require a url, but also handles for example filesystem paths, the converter needed to be able to deal with those non-urls as well, without corrupting them.
- It needed to be able to handle URLs from “out in the wild”, some of which may already be quoted (and should therefore not be quoted again).
The result currently looks like this:
def asciify_url(url, force_quote=False):
r"""Attempts to make a unicode url usuable with ``urllib/urllib2``.
More specifically, it attempts to convert the unicode object ``url``,
which is meant to represent a IRI, to an unicode object that,
containing only ASCII characters, is a valid URI. This involves:
* IDNA/Puny-encoding the domain name.
* UTF8-quoting the path and querystring parts.
See also RFC 3987.
assert type(url) == unicode
parts = urlparse.urlsplit(url)
if not parts.scheme or not parts.netloc:
# apparently not an url
# idna-encode domain
hostname = parts.hostname.encode('idna')
# UTF8-quote the other parts. We check each part individually if
# if needs to be quoted - that should catch some additional user
# errors, say for example an umlaut in the username even though
# the path *is* already quoted.
def quote(s, safe):
s = s or ''
# Triggers on non-ascii characters - another option would be:
# urllib.quote(s.replace('%', '')) != s.replace('%', '')
# which would trigger on all %-characters, e.g. "&".
if s.encode('ascii', 'replace') != s or force_quote:
return urllib.quote(s.encode('utf8'), safe=safe)
username = quote(parts.username, '')
password = quote(parts.password, safe='')
path = quote(parts.path, safe='/')
query = quote(parts.query, safe='&=')
# put everything back together
netloc = hostname
if username or password:
netloc = '@' + netloc
netloc = ':' + password + netloc
netloc = username + netloc
netloc += ':' + str(parts.port)
parts.scheme, netloc, path, query, parts.fragment])
A version with more extensive comments and doctests is part of the FeedPlatform code.
I subsequently found out that there is at the very least one other existing implementation of this, in httplib2. While that one doesn’t avoid double-quoting and won’t leave non-urls alone (my own specific requirements), the latter enables it to support partial urls. It also uses a custom quote function written after the spec, rather than relying on urllib.quote, which is interesting. I wonder what the practical differences are there. Finally, it has a bug where an auth-portion in the domain will lead to invalid IDNA-encoding, but that should be rare anyway.
It might further be noteworthy that the SVN version of FeedParser also applies IDNA-encoding, but does so on the full string given, which corrupts the URL it if there are non-ascii characters in any non-domain part.