htmlspecialchars() in Python

I was just looking for a way to unescape html entities in Python. Turns out this is not quite as simple as you might expect. Not as simple as PHP’s htmlspecialchars(), anyway. There is a translation table in htmlentitydefs, but you have to do the actual work yourself. Also, while that works for named entities, we want to support numeric ones as well.

I finally stumbled across this module, which contains a htmldecode() function that works very well:

import re

# matches a character entity reference (decimal numeric, hexadecimal numeric, or named).
charrefpat = re.compile(r'&(#(d+|x[da-fA-F]+)|[w.:-]+);?')
def decode(text):
    """
        Decode HTML entities in the given.
        text should be a unicode string, as that is what we insert.

        This is from:
            http://zesty.ca/python/scrape.py
    """
    from htmlentitydefs import name2codepoint
    if type(text) is unicode:
        uchr = unichr
    else:
        uchr = lambda value: value > 255 and unichr(value) or chr(value)

    def entitydecode(match, uchr=uchr):
        entity = match.group(1)
        if entity.startswith('#x'):
            return uchr(int(entity[2:], 16))
        elif entity.startswith('#'):
            return uchr(int(entity[1:]))
        elif entity in name2codepoint:
            return uchr(name2codepoint[entity])
        else:
            return match.group(0)
    return charrefpat.sub(entitydecode, text)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s