I was just looking for a way to unescape html entities in Python. Turns out this is not quite as simple as you might expect. Not as simple as PHP’s htmlspecialchars(), anyway. There is a translation table in htmlentitydefs, but you have to do the actual work yourself. Also, while that works for named entities, we want to support numeric ones as well.
I finally stumbled across this module, which contains a htmldecode() function that works very well:
import re # matches a character entity reference (decimal numeric, hexadecimal numeric, or named). charrefpat = re.compile(r'&(#(d+|x[da-fA-F]+)|[w.:-]+);?') def decode(text): """ Decode HTML entities in the given. text should be a unicode string, as that is what we insert. This is from: http://zesty.ca/python/scrape.py """ from htmlentitydefs import name2codepoint if type(text) is unicode: uchr = unichr else: uchr = lambda value: value > 255 and unichr(value) or chr(value) def entitydecode(match, uchr=uchr): entity = match.group(1) if entity.startswith('#x'): return uchr(int(entity[2:], 16)) elif entity.startswith('#'): return uchr(int(entity[1:])) elif entity in name2codepoint: return uchr(name2codepoint[entity]) else: return match.group(0) return charrefpat.sub(entitydecode, text)