Dec
5
I was just looking for a way to unescape html entities in Python. Turns out this is not quite as simple as you might expect. Not as simple as PHP's htmlspecialchars(), anyway. There is a translation table in htmlentitydefs, but you have to do the actual work yourself. Also, while that works for named entities, we want to support numeric ones as well.
I finally stumbled across this module, which contains a htmldecode() function that works very well:
import re
# matches a character entity reference (decimal numeric, hexadecimal numeric, or named).
charrefpat = re.compile(r'&(#(\d+|x[\da-fA-F]+)|[\w.:-]+);?')
def decode(text):
"""
Decode HTML entities in the given.
text should be a unicode string, as that is what we insert.
This is from:
http://zesty.ca/python/scrape.py
"""
from htmlentitydefs import name2codepoint
if type(text) is unicode:
uchr = unichr
else:
uchr = lambda value: value > 255 and unichr(value) or chr(value)
def entitydecode(match, uchr=uchr):
entity = match.group(1)
if entity.startswith('#x'):
return uchr(int(entity[2:], 16))
elif entity.startswith('#'):
return uchr(int(entity[1:]))
elif entity in name2codepoint:
return uchr(name2codepoint[entity])
else:
return match.group(0)
return charrefpat.sub(entitydecode, text)
