Dec
5
While we’re at it, another Python snippet. While parsing websites, I also needed to normalize extracted blocks of text with respect to whitespace. The function below does exactly that. It
- Replaces multiple whitespace characters with a single one.
- Completely removes spaces at the beginning and at the end of a line.
Note that it is not touching the linebreaks themselves, regardless of how many there are.
def sanitize_whitespace(text):
def repl(match): # avoid "unmatched group" error
g = match.groups()
# basically "\1\2\3"
return (g[0] or '') + (g[1] or '') + (g[2] or '')
return sanitize_whitespace.pattern.sub(repl, force_unicode(text))
sanitize_whitespace.pattern = re.compile(r"""(?mxu)
# spaces before a line - capture the linebreak in 1
(^)[\ \t]+ |
# spaces after a line - capture the linebreak in 2
[\ \t]+($) |
# multiple spaces within a line - capture the replacement space in 3
([\ \t])[\ \t]+
""")
