While we’re at it, another Python snippet. While parsing websites, I also needed to normalize extracted blocks of text with respect to whitespace. The function below does exactly that. It
- Replaces multiple whitespace characters with a single one.
- Completely removes spaces at the beginning and at the end of a line.
Note that it is not touching the linebreaks themselves, regardless of how many there are.
def sanitize_whitespace(text): def repl(match): # avoid "unmatched group" error g = match.groups() # basically "123" return (g[0] or '') + (g[1] or '') + (g[2] or '') return sanitize_whitespace.pattern.sub(repl, force_unicode(text)) sanitize_whitespace.pattern = re.compile(r"""(?mxu) # spaces before a line - capture the linebreak in 1 (^)[ t]+ | # spaces after a line - capture the linebreak in 2 [ t]+($) | # multiple spaces within a line - capture the replacement space in 3 ([ t])[ t]+ """)