Python: Normalize whitespace

While we’re at it, another Python snippet. While parsing websites, I also needed to normalize extracted blocks of text with respect to whitespace. The function below does exactly that. It

  1. Replaces multiple whitespace characters with a single one.
  2. Completely removes spaces at the beginning and at the end of a line.

Note that it is not touching the linebreaks themselves, regardless of how many there are.

def sanitize_whitespace(text):
    def repl(match): # avoid "unmatched group" error
        g = match.groups()
        # basically "\1\2\3"
        return (g[0] or '') + (g[1] or '') + (g[2] or '')
    return sanitize_whitespace.pattern.sub(repl, force_unicode(text))
sanitize_whitespace.pattern = re.compile(r"""(?mxu)
    # spaces before a line - capture the linebreak in 1
    (^)[\ \t]+   |
    # spaces after a line - capture the linebreak in 2
    [\ \t]+($)   |
    # multiple spaces within a line - capture the replacement space in 3
    ([\ \t])[\ \t]+
""")

Leave a Reply