Python: Normalize whitespace

While we’re at it, another Python snippet. While parsing websites, I also needed to normalize extracted blocks of text with respect to whitespace. The function below does exactly that. It

  1. Replaces multiple whitespace characters with a single one.
  2. Completely removes spaces at the beginning and at the end of a line.

Note that it is not touching the linebreaks themselves, regardless of how many there are.

def sanitize_whitespace(text):
    def repl(match): # avoid "unmatched group" error
        g = match.groups()
        # basically "123"
        return (g[0] or '') + (g[1] or '') + (g[2] or '')
    return sanitize_whitespace.pattern.sub(repl, force_unicode(text))
sanitize_whitespace.pattern = re.compile(r"""(?mxu)
    # spaces before a line - capture the linebreak in 1
    (^)[ t]+   |
    # spaces after a line - capture the linebreak in 2
    [ t]+($)   |
    # multiple spaces within a line - capture the replacement space in 3
    ([ t])[ t]+
""")

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s