Regex notes

This is a personal regex note for daily work, will keep update.

import re

re module

  • re.match(pattern, string) - if zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.
## match
re.match(r"project", "project manager")

<_sre.SRE_Match object; span=(0, 7), match=’project’>

## not match
re.match(r"manager", "project manager")
  • re.search(pattern, string) - go through the whole string, look for the first location where the regular expression pattern produces a match.
## use search
re.search(r"manager", "project manager")

<_sre.SRE_Match object; span=(8, 15), match=’manager’>

  • re.complie(pattern) - compile a regular expression pattern into a regular expression object, then we can use search, match and other methods. More efficient when the expression will be used several times.
## match again
regex = re.compile(r"project")
regex.match("project manager")

<_sre.SRE_Match object; span=(0, 7), match=’project’>

  • re.split(pattern, string) - split string by the occurrences of pattern. If parentheses are used in pattern, then the text of the pattern are also returned as part of the resulting list.
## split with space
re.split(r' ', 'project manager, IT section')

[‘project’, ‘manager,’, ‘IT’, ‘section’]

## split with more characters
re.split(r'[ ,]+', 'project manager, IT section')

[‘project’, ‘manager’, ‘IT’, ‘section’]

## split with parentheses
re.split(r'([ ,]+)', 'project manager, IT section')

[‘project’, ‘ ‘, ‘manager’, ‘, ‘, ‘IT’, ‘ ‘, ‘section’]

  • re.sub(pattern, repl, string) - replace the matches in string with repl.
## remove any of the special characters
re.sub(r"[,\.\:\;\!]+", "", "project manager, IT section...")

‘project manager IT section’

  • re.findall(pattern, string) - find all non-overlapping matches of pattern in string, return as a list of strings.
## find all special character in the pattern
re.findall(r"[,\.\:\;\!]+", "project manager, IT section...")

[‘,’, ‘…’]

Special characters

.
Matches any character except a newline.

^
Matches the start of the string.

$
Matches the end of the string.

*
Match 0 or more repetitions. ab* will match ‘a’, ‘ab’, or ‘abbbb’ (any number of ‘b’s).

+
Match 1 or more repetitions. ab+ will match ‘ab’, or ‘abbbb’ (any number of ‘b’s), not ‘a’.

?
Match 0 or 1 repetitions. ab? will match ‘a’ or ‘ab’, useful for plural form, birds? will match both ‘bird’ and ‘birds’.

*?, +?, ??
Match as much text as possible.

\
Either escapes special characters (permitting you to match characters like ‘*’, ‘?’, and so forth), or signals a special sequence; special sequences are discussed below.

[]
[abc] will match any of ‘a’, ‘b’, or ‘c’.
[a-z] will match any lowercase ASCII letter.
Special characters lose their special meaning inside sets. [(+*)] will match any of the literal characters ‘(‘, ‘+’, ‘*’, or ‘)’.

|
A|B will match either A or B. An arbitrary number of REs can be separated by the ‘|’ in this way, run from left to right, once A matches, B will not be tested further.

\b
Matches the word boundary.

\d
Matches any Unicode decimal digit, equivalent to [0-9]

\s
Matches Unicode whitespace characters (which includes [ \t\n\r\f\v])

\w
For Unicode (str) patterns:
Matches Unicode word characters, equivalent to [a-zA-Z0-9_].

Example

def clean(str_input):
    """Clean input string by keeping only Unicode ([a-zA-Z0-9_]) patterns and dot(.)

    input
    --------
    str_input : string
        any type of string

    output
    --------
    str_clean : string
        lower case string without any special characters in the pattern.

    notes
    --------
    - clean string
    - remove multiple spaces
    - lower case
    """
    sep = re.compile(r"[^\w.]")
    try:
        str_clean = sep.sub(" ", str_input)
        str_clean = re.sub("\s+", " ", str_clean).strip()
        return str_clean.lower()
    except:
        return ""
clean("fwoejurw#5r[x]few2\.3540#%###&^^*}$%$#&%*&(!)")

‘fwoejurw 5r x few2 .3540’