Python regex tutorial

Python's regex module provides full support for regular expressions. The module is available for both Python 2 and Python 3.

This tutorial will cover the basics of using regex in Python. We will cover the following topics:

  1. The basics of regular expressions
  2. Using regex in Python
  3. Advanced regex features

The basics of regular expressions

A regular expression, or "regex", is a set of characters that form a search pattern. When you search for data in a text, you can use this search pattern to describe what you are searching for.

A regex is formed by using a set of characters that are understood by the regex engine. These characters can be literal characters, or they can be special characters that denote different kinds of searches.

For example, the regex "a" will match the character "a" in a text. If you wanted to match the literal characters "a" and "b", you would use the regex "ab".

Special characters that are used in regex are called "metacharacters". Some of the most common metacharacters are ".", "^", "$", "*", "+", "?", "(", and ")".

The "." character is a metacharacter that denotes any character. For example, the regex "." will match any character in a text.

The "^" character is a metacharacter that denotes the start of a string. For example, the regex "^a" will match the character "a" at the start of a string.

The "$" character is a metacharacter that denotes the end of a string. For example, the regex "a$" will match the character "a" at the end of a string.

The "" character is a metacharacter that denotes zero or more occurrences of the previous character. For example, the regex "a" will match the character "a" zero or more times.

The "+" character is a metacharacter that denotes one or more occurrences of the previous character. For example, the regex "a+" will match the character "a" one or more times.

The "?" character is a metacharacter that denotes zero or one occurrences of the previous character. For example, the regex "a?" will match the character "a" zero or one time.

The "(" character is a metacharacter that denotes a capturing group. Capturing groups are used to group characters together for the purposes of backreferencing. We will cover backreferencing in more detail later in this tutorial.

The ")" character is a metacharacter that denotes a non-capturing group. Non-capturing groups are used to group characters together for the purposes of regex, but they are not captured for backreferencing.

Using regex in Python

Now that we've covered the basics of regex, let's see how we can use regex in Python.

Python's regex module provides full support for regular expressions. The module is available for both Python 2 and Python 3.

To use the regex module, we need to import it into our Python program:

import re

Once we have imported the regex module, we can start using it in our program.

The first thing we need to do is compile our regex. This is done using the compile() function:

regex = re.compile("a")

The compile() function takes a regex as an argument, and returns a

regex object. We can then use this regex object to match against a text.

To match a regex against a text, we use the match() function:

result = regex.match("text")

The match() function takes a text as an argument, and returns a match object if the regex matches the text. Otherwise, it returns None.

The match object has several methods that we can use to get information about the match. The most important methods are:

  • group(): Returns the matched text
  • start(): Returns the start index of the match
  • end(): Returns the end index of the match

Let's see an example of how we can use these methods:

import re

regex = re.compile("a")
result = regex.match("text")

if result:
    print(result.group())
    print(result.start())
    print(result.end())
else:
    print("No match")

In this example, we have imported the regex module and compiled a regex that matches the character "a". We have then used the match() function to match the regex against the text "text".

Since the regex does not match the text, the match object is None, and we print "No match".

If we change the text to "a text", we get the following output:

a
0
1

As you can see, the match object contains the matched text ("a"), as well as the start and end indices of the match (0 and 1).

Advanced regex features

In this section, we will cover some of the more advanced features of regex.

Character classes

Character classes are used to match a set of characters. For example, the regex "[abc]" will match the characters "a", "b", or "c".

We can also use character classes to match a range of characters. For example, the regex "[a-z]" will match any lowercase character.

We can also use character classes to match a set of characters that we want to exclude. For example, the regex "[^abc]" will match any character that is not "a", "b", or "c".

Shorthand character classes

There are a number of shorthand character classes that we can use in regex. These shorthand character classes

are:

  • \d: Matches any digit
  • \D: Matches any non-digit
  • \w: Matches any word character
  • \W: Matches any non-word character
  • \s: Matches any whitespace character
  • \S: Matches any non-whitespace character

For example, the regex "\d\d\d" will match any three digit number.

Quantifiers

Quantifiers are used to specify how many times a character can occur. For example, the regex "a+" will match the character "a" one or more times.

The most common quantifiers are:

  • ?: Matches zero or one occurrences
  • *: Matches zero or more occurrences
  • +: Matches one or more occurrences
  • {n}: Matches exactly n occurrences
  • {n,}: Matches n or more occurrences
  • {,m}: Matches 0 to m occurrences
  • {n,m}: Matches at least n and at most m occurrences

Greedy and non-greedy matching

By default, regex is "greedy", meaning that it will try continue parapgraph . . .

to match as much of the text as possible. For example, the regex "a+" will match the text "aaaa" as "aaaa", rather than "aa".

We can make regex "non-greedy" by adding the "?" character after the quantifier. For example, the regex "a+?" will match the text "aaaa" as "aa".

Capturing groups

Capturing groups are used to group characters together for the purposes of backreferencing. Backreferencing is used to match the same text that was matched by a previous capturing group.

For example, the regex "(\d\d\d)-(\d\d\d\d)" will match the text "123-4567". We can then use the backreference "\1" to match the first capturing group, and "\2" to match the second capturing group.

We can also use named capturing groups. Named capturing groups are specified by using the "?P" metacharacter, followed by the name of the group. For example, the regex "(?P<area_code>\d\d\d)-(?P<number>\d\d\d\d\d)" will match the text "123-4567". We can then use the backreference "area_code" to match the first capturing group, and "number" to match the second capturing group.

Lookahead and lookbehind

Lookahead and lookbehind are used to match a text that is either before or after the text that is being matched.

For example, the regex "(?<=abc)def" will match the text "def" only if it is preceded by the text "abc".

The most common lookahead and lookbehind assertions are:

  • (?=...) : Positive lookahead
  • (?!...) : Negative lookahead
  • (?<=...) : Positive lookbehind
  • (?<!...) : Negative lookbehind

Flags

Flags are used to modify the behaviour of regex. The most common flags are:

  • re.I : Makes the regex case-insensitive
  • re.M : Makes the ^ and $ metacharacters match the start and end of each line, rather than the start and end of the string
  • re.S : Makes the . metacharacter

match any character, including newlines

  • re.U : Makes the \w, \W, \b, and \B metacharacters match unicode characters
  • re.X : Allows you to use whitespace and comments in your regex

Flags can be specified when compiling a regex using the flags argument:

regex = re.compile("pattern", flags=re.I)

Flags can also be specified when using the match() function:

result = regex.match("text", flags=re.I)