Regular Expressions Simple Guide

A regular expression is a pattern that “matches” strings that have a particular form.

A common operation when editing text is to search for a given string of characters, sometimes with the purpose of replacing it with another string. Many "search and replace" facilities have the option of using regular expressions instead of simple strings of characters. You can think of a regular expression as a pattern that matches certain strings, namely all the strings in the language described by the regular expression. When a regular expression is used in a search operation, the goal is to find a string that matches the expression. This type of pattern matching is very useful.

In regular expressions, the alphabet usually includes all the characters on the keyboard. This leads to a problem, because regular expressions actually use two types of symbols: symbols that are members of the alphabet and special symbols such a * and ) that are used to construct expressions. These special symbols, which are not part of the language being described but are used in the description, are called metacharacters.

Metacharacters

There are 12 special characters (also called metacharacters) that have different special meaning.

S.No. Metacharacter Meaning
1 [ Opening Square Bracket  
2 ] Closing Square Bracket  
3 \ Backslash Escape Sequence 
4 ^ Caret Matches start of the position of the string regex is applied to
5 $ Dollar Sign Matches end of the position of the string regex is applied to 
6 . Period / Dot Matches any single character
7 | Pipe Symbol Or; Match either right side or left side of the symbol 
8 ? Question Mark Repeats previous item zero or one time (previous item optional)
9 * Asterisk Repeats previous item zero or more times 
10 + Plus Sign Repeats previous item one or more times
11 ( Opening Round Bracket  
12 ) Closing Round Bracket  

Escape Sequence

An escape sequence indicates that you want to use one of metacharacters as a literal. In a regular expression, an escape sequence involves placing the metacharacter \ (backslash) in front of the metacharacter. 

Backslash escapes special characters to suppress their special meaning. For example, + (plus sign) has a special meaning, but using \+ matches + 

Square Brackets / Character Class

To make it easier to deal with the large number of characters in the alphabet, character classes are introduced. A character class consists of a list of characters enclosed between brackets, [ and ]. A character class matches a single character, which can be any of the characters in the list. For example, [0123456789] matches any one of the digits 0 through 9. The same thing could be expressed as (0|1|2|3|4|5|6|7|8|9).

For convenience, a hyphen can be included in a character class to indicate a range of characters. This means that [0123456789] could also be written as [0-9] and that the regular expression [a-z] will match any single lowercase letter. A character class can include multiple ranges, so that [a-zA-Z] will match any letter, lowercase or uppercase.

[ ]: It match anything inside the square brackets for one character position. For example:

  • [ab] matches any single character that has either a or b
  • [a-d] matches any single character that has lowercase letters a through d

Negated Character Classes

^ means "not the following" when inside and at the start of [ ].

Typing a caret after the opening square bracket will negate the character class. The result is that the character class will match any character that is not in the character class. 

  • [^abc] matches not a, b or c

It is important to note that a negated character class still must match a character. For example, q[^u] does not mean: "q not followed by u". It means: "q followed by a character that is not u".

Repeating Character Classes

If you repeat a character class by using the ?, * or + operators, you will repeat the entire character class, and not just the character that it matched. For example, the regex [0-9]+ can match 837 as well as 222.

If you want to repeat the matched character, rather than the class, you will need to use back references. For example, ([0-9])\1+ will match 222 but not 837. When applied to the string 833337, it will match 3333 in the middle of this string.

Caret and Dollar

In most implementations, the meta-character ^ can be used in a regular expression to match the beginning of a line of text, so that the expression ^[a-zA-Z]+ will only match a word that occurs at the start of a line. Similarly, $ is used as a meta-character to match the end of a line.

Back References / Parentheses

When regular expressions are used in search-and-replace operations, a regular expression is used for the search pattern. A search is made in a (typically long) string for a substring that matches the pattern, and then the substring is replaced by a specified replacement pattern. The replacement pattern is not used for matching and is not a regular expression. However, it can be more than just a simple string. It’s possible to include parts of the substring that is being replaced in the replacement string.

The notations \0, \1, ... , \9 are used for this purpose. The first of these, \0, stands for the entire substring that is being replaced. The others are only available when parentheses are used in the search pattern. The notation \1 stands for "the part of the substring that matched the part of the search pattern beginning with the first ( in the pattern and ending with the matching )." Similarly, \2 represents whatever matched the part of the search pattern between the second pair of parentheses, and so on.

Examples

gray|grey

It can match "gray" or "grey".

gr(a|e)y

It can match "gray" or "grey".

gr[a|e]y

It can match "gray" or "grey".

colou?r

It matches "color" or "colour". (zero or one occurrence of u)

.at

It matches any three-character string ending with at. For example: "hat", "cat", "bat", "rat", "fat"

[bc]at

It matches "bat" and "cat". (any single character inside square bracket)

How Regex Engine Works?

The order of the characters inside a character class does not matter. gr[ae]y will match grey in "Is her hair grey or gray?", because that is the leftmost match. The engine applies a regex consisting only of literal characters. gr[ae]y can match both "gray" and "grey". Nothing noteworthy happens for the first twelve characters in the string. The engine will fail to match "g" at every step, and continue with the next character in the string. When the engine arrives at the 13th character, "g" is matched. The engine will then try to match the remainder of the regex with the text. The next token in the regex is the literal "r", which matches the next character in the text. So the third token, [ae] is attempted at the next character in the text ("e").

The character class gives the engine two options: match "a" or match "e". It will first attempt to match "a", and fail. Then, it must continue trying to match all the other permutations of the regex pattern before deciding that the regex cannot be matched with the text starting at character 13. So it will continue with the other option, and find that "e" matches. The last regex token is "y", which can be matched with the following character as well. The engine has found a complete match with the text starting at character 13. It will return "grey" as the match result, and look no further. Again, the leftmost match was returned, even though you put the "a" first in the character class, and "gray" could have been matched in the string. But the engine simply did not get that far, because another equally valid match was found to the left of it.