Sunday, November 10, 2019

All you need to know about Regular Expressions

Part II: A short tutorial in Python and the backslash plague.

RIA KULSHRESTHA
Mar 5 · 5 min read
Assuming you know what Regular Expressions are (in case you don’t, check out Part1 of this tutorial for a quick overview) we’ll now learn how to use them in Python. :)
The ‘re’ module provides an interface to the regular expression engine, and allows us to compile REs into objects and then perform matches with them.
We’ll start with importing the module. Then we will combine a regular expression by passing it as a string and turn it into a pattern object.
>>> import re
>>> pat_obj = re.compile('[a-z]+')
>>> print(pat_obj)
re.compile('[a-z]+')

What to do with pattern objects?

  • match(): Determine if the RE matches at the beginning of the string.
>>> m = pat_obj.match('helloworld')
>>> print(m)
<_sre.SRE_Match object; span=(0, 10), match='helloworld'>#Note how it doesn't take into account white spaces.
>>> m = pat_obj.match('hello world')
>>> print(m)
<_sre.SRE_Match object; span=(0, 5), match='hello'># Note that it is case-sensitive.
>>> m = pat_obj.match('Helloworld')
>>> print(m)
None#To ignore case>>> pat_obj = re.compile('[a-z]+', re.IGNORECASE)
>>> m = pat_obj.match('Helloworld')
>>> print(m)
<_sre.SRE_Match object; span=(0, 10), match='Helloworld'>
  • search(): Scan through a string, looking for any location where this RE matches.
#Note how it only prints the first match
>>> s = pat_obj.search('Hello World!')
>>> print(s)
<_sre.SRE_Match object; span=(1, 5), match='ello'>
  • To print all the matches,
    findall(): Find all substrings where the RE matches and returns them as a list.
>>> s = pat_obj.findall('Hello World!')
>>> print(s)
<_sre.SRE_Match object; span=(1, 5), match='ello'>
['ello', 'orld']#To find all the numbers in a string
>>> pat_obj_num = re.compile(r'\d+')
>>> pat_obj_num.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
['12', '11', '10']
  • group(): Returns the string matched by the RE. Because honestly, that’s what you’re interested in. Ain’t nobody got time for all that information.
#Using group with search
>>> s = pat_obj.search('Hello World!')
>>> print(s)
<_sre.SRE_Match object; span=(1, 5), match='ello'>>>> print(s.group())
ello#Using group with match
>>> m =  pat_obj.match("hello world")
>>> print(m)
<_sre.SRE_Match object; span=(0, 5), match='hello'>>>> print(m.group())
hello#Using group with findall>>> m =  pat_obj.findall("hello world")
>>> print(m)
['hello', 'world']>>> print(m.group())
Error!
  • span(): Returns a tuple containing the (start, end) positions of the match.
    start(), end(): Returns the starting and ending position of the match respectively.
>>> pat_obj = re.compile('[a-z]+', re.IGNORECASE)
>>> m = pat_obj.match('Helloworld')
>>> print(m)
<_sre.SRE_Match object; span=(0, 10), match='Helloworld'>
>>> print(m.start())
0>>> print(m.end())
10
>>> print(m.span())
(0, 10)

Grouping

Photo by Jeffrey F Lin on Unsplash
Groups are marked by the ( )meta-characters. They group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as *, +, ? or {m, n}.
Groups are numbered starting with 0. Group 0 is always present; it’s the whole RE, so match object methods all have group 0 as their default argument.
Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.
>>> pat_obj_group = re.compile('(a(b)c(d))e')
>>> m = pat_obj_group.match('abcde')
>>> print(m)
<_sre.SRE_Match object; span=(0, 5), match='abcde'>#Note m.group(0) matches the same regex as m.match()>>> print(m.group(0))
abcde
>>> print(m.group(1))
abcd#Note the number is determined left to right>>> print(m.group(2))
b
>>> print(m.group(3))
d# Note that multiple arguments can be passes to group()>>> print(m.group(2,1,3))
('b', 'abcd', 'd')
  • groups(): returns a tuple containing the strings for all the subgroups, from 1 up to however many there are.
>>> print(m.groups())
('abcd', 'b', 'd')

Substitution

sub(): Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, the string is returned unchanged.
repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth.
Empty matches for the pattern are replaced only when not adjacent to a previously empty match.
>>> print(re.sub('x','-','abxd'))
ab-d
>>> print(re.sub('ab*','-','abxd'))
-xd
>>> print(re.sub('x*','-','abxd'))
-a-b-d-
The optional argument count is the maximum number of pattern occurrences to be replaced; count must be a non-negative integer. If omitted or zero, all occurrences will be replaced.

The Backslash Plague

Photo by Aarón Blanco Tejedor on Unsplash
Though passing Regular Expressions helps in keeping things simple, it has one disadvantage. The backslash character (‘\’) is used to allow special characters to be used without invoking their special meaning which conflicts with Python’s usage of the same character in string literals where it is used to interpret the character following it differently.
For example, ’n’ by itself is simply a letter, but when you precede it with a backslash, it becomes \n, which is the newline character. Ah oh!
Let’s say you want to write a RE that matches the string ‘\section’, which might be found in a LaTeX file.
We’ll start with the desired string to be matched. Next, we must escape any backslashes and other metacharacters by preceding them with a backslash, resulting in the string ‘\\section’. The resulting string that must be passed to re.compil() must be ‘\\section’. However, to express this as a Python string literal, both backslashes must be escaped again, resulting in the string ‘\\\\section’.
In short, to match a literal backslash, one has to write ‘\\\\’ as the RE string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.
Source: XKCD
The solution is to use Python’s raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with ‘r’, so r‘\n’ is a two-character string containing ‘\’ and ‘n’, while ‘\n’ is a one-character string containing a newline.
Regular string and corresponding Raw string
"ab*" -> r"ab*""\\\\section" -> r"\\section""\\w+\\s+\\1" -> r"\w+\s+\1"

Fun tools and sources to learn Regex

Towards Data Science

Sharing concepts, ideas, and codes.

1 comment:

James said...

Proper nutrition plays an important role in helping to support the immune system and increase resistance. A nutritious meal is when fully combined with starch, fat, protein, vitamins, green vegetables and minerals.
Sprite’s nutritional information, remove curse service d&d 5e, is halloween a paid holiday, causes of uti, how to find a rich husband

Must Watch YouTube Videos for Databricks Platform Administrators

  While written word is clearly the medium of choice for this platform, sometimes a picture or a video can be worth 1,000 words. Below are  ...