Wednesday 16 November 2016

What is Regular Expressions? How to Use the regular Expressions in python?

Regular expression represents as "re"  [  Source from https://pymotw.com/2/re/) ]

What is Regular Expression? (re)
  • Regular Expressions are generally described as regex, regexp. 
  • These are mainly used for matching the text patterns.  
  • A large number of parsing problems are easier to solve with a regular expression than by    creating a special-purpose lexer and parser. 
  • Expressions can include literal text matching, repetition, pattern-composition, branching, and other sophisticated rules.  
  • Unix Tools such as SED, grep, awk uses regular expressions internally for finding the particular pattern.
How to use Regular expressions in python?
Step 1: Import the Regular Expression module as  -  import re
Step 2: Design the Regular Expression to be used for your application.
Step 3: Use the appropriate Regular expression method to parse the text or string. 

What is RAW python strings?

When writing regular expression in Python, it is recommended that you use raw strings instead of regular Python strings. Raw strings begin with a special prefix (r) and signal Python not to interpret backslashes and special meta characters in the string, allowing you to pass them through directly to the regular expression engine.
This means that a pattern like "\n\w" will not be interpreted and can be written as r"\n\w" instead of  "\\n\\w" as in other languages, which is much easier to read.

Regular expression Methods:
    Generally there are three types of methods which are been used more in the regular expressions.
  1.  re.match()
  2.  re.search()
  3.  re.findall()
re.match() - Matches at Beginning

It will match to the Beginning pattern of string.

Usage:
re.match(pattern, string, flags=0)
      If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

    Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

Examples:
 re.match(r'c', "abcdef")              >> No match 
 re.match(r'cat', 'dog cat dog')     >> No Match
 re.match(r'dog', 'dog cat dog')    >> Match 


From the above examples, it is evident that re.match() method tried to find the given pattern at beginning of the string. If it matches, then it returns the Match object else NONE.  

re.search() - Matches at Anywhere

The search() method is similar to match(), but search() doesn’t restrict us to only finding matches at the beginning of the string, so searching for ‘cat’ in below example string finds a match: 

Examples #1:
re.search(r'cat', 'dog cat dog')     >> Match


Usage:
 re.search(pattern, string, flags=0)    Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

Tip:
If you want to locate a match anywhere in string, use search() instead  match(). 


Examples #2:
 re.search(r'c', "abcdef")     >> Match

Regular expressions beginning with '^' can be used with search() to restrict the match at the beginning of the string:

Examples #3:
 re.search("^c", "abcdef")  >> No Match
 re.search("^a", "abcdef")  >>  Match


How about Multi line Matches? 
In MULTI LINE mode match() only matches at the beginning of the string. It means,  re.match() will only match at the beginning of the string and not at the beginning of each line.
Whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.

re.match('X', 'A\nB\nX', re.MULTILINE)    # No match
re.search('^X', 'A\nB\nX', re.MULTILINE)   # Match

re.findall() - All Matching Objects
re.findall()  Shall return the list of all matching patterns in the string.
>>> re.findall(r'dog', 'dog cat dog')   >> Returns the list of strings that matches to the pattern
  ['dog', 'dog']
>>> re.findall(r'cat', 'dog cat dog')
  ['cat']
Note: 
re.search() and re.match() return the single instances of literal text strings.
Whereas re.findall()  method returns all of the sub strings of the input that match the pattern without overlapping 

re.compile() : Compile the Regular Expression Pattern

Compile a regular expression pattern into a regular expression object, which can be used for matching using its match() and search() methods

pattern="test"
string="testpatterntest"
prog = re.compile(pattern)
result = prog.match(string)



or
re.match(pattern, string)

Using re.compile() shall make  reuse of regular expression more efficient, when the expression  used several times in a single program.

Python program Examples

re.search() :
The basic rules of regular expression search for a pattern within a string are:
  • The search proceeds through the string from start to end, stopping at the first match found
  • All of the pattern must be matched, but not all of the string
  • If match = re.search(pat, str) is successful, match is not None and in particular match.group() is the matching text 
Program #1    : In this example, match.group() is used to print the matching text

import re
patterns = ["this", "that"]
text= "Does this text match this string"
for pattern in patterns:
  match= re.search(pattern, text)
  if match is not None:
    print( "Found Match :", match.group())
  else:
    print('no match:', pattern )
  
Output:
  Found Match : this
  no match 
 
Program #2    : In this example, re.search() returns the match object if the pattern is matched or found.
From matchObject , you can get StartIndex, endIndex, string, pattern. Check out for more in program.
  
import re
patterns = ["this", "that"]
text= "Does this text match this string"
for pattern in patterns:
  matchObject= re.search(pattern, text)
  if matchObject is not None:
    startIndex = matchObject.start()
    endIndex =   matchObject.end()
    print('Found "%s" in "%s" from %d to %d ("%s")' %
          (matchObject.re.pattern, matchObject.string, startIndex,
           endIndex, text[startIndex:endIndex]))
  else:
    print('no match:', pattern )
 
Output:
Found "this" in "Does this text match this string" from 5 to 9 ("this")
no match: that 



Program #3 
findall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.   

import re
testPattern ="abc"str="abcbbbabcbbbbabc"
listStr = re.findall('abc', str)
print (listStr)

Output:
['abc', 'abc', 'abc'] 
 
Program #4 
findall() - use "for" loop to display the strings.
 
import re
testPattern ="abc"str="abcbbbabcbbbbabc"
listStr = re.findall('abc', str)
for match in listStr:
   print ("The Match string :", match) 

Output: 
The Match string : abc
The Match string : abc
The Match string : abc


Program #5
findall() - With Files 
For files, you may be in the habit of writing a loop to iterate over the
lines of the file, and you could then call findall() on each line. 
Instead, let findall() do the iteration for you -- much better! Just 
feed the whole file text into findall() and let it return a list of all 
the matches in a single step (recall that f.read() returns the whole 
text of a file in a single string):  
 
import re
testPattern="ab"
# Open file
fp = open('testfile.txt', 'r')
# Feed the file text into findall(); 
# it returns a list of all the found strings
 
listStr = re.findall(testPattern, fp.read())
print(listStr)
 
Output:
['ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab'] 
 
Program #6
 Use finditer() rather than findall() 
finditer() returns an iterator that produces Match instances 
instead of the strings returned by findall().

import re
testPattern="muni" 
text="munixxxmungggmunixxxmuniaaamuni" 
for matchIter in re.finditer(testPattern, text):
    startIndex= matchIter.start()
    endIndex= matchIter.end()
    print("StartIndex:", startIndex, "EndIndex:", endIndex, matchIter.group())

Output:
StartIndex: 0 EndIndex: 4   muni
StartIndex: 13 EndIndex: 17 muni
StartIndex: 20 EndIndex: 24 muni
StartIndex: 27 EndIndex: 31 muni  

 
Program #7 - re.compile
 
import re
regex_compiled_object = re.compile("this")
text= "Does this text match the pattern?"
if regex_compiled_object.search(text):
    print("found a match!")
else:
    print ("no match") 


Output: 
found a match!

No comments:

Post a Comment