3. Regular expressions

3.1. Import

>>> import re

3.2. Usage

  • Text processing,

    • Finding patterns,

    • Data cleaning

  • Data validation

3.3. Functions withing re package

Function

meaning and usage

Result

re.match

If match re.match(r"(\d+)\.(\d+)", "24.1632")

True/False

re.search

First occurrence re.search('(?<=abc)def', 'abcdef')

re.split

Splitting by separator re.split(r'\W+', 'a, b, c')

List

re.findall

Find all occurrences re.findall('\w+', "A B")

List

re.finditer

Find all occurrences re.finditer('\w+', "A B")

Iterator

3.4. Characters classes

Class

Meaning

.

Any character

^

Beginning of the line

$

End of line

*

Zero or more occurrences

+

One or more occurrences

?

One or zero occurrences

{n}

N of occurrences

{n, m}

Number of occurrences in range n to m

d

Number group - same as [0-9]

D

Anti number group [^0-9]

w

Group “characters” - same as [a-zA-Z0-9_]

W

Anti group “characters” - same as [^a-zA-Z0-9_]

s

Group of white characters - same as [\r\n\t\f\v]

[abc]

Group of characters a, b or c

[a-z]

Characters in range a to z

()

Group

3.5. Exercise - part 1

  • Create function check_ip

    • Function will be checking if IP is correct,

    • Check function on dictionary of hosts

 {
     '127.0.0.1': {'correct': None},
     '8.8.8.8': {'correct': None},
     'x.x.x.x': {'correct': None}
 }

* In place of **x.x.x.x** put any address from your network,
* Amend **correct** flag

Hint

You may use following expression, or find / create more precisse ^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$

3.6. Exercise - part 2

  • Create function check_email

    • Function will be checking if email is correct

3.7. Exercise - part 3

  • Using library requests

    • Download content of the page

  • Get all html tags,

  • Get human readable words

3.8. Exercise - part 4

  • Using library collections

    • Get number of occurrences of word from Ex. part 3 (second point),

    • Get top 10 of most frequent words ?,

    • Get top 70 of most frequent words ?,

3.9. Resources

  • Regular Expressions Cookbook by Steven Levithan, Jan Goyvaerts - book