Chapter 08 Python Course – Regular Expressions

Regular Expressions are a powerful tool used to find or replace specific patterns within strings. They are supported in many programming languages and are an essential skill, especially for tasks that frequently require text processing.

In this course, we will learn how to handle regular expressions using Python’s built-in module, re. This module provides nearly all functionalities of regular expressions, such as string searching and modification, and pattern matching.

Basic Concepts of Regular Expressions

Regular expressions are a way of searching for strings using specific patterns. They are supported by most text editors and are widely used in programming languages. Regular expressions can be considered a sort of mini-language, making them very useful for processing and analyzing strings.

Basic Components of Regular Expressions

  • Literal Characters: Characters that represent themselves; for example, a simply means the character a.
  • Meta Characters: Characters that have special meanings, such as ., ^, $, *, +, ?, [], {}, (), |, etc.

Important Meta Characters in Regular Expressions

  • .: Represents any single character. For example, a.c matches the format ‘a-c’ where any character is between a and c.
  • []: Represents one of several characters in the brackets. [abc] finds either a, b, or c.
  • ^: Represents the start of a string. For example, ^abc finds strings that start with ‘abc’.
  • $: Represents the end of a string. xyz$ finds strings that end with ‘xyz’.
  • *: Means the preceding character can repeat 0 or more times. For example, bo* matches patterns like ‘b’, ‘bo’, ‘boo’, ‘booo’, etc.
  • +: Means the preceding character can repeat 1 or more times. bo+ matches patterns like ‘bo’, ‘boo’, ‘booo’, etc.
  • ?: Means the preceding character can appear 0 or 1 time. colou?r can match both ‘color’ and ‘colour’.
  • {}: The number inside the braces specifies the number of repetitions. For example, a{2} means ‘aa’, and a{2,3} means ‘aa’ or ‘aaa’.
  • (): Specifies a group. This allows you to bundle an entire pattern or capture it for later use.
  • |: Acts as an OR operator meaning ‘A or B’. a|b means ‘a’ or ‘b’.

Using Regular Expressions in Python

The functionality for regular expressions in Python is provided through the re module. You can validate, search, and modify various regular expression patterns using this module.

Basic Usage of the re Module

import re

# Check if the string matches the regular expression pattern
pattern = r"^abc"
string = "abcdefg"
if re.match(pattern, string):
    print("Matches the regular expression!")
else:
    print("Does not match.")
    

The above code uses the regular expression ^abc to check if the string starts with ‘abc’. The match function searches from the start of the string, hence ‘abcdefg’ matches as it starts with ‘abc’.

Searching Nested Patterns: re.search()

Unlike match(), search() can find a pattern anywhere in the string. For example, it will find patterns in the middle of the string.

import re

pattern = r"abc"
string = "xyzabcdef"

if re.search(pattern, string):
    print("Pattern found!")
else:
    print("Pattern not found.")
    

Finding All Patterns: re.findall()

This is used when you want to return all sections of the string that match the pattern as a list.

import re

pattern = r"a"
string = "banana"

matches = re.findall(pattern, string)
print(matches)
    

In the above example, the function returns a list of all ‘a’s found in the string ‘banana’, resulting in [‘a’, ‘a’, ‘a’].

Replacing Patterns: re.sub()

To replace matching patterns with another string, use the sub() function.

import re

pattern = r"a"
replacement = "o"
string = "banana"

new_string = re.sub(pattern, replacement, string)
print(new_string)
    

This code changes all ‘a’s in the string ‘banana’ to ‘o’, producing the result ‘bonono’.

Applications of Regular Expressions through Real Examples

Regular expressions are very effective for data validation, extraction, and manipulation. Here, we’ll explore the applications of regular expressions through real examples such as extracting phone numbers, emails, and URLs.

1. Extracting Phone Numbers

Phone numbers can exist in various formats, such as ‘(123) 456-7890’, ‘123.456.7890’, ‘123-456-7890’, etc. Let’s write a regular expression to extract them.

import re

pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"
text = "Contact: (123) 456-7890, and 123-456-7890."

phone_numbers = re.findall(pattern, text)
print(phone_numbers)
    

The above regular expression can extract various formats of phone numbers.

2. Validating and Extracting Email Addresses

Email addresses are typically in the format username@domain.extension. Here’s a regular expression to extract them:

import re

pattern = r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b"
text = "For inquiries, please email contact@example.com."

emails = re.findall(pattern, text)
print(emails)
    

This regular expression can extract various email addresses that follow the email format.

3. Extracting URLs

Extracting URL links from web pages can also be useful. You can easily search for URLs within large texts using regular expressions.

import re

pattern = r"https?://(?:www\.)?\S+\.\S+"
text = "Our website is https://www.example.com. Please visit the link."

urls = re.findall(pattern, text)
print(urls)
    

This example’s regular expression extracts URLs starting with HTTP or HTTPS. ‘www’ may or may not be present, and various extensions can follow the domain name.

Debugging and Optimizing Regular Expressions

While regular expressions are very powerful, errors can occur when writing complex patterns. Therefore, here are some tips to debug and optimize them.

Using Comments

Adding comments to regular expressions can make complex patterns easier to understand. In Python, you can add comments using the re.VERBOSE flag.

import re

pattern = r"""
(?x)            # layout of the regular expression, comments allowed
\(?\d{3}\)?    # area code, optional parentheses
[-.\s]?         # separator after area code
\d{3}          # three-digit number
[-.\s]?         # separator between numbers
\d{4}          # last four-digit number
"""
text = "Here are the phone numbers (123) 456-7890 and 987-654-3210."
phone_numbers = re.findall(pattern, text)
print(phone_numbers)
    

Writing Efficient Patterns

  • Use simple and clear patterns whenever possible to increase processing speed.
  • Reduce specific matching ranges to shorten search times.
  • Use character clusters to reduce multiple meta-characters.

Conclusion

Regular expressions are a powerful and flexible tool for string processing. They may seem complex at first, but once familiar, they become excellent tools for data searching, validation, and transformation. Through the above practical exercises, try out how to use regular expressions in real scenarios. Practice effectively processing various types of strings using Python’s re module.

3. Information Extraction from Text Data

Regular expressions are effectively used in natural language processing (NLP) and data analysis. For example, they can be utilized to search for specific keywords in customer feedback data or to extract numerical and currency information from financial data.

import re

# Customer feedback example
feedback = "The service at our bank was fantastic. I was especially impressed by the kindness of Agent Kim. Thank you!"

# Extracting statements that include 'Agent Kim'
agent_pattern = r".*Agent Kim.*"
agent_feedback = re.search(agent_pattern, feedback)

if agent_feedback:
    print(agent_feedback.group())  # Extract specific sentence if found

Cautions When Using Regular Expressions

Regular expressions are very powerful tools, but improper use can lead to performance issues. In particular, when handling complex patterns, CPU usage can spike. To optimize, keep the following points in mind:

  • Use the simplest patterns possible and avoid unnecessary grouping.
  • Utilize non-greedy matching appropriately to reduce search time.
  • When regular expressions are not needed, it is better to use string methods (e.g., str.find(), str.replace()).

Debugging Regular Expressions

When writing regular expressions, unexpected results often occur. To address this, various online debugging tools can be utilized. These tools visually show the matching patterns of regular expressions, allowing for quick identification and correction of issues.

Extended Features of Regular Expressions

The Python re module offers additional functionalities using flags, in addition to basic regular expression functionalities. For example, there are features that ignore case sensitivity or are useful when dealing with multi-line strings:

  • re.IGNORECASE: Matches while ignoring case sensitivity.
  • re.MULTILINE: Used to find start and end across multiple lines.
  • re.DOTALL: The dot (.) matches all characters including newline characters.
import re

# Multi-line string
multiline_text = """first line
second line
third line"""

# Finding the start of lines in a multi-line example
multiline_pattern = r"^second"  # Finding the line that starts with 'second'

# Result of the match
matches = re.findall(multiline_pattern, multiline_text, re.MULTILINE)
print(matches)  # ['second']

Conclusion

In this lecture, we explored various ways to use regular expressions in Python. Regular expressions are a very powerful tool for string manipulation and can be applied in various fields. I hope the practical examples allow you to appreciate the usefulness of regular expressions. For those encountering regular expressions for the first time, they may seem complex and difficult, but by developing the ability to understand and apply patterns, they can become a highly efficient tool.

As you become more familiar with regular expressions through practice and repetition, you’ll acquire a powerful skill that allows you to easily solve complex string processing problems. I hope this lecture has greatly helped in laying the foundation of Python regular expressions.

By engaging with more practice and examples, familiarize yourself with regular expressions and enhance your data processing and analysis skills!

08-2 Python Tutorial – Getting Started with Regular Expressions

What is a Regular Expression?

A regular expression is a powerful tool for matching strings to specific patterns. It is mainly used for data validation, searching, and text processing tasks. Utilizing regular expressions in programming languages, especially in Python, allows you to easily handle complex pattern matching.

Using Regular Expressions in Python

The Python re module offers various functions related to regular expressions. Commonly used functions include matchsearchfindall, and finditer.


# Import the re module
import re

# Pattern matching example
pattern = re.compile(r'\d+')

# Search for numbers in a string
match = pattern.search("The cost is 1200 won.")
if match:
    print("Number found:", match.group())
    

Basic Patterns in Regular Expressions

You can perform more complex pattern matching through commonly used metacharacters in regular expressions. For example:

  • . : Any single character
  • ^ : Start of the string
  • $ : End of the string
  • * : Zero or more repetitions
  • + : One or more repetitions
  • ? : Zero or one repetition

Advanced Pattern Matching

To use regular expressions more deeply, you need to understand advanced features such as grouping and capturing, lookaheads, and lookbehinds.


# Grouping example
pattern = re.compile(r'(\d{3})-(\d{3,4})-(\d{4})')
match = pattern.search("The phone number is 010-1234-5678.")
if match:
    print("Area code:", match.group(1))
    print("Middle number:", match.group(2))
    print("Last number:", match.group(3))
    

Useful Examples of Regular Expressions

Regular expressions can be used to identify and process various string patterns. For example, you can check the validity of an email address or extract URLs from text.

Practical Examples

We will explore applications of regular expressions through various real-world cases. This section will demonstrate how regular expressions can contribute to problem-solving with specific code examples.

Cautions When Using Regular Expressions

While regular expressions are a powerful tool, performance issues may arise at times. You should be cautious when applying them to very complex patterns or large datasets. Additionally, you should consider readability and maintainability when using them.

Conclusion

Regular expressions are a very useful feature in programming languages like Python. With sufficient practice and understanding, you can write code more efficiently and concisely.

08-1 Python Course – Exploring Regular Expressions

What is Regular Expression?

Regular Expressions (regex or regexp for short) are strings used for searching, replacing, and extracting strings that match specific rules. They are mainly used in text processing to search multiple patterns or for data validation.

Basic Concepts of Regular Expressions

To understand the basic concepts of regular expressions, it is essential to know a few special characters that are commonly used.

Basic Patterns

  • Dot (.): Represents any single character.
  • Brackets ([]): Represents one of the characters inside the brackets. Example: [abc]
  • Caret (^): Represents the start of the string. Example: ^Hello
  • Dollar sign ($): Represents the end of the string. Example: world$
  • Asterisk (*): Indicates that the preceding character may occur zero or more times. Example: a*
  • Plus (+): Indicates that the preceding character may occur one or more times. Example: a+
  • Question mark (?): Indicates that the preceding character may occur zero or one time. Example: a?

Meta Characters

Meta characters are used with special meanings in regular expressions and often need to be escaped to be used literally as characters.

  • Backslash (\\): An escape character used to treat a special character as a regular character.
  • Pipe (|): The OR operator, which is considered true if any of the multiple patterns match. Example: a|b
  • Parentheses ((): Represents grouping and is used to create subpatterns. Example: (ab)

Using Regular Expressions in Python

In Python, the `re` module is used to handle regular expressions. This module provides various functions to easily work with regular expressions.

Functions in the re Module

  • re.match(): Checks if the beginning of a string matches the specified pattern.
  • re.search(): Searches the entire string for the first matching pattern.
  • re.findall(): Returns all substrings that match the pattern as a list.
  • re.finditer(): Returns all substrings that match the pattern as an iterable object.
  • re.sub(): Replaces substrings that match the pattern with another string.

Examples of Using Regular Expressions

Basic Usage Examples


import re

# Check if the start of the string is 'Hello'
result = re.match(r'^Hello', 'Hello, world!')
print(result)  # Returns a match object if successful, or None if failed.
    

Finding Patterns in a String


import re

search_result = re.search(r'world', 'Hello, world!')
print(search_result)  # Returns a match object for the matched portion.
    

Extracting All Matching Patterns


# Finding all 'a' characters in the string
all_matches = re.findall(r'a', 'banana')
print(all_matches)  # Returns a list of all matches found.
    

Transforming Strings Based on Patterns

You can use the re.sub() function to transform patterns in a string into other strings.


# Replace all whitespace with underscores
transformed_string = re.sub(r'\s', '_', 'Hello world!')
print(transformed_string)  # Output: 'Hello_world!'
    

Advanced Features of Regular Expressions

Grouping and Capturing

Grouping is very useful for capturing subpatterns of a regex for reuse or for performing specific tasks.


pattern = r'(\d+)-(\d+)-(\d+)'
string = 'Phone number: 123-456-7890'
match = re.search(pattern, string)

if match:
    print(match.group(0))  # Full matched string
    print(match.group(1))  # First group: 123
    print(match.group(2))  # Second group: 456
    print(match.group(3))  # Third group: 789
    

Lookahead and Lookbehind

Lookahead and Lookbehind are used to check conditions that are before or after a specific pattern. These features are commonly used techniques but can be somewhat complex.

Using Lookahead


# Finding the pattern where 'def' follows 'abc'
lookahead_pattern = r'abc(?=def)'
lookahead_string = 'abcdefghi'
lookahead_match = re.search(lookahead_pattern, lookahead_string)
print(lookahead_match)
    

Using Lookbehind


# Pattern that comes before '123'
lookbehind_pattern = r'(?<=123)abc'
lookbehind_string = '123abc'
lookbehind_match = re.search(lookbehind_pattern, lookbehind_string)
print(lookbehind_match)
    

Comprehensive Example: Extracting Email Addresses

Regular expressions are especially useful for extracting email addresses from entered text.


email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
text = "Contact email: example@example.com or support@domain.com"
emails = re.findall(email_pattern, text)

print(emails)  # ['example@example.com', 'support@domain.com']
    

Summary

Regular expressions are a powerful tool in string processing, and Python's `re` module provides sufficient functionality to work with them. By understanding the basic syntax of regular expressions and practicing, one can easily handle complex text patterns. Regular practice and application of these techniques will help solve more complex string processing issues effectively.

07: Flying with Python

In this course, we will explore how to utilize the advanced features of Python to solve complex problems and write efficient code. The main topics we will cover include various programming paradigms, advanced data structures, and the powerful built-in module functionalities that Python offers.

1. Advanced Programming Paradigms

Python is a multi-paradigm programming language. It supports procedural, object-oriented, and functional programming, allowing you to take advantage of each as needed. In this section, we will focus primarily on advanced techniques in object-oriented programming (OOP) and functional programming.

1.1 In-depth Object-Oriented Programming

The basic concept of OOP starts with the understanding of classes and objects. However, to design more complex programs, you need to know other concepts as well.

1.1.1 Inheritance and Polymorphism

Inheritance is a feature where a new class inherits the properties and methods of an existing class. By using inheritance, the reusability of the code can be enhanced. Polymorphism allows for the same interface to be used for objects of different classes.

class Animal:
    def speak(self):
        pass

class Dog(Animal):
    def speak(self):
        return "Woof!"

class Cat(Animal):
    def speak(self):
        return "Meow!"

def animal_sound(animal):
    print(animal.speak())

dog = Dog()
cat = Cat()

animal_sound(dog)  # Woof!
animal_sound(cat)  # Meow!

The above example is an illustration of polymorphism. By having the speak() method in different class objects, it can be called in the same way within the animal_sound function.

1.1.2 Abstraction and Interfaces

An abstract class is a class that defines a basic behavior, housing one or more abstract methods. An interface can be thought of as a collection of these abstract methods. In Python, abstraction is implemented through the ABC class of the abc module.

from abc import ABC, abstractmethod

class Shape(ABC):
    @abstractmethod
    def area(self):
        pass

class Circle(Shape):
    def __init__(self, radius):
        self.radius = radius

    def area(self):
        return 3.1415 * self.radius * self.radius

circle = Circle(5)
print(circle.area())  # 78.5375

In the above example, the Shape class is an abstract class that defines the abstract method area. The Circle class inherits from Shape and implements the area method.

1.2 Functional Programming

Functional programming uses pure functions to reduce side effects and implements complex behaviors through function composition. Python provides strong functional tools to encourage this style.

1.2.1 Lambda Functions

Lambda functions are anonymous functions defined typically with a single expression. They are useful for writing short and concise functions.

add = lambda x, y: x + y
print(add(5, 3))  # 8

In the above example, lambda defines an anonymous function that adds two parameters.

1.2.2 Higher-Order Functions

A higher-order function is a function that takes another function as an argument or returns it. Python’s map, filter, and reduce are examples that utilize these functional programming techniques.

numbers = [1, 2, 3, 4, 5]
squared = map(lambda x: x**2, numbers)
print(list(squared))  # [1, 4, 9, 16, 25]

In the above example, the map function applies the lambda function to each element in the list to create a new iterator.

2. Advanced Data Structures

Utilizing advanced data structures allows for more efficient handling of complex data operations. Here we will address more complex data structures beyond basic types like lists and dictionaries.

2.1 Collections Module

The Python collections module provides several data structures with specialized purposes. Let’s take a look at a few of them.

2.1.1 defaultdict

defaultdict is a dictionary that automatically creates a default value when a non-existent key is referenced.

from collections import defaultdict

fruit_counter = defaultdict(int)
fruits = ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']

for fruit in fruits:
    fruit_counter[fruit] += 1

print(fruit_counter)  # defaultdict(, {'apple': 3, 'banana': 2, 'orange': 1})

This example demonstrates how to easily count each fruit using defaultdict.

2.1.2 namedtuple

namedtuple is like a tuple but immutable while allowing access to fields by name which enhances the readability of the code.

from collections import namedtuple

Point = namedtuple('Point', ['x', 'y'])
p = Point(10, 20)

print(p.x, p.y)  # 10 20

By using namedtuple, fields can be accessed by name, allowing for clearer code.

2.2 Heap Queue Module

The heapq module implements a heap queue algorithm, enabling a list to be used as a priority queue.

import heapq

numbers = [1, 3, 5, 7, 9, 2, 4, 6, 8, 0]
heapq.heapify(numbers)  # Convert list to a priority queue

smallest = heapq.heappop(numbers)
print(smallest)  # 0

This allows for quick extraction of the minimum value in the data using a priority queue.

3. Utilizing Advanced Built-in Modules

The rich built-in modules of Python provide various functionalities. Here, we will introduce some modules for advanced tasks.

3.1 itertools Module

The itertools module offers useful functions for dealing with iterators. It is a powerful tool for repetitive data processing.

3.1.1 Combinations and Permutations

Combinations and permutations provide various methods for selecting elements from data sets.

from itertools import combinations, permutations

data = ['A', 'B', 'C']

# Combinations
print(list(combinations(data, 2)))  # [('A', 'B'), ('A', 'C'), ('B', 'C')]

# Permutations
print(list(permutations(data, 2)))  # [('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'C'), ('C', 'A'), ('C', 'B')]

These functions allow for the quick generation of various list combinations.

3.1.2 Handling Iterator Collections

This module provides tools for various iterations such as infinite loops, counting increments, and periodic repetitions.

from itertools import count, cycle

# Infinite count
for i in count(10):
    if i > 15:
        break
    print(i, end=' ')  # 10 11 12 13 14 15

print()  # New line

# Periodic repetition
for i, char in zip(range(10), cycle('ABC')):
    print(char, end=' ')  # A B C A B C A B C A

The above example shows how to utilize infinite loops and periodic repetitions.

3.2 functools Module

The functools module provides functional programming tools, offering various utilities particularly useful for handling functions.

3.2.1 lru_cache Decorator

The @lru_cache decorator is used for memoization, storing computed results to avoid recalculating for the same input.

from functools import lru_cache

@lru_cache(maxsize=None)
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

print([fibonacci(n) for n in range(10)])  # [0, 1, 1, 2, 3, 5, 8, 13, 21, 34]

In the code above, the computed results for the Fibonacci sequence are stored in the cache, saving execution time for the same input.

Conclusion

In this article, we have discussed advanced topics in Python. By effectively utilizing these features, complex problems can be solved efficiently, and high-level code can be written. Let's delve into more topics in the next course and advance towards becoming Python experts.