This article is part 4 of 4 in the series Python Regular Expressions Tutorial

Last Updated: Thursday 12th December 2013

We’ve covered a lot of ground in this series of articles, so let’s now put it all together and work through a real-life application.

A common task is to parse a Windows INI file, which are key/value pairs, separated into sections, something like this:

Let’s first write a bit of Python code that reads in a test file, line by line:

We will now extend this by writing some regular expressions that figure out what is on each line.

Identifying section headers

The first thing we’ll do is write a regular expression that will recognize a section header, the lines that start and end with square brackets. We could write such a regular expression like this: ^\[(.+)\]$

In plain English:

  • Match ^ (the start of the line).
  • Match a [ character (escaped, since [ normally has a special meaning in a regular expression).
  • Match one or more characters (the section name), captured in a group.
  • Match a ] character (it’s actually not necessary to escape this).
  • Match $ (the end of the line).

If we update our code to use this regular expression:

We get this output:

Seems to work fine!

Handling white-space in section headers

It would be handy to handle white-space in section headers, so if somebody gave us an INI file that looked like this:

We would be able to handle the oddly-written second section header properly. Right now, our code doesn’t find it, so let’s update the regular expression to handle it: ^\s*\[\s*(.+?)\s*\]

In plain English:

  • Match ^ (the start of the line).
  • Match \s* (zero or more white-space characters).
  • Match a [ character.
  • Match \s* (zero or more white-space characters).
  • Match one or more characters (the section name).
  • Match \s* (zero or more white-space characters).
  • Match a ] character.

Note that we had to make the + character (that captures the section name) non-greedy, to stop it from matching any trailing spaces that might appear before the closing ]. We also stop matching after the closing ] since we don’t care if there’s anything on the line after it.

Now our code recognizes the weirdly formatted section name:

This gives us the following output:

The second section header has been found and its name cleaned up.

Identifying key/value pairs

The next step is to write a regular expression that identifies the key/value pairs, maybe something like this: ^(.+)=(.+)$

In plain English:

  • Match ^ (the start of the line).
  • Match one or more characters (the key name), captured in a group.
  • Match the = character.
  • Match one or more characters (the key value), captured in a group.
  • Match $ (the end of the line).

Again, we’d like this regular expression to handle extraneous white-space, so let’s re-write it like this: ^\s*(.+?)\s*=\s*(.+?)\s*$

And our updated code now looks like this:

We wrap the key name and values in curly braces when we print them out so that we can see if they have been trimmed correctly.

If we give it the following test input:

We get the following output:

  • https://github.com/morrissimo morrissimo

    Great plain-English walkthrough.

    A tweak: ignore end-of-line comments – e.g, “val1=foo # this text is a comment; ignore it!”. To make this tweak, just change the end-of-line token in the key-value regex to “\#”:
    ^\s*(.+?)\s*=\s*(.+?)\s*\#

    This makes the key-value regex ignore anything on a line after a “#” – even if it’s a line that starts with a comment.

  • praba

    Thanks a lot buddy

  • Pingback: Python | Pearltrees()

  • Chris

    Thanks.