Robert's Perl Tutorial

http://www.sthomas.net/roberts-perl-tutorial.htm


A Worked Example: Date Change

Imagine you have a list of dates which are in the US format of month, day, year as opposed to the rest of the world's logical notion of day, month year. We need a regex to transpose the day and month. The dates are:

@dates=(
'01/22/95',
'05/15/87',
'8-13-96',
'5.27.78',
'6/16/1993'
);

The task can be split into steps such as:

  1. Match the first digit, or two digits. Capture this result.
  2. Match the delimiter, which appears to be one of / - .
  3. Match the second two digits, and capture that result
  4. Rebuild the string, but this time reversing the day and month.

That may not be all the steps, but it is certainly enough for a start. Planning regex is important. So, first pass:

@dates=(
'01/22/95',
'5/15/87',
'8-13-96',
'5.27.78',
'6/16/1993'
);

foreach (@dates) {
	print;
	s#(\d\d)/(\d\d)#$2/$1#;
	print " $_\n";
}

Hmm. This hasn't worked for the dates delimited with - . , and the last date hasn't worked either. The first problem is pretty easy; we are just matching / , nothing else. The second problem arises because we are matching two digits. Therefore, 5/15/87 is matched on the 15 and 87, not the 5 and 15. The date 6/16/1993 is matched on the 16 and the 19 of 1993.

We can fix both of those. First, we'll match either 1 or 2 digits. There are a few ways of doing this, such as \d{1,2} which means either 1 or two of the preceding character, or perhaps more easily \d\d? which means match one \d and the other digit is optional, hence the question mark. If we used \d+ then that would match 19988883 which is not a valid date, at least not as far as we are concerned.

Secondly, we'll use a character class for all the possible date delimiters. Here is just the loop with those amendments:

foreach (@dates) {
	print;
	s#(\d\d?)[/-.](\d\d?)#$2/$1#;
	print " $_\n";
}

which fails. Examine the error statement carefully. The key word is 'range'. What range? Well, the range between / and . because - is the range operator within a character class. That means it is a special character, or a metacharacter. And to negate the special meaning of metacharacters we have to use a backslash.

But wait! I don't hear you cry. Surely . is a metacharacter too? It is, but not within a character class so it doesn't need to be escaped.

foreach (@dates) {
	print;
	s#(\d\d?)[/\-.](\d\d?)#$2/$1#;
	print " $_\n";
}

Nearly there. However, we are always replacing the delimiter with / which is messy. That's an easy fix:

foreach (@dates) {
	print;
	s#(\d\d?)([/\-.])(\d\d?)#$3$2$1#;
	print " $_\n";
}

so that fixes that. In case you were wondering, the . dot does not act as '1 of anything' inside a character class. It would defeat the object of the character class if it did. So it doesn't need escaping. There is a further improvement you can make to this regex:

$m='/.-';

foreach (@dates) {
	print;
	s#(\d\d?)([$m])(\d\d?)#$3$2$1#;
	print " $_\n";
}

which is good practice because you are bound to want to change your delimiters at some point, and putting them inside the regex is hardcording, and we all know that ends in tears. You can also re-use the $m variable elsewhere, which is good pratice.

Did you notice the difference between what we assign to $m and what we had before?

    /\-.
$m='/.-';

The difference is that the - is no longer escaped. Why not? Logic. Perl knows - is the range operator. Therefore, there must be a character to the immediate left and immediate right of it in order for it to work, for example e-f. When we assign a string to $m, the range operator is the last character and therefore has no character to the right of it, so Perl doesn't interpret as a range operator. Try this:

$m='/-.';

and watch it fail.

Something else that causes heartache is matching what you don't mean to. Try this:

@dates=(
'01/22/95',
'5/15/87',
'8-13-96',
'5.27.78',
'/16/1993',
'8/1/993',
);

$m='/.-';

foreach (@dates) {
	print;
	s#(\d\d?)([$m])(\d\d?)#$3$2$1# or print "Invalid date! ";
	print " $_\n";
}

The two invalid dates at the end are let through. If you wanted to check the validity of every possible date since the start of the modern calendar then you might be better off with a database rather than a regex, but we can do some basic checking. The important point is that we know the limitations of what we are doing.

What we can do is make sure of two things; that there are three sets of digits seperated by our chosen delimiters, and that the last set of digits is either two digits, eg 99, 98, 87, or four digits, eg 1999, 1998, 1987.

How can we do this? Extend the match. After the second digit match we need to match the delimter again, then either 2 digits or four digits. How about:

$m='/.-';

foreach (@dates) {
	print;
	s#(\d\d?)([$m])(\d\d?)[$m](\d\d|\d{4})#$3$2$1$2# or print "Invalid date! ";
	print " $_\n";
}

which doesn't really work out. The problem is it lets 993 through. This is because \d\d will match on the front of 993. Furthermore, we aren't fixing the year back on to the end result.

The delimiter match is also faulty. We could match / as the first delimiter, and - as the second. So, three problems to fix:

foreach (@dates) {
	print;
	s#(\d\d?)([$m])(\d\d?)\2(\d\d|\d{4})$#$3$2$1$2$4# or print "Invalid!";
	print " $_\n";
}

This is now looking like a serious regex. Changes:

  1. We are re-using the second match, which is the delimiter, further on in the regex. That's what the \2 is. This ensures the second delimiter is the same as the first one, so 5/7-98 gets rejected.
  2. The $ on the end means end of string. Nothing allowed after that. So the regex now has to find either 2 or 4 digits at the end of the string, or it fails.
  3. Added the match of the year ($4) to the rebuild section of the regex.

Regex can be as complex as you need. The code above can be improved still further. We could reject all years that don't begin with either 19 or 20 if they are four-digit years. The other problem with the code so far is that it would reject a date like 02/24/99 which is valid because there are characters after the year. Both can be fixed:

@dates=(
'01/22/95',
'5/15/87',
'8-13-96',
'5.27.78',
'/16/1993',
'8/1/993',
'3/29/1854',
'! 4/23/1972 !',
);

$m='/.-';

foreach (@dates) {
	print;
	s#(\d\d?)([$m])(\d\d?)\2(\d\d|(?:19|20)\d{2})(?:$|\D)#$3$2$1$2$4# or print "Invalid!";
	print " $_\n";
}

We have now got a nested OR, and the inner OR is non-capturing for reasons of efficiency and readability. At the end we alternate between letting the regex match either an end of line or any non-digit, symbolised with \D.

We could go on. It is often very difficult to write a regex that matches anything of even minor complexity with absolute certainity. Think about IP addresses for example. What is important is to build the regex carefully, and understand what it can and cannot do. Catching anything supposedly invalid is a good idea too. Test your regex with all sorts of invalid data, and you'll understand what it can do.