Robert's Perl Tutorial

http://www.sthomas.net/roberts-perl-tutorial.htm


* + -- regexes become line noise

What if we didn't know what the email address was going to be ?

$_='My email address is <webslave@work.com>.';

print "Found it ! :$1:" if /(<.*>)/i;

When you see an if statement like this, read it right to left. The print statement is only executed if code on the right of the expression is true.

We'll discuss this. Firstly, we have the opening parens ( . So everything from ( to ) will be put into $1 if the match is successful. Then the first character of what we are searching for, < . Then we have a dot, or period . . For this regex, we can assume . matches any character at all.

So we are now matching < followed by any character. The * means 0 or more of the previous character. The regex finishes by requiring > .

This is important. Get the basics right and all regex are easy (I read somewhere once). An example best illustrates the point. Slot this regex in instead:

$_='My email address is <webslave@work.com>.';

print "Found it ! :$1:" if /(<*>)/i;

What's happening here ?

The regex starts, logically, at the start of the string. This doesn't mean it starts a 'M', it starts just before M. There is a 'nothing' between the string start and 'M'.

The regex is searching for <* , which is 0 or more < .

The first thing it finds is not < , but the nothing in between the start of the string and the 'M' from 'My email...". Does this match ?

As the regex is looking for "0 or more" < , we can certainly say that there are 0 < at the start of the string. So the match is, so far, successful. We have dealt with <* .

However, the next item to match is > . Unfortunately, the next item in the string is 'M', from 'My email..". The match fails at this point. Sure, it matched < without any problem, but the complete match has to work.

The only two characters that can match successfully at this point are < or > . The 'point' being that <* has been matched successfully, and we need either > to complete the match or more of < to continue the '0 or more' match denoted by * .

'M' is neither of them, so it fails at this point, when it has matched

Quick clarification - the regex cannot successfully match < , then skip on ahead through the string until it matches > . The characters in the string between < > also need to match the regex, and they don't in this case.

All is not lost. Regexes are hardy little beasts and don't give up easily. An attempt is made to match the regex wherever possible. The regex system keeps trying the match at every possible place in the string, working towards the end.

Let's look at the match when it reaches the 'm' in 'work.com'.

Again, we have here 0 < . So the match works as before. After success on <* the next character is analysed - it is a > , so the match is successful.

But, be warned. The match may be successful but your job is not done. Assuming the objective of was to return the email address within the angle brackets then that regex is a miserable failure. Watch for traps of this nature when regexing.

That's * explained. Just to consolidate, a quick look at:

$_='My email address is <webslave@work.com>.';
print "Match 1 worked :$1:" if /(<*)/i;

$_='<My email address is <webslave@work.com>.';
print "Match 2 worked :$1:" if /(<*)/i;

$_='My email address is <webslave@work.com<<<<>.';
print "Match 3 worked :$1:" if /(<*>)/i;

Match 1 is true. It doesn't return anything, but it is true because there are 0 < at the very start of the string.

Match 2 works. After the 0 < at the start of the string, there is 1 < so the regex can match that too.

Match 3 works. After the failing on the first < , it jumps to the second. After that, there are plenty more to match right up until the required ending.

Glad you followed that. Now, pay even closer attention ! Concentrate fully on the task at hand ! This should be straightforward now:

$_='HTML <I>munging</I> time !.';

/<I>(.*)<\/I>/i;

print "Found it ! $1\n";

Pretty much the same as the above, except the parens are moved so we return what's only inside the tags, not including the tags themselves. Also note how / is escaped like so; \/ otherwise Perl thinks that's the end of the regex.

Now, suppose we change $_ to :

$_='HTML <I>munging</I> time is here <I>again</I> !.';

and run it again. Interesting effect, eh ? This is known as Greedy Matching. What happens is that when Perl finds the initial match, that is <I> it jumps right to the end of the string and works back from there to find a match, so the longest string matches. This is fine unless you want the shortest string. And there is a solution:

/<I>(.*?)<\/I>/i;

Just add a question mark and Perl does stingy matching. No nationalistic jokes. I have Dutch and Scottish friends I don't want to offend.