Robert's Perl Tutorial

http://www.sthomas.net/roberts-perl-tutorial.htm


More Matching

Assume we have:

$_='HTML <I>munging</I> time is here <I>again</I> !.';

and we want to find all the italic words. We know that /g will match globally, so surely this will work:

$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';

$match=/<i>(.*?)<\/i>/ig;

print "$match\n";

except it returns 1, and there were definitely two matches. The match operator returns true or false, not the number of matches. So you can test it for truth with functions like if, while, unless Incidentally, the s/// operator does return the number of substitutions.

To return what is matched, you need to supply a list.

($match) = /<i>(.*?)<\/i>/i;

which handily puts all the first match into $match . Note that an = is used (for assignment), as opposed to =~ (to point the regex at a variable other than $_.

The parens force a list context in this case. There is just the one element in the list, but it is still a list. The entire match will be assigned to the list, or whatever is in the parens. Try adding some parens:

$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';

($word1, $word2) = /<i>(.*?)<\/i>/ig;

print "Word 1 is $word1 and Word 2 is $word2\n";

In the example above notice /g has been added so a global replacement is done - this means perl carries on matching even after it finds the first match. Of course, you might not know how many matches there will be, so you can just use an array, or any other type of list:

$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';

@words = /<i>(.*?)<\/i>/ig;

foreach $word (@words) {
        print "Found $word\n";
}

and @words will be grown to the appropriate size for the matches. You really can supply what you like to be assigned to:

($word1, @words[2..3], $last) = /<i>(.*?)<\/i>/ig;

you'll need more italics for that last one to work. It was only a demonstration.

There is another trick worth knowing. Because a regex returns true each time it matches, we can test that and do something every time it returns true. The ideal function is while which means 'do something as long the condition I'm testing is true'. In this case, we'll print out the match every time it is true.

$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';

while (/<(.*?)>(.*?)<\/\1>/g) {
        print "Found the HTML tag $1 which has $2 inside\n";
}

So the while operator runs the regex, and if it is true, carries out the statements inside the block.

Try running the program above without the /g . Notice how it loops forever? That's because the expression always evaluates to true. By using the /g we force the match to move on until it eventually fails.

Now we know this, an easy way to find the number of matches is:

$_='HTML <I>munging</I> time is here <I>again</I> ! What <EM>fun</EM> !';

$found++ while /<i>.*?<\/i>/ig;

print "Found $found matches\n";

You don't need braces in this case as nothing apart from the expression to be evaluated follows the while function.