* + -- regexes become line noise
What if we didn't know what the email address was going to be ?
$_='My email address is <webslave@work.com>.'; print "Found it ! :$1:" if /(<.*>)/i;
When you see an if
statement like this, read it right to left. The print statement is only executed if code on
the right of the expression is true.
We'll discuss this. Firstly, we have the opening parens ( .
So everything from ( to )
will be put into $1 if the match
is successful. Then the first character of what we are searching for, < . Then we have a dot, or period . . For this regex, we can assume . matches any character at all.
So we are now matching < followed by any
character. The * means 0 or more of the previous
character. The regex finishes by requiring > .
This is important. Get the basics right and all regex are easy (I read somewhere once). An example best illustrates the point. Slot this regex in instead:
$_='My email address is <webslave@work.com>.'; print "Found it ! :$1:" if /(<*>)/i;
What's happening here ?
The regex starts, logically, at the start of the string. This doesn't mean it starts a 'M', it starts just before M. There is a 'nothing' between the string start and 'M'.
The regex is searching for <* , which is 0
or more < .
The first thing it finds is not < , but
the nothing in between the start of the string and the 'M' from 'My email...". Does
this match ?
As the regex is looking for "0 or more" < ,
we can certainly say that there are 0 < at
the start of the string. So the match is, so far, successful. We have dealt with <* .
However, the next item to match is > .
Unfortunately, the next item in the string is 'M', from 'My email..". The match fails
at this point. Sure, it matched < without any
problem, but the complete match has to work.
The only two characters that can match successfully at this point are < or > .
The 'point' being that <* has been matched
successfully, and we need either > to
complete the match or more of < to continue
the '0 or more' match denoted by * .
'M' is neither of them, so it fails at this point, when it has matched
Quick clarification - the regex cannot successfully match <
, then skip on ahead through the string until it matches > . The characters in the string between < > also need to match the regex, and they
don't in this case.
All is not lost. Regexes are hardy little beasts and don't give up easily. An attempt is made to match the regex wherever possible. The regex system keeps trying the match at every possible place in the string, working towards the end.
Let's look at the match when it reaches the 'm' in 'work.com'.
Again, we have here 0 < . So the match
works as before. After success on <* the next
character is analysed - it is a > , so the
match is successful.
But, be warned. The match may be successful but your job is not done. Assuming the objective of was to return the email address within the angle brackets then that regex is a miserable failure. Watch for traps of this nature when regexing.
That's * explained. Just to consolidate, a
quick look at:
$_='My email address is <webslave@work.com>.'; print "Match 1 worked :$1:" if /(<*)/i; $_='<My email address is <webslave@work.com>.'; print "Match 2 worked :$1:" if /(<*)/i; $_='My email address is <webslave@work.com<<<<>.'; print "Match 3 worked :$1:" if /(<*>)/i;
Match 1 is true. It doesn't return anything, but it is true
because there are 0 < at the very
start of the string.
Match 2 works. After the 0 < at the start
of the string, there is 1 < so the regex can
match that too.
Match 3 works. After the failing on the first < ,
it jumps to the second. After that, there are plenty more to match right up until the
required ending.
Glad you followed that. Now, pay even closer attention ! Concentrate fully on the task at hand ! This should be straightforward now:
$_='HTML <I>munging</I> time !.'; /<I>(.*)<\/I>/i; print "Found it ! $1\n";
Pretty much the same as the above, except the parens are moved so
we return what's only inside the tags, not including the tags themselves. Also
note how / is escaped like so; \/ otherwise Perl thinks that's the end of
the regex.
Now, suppose we change $_ to :
$_='HTML <I>munging</I> time is here <I>again</I> !.';
and run it again. Interesting effect, eh ? This is known as
Greedy Matching. What happens is that when Perl finds the initial match, that
is <I> it jumps right to the end
of the string and works back from there to find a match, so the longest string
matches. This is fine unless you want the shortest string. And there is a
solution:
/<I>(.*?)<\/I>/i;
Just add a question mark and Perl does stingy matching. No nationalistic jokes. I have Dutch and Scottish friends I don't want to offend.