Regular expressions

Greediness and Laziness

Example

Given the HTML text from the previous example, let us now try to extract all the text that is inside the <A HREF=".."> tag.

We would like to get the result

A HREF="assignment6.html"
However, note that the following code will retrieve more than that:
#!/usr/bin/perl

my $html = "<A HREF=\"http://tarshish/md/biu/ac/il/assignment6.html\"> Assignment 6 </A>.";

$html =~ /<(.*)>/;

print "$1\n";
Result:
A HREF="http://tarshish.md.biu.ac.il/assignment6.html"> Assignment 6 </A

Explanation

The * quantifier (as well as the +) is greedy, i.e. it will try to "grab" as much text as it can before it is restricted to match the rest of the regular expression.

In the example above, it matched all the text until the last > sign of the HTML text, and not until the closest one.

To force the quantifier to be lazy and "grab" the minimum it can before the rest of the regular expression, write a question mark ? after it.

example

The correct regular expression for the example above is thus:
$html =~ /<(.*?)>/;
Result:
A HREF="http://tarshish.md.biu.ac.il/assignment6.html"

Table of Contents.
Previous | Next.