Regular expressions

Parentheses as memory - notes

Note 1

Instead of using the $1, $2, $3 ... special variables, you may assign the "remembered" substrings directly into variables.

For example, to extract the day, month, year, time and minutes from a date, write:

#!/usr/bin/perl

print "Please enter date and time, as in \"08-OCT-2012  16:30\"\n";
my $entry = <STDIN>;
chop ($entry);

($day, $month, $year, $hour, $min) = $entry =~ /(\d\d)-(\w\w\w)-(\d\d\d\d)  (\d\d):(\d\d)/;

# here the "remembered" parts from the regular expression were directly
# assigned to a list of variables.

print "Month: $month\n";

Example

Extract coordinates from PDB. Here is a typical ATOM line:

ATOM 1 N GLY A 2 1.888 -8.251 -2.511 1.00 36.63 N

#!/usr/bin/perl

while(<>) {
        /^ATOM/ or next;
        ($n, $x, $y, $z) = ($_ =~ /^.{6}(.{5}).{19}(.{8})(.{8})(.{8}).{22}../);
}

Note 2

Parentheses are also used in alternation, e.g.
$line =~ /ACM1_(HUMAN|RAT|MOUSE)/;
and for grouping characters before a quantifier, e.g.
$seq =~ /(GATA){2,}/;   # or
$seq =~ /(GATA)+/;

In both cases, the parentheses will also cause "remembering" of the substrings matched by the patterns enclosed by them.

Example

$seq =~ /TT(GATA)*(\w*)/;

# $1 will contain "GATA"
# $2 will contain the rest of the sequence after the GATA repetitions.

To avoid "remembering" things in parentheses, write ?: before them. e.g.

$seq =~ /TT(?:GATA)*(\w*)/;

# $1 will now contain the rest of the sequence after the GATA repetitions.

Note 3

To back-reference a remembered substring inside a regular expression, write \1, \2, \3 ... instead of $1, $2, $3 ...

Example

To find out whether a sequence contains two copies of some trinucleotide (but not one after the other) write:
if ($seq =~ /(...).+\1/) {  }


Table of Contents.
Previous | Next.