Text Processing Functions

Examples for using the substr function

Example 1:

Let's have a look at the first two lines in the Swiss-Prot entry:
0123456789012345678901234...
|         |         | 
ID   M1_HUMAN     STANDARD;      PRT;   460 AA. 
AC   P11229;
...
...
DE   MUSCARINIC ACETYLCHOLINE RECEPTOR M1
...
...
To extract the field names and values it is enough to specify their positions on the line.
Field: start at position 0 and count 2 characters.
Value: start at position 5 and continue to end of line.
(Notice that we count from 0).

In the program below we will extract three types of information from the Swiss-Prot file: the protein identification, accession and description.

#!/usr/bin/perl

$file = "sp.txt";

open (SP, $file) || die "cannot open \"$file\": $!";

while ($line = <SP>) {
   chomp ($line);
   
   my $field = substr ($line, 0, 2);
   my $value = substr ($line, 5);
   
   if ($field eq "ID") {
      $id = $value;
   }
   if ($field eq "AC") {
      $ac = $value;
   }
   if ($field eq "DE") {
      $de = $value;
   }
}

print "Identification: $id\n";
print "Accession No. : $ac\n";
print "Description   : $de\n";
Program output:
Identification: ACM1_HUMAN     STANDARD;      PRT;   460 AA.
Accession No. : P11229;
Description   : MUSCARINIC ACETYLCHOLINE RECEPTOR M1.


Example 2:

Let's extract all accession numbers from the Swiss-Prot list of muscarinic acetylcholine receptors.

We need to extract the characters at positions 12 - 17 of each line.

#!/usr/bin/perl

my $sp_list_file = "sp_list.txt";

open (SP, $sp_list_file) || die "cannot open \"$sp_list_file\": $!";

foreach (1 .. 3) {   #skip title lines
   <SP>; 
}

my @acc_nos = ();

while ($line = <SP>) {
   $acc = substr ($line, 12, 6);
   push (@acc_nos, $acc);
}

print "@acc_nos\n";
Program output:

P11229 P12657 P04761 P08482 P41985 P30372 P08172 P06199 P10980 P41984 P20309 P11483 P08483 P49578 P41986 P17200 P08173 P32211 32211 P08485 P30544 P08912 P08911 P16395


Table of Contents.
Previous | Next.