Text Processing / Regular Expressions

General guidelines

The data files that you need for this assignment are located inside a directory named /home/guest/source/ass6. These files may also be obtained from here.
NOTE: Assume that case-insensitivity is required wherever applicable.


    The file kinases_map.txt contains mapping information of human protein kinase genes, taken from the OMIM Gene Map (Online Mendelian Inheritance in Man) database. Each line contains information on one gene, where fields are separated by a pipe sign, or vertical bar (|).

    Fields description:

  1. Write a program that reads this file, and prints (in another file) the gene symbol, gene name and cytogenetic location, in a tab-delimited format.
  2. Modify your program to print only tyrosine kinase genes (i.e. only where the gene name contains the word tyrosine).
  3. Modify your program to print only genes located in chromosome 9.
  4. Modify your program so it asks the user for a year, and then prints only genes entered to the database AFTER that year.
  5. Assume you want to clone the genomic region coding for the P53 gene, including all relevant introns. The file p53_seq.txt contains the full genomic sequence of the P53 gene, in FASTA format. The part of the sequence that is translated to protein (including several introns) is between nucleotides 11717 - 18680.

    1. Start by defining the locations of the beginning and end of the coding region in variables. Thereafter, use those variables for extracting the coding sequence.
    2. Read the sequence from the file and store it in a scalar variable.
    3. Extract the part of the sequence that is translated to protein (including the introns) and store it in another variable.
    4. Validate that the coding sequence starts with an ATG codon and ends with a stop codon (either TAA, TAG or TGA).
    5. Check whether the coding sequence contains a restriction site for BamHI (cuts at GGATCC).
    6. Check whether the coding sequence contains a restriction site for BstSFI (cuts at CGryCG, where r is either G or A, and y is either C or T).
    7. Extract the gene gi number from the first line and print it before every output in a descriptive manner. E.g., "The gene gi35213 has a Drd I restriction site".

  6. Write a short program that can receive a cDNA sequence from either STDIN or a file, validates that it contains only valid nucleotides, and prints it as a mRNA sequence (replace all T with U).

    Example for valid sequences: "TTTTAATTAAACGTAAAAAGGCAGG" ,"tcaccttcgacgacgcttagagcagatagacgat", and "tTacG". Example for a wrong sequence:"GTXCTTXXAAGGCNNNTACTTTYTCCRAGCC".


  7. Write a program that reads in a sequence in GenBank format (e.g. as in genbank_seq.txt), removes all spaces and line numbers, and prints it on another file.

Table of Contents.
Course Home Page.