Text Processing / Regular Expressions
The data files that you need for this assignment are located inside a directory named
/home/guest/source/ass6. These files may also be obtained from here.
NOTE: Assume that case-insensitivity is required wherever applicable.
The file kinases_map.txt
contains mapping information of human protein kinase genes, taken from the
OMIM Gene Map (Online Mendelian Inheritance in Man) database.
Each line contains information on one gene, where fields are separated by a pipe sign, or vertical bar (|).
Date of entry to the database.
Cytogenetic location (chromosome number followed by cytogenetic
Accession number in OMIM.
Write a program that reads this file, and prints (in another file) the gene symbol, gene name
and cytogenetic location, in a tab-delimited format.
Modify your program to print only tyrosine kinase genes (i.e. only
where the gene name contains the word tyrosine).
Modify your program to print only genes located in chromosome 9.
Modify your program so it asks the user for a year, and then prints only genes
entered to the database AFTER that year.
Assume you want to clone the genomic region coding for the P53 gene, including all relevant introns. The file p53_seq.txt
contains the full genomic sequence of the P53 gene, in FASTA format. The part of the sequence that is translated
to protein (including several introns) is between nucleotides 11717 - 18680.
Start by defining the locations of the beginning and end of the coding region in
Thereafter, use those variables for extracting the coding sequence.
Read the sequence from the file and store it in a scalar variable.
Extract the part of the sequence that is translated to protein (including the introns) and store it in another variable.
Validate that the coding sequence starts with an ATG codon and ends with a stop codon (either TAA, TAG or TGA).
Check whether the coding sequence contains a restriction site for BamHI (cuts at GGATCC).
Check whether the coding sequence contains a restriction site for BstSFI (cuts at CGryCG, where r is either G or A, and y is either C or T).
Extract the gene gi number from the first line and print it before every output in a descriptive manner. E.g., "The gene gi35213 has a
Drd I restriction site".
Write a short program that can receive a cDNA sequence
from either STDIN or a file, validates that it contains
only valid nucleotides, and prints it as a mRNA sequence (replace all T with U).
Example for valid sequences: "TTTTAATTAAACGTAAAAAGGCAGG"
,"tcaccttcgacgacgcttagagcagatagacgat", and "tTacG".
Example for a wrong sequence:"GTXCTTXXAAGGCNNNTACTTTYTCCRAGCC".
Write a program that reads in a sequence in GenBank format
(e.g. as in genbank_seq.txt),
removes all spaces and line numbers, and prints it on another file.