Special Topics in CS: Bioinformatics
Homework 2
 



Investigating Alignments Using the Dot Matrix Technique


The goal of this homework assignment is to introduce students to the NCBI database and the use of the dot matrix technique for analysis of alignments. Students should read the help and examples information given with the Dotlet tool for creating dot matrix plots. They will be asked to download several sequences from NCBI and analyze them using Dotlet. Dotlet can be found at several web sites. If the server for this site seems very slow, you may want to use another site. Because the interface is not completely intuitive, students mays want to read the "need help?" section before using the tool.

  1. Go to the web site and go through the "learn by example section." Write a summary of the sorts of analysis that can be done using the dot matrix approach and the types of features of sequences that can be discovered.

  2. The dot matrix technique can be used to compare a sequence to itself. In many double-stranded DNA viruses such as the herpesviruses, the segment of DNA where replication is initiated (the origin or replication) contains many short segments of repetitive DNA. Reverse complements of the repetitive sequences also occur. The origin or replication for Human herpesvirus 7 has been sequenced and is available at the NCBI web site with Accession number L40417. Go to NCBI and select "Nucleotide" in the Search box and HHV7 origin in the "for" box. Then press Go. You will retrieve an entry for both the complete genome of HHV7 and for the origin binding protein site. Select the origin. If you scroll to the bottom of the page, you can see the listing for the DNA sequence. This sequence will be presented in a format designed to be easy to read for humans. The line numbers and spaces make this format more difficult for machine processing. In order to a get a listing of the DNA sequence in format more appropriate for machine processing, select FASTA for the Display. After you have pressed the Display button, you will see the sequence in FASTA format. You can now cut the DNA sequence (not the line beginning with ">") and paste it into Dotlet. Compute the dot matrix comparison of the sequence to itself. Adjust the gray scale to reduce noise and make the features easier to see. You should also experiment with several different window lengths. Describe and analyze the alignment features that can be detected using this method. Produce a hard copy of the Dotlet screen showing the alignment to submit with this homework.

  3. Insulin is a very small protein that is essential for glucose metabolism. Search the web for information about insulin. Write a short paragraph that describes the function of insulin, where it is produced in the body, when it was discovered, and diseases that result if there is insufficient insulin or from insulin resistance. Insulin is composed of two peptides, an A chain and a B chain. The gene for insulin encodes a "precursor" peptide that is produced in the pancreas but is not active. As the precursor protein folds, disulfide bonds form in the folded protein. A large center section of the molecule is cut out to produce active molecule consisting of the A and B chains. The sequences of the A and B chains of insulin are highly conserved, but the sequence of the central part this is cut out is less highly conserved. Why would you expect this to be the case? Cut and paste the sequences of the precursor peptide for both human and bovine (cow) insulin from the NCBI web site. Make sure you search for protein this time. The Accession number for the bovine insulin precursor is P01317 and for human insulin precursor is P01308. Compare the two sequences in Dotlet and report the results. Note that insulin is a very small protein and the resulting dot matrix will be very small, but you should still be able to see the relevant features. You may also want to compare insulin from cats (feline) and dogs (canine) to the insulin for humans and cows. Submit printouts of the web pages showing your results in addition to the summaries.

Due: Submit a hard copy of a report containing answers to the questions Friday, September 3, 2004 in the CSE main office (Butler 300).