[Standard] Sequence data format conversion (AW)


There is a large number of sequence data formats currently in use among bioinformatics tools. A common issue in sequence analysis is thus to have data in one format but needing it to be in some other one to be able to analyze the data with a given program. Most formats store sequence data in simple text files. Perl is a great tool for converting between these formats.

Two common formats for data query and analysis are FASTA and PHYLIP, respectively. The following example dataset containing three samples, each of which with 11 bases, demonstrate both formats:


The {shell}>{/shell} character tells the program that the string following it is the sample header (or name). FASTA files often use the {shell}.fasta{/shell} or {shell}.fas{/shell} filename extensions.


The first line of the phylip file contains a header that specifies the dimensions of the data matrix (the number of samples times the number of characters) to help programs allocate memory before reading the rest of the data. PHYLIP files often use the {shell}.phylip{/shell} or {shell}.phy{/shell} filename extensions.


  1. For this project, you will write a program that converts sequence data stored in FASTA format into the corresponding PHYLIP format. You can use the data above as a simple test case when you develop the program but your main aim is to convert the FASTA file contained in this archive: dogsnps.tar.gz. The dataset consists of thousands of SNPs from a large sample of dogs. The FASTA dataset is not sorted but make the code output an alphabetically sorted PHYLIP dataset. The dataset is large and it has therefore been compressed with gzip. You can download and extract the archive like this:

    Please compress your resulting PHYLIP file before emailing it to us together with your program. You can compress your file using this command:

Bonus points

  1. Give the program an option to read in a FASTA dataset and bootstrap/resample the data to produce any number (lets say 100) of perturbed PHYLIP files with random subsets of the original data to facilitate statistical analyses. For each replicate, randomly pick as many sequence positions (or columns in the data matrix) as there are in the original dataset, allowing the program to pick the same position more than once. See Exercise 5 from the Arrays session for clues. Document how the option works but you do not need to send us all the replicates.
  2. Use a program such as PHYLIP’s NEIGHBOR (command-line tool, requires installation) or QuickTree (online server, no installation required) to recover the interrelationships between the samples using your new PHYLIP file using neighbor joining. Use FigTree (java program) or EvolView (online server, no installation required) to visualize the tree. Send us the tree along with the rest of the files.