8. Control structures

Control structures such as if, else, elsif, unless are fundamental to almost any programming language and the primary way to introduce logic in a program and control the path of execution. That said, most other programming languages lack the unless and elsif keywords. (Many languages have the equivalent of the elsif, but with less elegant spelling, such as elseif or even else if.)

if has already been introduced and you have had plenty of exercises involving its use already. But it is not always easy to write tests that capture precisely what you want,  and sometimes you want to do something only if a test fails (that is, whatever you put after if was false). This is where else and elsif comes in handy.

Other useful functions for changing the flow of your program are next, which will skip the rest of the current iteration of a loop and move to the next item in the list, and last, which will skip the rest of the current iteration of the loop and also skip the rest of the items in the list.

An example could be as the code below. Try it out and see if it does what you expected!

For this exercise we will be filtering the file snp_data.vcf. To write the code you will want to know what the file looks like. A quick open with less -S snp_data.vcf will most likely not give the best overview. The headers and the data lines are too long and out of sync. A better way is to open the file in Excel or some other spreadsheet program, but then you will have to first rename it to snp_data.txt and tell Excel to split the rows on tabs. And you will have to leave your new favorite environment, the terminal window. An even better way is to open the file with perl and print out the contents of the file so that you actually can make sense of the contents.

Exercise 1.

  1. Write a program that opens the file snp_data.vcf and prints out each line in the file.
  2. That was not much better than less, perhaps even worse. What we really want is the header of each column and the values on a row so that we can see what kind of value each column contains. Lets start with printing only the line with column headers (that’s the last comment line). Find the characteristics that makes it special and modify your program to print only that line.
  3. The line is still too long for most terminal windows, so now modify your program to print each column header on a separate row. (Hint: It will be useful to have the headers in an array later)
  4. That’s better! Now we can see what columns we have, but there is an awful lot of samples. Limit the program to only print the first two sample headers and all that is before these. (Hint: What makes the sample headers special, and how can you use that to limit what elements you print?)
  5. Now the output is starting to become manageable, so this is a good time to start adding some example data also. Modify your program to print the corresponding columns of the first row of data, on the same row as each of the headers. To avoid printing all lines in the file, you will somehow have to get out of the loop reading the file after reading the first data line. (Hint: With for you can loop over two arrays in parallel)

You should now have a program that reads the vcf-file and prints something similar to this (the INFO-line is still a bit long, but at least we can now see what value corresponds to which header) :

Exercise 2.

Now that we have warmed up and got a good view of the contents of the file it is time to start filtering out the rows that we want. The exercise below is possible to complete in a lot of ways. Try to write the code such that it is easy to follow the logics rather than as effective or compact as possible.

  1. Make a new program that reads the vcf-file. The old can still be good to have around, so it is best to make a new. Start with just reading the file and skipping the comments and printing only the column header line. But again, limit the column header to two samples, just for easier reading of the output. Remember, perl programmers are lazy, so copy the relevant parts from your previous program.
  2. Modify your program to also print all of the data rows, but only the first two samples. The output should be similar to the original file, just with shorter rows.
  3. As we will not use it here, remove the INFO column from the output. (Hint: The INFO column has always the same position in the table, both for the header and the data)
  4. The columns REF and ALT contains the reference nucleotide and the alternative allele (SNP). Make your program skip over lines where the alternative allele has more than one value.
  5. Have your program print only transversions with a QUAL-value larger than 50, still ignoring rows with multiple alternative alleles.(Hint: There are fewer combinations that are transitions, so they are easier to match.)
  6. Allow also transitions, if they have a QUAL-value larger than 100.
  7. Add the condition that the first sample should not have the reference allele (denoted by the period).
  8. Limit your printing to the rows that have a position between 300 to 600 or 1000 to 1300 on chromosome 1.
  9. Modify the program to filter the output on a value in the INFO column. Feel free to choose whichever you want, the meaning of each value is explained in the comments at the beginning of the file.