9. Strings and advanced sorting

This exercise deals with truncating and padding strings and rounding numbers. All things that are useful for presenting data in a human readable way.
  1. Dig out a copy of one of your programs for reading the file snp_data.vcf and make a new copy.
  2. Edit the program so that the comment lines are read and printed as is while all other lines are modified such that each value in a column is at most 7 characters long.
  3. Now change it such that instead of separating the columns with tabs, each column is separated with two spaces.
  4. The output from 3 looks a mess because different values have different length in the same column. Modify your program so that all values are 7 characters long, even if the original value has fewer characters (add spaces). The text should be right aligned in each column.(Hint: sprintf is quite useful…)
  5. Round (with proper rounding, not just truncating) the numbers in the QUAL column to one decimal.
  6. To indicate that some data has been truncated, exchange the last three displayed characters in each truncated value with three dots ‘…’. (e.g. ‘HelloWorld’ gets ‘Hell…’)
  7. Limit each line to 80 characters (not counting newlines).
Sorting an array is as simple as sort(@data), isn’t it?
  1. Try it out by sorting the array @ary = (1..20);
  2. Not what you expected? By default, perl sorts alphabetically, which means that 10 and 100 comes before 2 since 1 is before 2. Now, fix your sort to do the right thing.
  3. Sorting numerically is useful, but you can also make your very own sorting algorithm by calling a subroutine. Try writing a sort function that sorts the numbers according to their last digit first and then the first. The output should be 10, 20, 1, 11, 2, 12, 3, 13, 4, 14, 5, 15, 6, 16, 7, 17, 8, 18, 9, 19. (Hints: it might help by making all numbers have the same number of digits. The sorting subroutine should return 1 for a>b, 0 for a==b and -1 for a<b.)
  4. If you have done all the previous exercises, you are now experts on using perl to calculate means. So, for a change, let’s calculate the median instead. Write a program that reads a file with numbers (for example the files found in allele_freqs.tar.gz or our favorite file snp_data.vcf) and calculates the median value. It might not be obvious to all what this has to do with sorting, but if you think about the definition of the median a little, it might become clearer…
Sorting stuff can also be useful on hashes and the method is exactly the same as for arrays, you just have to define what should be compared.

my @sorted_keys = sort { $the_hash{$a} <=> $the_hash{$b} } keys %the_hash;

  1. Write a program that reads the data rows in the file snp_data.vcf into two hashes. Both hashes should have the concatenation of chromosome and position as key and the first hash (A) should have the whole line as value while the second hash (B) should have the QUAL column as value.
  2. Print the lines from the file numerically sorted from highest to lowest QUAL value. (Hint: The keys in your two hashes should be identical)