4. Subroutines

As you write programs you will notice that some blocks of code may start reappearing throughout your programs, such as computing the mean or median of values in arrays. Such repetitive tasks are perfect to isolate into subroutines, which essentially are distinct pieces of code that perform only a specific task.The benefits of subroutines are many:

  1. you isolate a specific block of code from the rest of the program, making it easier to find.
  2. the subroutine can be called several times throughout your program with different data each time, but you only need to write, optimize and maintain a single block of code instead of having that code repeated many times all over your program
  3. when you write a subroutine you need to delineate the task that the subroutine is focused on, compared to overall objective of your program. This is especially important when you are reusing the subroutine many times in the program. What is actually the point of the subroutine? How should you generalize the problem and the code to make it handle all the situations when it is needed?
  4. subroutines are typically named something, which is a great opportunity to stress what that piece of code is meant to accomplish. Many text editors make it easy to locate and navigate among subroutines by for instance listing them alphabetically on either side of the text field where you write your code.
  5. coming up with names for variables is sometimes… a pain. Subroutines (when used as intended) typically have their own set of local variables such as scalars or arrays that are meant to be used only within that code. This makes it possible to safely reuse names for variables, such as $i or $counter that are often used for keeping track on incrementally increasing numbers (iterators) which may otherwise conflict with other parts of your code if you used the same name somewhere else.
  6. subroutines can even be shared among programs to further help reusing code. Then they are bundled up into modules (also referred to as packages or libraries).
  7. In Perl, subroutines are also one of the intermediate stepping stones towards object oriented programming. Once you master subroutines, a whole new powerful programming paradigm is waiting just around the corner :-)

1 – Introduction to Subroutines

Lets revisit the interesting and challenging problem of computing the mean from arrays!

Exercise

Start from this program […]. The subroutine simply prints some information about the array that it gets. Pay attention to the fact that we pass a text string before the data arrays to that subroutine and what we then do with it inside the subroutine. It is often convenient to pass such extra pieces of information to subroutines.

In a mellow subroutine

  1. Examine and run the program.
  2. Make a subroutine that computes and returns the mean from arrays and print them in the main section of the program.

2 – Simplifying a program

As stressed above, subroutines can help cut down on the number of lines and repetitive areas of your code and at the same time highlight the code’s purpose.

Exercise

Here is a clunky solution to one of the array exercises in 4 – Basic statistics from array data, namely exercise #2 in the Advanced topics about doing cumulative counts of the proportion of data points that is equal to or less than some number. The dataset consists of two arrays that are analyzed one after the other with a lot of code duplication.

KISS (that is an acronym, not an order)

  1. Examine and run the program.
  2. Create a subroutine to which you can pass one data array at a time and print the same information. Call the subroutine something like cumulative_counts.
  3. Tweak the subroutine to print the names of the samples, ind1 or ind2, respectively, before printing the results from each array.

3 – Encoding or decoding sequence qualities in a subroutine

Regardless of sequencing technology, there is always some degree of error involved in reading a genetic sequence in a sequencer. Some parts of a sequence will simply be observed with higher accuracy than others. It is therefore common to get quality scores associated with each base in a sequence. The score represent the level of confidence at which we can consider the read base to represent the true base at that position in the sequence.

The genetic sequence is represented by a string of bases, such as ACGTC, saved in a file on the disk. For bioinformaticians working with the data, it is convenient if the associated quality scores are also represented by an equally long string. However, these scores typically range from 0 to 50, numbers that can be built from either one or two integers, so it is not obvious how a quality string such as 12422134 should match the above sequence. Many sequence formats therefore use alphabetic characters and other symbols to represent qualities such that only a single character is needed even for larger numbers.

Exercise

Here we will use subroutines to encode and decode quality strings. The subroutines below use two functions that come with Perl:

  • chr() converts an integer to a character (based on its position in the so-called ASCII table).
  • ord() converts the character back to the corresponding integer.

Different sequence technologies and sequence formats use different ways of encoding qualities and elaborating on those is beyond the scope of this exercise. The subroutines below were inspired by the MAQ projects documentation of the FASTQ format.

Show me the numbers

  1. Examine and run the program
  2. Decode the symbols you got in @encoded and verify that the results match the original list of integers (well at least the first 93 ones).
  3. Modify the program to instead encode the @ind1 and @ind2 arrays from earlier exercises.
  4. Report the mean quality for each of the four encoded sequence qualities below.
  5. Report the mean quality among all the four reads (the mean of the means).

4 – Introduction to Scope

The programs we have been writing so far have used global variables. The scope of these variables stretches over the whole program, i.e. they are accessible throughout the program and no other variable should really have the same name as these global ones. There is actually nothing stopping you to use the same name for a different variable that is meant to be used for a different purpose, but you are likely to end up in situations where you overwrite information in a variable in such a way that the program will not function correctly. You have a bug in your code. Subroutines provides opportunity to isolate variable names into their own scope and so do loops such as foreach or control structures such as if as well.

Exercise

This program uses two counter variables called $i in two places for different purposes and in different ways.

Cope with scope

  1. Examine and run the program. How many times is the print_message() subroutine called (used)?
  2. Remove the my operator before $i = 0; in the print_message() subroutine. What is going on now?
  3. A second highly useful use pragma in Perl is use strict; which encourages you to work with local variables that you declare with the my operator (there are a few other ones too but lets ignore them for now). use strict; and use warnings; work great together to help avoiding common mistakes and bugs. Add use use strict; on a new line after use warnings; and fix the program so that it runs without error or warnings. Make a habit of include these two in your programs. It is well worth it.