3. Lists and arrays

1 – Introduction to arrays

Arrays are Perl’s way of keeping ordered lists. Each element in an array can hold a single piece of information. You can think of an array as a column in a spreadsheet (or row, if you so prefer). Most likely, 99% or more of your bioinformatics tools will involve arrays of information or data in one way or another.

Exercise

Start with this simple program and modify it (if needed) to solve the tasks below.

Array manouvering

  1. Just run the program :-)
  2. Print the contents of the elements only at index 0 and 7 in @bases.
  3. Modify the program to print the base contained in each element on a new line. Hint: use a foreach loop to extract one element from the list at a time!
  4. Make the program print the bases this way but in reverse order.
  5. Just for giggles, modify the program to print the bases in alphabetically sorted order.
  6. Add some additional favorite bases to @bases and print the total number of elements in the array.
  7. Print the proportion of Gs in @bases.
  8. Print the proportion of Gs and Cs (the GC-content) in @bases.

2 – Populating arrays

We will now focus on how we add and remove data from arrays.

Exercise

Start with this program and modify it (if needed) to solve the tasks below.

Array population

  1. Rearrange the chemicals so that they print in your own order of choice.
  2. Put each chemical in a new array called @ingredients using push.
  3. Use shift and then pop to remove elements from @ingredients. What ingredients do you have left in @ingredients?
  4. Use shift and then pop to remove elements but put the contents of each element in new scalars and print the contents of those scalars.

3 – Automatically populating arrays

Here we will add data to arrays using Perl’s built-in methods to generate series of numbers or letters.

Exercise

Start with this program and modify it (if needed) to solve the tasks below.

Automatic array population

  1. Print the number of elements in @list.
  2. Use shift to remove only the numbers from the array by coding a while loop. Verify that you succeeded by adding a new print statement that shows what you have in left in the array.
  3. Use pop to remove only the letters from the array. Again , try to code a while loop that removes only the letters.
  4. Use pop to remove only the letters from the array in a while loop and put the letters in a new array called @letters as they are being removed one by one.

4 – Basic statistics from array data

Here we will aim for computing basic statistics from values stored in arrays.

Exercise

Start with this program and modify it (if needed) to solve the tasks below. You can save the program under a new name each time you complete an exercise that required you to modify it.

Array manipulation

  1. Print the number of elements in both arrays. Are they equally long?
  2. What are the index numbers of the first and last elements in the two arrays? Hint: this actually returns the index of the last element in your array: $#ind1
  3. Loop over @ind1 and print the value of each element on a new line.
  4. Loop over @ind1 but print the position of each element (1, 2, 3 …) instead of the value (3, 6, 5 …). Hint: an incrementing counter can be used to keep track on where you are!
  5. Loop over @ind1 and print the position of each element followed by its value. Separate the two values by a tab character.
  6. Loop over @ind1 and print the position of each element followed by its value and the corresponding value in @ind2, all separated by a tab character. You should get many rows of output subdivided into three columns. Hint: retrieve specific values from the other array using $ind2[  ] where you specify the index in the square brackets.
  7. Repeat step 6 again, but first print titles for each column. Once you get the output right, redirect it to a file instead of the screen and congratulate yourself for having created a data table in the form of a tab separated CSV-file (or rather, TSV file).
  8. Loop over the two arrays as before but in reverse.
  9. Loop over the arrays again (not in reverse) but increment all values in @ind1 and @ind2 by 5 before you print them.
  10. Loop over the arrays again and copy each value into a new array so that the new array should interleave every other value from @ind1 and @ind2 (3, 1, 6, 4 …). Verify that you succeeded.
  11. Copy the values of @ind2 into @ind1, adding the values to the end of the array. Verify that you succeeded.
  12. Same as in 11, but move the value from @ind2 into @ind1, removing the element and in effect making @ind2 shorter each time you move an element until @ind2 is empty. Verify that you succeeded.

Array computation

  1. Compute the means of the values for @ind1 and @ind2, respectively. Coming up with exercises is hard so over the next few days, you will become an expert in computing means from arrays
  2. At how many positions does each array have values larger than or equal to 10?
  3. At how many positions do both arrays have values larger than or equal to 10? Print a filtered table that contains only those positions. Hint: you can have nested if control structures.
  4. Find and print the maximum value across both arrays.
  5. At how many positions are the values in @ind1 larger then the values in @ind2 and vice versa? At how many positions are they the same?

Advanced topics (extra exercises)

  1. Arrange intervals that span five units (1 to 5, 6 to 10, 11 to 15, … ). The last interval should include the maxium value. For each array, count how many times you can assign a data point to each interval. Report the proportion of data that falls into each interval.
  2. Simplify the intervals to do a cumulative count instead (5 or less, 10 or less, 15 or less, …). Report the proportion of data that falls into each such interval.
  3. Compute and report the mean values for windows of 10 datapoints along each array. Repeat the same thing but use only 5 datapoints in each window.
  4. Lets visualize the data (somewhat)! Return to exercise Array manipulation #5 above but instead of printing the value of each element, print that number of # characters on each line to generate a simple bar blot. Do the same for @ind2 but use a * symbol instead. Hint: use the x operator to turn a value into a long string of symbols like this:
    '#' x $value
  5. Plot both arrays in the same graph. Put @ind2 in front of @ind1 so that you only see a value from @ind1 when it is larger than that of @ind2. You do this by adding # symbols correponding to the difference the two to the string of *s at that position.

5 – Random selections (extra exercise)

Random numbers are useful when subsampling data, i.e. creating smaller subsets of data from a larger pool. Here we will spice things up by creating programs that behaves slightly different each time you run them with the help of Perl’s built in function to generate random numbers: rand. This function is not listed in LP but you can find online documentation for rand here. In its simplest form, rand just returns a fractional value between 0 and 1 (e.g. 0.473914115297706, 0.302116682360978) that you can store in a scalar or act on directly. If we instead need random integers (e.g. 1, 43, 9, 4949), we can provide an upper boundary to rand and use a second built-in function in Perl to return only the integer part of that random number (whatever comes before the decimal): int.

Exercise

Start from this program and modify it (if needed) to solve the tasks below.

Randomness

  1. Run the program a couple of times to get a feeling of how rand and int works.
  2. Modify the second random number generator to only return 0s or 1s as possible integers.
  3. Head back to exercise Array Manipulation #6 above and insert code so that each position along the arrays only has a 50/50 chance of being printed as output. Run the program a couple of times. Are the subsampled tables you get each time always the same size?
  4. Modify that program so that you instead select and print values at 50 random positions, allowing the program to select the same position several times.
  5. Modify that program so that you instead select elements at 50 random positions, not allowing the program to select the same position several times. This is tricky and there are several ways to accomplish it. Hint: We have covered splice in LP. Perhaps you can use that function…