6/7. Hashes and regular expressions


Exercise 1.

Regular expressions are used to match and extract subsets of data or text in for instance a file. Hashes can be used for keeping track of how many times a certain pattern is found. Here you will use a regular expression to filter out the DNA sequence only from a fasta file, and subsequently a hash to calculate the proportion of G and C (GC-content) in that DNA sequence.

The DNA sequence can be found here: DNA

Start by reading the file line by line, using code such as this (the “print” command is only there to see that the file is being read):

Now use a regular expression to identify those lines that only contain DNA sequence and store them in a variable. It is a good idea to remove the new line at the end of each row before storing the sequence data.

When you have the complete DNA sequence stored in a variable you can move on to calculating the amount of “G” and “C” in the sequence. In order to do that you need to come up with a way of looping over every single base in the sequence (hint: arrays represent one such method) and then count what you see. As already pointed out, hashes are excellent for keeping track of counts. For instance, the code below feeds a hash (called %WORD_COUNT) every single world of a text stored as an array (@words), and the hash value for each word (or key) is increased by 1 every time the word is seen:

By modifying this code you should be able to estimate the GC-content of the DNA sequence. Good luck!

Exercise 2.

Here we will use regular expressions and hashes to study the linguistics of the paper presenting the human genome sequence: Human_genome_paper. Start by testing the code pasted below and then continue to solve the other tasks.

The code above reads the paper and applies a simple regular expression to test if lines contain the letter sequence “gene”.

1. Modify this code so that each time a line contains “gene” it reports in what context “gene” was written. To do that, use the “match variables” to extract and print text that surrounds “gene”.

2. How many lines contained the letter sequence “gene”?

3. How many times is the letter sequence “gene” found in the text (Hint: “If” control structures finish evaluating a statement as soon as the condition is fulfilled. That is, independent of how many places that matches the regular expression, “if” will only note one of them. On the other hand, a “while” statement in combination with option modifier “g” (such as that presented below) can be used to repeat a block of code, such as a regular expression, until the condition is no longer true.)

4. How many times can the word “gene” be found in the text? (Note the difference between word and “letter sequence”.)

5. How many words in the text start with “gene”? Which of these is most frequent?

6. Which is the most frequently used of all words in the text? Here you need to come up with a method that first counts the frequency of all words, and subsequently identifies the most frequently used one (by the way, the first part of this task is very similar to exercise 1 except here you will be counting words instead of letters).

Exercise 3.

Here is a PLINK output file showing the results of an association analysis including a large numbers of genetic markers (SNPs): Data. In this file, each row shows how well a particular SNP is associated with a trait of interest (Bovinophobia in humans). Your task is to identify the best associated SNP by identifying and extracting the SNP with the lowest p-value (column 5). To complicate things we are however not interested in all SNPs in this file, but rather a subset presented in this file:  id_list. The question thus is, which of the SNPs present in the “id_list.txt” shows the best association to the trait?

Exercise 4.

A list containing information on all participants of this course can be found here: participants. Write a program that identifies the supervisor of each participant and reports how many students are supervised by each supervisor (wow, that’s really interesting!). This program can be divided in four main tasks: (1) Reading the file line by line, (2) using a regular expression to extract the supervisor of each participant, (3) storing/counting supervisors using a hash, (4) printing the results.

But before you start, take a look at the participants file in the terminal window using the unix command:

This command shows the content of the file, line by line. How many lines are there? How many lines are there if you open the file using a text editor or excel?

Also try looking at the file using this command:

This command wraps the lines of the file to show as much of the contents as possible on a single screen. Do you notice the “^M” characters? Here is what has happened: Mac and Unix use different characters to mark new lines. The participants file was generated on a Mac, which uses “carriage return” (CR) to indicate that the line has come to an end (marked by “^M” when the file is shown using the less command). Unix instead uses “line feed” (LF) for this task. Therefore, in order for perl (that runs on your Unix machines) to be able to read this file line by line we first need to replace (substitute) all Mac new lines with Unix new lines. In perl CR is encoded by meta character “\r” while LF is encoded by “\n”. Thus, before you can even start reading the participants file line by line, you need to store its content using new lines that are recognized by perl. This represents a commonly encountered problem when programming and one that is very handy to know how to solve. Save the file with correct new lines and proceed to solve the task presented at the top of this page.