10. Modules, file tests, directories

Checking if a file exists

A common error when running a script that reads from a file is that the file it should read is not where the script expects it to be. Depending on how you have written your open statement, this error might be catched or just fail silently (remember to turn on warnings, and it will not be completely silent!) One way to avoid this error is to check if the file exists before you try to open it. Another reason to check if a file exists is that you might only want to use the data in the file if it is there, otherwise just continue with the rest of the program without that extra information.

  1. Dig out an old program that opens and reads from a file, like the ones you wrote for exercise 8. Control structures.
  2. Try to run the program from a folder where the data file is not accessible by the name written in your program. What happens? If you were ambitious when you wrote your program, it will die with a message saying the the file does not exist, otherwise, it might fail silently and just not prints anything.
  3. Modify your program to first check if the file exists before it tries to open it, if the file is missing, write a polite error message and then exit.
  4. Before the file is opened, also make sure that it is a normal file and not a directory or a link.
  5. Optional: If the file does not exists, have your program ask the user for a new file name until it is able to find the file.
  6. Optional: Write a program that tries to open the file hello.txt. Before the open, check if the file exists. If it does not, check for the file hello.txt.gz and if that exists, gunzip it and then continue to open the original file hello.txt which should now exist. To test this program, you will have to create the file hello.txt and gzip it. This can be done with the following two commands:
  7. Optional: Write a program that loops endlessly until the file “continue” appears in a specific folder. Add a line with sleep 1; in the loop, otherwise this loop will consume all your available CPU.

Using glob to find files

Sometimes you just know in which directory to look, but not the actual names of the files you want to read. There are (at least) two ways to handle this. The simplest is to use file globbing.

For this exercise we need a directory tree with some files. Download, unpack and run the script makeFiles.pl in a temporary directory. Unless you specify another name as a command line parameter, this script will create a directory named testdir. The new directory contains a random number of files in multiple levels of subdirectories. Each file contains ten rows with random numbers. The names of the files and directories are random and it is not possible to separate files from directories by the name. Feel free to look at the makeFiles.pl script to figure out how it works.

  1. Write a script that uses file globbing to find all files in the top level directory (testdir). Print the name of each file and if it is a proper file or a directory.
  2. Put your glob code into a subroutine that takes a directory path as its only parameter.
  3. Make your subroutine call itself using the current directory (not the top level directory) path as argument when it finds a directory in the list of files returned from glob. This is called recursing.
  4. Optional: Calculate the mean of all numbers in all files. (Hint: Let your subroutine return the sum and count for all files it has seen.)

Using a module

A good thing with Perl is that no matter what you want to do with it, chances are that someone has already written a module that does what you want, or at least parts of it. Therefore it is always a good idea to look at CPAN before you start out writing your own code. Installing modules can sometimes be challenging and is beyond the scope of this course, but luckily many good modules are included with perl itself as core modules, or installed by default anyway (depending on your operating system).

In this exercise we will make use of the module LWP::Simple, which adds a simple interface to the more complicated stuff hidden in lib-www-perl. This is very useful for downloading data from the web, such as dumping a number of pages from some on-line database or, as in this case, a number of sequences from NCBI.

We will start with the example script NCBI_download.pl. This is a complete script, so start by just downloading an running it, then we will try to decompose it to understand what it really does and how it could be modified to do other stuff.

If you tried the script you got a lot of text written to your terminal. This is genbank records, and probably you would want to redirect the output from this script into a file to be able to keep the data it fetches.

Starting from the top, the first important thing is the row
use LWP::Simple;
This tells perl to locate and load the module LWP::Simple. If the module is not found or contains an error, the program will exit with an error here.

Here we define some variables that will be used later. The most cryptic is probably the last one, $query. This defines the query to send to the NCBI nucleotide database and specifies to look for the terms 18S or SSU in any organism from the taxonomic group 147099 (Acoelomorpha). The query is exactly the same as you would write it in the NCBI web interface and has really nothing to do with Perl.

Next, the $esearch variable is defined to hold the web address to use in when sending the query. Notice how the previously defined $utils and $db are used in the definition.

The next line is where much of the action is:

  • First, the strings stored in $esearch and $query are concatenated into a complete address (note how the $esearch string ends with ‘term=’). You can try and replace the line with a print $esearch . $query to see what is going on.
  • Second, the subroutine get is called with the address as argument. The subroutine get is defined in the LWP::Simple module and was imported to your program’s namespace when use LWP::Simple was called. This is somewhat bad behavior from a module and most modules will only export subroutines that you have explicitly asked for. (It is no problem in this case, but in a larger program you might want to use some other module which also exports the subroutine get. Then you have a collision, and the module you loaded last will win.)

The return value from the get is now used in a pattern match to extract some values which are saved as $Count, $QueryKey and $webEnv.

These are then used to build another query address which is used for another call to get.

Finally, the result from the second get is printed.

  1. Edit the script NCBI_download.pl so that it prints the values of $esearch and then stops.
  2. Edit the script NCBI_download.pl so that it prints the return value of the first get. What is it?
  3. Print out the value of $efetch and paste it into your browser’s address field. What do you get?
  4. Try changing the query string in the program and see what you will get instead.