5. Input and output

1 – Reading files from standard input

Exercise

In this exercise we will look at the two operators <STDIN> and <> (the diamond operator). Both of them can be used to read data from Standard Input but only one of them can read data from one or more files passed as arguments on the command line. Start with this program that simply reads a stream of data on standard input and reprints it.

STDIN

  1. Make sure that you have the file snp_data.vcf from the UNIX exercise in your current directory.
  2. Either of these shell commands will make the program read and reprint the file to the screen:
  3. Modify the program to enumerate the lines and compute the number of characters on each line before reprinting them. The first line of the output could look something like:
    1 20 ##fileformat=VCFv4.1
    when you pipe it to less -S. A line counter program like this can be highly useful when trying to debug malformed data input files that some program refuses to read.
  4. Instead of piping to the program, try reading the file as an argument:
    This will not work when using the <STDIN> operator. Try the diamond operator instead, which handles both piping and command line arguments.

2 – Invocation arguments

Arguments passed to a program on the command-line can be used to configure its settings, specify the particular functionality that needs to be accessed, point to files or other data and so on. These kind of arguments are often called options, switches or flags etc. They are stored in Perl’s special array @ARGV.

Exercise

You will here return to the first exercise we did on scalar data but instead of providing allele frequencies in the script, you will be using @ARGV to read them and to compute Fst. Here is a demonstration of how @ARGV works:

ARRRGV

  1. Pass the arguments
    to the program when you run it on the command line:
    Hopefully it should just spit them out again on separate lines.
  2. Modify the Fst program to treat the first two arguments in @ARGV as frequency input for the Fst calculation from the basic arithmetics session.
  3. You can give different numbers of arguments to your program. Make the program compute Fst only if you actually pass two arguments on the command line.
  4. Modify the program to read the allele frequencies and print Fst values from the two files contained in this archive. On the command line you can download and extract the archive like this:

    There is a million frequencies in a single column in each file (so don’t even think about opening them in Excel or something, but by all means have a quick look with less). The first frequency in one file should be compared to the first frequency in the other file and so on. Hint: you can use both <STDIN> and <> in the same program. Use <STDIN> to read in data from one file over standard input and <> the read data from the second file, which you instead pass as an argument.

3 – Filehandles

The standard input stream and diamond operator are a bit limited when it comes to dealing with files, as is printing output only to STDOUT, which we have been doing so far. Maybe you need to keep track on the name of a file internally when you generate a new files that should derive their names from the name of the input file. Perhaps you need to print to multiple output files in your script whereas STDOUT and STDERR only provide two such output channels.

Exercise

In this exercise you will continue modifying the Fst program to read the two data files using Perl’s open function and filehandles. We will be using the three argument version of open, which is the safest approach. We also use the logical operator or to check if the open function succeeds. If open does succeed, it returns a “1” which is evaluated to true. If it does not succeed, it returns a “0” which is evaluated to false. A failed attempt to open a file for reading would be that it does not exist or that it is not readable by the user. The or operator will only continue doing what we ask it if the last action we took evaluated to false. Consider this program:

Can you handle this?

  1. Run the program: i) without passing an argument; ii) with a bogus argument that does not actually match a file; and lastly iii) provide one of the allele frequency files.
  2. Modify your Fst computing program to open the two files using open and to only do so when two arguments are provided.
  3. Instead of printing to STDOUT, modify the program to print the output to a file which name is the concatenation of the name of the first file and “.fst”.