How to Use
Contents
- How to use Linux
- How to Login
- from Mac or LINUX
- from Windows
- Files
- Two writable areas
- Public Databases
- Qsub
- Qsub'ing Script Files
- Qsub Related Topics
- Qstat
- Qstat Monitor
- Getting Specific Information out of Qstat
- Qdel
- Screen and qsub -I
- Using Screen
- Submitting via Screen
- Screen Related Topics
- Multiple Nodes: MPI
- NAMD
- MPI related Links
- Multiple Nodes: PVM
- PVM Related Links
- Communication
- Sharing Files on the Web
- Locking your Web files with a Password
- Listing All Users
- Spam Note!
- Listing Current Cluster Users
- Spam Note!
- Examples
- Reusing Scripts
- Running many input files
How to use Linux
Please refer to Linux Essentials
How to Login
from Mac or LINUX
- Open terminal and type:
ssh -X your_username@biocluster.ucr.edu
from Windows
Open Putty and select ssh. Download PuTTY if you do not have it.
- Provide host name and session name
hostname:
biocluster.ucr.edu
- Enter your identity information
username:
your username
password:
your password
Setup for graphics emulation. Download and install Xming if you do not have it.
Use WinSCP for file exchange. Download and install WinSCP if you do not have it.
- Open terminal and type:
Files
Please read about the NFS-and-Caching-Problem
Two writable areas
The cluster nodes have fast network access to user-writable areas of Bioclusuter:
/home
/srv/projects
From the cluster nodes, /home can be read and modified, but /srv/projects is only for reading.
Public Databases
The NCBI, PFAM, and Uniprot databases are located on Biocluster in
/srv/projects/db
You can also view our directory of databases on the web.
The last update was in July 2008.
Qsub
Submit, view, and delete your computational tasks on Biocluster with the queue management utilities: qsub, qstat, and qdel.
This documentation describes the traditional way of using qsub, but some researchers may find it more convenient to use the qsub -I interactive approach. To learn more, see Screen and qsub -I
Qsub'ing Script Files
- The script file must start with the line
#!/bin/sh
to tell the system to use /bin/sh to run the commands listed in the rest of the file. The purpose of these commands is to setup and run the application.
For a simple single CPU-core task, you might consider writing a script that will run blast on a single node, using 1 CPU-core.
Without the queuing system, you would have ran the following on the command line:blastall -p blastp -d /data/NCBI/blast/nr/nr -i ~/blast/proteins.fa -e 1e-19
You can transform the above command into a script by placing the following into a file (in my example the file will be named blast.sh):
#!/bin/sh blastall -p blastp -d /data/NCBI/blast/nr/nr -i ~/blast/proteins.fa -e 1e-19
#!/bin/sh DIR=~/blast INPUT=proteins.fa PROGRAM=blastp DATABASE=/data/NCBI/blast/nr/nr OPTIONS="-e 1e-10" cd $DIR; blastall -p $PROGRAM -d $DATABASE -i $INPUT $OPTIONS
By setting the variables to point to each of the blast locations, you will simplify understanding and preparing of other blast runs in future.
The DIR value is the working directory for the job to be ran. Without the cd $DIR command, task will try to run your home directory.
The INPUT, DATABASE and PROGRAM are passed as is to blastall. However, OPTIONS are in quotes becouse they includes spaces.
This script will run a blast on an available node of the cluster.
All output from the program is saved into files.
Submit the file by using qsub as follows:
qsub blast.sh
The output will look like: 123.biocluster.ucr.edu
The task is now queued, and it will run when resources become available. The number 123 in the output of qsub is the "Job ID Number" which is used to track the progress of your task with qstat.
Qsub Related Topics
Screen and qsub -I is an alternative that you should use when developing (modifying and testing) a new program.
- The script file must start with the line
Qstat
To list the current running and queued tasks run
qstat # this form is easy to type qstat -n # this also lists nodes that are being used qstat -nu `whoami` # this lists only the tasks that you are running
Example task submission
echo sleep 20 | qsub
Output
15568.biocluster.ucr.edu
List submitted tasks
qstat -nu `whoami`
Output
biocluster.ucr.edu:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
15568.biocluster.ucr alevchuk batch STDIN 989 1 -- -- -- R --
node04/0
Once the job is no longer listed, it has finished. The files blast.sh.o123 and blast.sh.e123 will now be available. The blast.sh will be whatever the name of your job script is, and the number at the end will be the job id number. The o file is the normal output, and in the case of blast will be the file we are interested in. The e file is the error output. Both of these will always exist, but the error output will usually be empty.
Qstat Monitor
Our custom qstat monitor can give you visual feedback about the state of the cluster.
For example:
To see the live data:- Login to Biocluster with X11 graphics forwaring enabled
ssh biocluster -X
- Run
qstatMonitor
- The raw data and the additional features howto is given in the standard output of the R scripts.
Getting Specific Information out of Qstat
Sometimes you may need to pull out the Job IDs out of qstat, but those may be hard to find in all the data displayed by qstat.
So here is how to get the specifics: Show only the tasks that I am running:qstat -u `whoami` -n
qstat -u `whoami` -n | grep -E -B 1 '(node02)|(node03)|(node10)|(node11)'
Qdel
To cancel a task that is in the queue or running
Locate the Job id with qstat
qstat -u `whoami`
Outputbiocluster.ucr.edu:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
1234.biocluster.ucr alevchuk batch STDIN 989 1 -- -- -- R --
- Run
qdel 1234
Where 1234 is the Job ID number of your task
Screen and qsub -I
The is an alternative method has the following advantages:
- Useful for developing (modifying and testing) a new program
- Programs run in the foreground
- Prints Output and Error messages to you screen immediately
- Running programs can be suspended (with Ctrl-Z), terminated (with Ctrl-C), and restarted just like any other Linux process
Avoids problems with NFS and Caching. When you modify the program directly on the node, you are guaranteed that the run will be using your latest modification, and not some earlier version, as it often happens with the standard qsub approach. (See NFS and Caching Problem)
The disadvantages are:
You must know how screen works, what keys to press to 'detach', and how to 'attach' to an existing screen session
- You must take extra care to release computing resources once you are done with them
This approach involves extra preparation and cleanup effort which takes more time than the simply running qsub ./some-script (See qsub)
Using Screen
If you have not used the screen utility before, then please refer to Linux-Essentials/Screen
Submitting via Screen
- Start a new screen session
screen -S a_simple_name_for_you_computational_task
Start a new qsub -I session
Make sure that all qsub -I sessions are exited once your computations are complete. Learn more...
qsub -I
This will reserve one CPU core on the cluster
If you do not want to leave the screen session open after the qsub -I session has exited then useqsub -I; exit
cd into the directory where your program is
cd your_directory
(Optional) Make changes to executable files or input files
Let's assume that you have an executable ./your_program. Then run
./your_program; exit
After starting the job in this way, the screen session can be detached by pressing Ctr-a d. After that you will be back in Biocluster.
Your program will take some time to run, but exit will not do anything until the program finishes
Please Notice: The exit portion is very important. If not exited, the qsub -I session will continue reserving the resources, even after the computation is complete. Learn more...
Screen Related Topics
Running multiple R scripts on Biocluster - 2009-04-11
qsub - the standard and non-interactive way of submitting computational tasks
Linux-Essentials/Screen
Cleaning-Up after qsub -I Sessions
Multiple Nodes: MPI
NAMD

"NAMD is a parallel molecular dynamics code designed for high-performance simulation of large biomolecular systems." - Theorietical and Computational Biophysics Group at UIUCLaunching
The main script that runs NAMD is namd-start, but first you need to open a screen session and reserve cluster nodes:
screen -S namd-session
qsub -I -l nodes=8:ppn=8 && exit
cd into the directory that contains the input file
namd-start <input-file> --avoid node19,node24 && exit
Related Topics
Monitoring Performance
- There is a record of all NAMD runs on the cluster. It shows:
- Users who launched the task
Performance in milliseconds per step
- Nodes used
- Number of steps in the task
- Example
- The graph of the current data is here
Reading the results
- Detach form the screen session
Run tail -f on the output file:
tail -f <output-file>
The script namd-start will help you locate the output file
How does our performance compare to other HPC infrastructures?
MPI related Links
Official Homepage at the Mathematics and Computer Science Division division of Argonne National Laboratory
Multiple Nodes: PVM
Currently we are not running any system-wide PVM applications.
Version 3.4.5 is installed on Biocluster and all cluster nodes. You can get more information by running:man pvm man pvm_intro
PVM Related Links
Communication
Sharing Files on the Web
Simply move the files to ~/.html when you want to share them.
So, from Biocluster:
mkdir hello-www echo '<h1>Hello WWW!</h1>' > ./hello-www/hello.html mv hello-www ~/.html/
Now, point your web-browser to http://biocluster.ucr.edu/~alevchuk/hello-www/
But instead of alevchuk put your username.
Locking your Web files with a Password
In rear cases, you may need to lock some of the web files with a password:
Email <aleksandr DOT levchuk AT ucr DOT edu> and ask to enable password locking.
- Run this set of commands:
touch ~/.html/.htpasswd htpasswd ~/.html/.htpasswd webuser
This will ask you to create the new password
- Go to the directory that you want to lock
mkdir ~/.html/locked_dir cd ~/.html/locked_dir
You can choose a different directory name.
- Now run this command:
echo 'AuthName "Please login. The username is webuser." AuthType Basic AuthUserFile /home/alevchuk/.html/.htpasswd require user webuser' > .htaccess
But instead of /home/alevchuk put your own home directory. To find it run echo ~
Now, test it out by pointing your web-browser to http://biocluster.ucr.edu/~alevchuk/locked_dir
But instead of alevchuk and locked_dir put your username and your directory name.
Listing All Users
Simply run
all-users
You can generate an email list of particular users by running
all-users | grep -E '(alevchuk)|(tgirke)' | awk -F\\t '// {print " " $1 " <" $3 ">"}'
This will output
Thomas Girke <<tgirke AT citrus DOT ucr DOT edu>>
Aleksandr Levchuk <<alevchuk AT gmail DOT com>>
Spam Note!
When contacting a number of our users, it is preferred that the BCC email address filed is used.
Optionally, the names (not email addresses) of recipients can be listed in the body of the email.
Following this rule will inhibit a type of Spam attacks that take advantage of crawling recipient's Inboxes.
Out-of-date email addresses? Please email <alevchuk AT gmail DOT com>
Listing Current Cluster Users
To get a list of UNIX user names run
qstat | awk '// {print $3}' | sort | uniq | grep "^[^-N]"
To get the list of real names run
grep <(all-users) -f <(qstat | awk '// {print $3}' | \ sort | uniq | grep "^[^-N]") | awk -F\\t '// {print $1}'
To get the list of emails run
grep <(all-users) -f <(qstat | awk '// {print $3}' | \ sort | uniq | grep "^[^-N]") | awk -F\\t '// {print " " $1 " <" $3 "> "}'
Spam Note!
When contacting a number of our users, it is preferred that the BCC email address filed is used.
Optionally, the names (not email addresses) of recipients can be listed in the body of the email.
Following this rule will inhibit a type of Spam attacks that take advantage of crawling recipient's Inboxes.
Out-of-date email addresses? Please email <alevchuk AT gmail DOT com>
Examples
Reusing Scripts
- The worst, but simplest, way to reuse shell scripts has already be hinted at: use multiple directories, one for each job, each with a slightly modified script. Using the previous BLAST script, it is easy to illustrate this idea. Imagine you wished to run a set of proteins against multiple databases. In this case we will use Arabidopsis thaliana and Oryza sativa. To do this you can simply make two directories: ~/blast/ath and ~/blast/osa. Then place a copy of the blast script in each directory, named blast.sh with the INPUT and DATABASE variables modified. However, for more than a few combinations of input and databases, this becomes increasingly tedious.
Rather, a simple evolution of this will allow you to use the same file for multiple qsub tasks. By leaving the options unspecified you can make a script act as a generic schema for a task. For instance, blast is usually ran relatively similar each time. The following file would make a generic schema for blast:
#!/bin/sh #PBS -v DB,IN PROGRAM=blastp OPTIONS="-e 1e-10" cd $PBS_O_WORKDIR blastall -p $PROGRAM -d $DB -i $IN $OPTIONS
This variant makes a couple changes. First, the script sets the PROGRAM and OPTIONS variables since they tend to be the same for many jobs. Secondly, the script will change directories not to a directory specified in the script, but rather to $PBS_O_WORKDIR which is the directory qsub was run in. Finally, blastall is run. As a general schema, this should be saved in an easy location such as ~/torque/blastp.sh where future schemes can go.
However, DB and IN were not specified in the script. Where do they come from? The answer lies in the second line. That is a directive telling qsub to send the DB and IN environmental variables to the remote host. These are set on the command line to qsub. However, they are environmental variable so they go before the command:
DB=/data/TIGR/arabidopsis/ATH1.pep IN=input.fa qsub ~/torque/blastp.sh
Running many input files
Many programs, such as blast run relatively quickly, but usually have a large number of input chunks. In the case of blast a chunk would be a single sequence. In general, a chunk is the smallest amount of input that can be ran independently. In order to speed up the processing of a large number of blast, a little bit of shell scripting can be used. The idea is to split the data into a few smaller groups of chunks, then run these chunks as independent jobs. For example, a blast job of 1000 sequences can be speed up by running 5 jobs of 200 sequences.
The first step is to create the smaller input files. This is very dependent on the program, however almost any program that takes sequences as input can make use of seqsplit-fasta from Biosquid. This utility will take a sequence file and convert it into many smaller files. It can do this in one of two ways, first it can create files with a given number of sequences (with the last file possibly being smaller) or it can create a number of approximately even files. To do this, you specify the input file, the number of sequences per file or the number of files, a prefix for the output files and a file format (if not fasta).
seqsplit-fasta -i proteins.fa -l 100 -p proteins-
This will create files of 100 sequences named proteins-01, proteins-02 and so on.
seqsplit-fasta -i proteins.fa -L 100 -p proteins-
That will create the files namded proteins-01, proteins-02 to proteins-100.
seqsplit-fasta -i proteins.fa -L 100 -s 3 -p proteins-
The additional -s 3 will make the files named proteins-001, proteins-002 and so forth.
If each segment of the input has a constant number of lines or bytes, then split can be used to split the input:
split --lines=100 --numeric-suffixes --suffix-length=3 proteins.fa proteins-
This will create files with 100 lines in each. The files will be named proteins-000, proteins-001 and so forth.
If the data is small enough, then this can always be done by hand.
Now, we need to write a script file capable of taking any input we want. To do this, we are just going to use the blast.sh from the Reuse Examples, modified to only need the input files specified, and to operate on multiple files:#!/bin/sh #PBS -v IN DB=/data/NCBI/blast/nr/nr PROGRAM=blastp OPTIONS="-e 1e-10" cd $PBS_O_WORKDIR for input in $IN; do blastall -p $PROGRAM -d $DB -i $input $OPTIONS done
IN=proteins-01 qsub blast.sh IN=proteins-02 qsub blast.sh IN=proteins-03 qsub blast.sh IN=proteins-04 qsub blast.sh
However, if there are dozens or hundreds of files, a rather complicated command line needs to be run. While this can be ran straight from the command line, it will likely be easier to save it to a file (a good name would be run-on-all-proteins.sh), and make that file executable. First, we use find to create a list of filenames to run. Then we pipe that output into xargs to print a certain number of them per line, we finally send that into a small while loop in order to submit our jobs. The full command line will look like (formated for some readability):
#!/bin/sh find -name "proteins-*" | xargs -n 10 echo | while read IN; do export IN; qsub blast.sh; done
chmod +x ./run-on-all-proteins.sh

.
[ Home ]
[ Workshops ]
[ R & BioC ]
[ BioC-Seq ]
[ R Programming ]
[ EMBOSS ]
[ Linux ]
[ Cluster ]