Asgard   -  FAQ for new users of the Beowulf
Clustersysteme der InformatikdiensteID-Clustersysteme ETH Logo

Content

  1. How do I get an account on the Beowulf?
  2. How do I connect to the machine and what's it's name?
  3. Connecting without password to asgard.
  4. How do I compile a MPI program on the machine?
  5. And running the compiled program?
  6. One or two processors per node (ppn=2) due to memory or multiple threads?
  7. What do I do if I exceed my walltime limit?
  8. How can I get more information about my submitted jobs?
  9. More about the queueing system?
  10. Distributing the data before starting the computations?

Questions and Answers

How do I get an account on the Beowulf?
Fill in this form and send it to your local Beowulf coordinator (see "Group reference people" at the bottom) by mail.
          Account Form for Hreidar       (GMS 060830)
          ========================

Institution (mark one of following)
  math   (D-MATH)
  phys   (D-PHYS)
  werk   (D-MATL)
  geop   (D-ERDW)
  biol   (D-BIOL)
  bio2   (D-BIOL Prof. Ban)
  bio3   (D-BIOL Prof. Pelkmans)
  other

Group:

Name:
First Name:
Phone:
E-Mail:

Account validity:
In absence of a limiting date, a length of time for which the
account should remain valid or the word "indefinite", the
validity of the account will be restricted to 6 months.

Login Name (nethz-username):
Initial Password (first 4 letters):
The finishing characters of the password will be sent to the
applicant with the confirmation of the account creation. The
resulting initial password must be changed on the 1st login.

Shell (mark one of following)
  bash
  tcsh

To be e-mailed to the System Management by the Group Contact

Your local Beowulf coordinator will pass this form to the system administrator and you will be informed when and if your account is created.

Top

How do I connect to the machine and what's it's name?
Access to the machine is allowed only through secure shell (ssh). That goes for login as well as file transfer. On special occasions file transfer might also be done by ftp connection initiated FROM Asgard towards the external machine.

The name of the machine is asgard[.ethz.ch], so that the login will usually be effected by the command "ssh asgard".

On Unix machines ssh is usually available or can be installed by the (local) system manager. The sources are available e.g. at the Sunsite mirror of Switch. For PCs and Macs there is a site license available at the ETHZ (search for "secure").

It is recommended to log out if you are not using the machine over a period of several hours or over night (otherwise you will miss the news in the motd).

The machine is configured to use ssh 2 (there is no fall back to ssh 1).

Top

Connecting without password to asgard.
Generate a key pair on your normal account with ssh-keygen. This key pair consists of a private and a public key. The private key can be protected with a passphrase. Since the file containing the private key is protected with your normal login password, this is not really necessary: you can give an empty passphrase (ie. hit enter). Then, copy the public key (you can find it in $HOME/.ssh/identity.pub) to asgard into the file $HOME/.ssh/authorized_keys. Now, you should be able to log into asgard without giving the password on asgard. If you left the passphrase emtpy, you don't need to type anything.

If the above statement "you don't need to type anything" is not true and you are asked "Are you sure you want to continue connecting (yes/no)?" because the host is not known to your system, then you should set up your configuration correctly. If you answer yes to the above question, ssh tries to add the host key of the remote host to the file known_hosts. If this is not found or not writeable, then this operation fails and you are asked over and over again. By specifying

UserKnownHostsFile $HOME/.ssh/known_hosts
in the file $HOME/.ssh/config.

Top

How do I compile a MPI program on the machine?
Log into the machine and compile the program using mpicc (for LAM) or /usr/local/apli/mpich/bin/mpicc (for MPICH) instead of gcc. This automatically adds the necessary includes paths, libraries and library paths.

Top

And running the compiled program?
You need to submit a job to the queueing system. This is done with the qsub command. The method is somewhat different for LAM and MPICH: I have set up an example for both of them: LAM example and MPICH example.

Submitting the job is done using

qsub -l nodes=4 pbs.script
where 4 is the number of requested processors.

When you enter the the given qsub command a job identifier is returned. This may look like this:

$ qsub -l nodes=4 pbs.mpich
39717.asgard01.ethz.ch
When the job has finished, two new files should be in the directory: pbs.mpich.o39717 (contains the output of the pbs.mpich script to standard out) and pbs.mpich.e39717 (contains the output of the pbs.mpich script to standard error).

Top

One or two processors per node (ppn=2) due to memory or multiple threads?
All Asgard nodes are (at the moment) double-CPU machines with 1GB memory and 1GB swap space per node. This poses certain requirements on the usage of the nodes. Some of the main points are described in the following paragraphs.
Single-process jobs
If your process uses two processors (threading), you should reserve both of them by using ppn=2 in the qsub call. If you don't do that, you are potentially allowing another job to run on the same node. This job will then take away from you the resources that you expect to be using.

If your process needs between 1/2 and 1 GB of memory, you should reserve the whole node, by specifying ppn=2. If you don't do that, you are risking that either another user or even another of your own jobs using between 1/2 and 1 GB memory will be run on the same node. The two jobs together could then need more than the available physical memory and start swapping. In such a case the walltime can potentially increase by a factor of 10 or more, causing the job abort due to time limit.

If your process needs between 1 and 2 GB of memory, it is paramount to use the ppn=2 parameter. Even so it is practically guaranteed that your job will swap and the ratio CP-time/wallclock will become pretty bad. You should start wondering if Asgard is the right machine for you (or if you have money to buy more memory for it).

If your process needs more than 2 GB of memory, you should run it somewhere else. I have seen jobs saying they used more than that, but I would attribute it more to bad accounting or good luck than to anything else.

Parallel jobs
If you are running parallel jobs (i.e. using more than 1 node), you should practically ALWAYS use the ppn=2 parameter. That will guarantee that you are not sharing the node with another user who is doing strange things, it will make things easier on the scheduling (which does NOT guarantee, that two of your ppn=1 jobs will run on the same nodes), and will help with the automatic process deletion at the end of your job.

With MPI you do not have to reprogram your code to be able to run two processes on one node, but depending on the MPI version you use, your batch script might need a slight adjustment.

Of course, all the comments made about the single-process job memory still apply (i.e. "1/2 to 1 GB: one process per node, 1-2GB: you will be SLOW, more then 2 GB: forget it").

If you want to use both CPUs on a node, add ppn=2 as shown in this example:

qsub -l nodes=4:ppn=2 pbs.lam

Top

What do I do if I exceed my walltime limit?
An empty pbs.mpich.e39717 means no errors. If you get
=>> PBS: job killed: walltime 654 exceeded limit 600
your job was running too long. Try using
qsub -l nodes=4,walltime=1200 pbs.mpich
where 1200 is the number of seconds your job is allowed to run. See the man page of qsub for more information.

Top

How can I get more information about my submitted jobs?
You can get a list of all submitted jobs at a specific node using qstat:
$ qstat -a                       

asgard01.ethz.ch: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
37949.asgard01. parcom04 small    job_s3        --    4  --    --  00:10 R   -- 
39717.asgard01. pfrauenf small    pbs         30267   4  --    --  00:10 R 00:01
39720.asgard01. parcom14 small    job_a1      30424   8  --    --  00:10 R 00:00
$ qstat -a @n69 

n69.asgard.net: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
47323.asgard01. avoellmy large    transims.b  18184  96  --  500mb 08:00 R 00:04
With qstatall you can get informations about the jobs on all queueing servers, but beware, there may be quite a lot of them. So use qstatall -u username.

Or you can get more information about a particular job:

$ qstat -f 39717.asgard01.ethz.ch
Job Id: 39717.asgard01.ethz.ch
    Job_Name = pbs
    Job_Owner = pfrauenf@gate01.asgard.net
    resources_used.cput = 00:00:39
    resources_used.mem = 6668kb
    resources_used.vmem = 23444kb
    resources_used.walltime = 00:01:39
    job_state = R
    queue = small
    server = asgard01.ethz.ch
    Checkpoint = u
    ctime = Fri May 12 10:25:11 2000
    Error_Path = asgard01.ethz.ch:/asgard/home/math/pfrauenf/mpich/latbw/pbs.e3
        9717
    exec_host = n77/1+n76/1+n75/1+n74/1
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Fri May 12 10:25:12 2000
    Output_Path = asgard01.ethz.ch:/asgard/home/math/pfrauenf/mpich/latbw/pbs.o
        39717
    Priority = 0
    qtime = Fri May 12 10:25:11 2000
    Rerunable = True
    Resource_List.nodect = 4
    Resource_List.nodes = 4
    Resource_List.walltime = 00:10:00
    session_id = 30267
    Variable_List = PBS_O_HOME=/asgard/home/math/pfrauenf,
        PBS_O_LANG=de_CH.ISO-8859-1,PBS_O_LOGNAME=pfrauenf,
        PBS_O_PATH=/usr/local/apli/KAI/KCC.flex-3.4g-1/KCC_BASE/bin:/usr/local
        /apli/bin:/usr/pbs/bin:/asgard/home/math/pfrauenf/bin:/usr/local/bin:/u
        sr/bin:/usr/X11R6/bin:/bin:/usr/games/bin:/usr/games:/opt/gnome/bin:.,
        PBS_O_MAIL=/var/spool/mail/pfrauenf,PBS_O_SHELL=/bin/bash,
        PBS_O_HOST=asgard01.ethz.ch,
        PBS_O_WORKDIR=/asgard/home/math/pfrauenf/mpich/latbw,
        PBS_O_QUEUE=feed01
    comment = Job started on Fri May 12 at 10:25
    etime = Fri May 12 10:25:11 2000

Top

More about the queueing system?
Have a look at the man pages of qalter, qdel, qhold, qmove, qmsg, qrerun, qrls, qselect, qstat and qsub.

qservers shows how many jobs are submitted to the different queue servers:

$ qservers   
Server            Max  Tot  Que  Run  Hld  Wat  Trn  Ext Status
---------------- ---- ---- ---- ---- ---- ---- ---- ---- ----------
asgard01.ethz.ch    0    3    0    3    0    0    0    0 Active    
n1.asgard.net       0    0    0    0    0    0    0    0 Idle      
n23.asgard.net      0   84   36   48    0    0    0    0 Active    
n46.asgard.net      0  141    5  136    0    0    0    0 Active    
n69.asgard.net      0    1    0    1    0    0    0    0 Active

qstatall -Q shows the distribution of the jobs in the queues:

$ qstatall -Q
Queue             Max  Tot  Ena  Str  Que  Run  Hld  Wat  Trn  Ext Type
---------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----------
creeper_r           0    0  yes  yes    0    0    0    0    0    0 Route     
feed01              0    0  yes  yes    0    0    0    0    0    0 Route     
small               0    1  yes  yes    0    1    0    0    0    0 Execution 
medium              0    1  yes  yes    1    0    0    0    0    0 Execution 
midnight            0    0  yes   no    0    0    0    0    0    0 Execution 
l1                  0    0  yes  yes    0    0    0    0    0    0 Route     
l2                  0    0  yes  yes    0    0    0    0    0    0 Route     
large_r             0    0  yes  yes    0    0    0    0    0    0 Route     
runner_r            0    0  yes  yes    0    0    0    0    0    0 Route     
mystery             0    0  yes   no    0    0    0    0    0    0 Route     
huge                0    0  yes   no    0    0    0    0    0    0 Route     
feed1               0    0  yes  yes    0    0    0    0    0    0 Route     
huge_a              0    0  yes   no    0    0    0    0    0    0 Execution 
feed23              0    0  yes  yes    0    0    0    0    0    0 Route     
creeper            48   83  yes  yes   35   48    0    0    0    0 Execution 
feed46              0    0  yes  yes    0    0    0    0    0    0 Route     
runner            138  141  yes  yes    5  136    0    0    0    0 Execution 
feed69              0    0  yes  yes    0    0    0    0    0    0 Route     
large               0    1  yes  yes    0    1    0    0    0    0 Execution

Top

Distributing the data before starting the computations?
In the Modus operandi, sections 'Disk space' and 'Efficiency tips', users are adviced to copy their data (and programs) to the local scratch disk of each compute node before starting the actual computation.

The following predefined variables are available:

HOME
Your home directory.
HOME_SRV
Server for the HOME directory.
WORK
Your working directory. Per default identical to HOME. If you need a REALLY big space you can apply for it separately (through your contact person).
WORK_SRV
Server for the WORK directory.
ARCH
Your archive directory. Per default identical to HOME. If some of your data is not used on a daily production basis but eats up your HOME or WORK space, you should consider using the archive server.
CAUTION! This directory is NOT to be used by batch jobs.
CAUTION! This directory should be limited to files which change only sporadically (if at all).
You can apply for ARCH space through your contact person.
ARCH_SRV
Server for the ARCH directory.
LOCAL_SCR
Your scratch directory on the compute node. If it doesn't exist, you can create it yourself (mkdir $LOCAL_SCR). The files in this space should exist only while you are using the compute node (your responsibility).

Examples for setting up the local scratch directory and copying the data in file from the file server:

bash Shell
if [ ! -d $LOCAL_SCR ] ; then mkdir $LOCAL_SCR ; fi
rcp $WORK_SRV:$WORK/file $LOCAL_SCR
csh and tcsh Shells
if (! -d $LOCAL_SCR) then;  mkdir $LOCAL_SCR; endif
rcp $WORK_SRV:$WORK/file $LOCAL_SCR

Top

Valid HTML 4.0! Valid CSS!