JobManagement » History » Version 3

« Previous - Version 3/7 (diff) - Next » - Current version
Anonymous, 08/19/2008 04:10 PM
Installation Guide: use procksi_pbs.tgz


= Job Management =
The queuing system (resource manager) is the heart of the distributed computing on a cluster. It consists of three parts: the server, the scheduler, and the machine-oriented mini-server (MOM) executing the jobs.

We are assuming the following configuration:

||PBS TORQUE|| version 2.1.8           ||server, basic scheduler, mom  ||[source:Externals/Cluster/torque-2.1.8.tgz  download from repository]
||MAUI      || version 3.2.6.p18    ||scheduler                     ||[source:Externals/Cluster/maui-3.2.6p18.tgz download from repository]

Please check the distributors website's for newer versions:

||PBS TORQUE ||http://www.clusterresources.com/pages/products/torque-resource-manager.php
||MAUI       ||http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php

The install directories for ''TORQUE'' and ''MAUI'' will be:

||PBS TORQUE ||''/var/spool/torque''
||MAUI       ||''/var/spool/maui''
TORQUE === Register new services ===
Edit ''/etc/services'' and add at the end: {{{
  1. PBS/Torque services
    pbs 15001/tcp # pbs_server
    pbs 15001/udp # pbs_server
    pbs_mom 15002/tcp # pbs_mom <-> pbs_server
    pbs_mom 15002/udp # pbs_mom <-> pbs_server
    pbs_resmom 15003/tcp # pbs_mom resource management
    pbs_resmom 15003/udp # pbs_mom resource management
    pbs_sched 15004/tcp # pbs scheduler (pbs_sched)
    pbs_sched 15004/udp # pbs scheduler (pbs_sched)
    }}}

=== Setup and Configuration on the Master Node ===
Extract and build the distribution TORQUE on the master node. Configure server, monitor and clients to use secure file transfer (scp). {{{
export TORQUECFG=/var/spool/torque
tar -xzvf TORQUE.tar.gz
cd TORQUE
}}}

Configuration for a 64bit machine with the following compiler options: {{{
FFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC"
CFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC"
CXXFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC"
LDFLAGS = "-L/usr/local/lib -L/usr/local/lib64"
}}}
'''Attention''': For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''.

Configure, build, and install: {{{
./configure --prefix=/usr/local --with-spooldir=$TORQUECFG
make
make install
}}}
If not configures otherwise, binaries are installed in ''/usr/local/bin'' and ''/usr/local/sbin''.

Initialise/configure the queuing system's server daemon (''pbs_server''): {{{
pbs_server -t create
}}}

Set the PBS operator and manager (must be a valid user name). {{{
qmgr

set server_name = master01.procksi.local
set server scheduling = true
set server operators = ","
set server managers = ","

}}}

Allow only ''procksi'' and ''root'' to submit jobs into the queue: {{{

set server acl_users = "root,procksi"
set server acl_user_enable = true

}}}

Set email address for email that is sent by PBS: {{{

set server mail_from =

}}}

Allow submissions from slave hosts (only):
'''ATTENTION: NEEDS TO BE CHECKED. DOES NOT WORK PROPERLY YET!! ''' {{{ {{{

set server allow_node_submit = true
set server submit_hosts = master01.procksi.local

slave01.procksi.local
slave02.procksi.local
slave03.procksi.local
slave04.procksi.local
}}}

Restrict nodes that can access the PBS server: {{{

set server acl_hosts = master01.procksi.local

slave01.procksi.local
slave02.procksi.local
slave03.procksi.local
slave04.procksi.local

set acl_host_enable = true

}}}

And set in ''torque.cfg'' in order
to use the internal interface: {{{
SERVERHOST master01.procksi.local
ALLOWCOMPUTEHOSTSUBMIT true
}}}
}}}

Configure default node to be used (see below): {{{

set server default_node = slave

}}}

Set the default queue to ''batch'' {{{

set server default_queue=batch

}}}

Configure the main queue ''batch'': {{{

create queue batch queue_type=execution
set queue batch started=true
set queue batch enabled=true
set queue batch resources_default.nodes=1

}}}

Configure queue ''test ''accordingly''.

Specify all compute nodes to be used by creating/editing ''$TORQUECFG/server_priv/nodes.'' This may include the same machine where pbs_server will run. If the compute nodes have more than one processor, just add np=X after the name with X being the number of processors. Add node attributes so that a subset of nodes can be requested during the submission stage. {{{
master01.procksi.local np=2 procksi master xeon
slave01.procksi.local np=2 procksi slave xeon
slave02.procksi.local np=2 procksi slave xeon
slave03.procksi.local np=4 procksi slave opteron
slave04.procksi.local np=4 procksi slave opteron
}}}

Although the master node (''master01'') has two processors as well, we only allow one processor to be used for the queueing system as the other processor will be used for handling all frontend communication and I/O. (Make sure that hyperthreading technology is disabled on the head node and all compute nodes!)

Request job to be run on specific nodes (on submission): * Run on any compute node: {{{
qsub -q batch -l nodes=1:procksi
}}} * Run on any slave node: {{{
qsub -q batch -l nodes=1:slave
}}} * Run on master node: {{{
qsub -q batch -l nodes=1:master
}}}

=== Setup and Configuration on the Slave Nodes ===
Extract and build the distribution TORQUE on each slave node. Configure monitor and clients to use secure file transfer (scp). {{{
export TORQUECFG=/var/spool/torque
tar -xzvf TORQUE.tar.gz
cd TORQUE
}}}

Configuration for a 64bit machine with the following compiler options: {{{
FFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC"
CFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC"
CXXFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC"
LDFLAGS = "-L/usr/local/lib -L/usr/local/lib64"
}}}
Attention: For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''.

Configure, build, and install: {{{
./configure --prefix=/usr/local --with-spooldir=$TORQUECFG --disable-server --enable-mom --enable-clients --with-default-server=master01.procksi.local
make
make install
}}}

Configure the compute nodes by creating/editing ''$TORQUECFG/mom_priv/config''. The first line specifies the PBS server, the second line specifies hosts which can be trusted to access mom services as non-root, and the last line allows copying data via NFS without using SCP. {{{
$pbsserver master01.procksi.local
$loglevel 255
$restricted master01.procksi.local
$usecp master01.procksi.local:/home/procksi /home/procksi
}}}

Start the queueing system (manually) in the correct order: * Start the mom: {{{
/usr/local/sbin/pbs_mom
}}} * Kill the server: {{{
/usr/local/sbin/qterm -t quick
}}} * Start the server: {{{
/usr/local/sbin/pbs_server
}}} * Start the scheduler: {{{
/usr/local/sbin/pbs_sched
}}}

If you want to use MAUI as the final scheduler, keep in mind to kill ''pbs_sched'' after testing the TORQURE installation.

Check that all nodes are properly configured and correctly reporting {{{
qstat -q
pbsnodes -a
}}}

=== Prologue and Epilogue Scripts ===
Get [repos:Externals/procksi_pbs.tgz] from the repository and untar it: {{{
untar –xvzf procksi_pbs.tgz
}}}

The ''prologue'' script is executed just before the submitted job starts. Here, it generates a unique temp directory for each job in ''/scratch''.
It must be installed on each NODE (master, slave): {{{
cp ./pbs/NODE/var/spool/torque/mom/priv/prologue $TORQUECFG/mom_priv
chmod 500 $TORQUECFG/mom_priv/prologue
}}}

The ''epilogue'' script is executed right after the submitted job has ended. Here, it deletes the job's temp directory from ''/scratch.'' It must be installed on each NODE (master, slave) {{{
cp ./pbs/NODE/var/spool/torque/mom/priv/epilogue $TORQUECFG/mom_priv
chmod 500 $TORQUECFG/mom_priv/epilogue
}}}

MAUI === Register new services ===
Edit ''/etc/services'' and add at the end: {{{
  1. PBS/MAUI services
    pbs_maui 42559/tcp # pbs scheduler (maui)
    pbs_maui 42559/udp # pbs scheduler (maui)
    }}}

=== Setup and Configuration on the Head Node ===
Extract and build the distribution MAUI. {{{
export MAUIDIR=/usr/local/maui
tar -xzvf MAUI.tar.gz
cd TORQUE
}}}

Configuration for a 64bit machine with the following compiler options: {{{
FFLAGS = “-m64 -march=[Add Architecture] -O3 -fPIC"
CFLAGS = “-m64 -march=[Add Architecture] -O3 -fPIC"
CXXFLAGS = “-m64 -march=[Add Architecture] -O3 -fPIC"
LDFLAGS = “-L/usr/local/lib -L/usr/local/lib64"
}}}
'''Attention''': For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''.

Configure, build, and install: {{{
./configure --with-pbs=$TORQUECFG --with-spooldir=$MAUIDIR
make
make install
}}}

Fine-tune MAUI in $''MAUIDIR/maui.cfg'': {{{
SERVERHOST master01.procksi.local

  1. primary admin must be first in list
    ADMIN1 procksi
    ADMIN1 root
  1. Resource Manager Definition
    RMCFG[MASTER01.PROCKSI.LOCAL]
    ]
    TYPE=PBS@RMNHOST@
    PORT=15001
    EPORT=15004 [CAN BE ALTERNATIVELY: 15017 - TRY!!!]

SERVERPORT 42559
SERVERMODE NORMAL

  1. Node Allocation:
    NODEALLOCATIONPOLICY PRIORITY
    NODECFG[DEFAULT] PRIORITY='- JOBCOUNT'
    }}}

Start the MAUI scheduler manually. Make sure that pbs_sched is not running any longer.

  • Start the scheduler: {{{
    /usr/local/sbin/maui
    }}}

Get [repos:Externals/Cluster/procksi_pbs.tgz] from the repository and untar it: {{{
untar –xvzf procksi_pbs.tgz
}}}

Make the entire queuing system (Torque + Maui) start at bootup: {{{
cp ./pbs/master/etc/init.d/pbs_* /etc/init.d/
/sbin/chkconfig --add pbs_mom
/sbin/chkconfig --add pbs_maui
/sbin/chkconfig --add pbs_server
/sbin/chkconfig pbs_mom on
/sbin/chkconfig pbs_maui on
/sbin/chkconfig pbs_server on
}}}

If you want to use the simple scheduler that comes with PBS Torque, then substitute ''pbs_maui'' with ''pbs_sched''.

=== Setup and Configuration on the Slave Nodes ===
Get [repos:Externals/Cluster/procksi_pbs.tgz] from the repository and untar it: {{{
untar –xvzf procksi_pbs.tgz
}}}

Make the entire queuing system start at bootup: {{{
cp ./pbs/slave/etc/init.d/pbs_mom /etc/init.d/
/sbin/chkconfig --add pbs_mom
/sbin/chkconfig pbs_mom on
}}}