JobManagement » History » Version 5

Paweł Widera, 11/28/2008 08:20 PM
More complex priority function for maui added.

1 1 Anonymous
= Job Management =
2 1 Anonymous
The queuing system (resource manager) is the heart of the distributed computing on a cluster. It consists of three parts: the server, the scheduler, and the machine-oriented mini-server (MOM) executing the jobs.
3 1 Anonymous
4 1 Anonymous
We are assuming the following configuration:
5 1 Anonymous
6 3 Anonymous
 ||PBS TORQUE|| version 2.1.8           ||server, basic scheduler, mom  ||[source:Externals/Cluster/torque-2.1.8.tgz  download from repository]
7 3 Anonymous
 ||MAUI      || version 3.2.6.p18	||scheduler                     ||[source:Externals/Cluster/maui-3.2.6p18.tgz download from repository]
8 1 Anonymous
9 1 Anonymous
10 1 Anonymous
11 3 Anonymous
Please check the distributors website's for newer versions:
12 3 Anonymous
13 1 Anonymous
 ||PBS TORQUE ||http://www.clusterresources.com/pages/products/torque-resource-manager.php
14 1 Anonymous
 ||MAUI       ||http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php
15 1 Anonymous
 
16 1 Anonymous
17 1 Anonymous
The install directories for ''TORQUE'' and ''MAUI'' will be:
18 1 Anonymous
19 1 Anonymous
 ||PBS TORQUE ||''/var/spool/torque''
20 1 Anonymous
 ||MAUI       ||''/var/spool/maui''
21 1 Anonymous
 
22 1 Anonymous
23 1 Anonymous
24 1 Anonymous
== TORQUE ==
25 1 Anonymous
26 1 Anonymous
=== Register new services ===
27 1 Anonymous
Edit ''/etc/services'' and add at the end:
28 1 Anonymous
{{{
29 1 Anonymous
# PBS/Torque services
30 1 Anonymous
pbs           15001/tcp    # pbs_server
31 1 Anonymous
pbs           15001/udp    # pbs_server
32 1 Anonymous
pbs_mom       15002/tcp    # pbs_mom <-> pbs_server
33 1 Anonymous
pbs_mom       15002/udp    # pbs_mom <-> pbs_server
34 1 Anonymous
pbs_resmom    15003/tcp    # pbs_mom resource management
35 1 Anonymous
pbs_resmom    15003/udp    # pbs_mom resource management
36 1 Anonymous
pbs_sched     15004/tcp    # pbs scheduler (pbs_sched)
37 1 Anonymous
pbs_sched     15004/udp    # pbs scheduler (pbs_sched)
38 1 Anonymous
}}}
39 1 Anonymous
  
40 1 Anonymous
41 1 Anonymous
=== Setup and Configuration on the Master Node ===
42 1 Anonymous
Extract and build the distribution TORQUE on the master node. Configure server, monitor and clients to use secure file transfer (scp).
43 1 Anonymous
{{{
44 1 Anonymous
export TORQUECFG=/var/spool/torque
45 1 Anonymous
tar -xzvf TORQUE.tar.gz
46 1 Anonymous
cd TORQUE
47 1 Anonymous
}}}
48 1 Anonymous
49 1 Anonymous
Configuration for a 64bit machine with the following compiler options:
50 1 Anonymous
{{{
51 1 Anonymous
FFLAGS   = "-m64 -march=[Add Architecture]  -O3 -fPIC"
52 1 Anonymous
CFLAGS   = "-m64 -march=[Add Architecture]  -O3 -fPIC"
53 1 Anonymous
CXXFLAGS = "-m64 -march=[Add Architecture]  -O3 -fPIC"
54 1 Anonymous
LDFLAGS  = "-L/usr/local/lib -L/usr/local/lib64"
55 1 Anonymous
}}}
56 1 Anonymous
'''Attention''': For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''.
57 1 Anonymous
58 1 Anonymous
Configure, build, and install:
59 1 Anonymous
{{{
60 1 Anonymous
./configure  --prefix=/usr/local --with-spooldir=$TORQUECFG
61 1 Anonymous
make
62 1 Anonymous
make install 
63 1 Anonymous
}}}
64 1 Anonymous
If not configures otherwise, binaries are installed in ''/usr/local/bin'' and ''/usr/local/sbin''. 
65 1 Anonymous
66 1 Anonymous
Initialise/configure the queuing system's server daemon (''pbs_server''):
67 1 Anonymous
{{{
68 1 Anonymous
pbs_server -t create
69 1 Anonymous
}}}
70 1 Anonymous
71 1 Anonymous
Set the PBS operator and manager (must be a valid user name). 
72 1 Anonymous
{{{
73 1 Anonymous
qmgr
74 1 Anonymous
> set server_name = master01.procksi.local
75 1 Anonymous
> set server scheduling = true
76 1 Anonymous
> set server operators  = "root@master01.procksi.local,procksi@master01.procksi.local"
77 1 Anonymous
> set server managers  = "root@master01.procksi.local,procksi@master01.procksi.local"
78 1 Anonymous
}}}
79 1 Anonymous
80 1 Anonymous
Allow only ''procksi'' and ''root'' to submit jobs into the queue:
81 1 Anonymous
{{{
82 1 Anonymous
> set server acl_users = "root,procksi" 
83 1 Anonymous
> set server acl_user_enable = true
84 1 Anonymous
}}}
85 1 Anonymous
86 1 Anonymous
Set email address for email that is sent by PBS:
87 1 Anonymous
{{{
88 1 Anonymous
> set server mail_from = pbs@procksi.net	
89 1 Anonymous
}}}
90 1 Anonymous
 
91 2 Anonymous
Allow submissions from slave hosts (only):
92 2 Anonymous
'''ATTENTION: NEEDS TO BE CHECKED. DOES NOT WORK PROPERLY YET!! '''
93 1 Anonymous
{{{
94 2 Anonymous
{{{
95 1 Anonymous
> set server allow_node_submit = true
96 1 Anonymous
> set server submit_hosts = master01.procksi.local
97 1 Anonymous
                            slave01.procksi.local
98 1 Anonymous
                            slave02.procksi.local
99 1 Anonymous
                            slave03.procksi.local
100 1 Anonymous
                            slave04.procksi.local
101 1 Anonymous
}}}
102 1 Anonymous
 
103 1 Anonymous
104 1 Anonymous
Restrict nodes that can access the PBS server:
105 1 Anonymous
{{{
106 1 Anonymous
> set server acl_hosts = master01.procksi.local
107 1 Anonymous
                         slave01.procksi.local
108 1 Anonymous
                         slave02.procksi.local
109 1 Anonymous
                         slave03.procksi.local
110 1 Anonymous
                         slave04.procksi.local
111 1 Anonymous
> set acl_host_enable = true
112 1 Anonymous
}}}
113 1 Anonymous
114 1 Anonymous
And set in ''torque.cfg'' in order 
115 1 Anonymous
to use the internal interface:
116 1 Anonymous
{{{
117 1 Anonymous
SERVERHOST              master01.procksi.local
118 1 Anonymous
ALLOWCOMPUTEHOSTSUBMIT  true
119 1 Anonymous
}}}
120 2 Anonymous
}}}
121 1 Anonymous
122 1 Anonymous
Configure default node to be used (see below):
123 1 Anonymous
{{{
124 1 Anonymous
> set server default_node = slave
125 1 Anonymous
}}}
126 1 Anonymous
127 1 Anonymous
128 1 Anonymous
Set the default queue to ''batch''
129 1 Anonymous
{{{
130 1 Anonymous
> set server default_queue=batch
131 1 Anonymous
}}}
132 1 Anonymous
 
133 1 Anonymous
Configure the main queue ''batch'':
134 1 Anonymous
{{{
135 1 Anonymous
> create queue batch queue_type=execution
136 1 Anonymous
> set queue batch started=true
137 1 Anonymous
> set queue batch enabled=true
138 1 Anonymous
> set queue batch resources_default.nodes=1
139 1 Anonymous
}}}
140 1 Anonymous
141 1 Anonymous
Configure queue ''test ''accordingly''. 
142 1 Anonymous
143 1 Anonymous
Specify all compute nodes to be used by creating/editing ''$TORQUECFG/server_priv/nodes.'' This may include the same machine where pbs_server will run. If the compute nodes have more than one processor, just add np=X after the name with X being the number of processors. Add node attributes so that a subset of nodes can be requested during the submission stage.
144 1 Anonymous
{{{
145 1 Anonymous
master01.procksi.local  np=2  procksi  master  xeon
146 1 Anonymous
slave01.procksi.local   np=2  procksi  slave   xeon
147 1 Anonymous
slave02.procksi.local   np=2  procksi  slave   xeon
148 1 Anonymous
slave03.procksi.local   np=4  procksi  slave   opteron
149 1 Anonymous
slave04.procksi.local   np=4  procksi  slave   opteron
150 1 Anonymous
}}}
151 1 Anonymous
152 1 Anonymous
Although the master node (''master01'') has two processors as well, we only allow one processor to be used for the queueing system as the other processor will be used for handling all frontend communication and I/O. (Make sure that hyperthreading technology is disabled on the head node and all compute nodes!)
153 1 Anonymous
154 1 Anonymous
Request job to be run on specific nodes (on submission):
155 1 Anonymous
 * Run on any compute node: 	
156 1 Anonymous
 {{{
157 1 Anonymous
 qsub -q batch -l nodes=1:procksi
158 1 Anonymous
 }}}
159 1 Anonymous
 * Run on any slave node:	
160 1 Anonymous
 {{{
161 1 Anonymous
 qsub -q batch -l nodes=1:slave
162 1 Anonymous
 }}}
163 1 Anonymous
 * Run on master node:		
164 1 Anonymous
 {{{
165 1 Anonymous
 qsub -q batch -l nodes=1:master
166 1 Anonymous
 }}}
167 1 Anonymous
168 1 Anonymous
 
169 1 Anonymous
170 1 Anonymous
171 1 Anonymous
=== Setup and Configuration on the Slave Nodes ===
172 1 Anonymous
Extract and build the distribution TORQUE on each slave node. Configure monitor and clients to use secure file transfer (scp).
173 1 Anonymous
{{{
174 1 Anonymous
export TORQUECFG=/var/spool/torque
175 1 Anonymous
tar -xzvf TORQUE.tar.gz
176 1 Anonymous
cd TORQUE
177 1 Anonymous
}}}
178 1 Anonymous
179 1 Anonymous
Configuration for a 64bit machine with the following compiler options:
180 1 Anonymous
{{{
181 1 Anonymous
FFLAGS   = "-m64 -march=[Add Architecture] -O3 -fPIC"
182 1 Anonymous
CFLAGS   = "-m64 -march=[Add Architecture] -O3 -fPIC"
183 1 Anonymous
CXXFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC"
184 1 Anonymous
LDFLAGS  = "-L/usr/local/lib -L/usr/local/lib64"
185 1 Anonymous
}}}
186 1 Anonymous
Attention: For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''.
187 1 Anonymous
188 1 Anonymous
Configure, build, and install:
189 1 Anonymous
{{{
190 1 Anonymous
./configure  --prefix=/usr/local --with-spooldir=$TORQUECFG --disable-server --enable-mom --enable-clients --with-default-server=master01.procksi.local
191 1 Anonymous
make
192 1 Anonymous
make install 
193 1 Anonymous
}}}
194 1 Anonymous
 
195 1 Anonymous
Configure the compute nodes by creating/editing ''$TORQUECFG/mom_priv/config''. The first line specifies the PBS server, the second line specifies hosts which can be trusted to access mom services as non-root, and the last line allows copying data via NFS without using SCP.
196 1 Anonymous
{{{
197 1 Anonymous
$pbsserver   master01.procksi.local
198 1 Anonymous
$loglevel    255
199 1 Anonymous
$restricted  master01.procksi.local
200 1 Anonymous
$usecp       master01.procksi.local:/home/procksi   /home/procksi
201 1 Anonymous
}}}
202 1 Anonymous
203 1 Anonymous
Start the queueing system (manually) in the correct order:
204 1 Anonymous
 * Start the mom:
205 1 Anonymous
 {{{
206 1 Anonymous
 /usr/local/sbin/pbs_mom
207 1 Anonymous
 }}}
208 1 Anonymous
 * Kill the server:
209 1 Anonymous
 {{{
210 1 Anonymous
 /usr/local/sbin/qterm -t quick
211 1 Anonymous
 }}}
212 1 Anonymous
 * Start the server:
213 1 Anonymous
 {{{ 
214 1 Anonymous
 /usr/local/sbin/pbs_server
215 1 Anonymous
 }}}	
216 1 Anonymous
 * Start the scheduler:		
217 1 Anonymous
 {{{
218 1 Anonymous
 /usr/local/sbin/pbs_sched
219 1 Anonymous
 }}} 
220 1 Anonymous
221 1 Anonymous
If you want to use MAUI as the final scheduler, keep in mind to kill ''pbs_sched'' after testing the TORQURE installation.
222 1 Anonymous
223 1 Anonymous
224 1 Anonymous
Check that all nodes are properly configured and correctly reporting
225 1 Anonymous
{{{
226 1 Anonymous
qstat  -q
227 1 Anonymous
pbsnodes -a
228 1 Anonymous
}}}
229 1 Anonymous
230 1 Anonymous
231 1 Anonymous
=== Prologue and Epilogue Scripts ===
232 1 Anonymous
Get [repos:Externals/procksi_pbs.tgz] from the repository and untar it:
233 1 Anonymous
{{{
234 1 Anonymous
untar –xvzf procksi_pbs.tgz
235 1 Anonymous
}}}
236 1 Anonymous
237 1 Anonymous
The ''prologue'' script is executed just before the submitted job starts. Here, it generates a unique temp directory for each job in ''/scratch''. 
238 3 Anonymous
It must be installed on each NODE (master, slave):
239 1 Anonymous
{{{
240 3 Anonymous
cp ./pbs/NODE/var/spool/torque/mom/priv/prologue $TORQUECFG/mom_priv
241 1 Anonymous
chmod 500 $TORQUECFG/mom_priv/prologue
242 1 Anonymous
}}}
243 1 Anonymous
 
244 3 Anonymous
The ''epilogue'' script is executed right after the submitted job has ended. Here, it deletes the job's temp directory from ''/scratch.'' It must be installed on each NODE (master, slave)
245 1 Anonymous
{{{
246 3 Anonymous
cp ./pbs/NODE/var/spool/torque/mom/priv/epilogue $TORQUECFG/mom_priv
247 1 Anonymous
chmod 500 $TORQUECFG/mom_priv/epilogue
248 1 Anonymous
}}}
249 1 Anonymous
250 1 Anonymous
251 1 Anonymous
== MAUI ==
252 1 Anonymous
253 1 Anonymous
=== Register new services ===
254 1 Anonymous
Edit ''/etc/services'' and add at the end:
255 1 Anonymous
{{{ 
256 1 Anonymous
# PBS/MAUI services
257 1 Anonymous
pbs_maui  42559/tcp    # pbs scheduler (maui)
258 1 Anonymous
pbs_maui  42559/udp    # pbs scheduler (maui)
259 1 Anonymous
}}}
260 1 Anonymous
261 1 Anonymous
262 1 Anonymous
=== Setup and Configuration on the Head Node ===
263 1 Anonymous
Extract and build the distribution MAUI.
264 1 Anonymous
{{{
265 4 Paweł Widera
export MAUIDIR=/var/spool/maui
266 1 Anonymous
tar -xzvf MAUI.tar.gz
267 1 Anonymous
cd TORQUE
268 1 Anonymous
}}}
269 1 Anonymous
270 1 Anonymous
Configuration for a 64bit machine with the following compiler options:
271 1 Anonymous
{{{
272 1 Anonymous
FFLAGS   = “-m64 -march=[Add Architecture] -O3 -fPIC"
273 1 Anonymous
CFLAGS   = “-m64 -march=[Add Architecture] -O3 -fPIC"
274 1 Anonymous
CXXFLAGS = “-m64 -march=[Add Architecture] -O3 -fPIC"
275 1 Anonymous
LDFLAGS	 = “-L/usr/local/lib -L/usr/local/lib64"
276 1 Anonymous
}}}
277 1 Anonymous
'''Attention''': For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''.
278 1 Anonymous
279 1 Anonymous
Configure, build, and install:
280 1 Anonymous
{{{
281 1 Anonymous
./configure --with-pbs=$TORQUECFG --with-spooldir=$MAUIDIR
282 1 Anonymous
make
283 1 Anonymous
make install 
284 1 Anonymous
}}}
285 1 Anonymous
286 1 Anonymous
Fine-tune MAUI in $''MAUIDIR/maui.cfg'':
287 1 Anonymous
{{{
288 1 Anonymous
SERVERHOST            master01.procksi.local
289 1 Anonymous
290 1 Anonymous
# primary admin must be first in list
291 1 Anonymous
ADMIN1                procksi
292 1 Anonymous
ADMIN1                root
293 1 Anonymous
        
294 1 Anonymous
# Resource Manager Definition
295 1 Anonymous
RMCFG[MASTER01.PROCKSI.LOCAL]
296 1 Anonymous
]		
297 1 Anonymous
TYPE=PBS@RMNHOST@ 			
298 1 Anonymous
PORT=15001
299 1 Anonymous
EPORT=15004	[CAN BE ALTERNATIVELY: 15017 - TRY!!!]
300 1 Anonymous
301 1 Anonymous
SERVERPORT  42559
302 1 Anonymous
SERVERMODE  NORMAL
303 1 Anonymous
304 1 Anonymous
# Node Allocation:
305 5 Paweł Widera
# JOBCOUNT  number of jobs currently running on node
306 5 Paweł Widera
# LOAD      current 1 minute load average
307 5 Paweł Widera
# AMEM      real memory currently available to batch jobs
308 5 Paweł Widera
# APROCS    processors currently available to batch jobs
309 5 Paweł Widera
# PREF      node meets job specific resource preferences
310 5 Paweł Widera
311 1 Anonymous
NODEALLOCATIONPOLICY  PRIORITY
312 5 Paweł Widera
NODECFG[DEFAULT] PRIORITYF='-JOBCOUNT - 2*LOAD + 0.5*AMEM + 0.25*APROCS + PREF'
313 1 Anonymous
}}}
314 1 Anonymous
315 1 Anonymous
316 1 Anonymous
Start the MAUI scheduler manually. Make sure that pbs_sched is not running any longer.
317 1 Anonymous
318 1 Anonymous
 * Start the scheduler:
319 1 Anonymous
 {{{
320 1 Anonymous
 /usr/local/sbin/maui
321 1 Anonymous
 }}}
322 1 Anonymous
 
323 1 Anonymous
324 3 Anonymous
Get [repos:Externals/Cluster/procksi_pbs.tgz] from the repository and untar it:
325 1 Anonymous
{{{
326 3 Anonymous
untar –xvzf procksi_pbs.tgz
327 1 Anonymous
}}}
328 1 Anonymous
329 3 Anonymous
Make the entire queuing system (Torque + Maui) start at bootup:
330 3 Anonymous
{{{
331 3 Anonymous
cp ./pbs/master/etc/init.d/pbs_* /etc/init.d/
332 3 Anonymous
/sbin/chkconfig  --add pbs_mom
333 3 Anonymous
/sbin/chkconfig  --add pbs_maui
334 3 Anonymous
/sbin/chkconfig  --add pbs_server
335 3 Anonymous
/sbin/chkconfig  pbs_mom  on
336 3 Anonymous
/sbin/chkconfig  pbs_maui  on
337 3 Anonymous
/sbin/chkconfig  pbs_server  on
338 3 Anonymous
}}}
339 1 Anonymous
340 3 Anonymous
If you want to use the simple scheduler that comes with PBS Torque, then substitute ''pbs_maui'' with ''pbs_sched''.
341 3 Anonymous
342 3 Anonymous
343 1 Anonymous
=== Setup and Configuration on the Slave Nodes ===
344 1 Anonymous
Get [repos:Externals/Cluster/procksi_pbs.tgz] from the repository and untar it:
345 1 Anonymous
{{{
346 1 Anonymous
untar –xvzf procksi_pbs.tgz
347 1 Anonymous
}}}
348 1 Anonymous
349 1 Anonymous
Make the entire queuing system start at bootup:
350 1 Anonymous
{{{
351 3 Anonymous
cp ./pbs/slave/etc/init.d/pbs_mom /etc/init.d/
352 3 Anonymous
/sbin/chkconfig  --add pbs_mom
353 3 Anonymous
/sbin/chkconfig  pbs_mom  on
354 1 Anonymous
}}}