JobManagement » History » Version 7

Paweł Widera, 11/28/2008 08:43 PM
Maui grid status commands added

1 7 Paweł Widera
2 7 Paweł Widera
h1. Job Management
3 7 Paweł Widera
4 1 Anonymous
The queuing system (resource manager) is the heart of the distributed computing on a cluster. It consists of three parts: the server, the scheduler, and the machine-oriented mini-server (MOM) executing the jobs.
5 1 Anonymous
6 1 Anonymous
We are assuming the following configuration:
7 1 Anonymous
8 3 Anonymous
 ||PBS TORQUE|| version 2.1.8           ||server, basic scheduler, mom  ||[source:Externals/Cluster/torque-2.1.8.tgz  download from repository]
9 3 Anonymous
 ||MAUI      || version 3.2.6.p18	||scheduler                     ||[source:Externals/Cluster/maui-3.2.6p18.tgz download from repository]
10 1 Anonymous
11 1 Anonymous
12 1 Anonymous
13 3 Anonymous
Please check the distributors website's for newer versions:
14 3 Anonymous
15 1 Anonymous
 ||PBS TORQUE ||http://www.clusterresources.com/pages/products/torque-resource-manager.php
16 1 Anonymous
 ||MAUI       ||http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php
17 1 Anonymous
 
18 1 Anonymous
19 7 Paweł Widera
The install directories for _TORQUE_ and _MAUI_ will be:
20 1 Anonymous
21 7 Paweł Widera
 ||PBS TORQUE ||_/var/spool/torque_
22 7 Paweł Widera
 ||MAUI       ||_/var/spool/maui_
23 1 Anonymous
 
24 1 Anonymous
25 1 Anonymous
26 1 Anonymous
27 7 Paweł Widera
h2. TORQUE
28 7 Paweł Widera
29 7 Paweł Widera
30 7 Paweł Widera
31 7 Paweł Widera
h3. Register new services
32 7 Paweł Widera
33 7 Paweł Widera
Edit _/etc/services_ and add at the end:
34 7 Paweł Widera
<pre>
35 1 Anonymous
# PBS/Torque services
36 1 Anonymous
pbs           15001/tcp    # pbs_server
37 1 Anonymous
pbs           15001/udp    # pbs_server
38 1 Anonymous
pbs_mom       15002/tcp    # pbs_mom <-> pbs_server
39 1 Anonymous
pbs_mom       15002/udp    # pbs_mom <-> pbs_server
40 1 Anonymous
pbs_resmom    15003/tcp    # pbs_mom resource management
41 1 Anonymous
pbs_resmom    15003/udp    # pbs_mom resource management
42 1 Anonymous
pbs_sched     15004/tcp    # pbs scheduler (pbs_sched)
43 1 Anonymous
pbs_sched     15004/udp    # pbs scheduler (pbs_sched)
44 7 Paweł Widera
</pre>
45 1 Anonymous
  
46 1 Anonymous
47 7 Paweł Widera
48 7 Paweł Widera
h3. Setup and Configuration on the Master Node
49 7 Paweł Widera
50 1 Anonymous
Extract and build the distribution TORQUE on the master node. Configure server, monitor and clients to use secure file transfer (scp).
51 7 Paweł Widera
<pre>
52 1 Anonymous
export TORQUECFG=/var/spool/torque
53 1 Anonymous
tar -xzvf TORQUE.tar.gz
54 1 Anonymous
cd TORQUE
55 7 Paweł Widera
</pre>
56 1 Anonymous
57 1 Anonymous
Configuration for a 64bit machine with the following compiler options:
58 7 Paweł Widera
<pre>
59 1 Anonymous
FFLAGS   = "-m64 -march=[Add Architecture]  -O3 -fPIC"
60 1 Anonymous
CFLAGS   = "-m64 -march=[Add Architecture]  -O3 -fPIC"
61 1 Anonymous
CXXFLAGS = "-m64 -march=[Add Architecture]  -O3 -fPIC"
62 1 Anonymous
LDFLAGS  = "-L/usr/local/lib -L/usr/local/lib64"
63 7 Paweł Widera
</pre>
64 7 Paweł Widera
*Attention*: For Intel Xenon processors use _-march=nocona_, for AMD Opteron processors use _-march=opteron_.
65 1 Anonymous
66 1 Anonymous
Configure, build, and install:
67 7 Paweł Widera
<pre>
68 1 Anonymous
./configure  --prefix=/usr/local --with-spooldir=$TORQUECFG
69 1 Anonymous
make
70 1 Anonymous
make install 
71 7 Paweł Widera
</pre>
72 7 Paweł Widera
If not configures otherwise, binaries are installed in _/usr/local/bin_ and _/usr/local/sbin_. 
73 1 Anonymous
74 7 Paweł Widera
Initialise/configure the queuing system's server daemon (_pbs_server_):
75 7 Paweł Widera
<pre>
76 1 Anonymous
pbs_server -t create
77 7 Paweł Widera
</pre>
78 1 Anonymous
79 1 Anonymous
Set the PBS operator and manager (must be a valid user name). 
80 7 Paweł Widera
<pre>
81 1 Anonymous
qmgr
82 1 Anonymous
> set server_name = master01.procksi.local
83 1 Anonymous
> set server scheduling = true
84 1 Anonymous
> set server operators  = "root@master01.procksi.local,procksi@master01.procksi.local"
85 1 Anonymous
> set server managers  = "root@master01.procksi.local,procksi@master01.procksi.local"
86 7 Paweł Widera
</pre>
87 1 Anonymous
88 7 Paweł Widera
Allow only _procksi_ and _root_ to submit jobs into the queue:
89 7 Paweł Widera
<pre>
90 1 Anonymous
> set server acl_users = "root,procksi" 
91 2 Anonymous
> set server acl_user_enable = true
92 7 Paweł Widera
</pre>
93 1 Anonymous
94 1 Anonymous
Set email address for email that is sent by PBS:
95 7 Paweł Widera
<pre>
96 1 Anonymous
> set server mail_from = pbs@procksi.net	
97 7 Paweł Widera
</pre>
98 1 Anonymous
 
99 1 Anonymous
Allow submissions from slave hosts (only):
100 7 Paweł Widera
*ATTENTION: NEEDS TO BE CHECKED. DOES NOT WORK PROPERLY YET!! *
101 7 Paweł Widera
<pre>
102 7 Paweł Widera
<pre>
103 1 Anonymous
> set server allow_node_submit = true
104 1 Anonymous
> set server submit_hosts = master01.procksi.local
105 1 Anonymous
                            slave01.procksi.local
106 1 Anonymous
                            slave02.procksi.local
107 1 Anonymous
                            slave03.procksi.local
108 1 Anonymous
                            slave04.procksi.local
109 7 Paweł Widera
</pre>
110 1 Anonymous
 
111 1 Anonymous
112 1 Anonymous
Restrict nodes that can access the PBS server:
113 7 Paweł Widera
<pre>
114 1 Anonymous
> set server acl_hosts = master01.procksi.local
115 2 Anonymous
                         slave01.procksi.local
116 1 Anonymous
                         slave02.procksi.local
117 1 Anonymous
                         slave03.procksi.local
118 1 Anonymous
                         slave04.procksi.local
119 1 Anonymous
> set acl_host_enable = true
120 7 Paweł Widera
</pre>
121 1 Anonymous
122 7 Paweł Widera
And set in _torque.cfg_ in order 
123 1 Anonymous
to use the internal interface:
124 7 Paweł Widera
<pre>
125 1 Anonymous
SERVERHOST              master01.procksi.local
126 1 Anonymous
ALLOWCOMPUTEHOSTSUBMIT  true
127 7 Paweł Widera
</pre>
128 7 Paweł Widera
</pre>
129 1 Anonymous
130 1 Anonymous
Configure default node to be used (see below):
131 7 Paweł Widera
<pre>
132 1 Anonymous
> set server default_node = slave
133 7 Paweł Widera
</pre>
134 1 Anonymous
135 1 Anonymous
136 7 Paweł Widera
Set the default queue to _batch_
137 7 Paweł Widera
<pre>
138 1 Anonymous
> set server default_queue=batch
139 7 Paweł Widera
</pre>
140 1 Anonymous
 
141 7 Paweł Widera
Configure the main queue _batch_:
142 7 Paweł Widera
<pre>
143 1 Anonymous
> create queue batch queue_type=execution
144 1 Anonymous
> set queue batch started=true
145 1 Anonymous
> set queue batch enabled=true
146 1 Anonymous
> set queue batch resources_default.nodes=1
147 7 Paweł Widera
</pre>
148 1 Anonymous
149 7 Paweł Widera
Configure queue _test _accordingly_. 
150 1 Anonymous
151 7 Paweł Widera
Specify all compute nodes to be used by creating/editing _$TORQUECFG/server_priv/nodes._ This may include the same machine where pbs_server will run. If the compute nodes have more than one processor, just add np=X after the name with X being the number of processors. Add node attributes so that a subset of nodes can be requested during the submission stage.
152 7 Paweł Widera
<pre>
153 1 Anonymous
master01.procksi.local  np=2  procksi  master  xeon
154 1 Anonymous
slave01.procksi.local   np=2  procksi  slave   xeon
155 1 Anonymous
slave02.procksi.local   np=2  procksi  slave   xeon
156 1 Anonymous
slave03.procksi.local   np=4  procksi  slave   opteron
157 1 Anonymous
slave04.procksi.local   np=4  procksi  slave   opteron
158 7 Paweł Widera
</pre>
159 1 Anonymous
160 7 Paweł Widera
Although the master node (_master01_) has two processors as well, we only allow one processor to be used for the queueing system as the other processor will be used for handling all frontend communication and I/O. (Make sure that hyperthreading technology is disabled on the head node and all compute nodes!)
161 1 Anonymous
162 1 Anonymous
Request job to be run on specific nodes (on submission):
163 7 Paweł Widera
* Run on any compute node: 	
164 7 Paweł Widera
<pre>
165 1 Anonymous
 qsub -q batch -l nodes=1:procksi
166 7 Paweł Widera
</pre>
167 7 Paweł Widera
* Run on any slave node:	
168 7 Paweł Widera
<pre>
169 1 Anonymous
 qsub -q batch -l nodes=1:slave
170 7 Paweł Widera
</pre>
171 7 Paweł Widera
* Run on master node:		
172 7 Paweł Widera
<pre>
173 1 Anonymous
 qsub -q batch -l nodes=1:master
174 7 Paweł Widera
</pre>
175 1 Anonymous
176 1 Anonymous
 
177 1 Anonymous
178 1 Anonymous
179 7 Paweł Widera
180 7 Paweł Widera
h3. Setup and Configuration on the Slave Nodes
181 7 Paweł Widera
182 1 Anonymous
Extract and build the distribution TORQUE on each slave node. Configure monitor and clients to use secure file transfer (scp).
183 7 Paweł Widera
<pre>
184 1 Anonymous
export TORQUECFG=/var/spool/torque
185 1 Anonymous
tar -xzvf TORQUE.tar.gz
186 1 Anonymous
cd TORQUE
187 7 Paweł Widera
</pre>
188 1 Anonymous
189 1 Anonymous
Configuration for a 64bit machine with the following compiler options:
190 7 Paweł Widera
<pre>
191 1 Anonymous
FFLAGS   = "-m64 -march=[Add Architecture] -O3 -fPIC"
192 1 Anonymous
CFLAGS   = "-m64 -march=[Add Architecture] -O3 -fPIC"
193 1 Anonymous
CXXFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC"
194 1 Anonymous
LDFLAGS  = "-L/usr/local/lib -L/usr/local/lib64"
195 7 Paweł Widera
</pre>
196 7 Paweł Widera
Attention: For Intel Xenon processors use _-march=nocona_, for AMD Opteron processors use _-march=opteron_.
197 1 Anonymous
198 1 Anonymous
Configure, build, and install:
199 7 Paweł Widera
<pre>
200 1 Anonymous
./configure  --prefix=/usr/local --with-spooldir=$TORQUECFG --disable-server --enable-mom --enable-clients --with-default-server=master01.procksi.local
201 1 Anonymous
make
202 1 Anonymous
make install 
203 7 Paweł Widera
</pre>
204 1 Anonymous
 
205 7 Paweł Widera
Configure the compute nodes by creating/editing _$TORQUECFG/mom_priv/config_. The first line specifies the PBS server, the second line specifies hosts which can be trusted to access mom services as non-root, and the last line allows copying data via NFS without using SCP.
206 7 Paweł Widera
<pre>
207 1 Anonymous
$pbsserver   master01.procksi.local
208 1 Anonymous
$loglevel    255
209 1 Anonymous
$restricted  master01.procksi.local
210 1 Anonymous
$usecp       master01.procksi.local:/home/procksi   /home/procksi
211 7 Paweł Widera
</pre>
212 1 Anonymous
213 1 Anonymous
Start the queueing system (manually) in the correct order:
214 7 Paweł Widera
* Start the mom:
215 7 Paweł Widera
<pre>
216 1 Anonymous
 /usr/local/sbin/pbs_mom
217 7 Paweł Widera
</pre>
218 7 Paweł Widera
* Kill the server:
219 7 Paweł Widera
<pre>
220 1 Anonymous
 /usr/local/sbin/qterm -t quick
221 7 Paweł Widera
</pre>
222 7 Paweł Widera
* Start the server:
223 7 Paweł Widera
<pre>
224 1 Anonymous
 /usr/local/sbin/pbs_server
225 7 Paweł Widera
</pre>
226 7 Paweł Widera
* Start the scheduler:		
227 7 Paweł Widera
<pre>
228 1 Anonymous
 /usr/local/sbin/pbs_sched
229 7 Paweł Widera
</pre>
230 1 Anonymous
231 7 Paweł Widera
If you want to use MAUI as the final scheduler, keep in mind to kill _pbs_sched_ after testing the TORQURE installation.
232 1 Anonymous
233 1 Anonymous
234 1 Anonymous
Check that all nodes are properly configured and correctly reporting
235 7 Paweł Widera
<pre>
236 1 Anonymous
qstat  -q
237 1 Anonymous
pbsnodes -a
238 7 Paweł Widera
</pre>
239 1 Anonymous
240 1 Anonymous
241 7 Paweł Widera
242 7 Paweł Widera
h3. Prologue and Epilogue Scripts
243 7 Paweł Widera
244 1 Anonymous
Get [repos:Externals/procksi_pbs.tgz] from the repository and untar it:
245 7 Paweł Widera
<pre>
246 1 Anonymous
untar –xvzf procksi_pbs.tgz
247 7 Paweł Widera
</pre>
248 1 Anonymous
249 7 Paweł Widera
The _prologue_ script is executed just before the submitted job starts. Here, it generates a unique temp directory for each job in _/scratch_. 
250 1 Anonymous
It must be installed on each NODE (master, slave):
251 7 Paweł Widera
<pre>
252 1 Anonymous
cp ./pbs/NODE/var/spool/torque/mom/priv/prologue $TORQUECFG/mom_priv
253 1 Anonymous
chmod 500 $TORQUECFG/mom_priv/prologue
254 7 Paweł Widera
</pre>
255 1 Anonymous
 
256 7 Paweł Widera
The _epilogue_ script is executed right after the submitted job has ended. Here, it deletes the job's temp directory from _/scratch._ It must be installed on each NODE (master, slave)
257 7 Paweł Widera
<pre>
258 1 Anonymous
cp ./pbs/NODE/var/spool/torque/mom/priv/epilogue $TORQUECFG/mom_priv
259 1 Anonymous
chmod 500 $TORQUECFG/mom_priv/epilogue
260 7 Paweł Widera
</pre>
261 1 Anonymous
262 1 Anonymous
263 1 Anonymous
264 7 Paweł Widera
h2. MAUI
265 7 Paweł Widera
266 7 Paweł Widera
267 7 Paweł Widera
268 7 Paweł Widera
h3. Register new services
269 7 Paweł Widera
270 7 Paweł Widera
Edit _/etc/services_ and add at the end:
271 7 Paweł Widera
<pre>
272 1 Anonymous
# PBS/MAUI services
273 1 Anonymous
pbs_maui  42559/tcp    # pbs scheduler (maui)
274 1 Anonymous
pbs_maui  42559/udp    # pbs scheduler (maui)
275 7 Paweł Widera
</pre>
276 1 Anonymous
277 1 Anonymous
278 7 Paweł Widera
279 7 Paweł Widera
h3. Setup and Configuration on the Head Node
280 7 Paweł Widera
281 1 Anonymous
Extract and build the distribution MAUI.
282 7 Paweł Widera
<pre>
283 1 Anonymous
export MAUIDIR=/var/spool/maui
284 1 Anonymous
tar -xzvf MAUI.tar.gz
285 1 Anonymous
cd TORQUE
286 7 Paweł Widera
</pre>
287 1 Anonymous
288 1 Anonymous
Configuration for a 64bit machine with the following compiler options:
289 7 Paweł Widera
<pre>
290 1 Anonymous
FFLAGS   = “-m64 -march=[Add Architecture] -O3 -fPIC"
291 1 Anonymous
CFLAGS   = “-m64 -march=[Add Architecture] -O3 -fPIC"
292 1 Anonymous
CXXFLAGS = “-m64 -march=[Add Architecture] -O3 -fPIC"
293 1 Anonymous
LDFLAGS	 = “-L/usr/local/lib -L/usr/local/lib64"
294 7 Paweł Widera
</pre>
295 7 Paweł Widera
*Attention*: For Intel Xenon processors use _-march=nocona_, for AMD Opteron processors use _-march=opteron_.
296 1 Anonymous
297 1 Anonymous
Configure, build, and install:
298 7 Paweł Widera
<pre>
299 1 Anonymous
./configure --with-pbs=$TORQUECFG --with-spooldir=$MAUIDIR
300 5 Paweł Widera
make
301 5 Paweł Widera
make install 
302 7 Paweł Widera
</pre>
303 5 Paweł Widera
304 7 Paweł Widera
Fine-tune MAUI in $_MAUIDIR/maui.cfg_:
305 7 Paweł Widera
<pre>
306 1 Anonymous
SERVERHOST            master01.procksi.local
307 5 Paweł Widera
308 1 Anonymous
# primary admin must be first in list
309 1 Anonymous
ADMIN1                procksi
310 1 Anonymous
ADMIN1                root
311 1 Anonymous
        
312 1 Anonymous
# Resource Manager Definition
313 1 Anonymous
RMCFG[MASTER01.PROCKSI.LOCAL]
314 1 Anonymous
]		
315 1 Anonymous
TYPE=PBS@RMNHOST@ 			
316 1 Anonymous
PORT=15001
317 1 Anonymous
EPORT=15004	[CAN BE ALTERNATIVELY: 15017 - TRY!!!]
318 1 Anonymous
319 3 Anonymous
SERVERPORT  42559
320 1 Anonymous
SERVERMODE  NORMAL
321 3 Anonymous
322 1 Anonymous
# Node Allocation:
323 1 Anonymous
# JOBCOUNT  number of jobs currently running on node
324 3 Anonymous
# LOAD      current 1 minute load average
325 3 Anonymous
# AMEM      real memory currently available to batch jobs
326 3 Anonymous
# APROCS    processors currently available to batch jobs
327 3 Anonymous
# PREF      node meets job specific resource preferences
328 3 Anonymous
329 3 Anonymous
NODEALLOCATIONPOLICY  PRIORITY
330 3 Anonymous
NODECFG[DEFAULT] PRIORITYF='-JOBCOUNT - 2*LOAD + 0.5*AMEM + 0.25*APROCS + PREF'
331 7 Paweł Widera
</pre>
332 3 Anonymous
333 3 Anonymous
334 1 Anonymous
Start the MAUI scheduler manually. Make sure that pbs_sched is not running any longer.
335 3 Anonymous
336 7 Paweł Widera
* Start the scheduler:
337 7 Paweł Widera
<pre>
338 1 Anonymous
 /usr/local/sbin/maui
339 7 Paweł Widera
</pre>
340 1 Anonymous
 
341 1 Anonymous
342 1 Anonymous
Get [repos:Externals/Cluster/procksi_pbs.tgz] from the repository and untar it:
343 7 Paweł Widera
<pre>
344 1 Anonymous
untar –xvzf procksi_pbs.tgz
345 7 Paweł Widera
</pre>
346 3 Anonymous
347 3 Anonymous
Make the entire queuing system (Torque + Maui) start at bootup:
348 7 Paweł Widera
<pre>
349 1 Anonymous
cp ./pbs/master/etc/init.d/pbs_* /etc/init.d/
350 6 Paweł Widera
/sbin/chkconfig  --add pbs_mom
351 6 Paweł Widera
/sbin/chkconfig  --add pbs_maui
352 6 Paweł Widera
/sbin/chkconfig  --add pbs_server
353 6 Paweł Widera
/sbin/chkconfig  pbs_mom  on
354 6 Paweł Widera
/sbin/chkconfig  pbs_maui  on
355 6 Paweł Widera
/sbin/chkconfig  pbs_server  on
356 7 Paweł Widera
</pre>
357 6 Paweł Widera
358 7 Paweł Widera
If you want to use the simple scheduler that comes with PBS Torque, then substitute _pbs_maui_ with _pbs_sched_.
359 6 Paweł Widera
360 6 Paweł Widera
361 7 Paweł Widera
362 7 Paweł Widera
h3. Setup and Configuration on the Slave Nodes
363 7 Paweł Widera
364 6 Paweł Widera
Get [repos:Externals/Cluster/procksi_pbs.tgz] from the repository and untar it:
365 7 Paweł Widera
<pre>
366 6 Paweł Widera
untar –xvzf procksi_pbs.tgz
367 7 Paweł Widera
</pre>
368 6 Paweł Widera
369 1 Anonymous
Make the entire queuing system start at bootup:
370 7 Paweł Widera
<pre>
371 1 Anonymous
cp ./pbs/slave/etc/init.d/pbs_mom /etc/init.d/
372 1 Anonymous
/sbin/chkconfig  --add pbs_mom
373 1 Anonymous
/sbin/chkconfig  pbs_mom  on
374 7 Paweł Widera
</pre>
375 1 Anonymous
376 1 Anonymous
377 7 Paweł Widera
h3. Monitoring Grid Status
378 7 Paweł Widera
379 7 Paweł Widera
380 7 Paweł Widera
* display queue information (active/idle jobs)
381 7 Paweł Widera
<pre>
382 1 Anonymous
 showq
383 7 Paweł Widera
</pre>
384 7 Paweł Widera
* current and historical scheduling statistics
385 7 Paweł Widera
<pre>
386 1 Anonymous
 showstats -v
387 7 Paweł Widera
</pre>
388 7 Paweł Widera
* display job state and resources information
389 7 Paweł Widera
<pre>
390 1 Anonymous
 checkjob <JOB_ID>
391 7 Paweł Widera
</pre>
392 7 Paweł Widera
* display node state and resources information
393 7 Paweł Widera
<pre>
394 1 Anonymous
 checknode <NODE_NAME>
395 7 Paweł Widera
</pre>