JobManagement » History » Version 2

Anonymous, 09/17/2007 11:53 AM

1 1 Anonymous
= Job Management =
2 1 Anonymous
The queuing system (resource manager) is the heart of the distributed computing on a cluster. It consists of three parts: the server, the scheduler, and the machine-oriented mini-server (MOM) executing the jobs.
3 1 Anonymous
4 1 Anonymous
We are assuming the following configuration:
5 1 Anonymous
6 1 Anonymous
 ||PBS TORQUE|| version 2.1.8      ||server, basic scheduler, mom 
7 1 Anonymous
 ||MAUI      || version 3.2.6.p18	||scheduler
8 1 Anonymous
9 1 Anonymous
10 1 Anonymous
The sources can be obtained from:
11 1 Anonymous
12 1 Anonymous
 ||PBS TORQUE ||http://www.clusterresources.com/pages/products/torque-resource-manager.php
13 1 Anonymous
 ||MAUI       ||http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php
14 1 Anonymous
 
15 1 Anonymous
16 1 Anonymous
The install directories for ''TORQUE'' and ''MAUI'' will be:
17 1 Anonymous
18 1 Anonymous
 ||PBS TORQUE ||''/var/spool/torque''
19 1 Anonymous
 ||MAUI       ||''/var/spool/maui''
20 1 Anonymous
 
21 1 Anonymous
22 1 Anonymous
23 1 Anonymous
== TORQUE ==
24 1 Anonymous
25 1 Anonymous
=== Register new services ===
26 1 Anonymous
Edit ''/etc/services'' and add at the end:
27 1 Anonymous
{{{
28 1 Anonymous
# PBS/Torque services
29 1 Anonymous
pbs           15001/tcp    # pbs_server
30 1 Anonymous
pbs           15001/udp    # pbs_server
31 1 Anonymous
pbs_mom       15002/tcp    # pbs_mom <-> pbs_server
32 1 Anonymous
pbs_mom       15002/udp    # pbs_mom <-> pbs_server
33 1 Anonymous
pbs_resmom    15003/tcp    # pbs_mom resource management
34 1 Anonymous
pbs_resmom    15003/udp    # pbs_mom resource management
35 1 Anonymous
pbs_sched     15004/tcp    # pbs scheduler (pbs_sched)
36 1 Anonymous
pbs_sched     15004/udp    # pbs scheduler (pbs_sched)
37 1 Anonymous
}}}
38 1 Anonymous
  
39 1 Anonymous
40 1 Anonymous
=== Setup and Configuration on the Master Node ===
41 1 Anonymous
Extract and build the distribution TORQUE on the master node. Configure server, monitor and clients to use secure file transfer (scp).
42 1 Anonymous
{{{
43 1 Anonymous
export TORQUECFG=/var/spool/torque
44 1 Anonymous
tar -xzvf TORQUE.tar.gz
45 1 Anonymous
cd TORQUE
46 1 Anonymous
}}}
47 1 Anonymous
48 1 Anonymous
Configuration for a 64bit machine with the following compiler options:
49 1 Anonymous
{{{
50 1 Anonymous
FFLAGS   = "-m64 -march=[Add Architecture]  -O3 -fPIC"
51 1 Anonymous
CFLAGS   = "-m64 -march=[Add Architecture]  -O3 -fPIC"
52 1 Anonymous
CXXFLAGS = "-m64 -march=[Add Architecture]  -O3 -fPIC"
53 1 Anonymous
LDFLAGS  = "-L/usr/local/lib -L/usr/local/lib64"
54 1 Anonymous
}}}
55 1 Anonymous
'''Attention''': For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''.
56 1 Anonymous
57 1 Anonymous
Configure, build, and install:
58 1 Anonymous
{{{
59 1 Anonymous
./configure  --prefix=/usr/local --with-spooldir=$TORQUECFG
60 1 Anonymous
make
61 1 Anonymous
make install 
62 1 Anonymous
}}}
63 1 Anonymous
If not configures otherwise, binaries are installed in ''/usr/local/bin'' and ''/usr/local/sbin''. 
64 1 Anonymous
65 1 Anonymous
Initialise/configure the queuing system's server daemon (''pbs_server''):
66 1 Anonymous
{{{
67 1 Anonymous
pbs_server -t create
68 1 Anonymous
}}}
69 1 Anonymous
70 1 Anonymous
Set the PBS operator and manager (must be a valid user name). 
71 1 Anonymous
{{{
72 1 Anonymous
qmgr
73 1 Anonymous
> set server_name = master01.procksi.local
74 1 Anonymous
> set server scheduling = true
75 1 Anonymous
> set server operators  = "root@master01.procksi.local,procksi@master01.procksi.local"
76 1 Anonymous
> set server managers  = "root@master01.procksi.local,procksi@master01.procksi.local"
77 1 Anonymous
}}}
78 1 Anonymous
79 1 Anonymous
Allow only ''procksi'' and ''root'' to submit jobs into the queue:
80 1 Anonymous
{{{
81 1 Anonymous
> set server acl_users = "root,procksi" 
82 1 Anonymous
> set server acl_user_enable = true
83 1 Anonymous
}}}
84 1 Anonymous
85 1 Anonymous
Set email address for email that is sent by PBS:
86 1 Anonymous
{{{
87 1 Anonymous
> set server mail_from = pbs@procksi.net	
88 1 Anonymous
}}}
89 1 Anonymous
 
90 2 Anonymous
Allow submissions from slave hosts (only):
91 2 Anonymous
'''ATTENTION: NEEDS TO BE CHECKED. DOES NOT WORK PROPERLY YET!! '''
92 1 Anonymous
{{{
93 2 Anonymous
{{{
94 1 Anonymous
> set server allow_node_submit = true
95 1 Anonymous
> set server submit_hosts = master01.procksi.local
96 1 Anonymous
                            slave01.procksi.local
97 1 Anonymous
                            slave02.procksi.local
98 1 Anonymous
                            slave03.procksi.local
99 1 Anonymous
                            slave04.procksi.local
100 1 Anonymous
}}}
101 1 Anonymous
 
102 1 Anonymous
103 1 Anonymous
Restrict nodes that can access the PBS server:
104 1 Anonymous
{{{
105 1 Anonymous
> set server acl_hosts = master01.procksi.local
106 1 Anonymous
                         slave01.procksi.local
107 1 Anonymous
                         slave02.procksi.local
108 1 Anonymous
                         slave03.procksi.local
109 1 Anonymous
                         slave04.procksi.local
110 1 Anonymous
> set acl_host_enable = true
111 1 Anonymous
}}}
112 1 Anonymous
113 1 Anonymous
And set in ''torque.cfg'' in order 
114 1 Anonymous
to use the internal interface:
115 1 Anonymous
{{{
116 1 Anonymous
SERVERHOST              master01.procksi.local
117 1 Anonymous
ALLOWCOMPUTEHOSTSUBMIT  true
118 1 Anonymous
}}}
119 2 Anonymous
}}}
120 1 Anonymous
121 1 Anonymous
Configure default node to be used (see below):
122 1 Anonymous
{{{
123 1 Anonymous
> set server default_node = slave
124 1 Anonymous
}}}
125 1 Anonymous
126 1 Anonymous
127 1 Anonymous
Set the default queue to ''batch''
128 1 Anonymous
{{{
129 1 Anonymous
> set server default_queue=batch
130 1 Anonymous
}}}
131 1 Anonymous
 
132 1 Anonymous
Configure the main queue ''batch'':
133 1 Anonymous
{{{
134 1 Anonymous
> create queue batch queue_type=execution
135 1 Anonymous
> set queue batch started=true
136 1 Anonymous
> set queue batch enabled=true
137 1 Anonymous
> set queue batch resources_default.nodes=1
138 1 Anonymous
}}}
139 1 Anonymous
140 1 Anonymous
Configure queue ''test ''accordingly''. 
141 1 Anonymous
142 1 Anonymous
Specify all compute nodes to be used by creating/editing ''$TORQUECFG/server_priv/nodes.'' This may include the same machine where pbs_server will run. If the compute nodes have more than one processor, just add np=X after the name with X being the number of processors. Add node attributes so that a subset of nodes can be requested during the submission stage.
143 1 Anonymous
{{{
144 1 Anonymous
master01.procksi.local  np=2  procksi  master  xeon
145 1 Anonymous
slave01.procksi.local   np=2  procksi  slave   xeon
146 1 Anonymous
slave02.procksi.local   np=2  procksi  slave   xeon
147 1 Anonymous
slave03.procksi.local   np=4  procksi  slave   opteron
148 1 Anonymous
slave04.procksi.local   np=4  procksi  slave   opteron
149 1 Anonymous
}}}
150 1 Anonymous
151 1 Anonymous
Although the master node (''master01'') has two processors as well, we only allow one processor to be used for the queueing system as the other processor will be used for handling all frontend communication and I/O. (Make sure that hyperthreading technology is disabled on the head node and all compute nodes!)
152 1 Anonymous
153 1 Anonymous
Request job to be run on specific nodes (on submission):
154 1 Anonymous
 * Run on any compute node: 	
155 1 Anonymous
 {{{
156 1 Anonymous
 qsub -q batch -l nodes=1:procksi
157 1 Anonymous
 }}}
158 1 Anonymous
 * Run on any slave node:	
159 1 Anonymous
 {{{
160 1 Anonymous
 qsub -q batch -l nodes=1:slave
161 1 Anonymous
 }}}
162 1 Anonymous
 * Run on master node:		
163 1 Anonymous
 {{{
164 1 Anonymous
 qsub -q batch -l nodes=1:master
165 1 Anonymous
 }}}
166 1 Anonymous
167 1 Anonymous
 
168 1 Anonymous
169 1 Anonymous
170 1 Anonymous
=== Setup and Configuration on the Slave Nodes ===
171 1 Anonymous
Extract and build the distribution TORQUE on each slave node. Configure monitor and clients to use secure file transfer (scp).
172 1 Anonymous
{{{
173 1 Anonymous
export TORQUECFG=/var/spool/torque
174 1 Anonymous
tar -xzvf TORQUE.tar.gz
175 1 Anonymous
cd TORQUE
176 1 Anonymous
}}}
177 1 Anonymous
178 1 Anonymous
Configuration for a 64bit machine with the following compiler options:
179 1 Anonymous
{{{
180 1 Anonymous
FFLAGS   = "-m64 -march=[Add Architecture] -O3 -fPIC"
181 1 Anonymous
CFLAGS   = "-m64 -march=[Add Architecture] -O3 -fPIC"
182 1 Anonymous
CXXFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC"
183 1 Anonymous
LDFLAGS  = "-L/usr/local/lib -L/usr/local/lib64"
184 1 Anonymous
}}}
185 1 Anonymous
Attention: For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''.
186 1 Anonymous
187 1 Anonymous
Configure, build, and install:
188 1 Anonymous
{{{
189 1 Anonymous
./configure  --prefix=/usr/local --with-spooldir=$TORQUECFG --disable-server --enable-mom --enable-clients --with-default-server=master01.procksi.local
190 1 Anonymous
make
191 1 Anonymous
make install 
192 1 Anonymous
}}}
193 1 Anonymous
 
194 1 Anonymous
Configure the compute nodes by creating/editing ''$TORQUECFG/mom_priv/config''. The first line specifies the PBS server, the second line specifies hosts which can be trusted to access mom services as non-root, and the last line allows copying data via NFS without using SCP.
195 1 Anonymous
{{{
196 1 Anonymous
$pbsserver   master01.procksi.local
197 1 Anonymous
$loglevel    255
198 1 Anonymous
$restricted  master01.procksi.local
199 1 Anonymous
$usecp       master01.procksi.local:/home/procksi   /home/procksi
200 1 Anonymous
}}}
201 1 Anonymous
202 1 Anonymous
Start the queueing system (manually) in the correct order:
203 1 Anonymous
 * Start the mom:
204 1 Anonymous
 {{{
205 1 Anonymous
 /usr/local/sbin/pbs_mom
206 1 Anonymous
 }}}
207 1 Anonymous
 * Kill the server:
208 1 Anonymous
 {{{
209 1 Anonymous
 /usr/local/sbin/qterm -t quick
210 1 Anonymous
 }}}
211 1 Anonymous
 * Start the server:
212 1 Anonymous
 {{{ 
213 1 Anonymous
 /usr/local/sbin/pbs_server
214 1 Anonymous
 }}}	
215 1 Anonymous
 * Start the scheduler:		
216 1 Anonymous
 {{{
217 1 Anonymous
 /usr/local/sbin/pbs_sched
218 1 Anonymous
 }}} 
219 1 Anonymous
220 1 Anonymous
If you want to use MAUI as the final scheduler, keep in mind to kill ''pbs_sched'' after testing the TORQURE installation.
221 1 Anonymous
222 1 Anonymous
223 1 Anonymous
Check that all nodes are properly configured and correctly reporting
224 1 Anonymous
{{{
225 1 Anonymous
qstat  -q
226 1 Anonymous
pbsnodes -a
227 1 Anonymous
}}}
228 1 Anonymous
229 1 Anonymous
230 1 Anonymous
=== Prologue and Epilogue Scripts ===
231 1 Anonymous
Get [repos:Externals/procksi_pbs.tgz] from the repository and untar it:
232 1 Anonymous
{{{
233 1 Anonymous
untar –xvzf procksi_pbs.tgz
234 1 Anonymous
}}}
235 1 Anonymous
236 1 Anonymous
The ''prologue'' script is executed just before the submitted job starts. Here, it generates a unique temp directory for each job in ''/scratch''. 
237 1 Anonymous
It must be installed on each node:
238 1 Anonymous
{{{
239 1 Anonymous
cp ./prologue $TORQUECFG/mom_priv
240 1 Anonymous
chmod 500 $TORQUECFG/mom_priv/prologue
241 1 Anonymous
}}}
242 1 Anonymous
 
243 1 Anonymous
The ''epilogue'' script is executed right after the submitted job has ended. Here, it deletes the job's temp directory from ''/scratch.'' It must be installed on each node:
244 1 Anonymous
{{{
245 1 Anonymous
cp ./epilogue $TORQUECFG/mom_priv
246 1 Anonymous
chmod 500 $TORQUECFG/mom_priv/epilogue
247 1 Anonymous
}}}
248 1 Anonymous
249 1 Anonymous
250 1 Anonymous
== MAUI ==
251 1 Anonymous
252 1 Anonymous
=== Register new services ===
253 1 Anonymous
Edit ''/etc/services'' and add at the end:
254 1 Anonymous
{{{ 
255 1 Anonymous
# PBS/MAUI services
256 1 Anonymous
pbs_maui  42559/tcp    # pbs scheduler (maui)
257 1 Anonymous
pbs_maui  42559/udp    # pbs scheduler (maui)
258 1 Anonymous
}}}
259 1 Anonymous
260 1 Anonymous
261 1 Anonymous
=== Setup and Configuration on the Head Node ===
262 1 Anonymous
Extract and build the distribution MAUI.
263 1 Anonymous
{{{
264 1 Anonymous
export MAUIDIR=/usr/local/maui
265 1 Anonymous
tar -xzvf MAUI.tar.gz
266 1 Anonymous
cd TORQUE
267 1 Anonymous
}}}
268 1 Anonymous
269 1 Anonymous
Configuration for a 64bit machine with the following compiler options:
270 1 Anonymous
{{{
271 1 Anonymous
FFLAGS   = “-m64 -march=[Add Architecture] -O3 -fPIC"
272 1 Anonymous
CFLAGS   = “-m64 -march=[Add Architecture] -O3 -fPIC"
273 1 Anonymous
CXXFLAGS = “-m64 -march=[Add Architecture] -O3 -fPIC"
274 1 Anonymous
LDFLAGS	 = “-L/usr/local/lib -L/usr/local/lib64"
275 1 Anonymous
}}}
276 1 Anonymous
'''Attention''': For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''.
277 1 Anonymous
278 1 Anonymous
Configure, build, and install:
279 1 Anonymous
{{{
280 1 Anonymous
./configure --with-pbs=$TORQUECFG --with-spooldir=$MAUIDIR
281 1 Anonymous
make
282 1 Anonymous
make install 
283 1 Anonymous
}}}
284 1 Anonymous
285 1 Anonymous
Fine-tune MAUI in $''MAUIDIR/maui.cfg'':
286 1 Anonymous
{{{
287 1 Anonymous
SERVERHOST            master01.procksi.local
288 1 Anonymous
289 1 Anonymous
# primary admin must be first in list
290 1 Anonymous
ADMIN1                procksi
291 1 Anonymous
ADMIN1                root
292 1 Anonymous
        
293 1 Anonymous
# Resource Manager Definition
294 1 Anonymous
RMCFG[MASTER01.PROCKSI.LOCAL]
295 1 Anonymous
]		
296 1 Anonymous
TYPE=PBS@RMNHOST@ 			
297 1 Anonymous
PORT=15001
298 1 Anonymous
EPORT=15004	[CAN BE ALTERNATIVELY: 15017 - TRY!!!]
299 1 Anonymous
300 1 Anonymous
SERVERPORT  42559
301 1 Anonymous
SERVERMODE  NORMAL
302 1 Anonymous
303 1 Anonymous
# Node Allocation:
304 1 Anonymous
NODEALLOCATIONPOLICY  PRIORITY
305 1 Anonymous
NODECFG[DEFAULT] PRIORITY='- JOBCOUNT'
306 1 Anonymous
}}}
307 1 Anonymous
308 1 Anonymous
309 1 Anonymous
Start the MAUI scheduler manually. Make sure that pbs_sched is not running any longer.
310 1 Anonymous
311 1 Anonymous
 * Start the scheduler:
312 1 Anonymous
 {{{
313 1 Anonymous
 /usr/local/sbin/maui
314 1 Anonymous
 }}}
315 1 Anonymous
 
316 1 Anonymous
317 1 Anonymous
Make the entire queuing system start at bootup:
318 1 Anonymous
{{{
319 1 Anonymous
cp ./pbs_master-node /etc/init.d/pbs 
320 1 Anonymous
/sbin/chkconfig --add pbs
321 1 Anonymous
/sbin/chkconfig pbs on
322 1 Anonymous
}}}
323 1 Anonymous
324 1 Anonymous
325 1 Anonymous
=== Setup and Configuration on the Slave Nodes ===
326 1 Anonymous
Get [repos:Externals/Cluster/procksi_pbs.tgz] from the repository and untar it:
327 1 Anonymous
{{{
328 1 Anonymous
untar –xvzf procksi_pbs.tgz
329 1 Anonymous
}}}
330 1 Anonymous
331 1 Anonymous
Make the entire queuing system start at bootup:
332 1 Anonymous
{{{
333 1 Anonymous
cp ./pbs_compute-node /etc/init.d/pbs
334 1 Anonymous
/sbin/chkconfig  --add  pbs
335 1 Anonymous
/sbin/chkconfig  pbs  on
336 1 Anonymous
}}}