JobManagement » History » Version 1

Anonymous, 09/14/2007 09:46 AM

1 1 Anonymous
= Job Management =
2 1 Anonymous
The queuing system (resource manager) is the heart of the distributed computing on a cluster. It consists of three parts: the server, the scheduler, and the machine-oriented mini-server (MOM) executing the jobs.
3 1 Anonymous
4 1 Anonymous
We are assuming the following configuration:
5 1 Anonymous
6 1 Anonymous
 ||PBS TORQUE|| version 2.1.8      ||server, basic scheduler, mom 
7 1 Anonymous
 ||MAUI      || version 3.2.6.p18	||scheduler
8 1 Anonymous
9 1 Anonymous
10 1 Anonymous
The sources can be obtained from:
11 1 Anonymous
12 1 Anonymous
 ||PBS TORQUE ||http://www.clusterresources.com/pages/products/torque-resource-manager.php
13 1 Anonymous
 ||MAUI       ||http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php
14 1 Anonymous
 
15 1 Anonymous
16 1 Anonymous
The install directories for ''TORQUE'' and ''MAUI'' will be:
17 1 Anonymous
18 1 Anonymous
 ||PBS TORQUE ||''/var/spool/torque''
19 1 Anonymous
 ||MAUI       ||''/var/spool/maui''
20 1 Anonymous
 
21 1 Anonymous
22 1 Anonymous
23 1 Anonymous
== TORQUE ==
24 1 Anonymous
25 1 Anonymous
=== Register new services ===
26 1 Anonymous
Edit ''/etc/services'' and add at the end:
27 1 Anonymous
{{{
28 1 Anonymous
# PBS/Torque services
29 1 Anonymous
pbs           15001/tcp    # pbs_server
30 1 Anonymous
pbs           15001/udp    # pbs_server
31 1 Anonymous
pbs_mom       15002/tcp    # pbs_mom <-> pbs_server
32 1 Anonymous
pbs_mom       15002/udp    # pbs_mom <-> pbs_server
33 1 Anonymous
pbs_resmom    15003/tcp    # pbs_mom resource management
34 1 Anonymous
pbs_resmom    15003/udp    # pbs_mom resource management
35 1 Anonymous
pbs_sched     15004/tcp    # pbs scheduler (pbs_sched)
36 1 Anonymous
pbs_sched     15004/udp    # pbs scheduler (pbs_sched)
37 1 Anonymous
}}}
38 1 Anonymous
  
39 1 Anonymous
40 1 Anonymous
=== Setup and Configuration on the Master Node ===
41 1 Anonymous
Extract and build the distribution TORQUE on the master node. Configure server, monitor and clients to use secure file transfer (scp).
42 1 Anonymous
{{{
43 1 Anonymous
export TORQUECFG=/var/spool/torque
44 1 Anonymous
tar -xzvf TORQUE.tar.gz
45 1 Anonymous
cd TORQUE
46 1 Anonymous
}}}
47 1 Anonymous
48 1 Anonymous
Configuration for a 64bit machine with the following compiler options:
49 1 Anonymous
{{{
50 1 Anonymous
FFLAGS   = "-m64 -march=[Add Architecture]  -O3 -fPIC"
51 1 Anonymous
CFLAGS   = "-m64 -march=[Add Architecture]  -O3 -fPIC"
52 1 Anonymous
CXXFLAGS = "-m64 -march=[Add Architecture]  -O3 -fPIC"
53 1 Anonymous
LDFLAGS  = "-L/usr/local/lib -L/usr/local/lib64"
54 1 Anonymous
}}}
55 1 Anonymous
'''Attention''': For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''.
56 1 Anonymous
57 1 Anonymous
Configure, build, and install:
58 1 Anonymous
{{{
59 1 Anonymous
./configure  --prefix=/usr/local --with-spooldir=$TORQUECFG
60 1 Anonymous
make
61 1 Anonymous
make install 
62 1 Anonymous
}}}
63 1 Anonymous
If not configures otherwise, binaries are installed in ''/usr/local/bin'' and ''/usr/local/sbin''. 
64 1 Anonymous
65 1 Anonymous
Initialise/configure the queuing system's server daemon (''pbs_server''):
66 1 Anonymous
{{{
67 1 Anonymous
pbs_server -t create
68 1 Anonymous
}}}
69 1 Anonymous
70 1 Anonymous
Set the PBS operator and manager (must be a valid user name). 
71 1 Anonymous
{{{
72 1 Anonymous
qmgr
73 1 Anonymous
> set server_name = master01.procksi.local
74 1 Anonymous
> set server scheduling = true
75 1 Anonymous
> set server operators  = "root@master01.procksi.local,procksi@master01.procksi.local"
76 1 Anonymous
> set server managers  = "root@master01.procksi.local,procksi@master01.procksi.local"
77 1 Anonymous
}}}
78 1 Anonymous
79 1 Anonymous
Allow only ''procksi'' and ''root'' to submit jobs into the queue:
80 1 Anonymous
{{{
81 1 Anonymous
> set server acl_users = "root,procksi" 
82 1 Anonymous
> set server acl_user_enable = true
83 1 Anonymous
}}}
84 1 Anonymous
85 1 Anonymous
Set email address for email that is sent by PBS:
86 1 Anonymous
{{{
87 1 Anonymous
> set server mail_from = pbs@procksi.net	
88 1 Anonymous
}}}
89 1 Anonymous
 
90 1 Anonymous
'''The following section needs to be checked! Allow submissions from slave hosts (only):
91 1 Anonymous
{{{
92 1 Anonymous
> set server allow_node_submit = true
93 1 Anonymous
> set server submit_hosts = master01.procksi.local
94 1 Anonymous
                            slave01.procksi.local
95 1 Anonymous
                            slave02.procksi.local
96 1 Anonymous
                            slave03.procksi.local
97 1 Anonymous
                            slave04.procksi.local
98 1 Anonymous
}}}
99 1 Anonymous
 
100 1 Anonymous
101 1 Anonymous
Restrict nodes that can access the PBS server:
102 1 Anonymous
{{{
103 1 Anonymous
> set server acl_hosts = master01.procksi.local
104 1 Anonymous
                         slave01.procksi.local
105 1 Anonymous
                         slave02.procksi.local
106 1 Anonymous
                         slave03.procksi.local
107 1 Anonymous
                         slave04.procksi.local
108 1 Anonymous
> set acl_host_enable = true
109 1 Anonymous
}}}
110 1 Anonymous
111 1 Anonymous
And set in ''torque.cfg'' in order 
112 1 Anonymous
to use the internal interface:
113 1 Anonymous
{{{
114 1 Anonymous
SERVERHOST              master01.procksi.local
115 1 Anonymous
ALLOWCOMPUTEHOSTSUBMIT  true
116 1 Anonymous
}}}
117 1 Anonymous
'''
118 1 Anonymous
119 1 Anonymous
Configure default node to be used (see below):
120 1 Anonymous
{{{
121 1 Anonymous
> set server default_node = slave
122 1 Anonymous
}}}
123 1 Anonymous
124 1 Anonymous
125 1 Anonymous
Set the default queue to ''batch''
126 1 Anonymous
{{{
127 1 Anonymous
> set server default_queue=batch
128 1 Anonymous
}}}
129 1 Anonymous
 
130 1 Anonymous
Configure the main queue ''batch'':
131 1 Anonymous
{{{
132 1 Anonymous
> create queue batch queue_type=execution
133 1 Anonymous
> set queue batch started=true
134 1 Anonymous
> set queue batch enabled=true
135 1 Anonymous
> set queue batch resources_default.nodes=1
136 1 Anonymous
}}}
137 1 Anonymous
138 1 Anonymous
Configure queue ''test ''accordingly''. 
139 1 Anonymous
140 1 Anonymous
Specify all compute nodes to be used by creating/editing ''$TORQUECFG/server_priv/nodes.'' This may include the same machine where pbs_server will run. If the compute nodes have more than one processor, just add np=X after the name with X being the number of processors. Add node attributes so that a subset of nodes can be requested during the submission stage.
141 1 Anonymous
{{{
142 1 Anonymous
master01.procksi.local  np=2  procksi  master  xeon
143 1 Anonymous
slave01.procksi.local   np=2  procksi  slave   xeon
144 1 Anonymous
slave02.procksi.local   np=2  procksi  slave   xeon
145 1 Anonymous
slave03.procksi.local   np=4  procksi  slave   opteron
146 1 Anonymous
slave04.procksi.local   np=4  procksi  slave   opteron
147 1 Anonymous
}}}
148 1 Anonymous
149 1 Anonymous
Although the master node (''master01'') has two processors as well, we only allow one processor to be used for the queueing system as the other processor will be used for handling all frontend communication and I/O. (Make sure that hyperthreading technology is disabled on the head node and all compute nodes!)
150 1 Anonymous
151 1 Anonymous
Request job to be run on specific nodes (on submission):
152 1 Anonymous
 * Run on any compute node: 	
153 1 Anonymous
 {{{
154 1 Anonymous
 qsub -q batch -l nodes=1:procksi
155 1 Anonymous
 }}}
156 1 Anonymous
 * Run on any slave node:	
157 1 Anonymous
 {{{
158 1 Anonymous
 qsub -q batch -l nodes=1:slave
159 1 Anonymous
 }}}
160 1 Anonymous
 * Run on master node:		
161 1 Anonymous
 {{{
162 1 Anonymous
 qsub -q batch -l nodes=1:master
163 1 Anonymous
 }}}
164 1 Anonymous
165 1 Anonymous
 
166 1 Anonymous
167 1 Anonymous
168 1 Anonymous
=== Setup and Configuration on the Slave Nodes ===
169 1 Anonymous
Extract and build the distribution TORQUE on each slave node. Configure monitor and clients to use secure file transfer (scp).
170 1 Anonymous
{{{
171 1 Anonymous
export TORQUECFG=/var/spool/torque
172 1 Anonymous
tar -xzvf TORQUE.tar.gz
173 1 Anonymous
cd TORQUE
174 1 Anonymous
}}}
175 1 Anonymous
176 1 Anonymous
Configuration for a 64bit machine with the following compiler options:
177 1 Anonymous
{{{
178 1 Anonymous
FFLAGS   = "-m64 -march=[Add Architecture] -O3 -fPIC"
179 1 Anonymous
CFLAGS   = "-m64 -march=[Add Architecture] -O3 -fPIC"
180 1 Anonymous
CXXFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC"
181 1 Anonymous
LDFLAGS  = "-L/usr/local/lib -L/usr/local/lib64"
182 1 Anonymous
}}}
183 1 Anonymous
Attention: For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''.
184 1 Anonymous
185 1 Anonymous
Configure, build, and install:
186 1 Anonymous
{{{
187 1 Anonymous
./configure  --prefix=/usr/local --with-spooldir=$TORQUECFG --disable-server --enable-mom --enable-clients --with-default-server=master01.procksi.local
188 1 Anonymous
make
189 1 Anonymous
make install 
190 1 Anonymous
}}}
191 1 Anonymous
 
192 1 Anonymous
Configure the compute nodes by creating/editing ''$TORQUECFG/mom_priv/config''. The first line specifies the PBS server, the second line specifies hosts which can be trusted to access mom services as non-root, and the last line allows copying data via NFS without using SCP.
193 1 Anonymous
{{{
194 1 Anonymous
$pbsserver   master01.procksi.local
195 1 Anonymous
$loglevel    255
196 1 Anonymous
$restricted  master01.procksi.local
197 1 Anonymous
$usecp       master01.procksi.local:/home/procksi   /home/procksi
198 1 Anonymous
}}}
199 1 Anonymous
200 1 Anonymous
Start the queueing system (manually) in the correct order:
201 1 Anonymous
 * Start the mom:
202 1 Anonymous
 {{{
203 1 Anonymous
 /usr/local/sbin/pbs_mom
204 1 Anonymous
 }}}
205 1 Anonymous
 * Kill the server:
206 1 Anonymous
 {{{
207 1 Anonymous
 /usr/local/sbin/qterm -t quick
208 1 Anonymous
 }}}
209 1 Anonymous
 * Start the server:
210 1 Anonymous
 {{{ 
211 1 Anonymous
 /usr/local/sbin/pbs_server
212 1 Anonymous
 }}}	
213 1 Anonymous
 * Start the scheduler:		
214 1 Anonymous
 {{{
215 1 Anonymous
 /usr/local/sbin/pbs_sched
216 1 Anonymous
 }}} 
217 1 Anonymous
218 1 Anonymous
If you want to use MAUI as the final scheduler, keep in mind to kill ''pbs_sched'' after testing the TORQURE installation.
219 1 Anonymous
220 1 Anonymous
221 1 Anonymous
Check that all nodes are properly configured and correctly reporting
222 1 Anonymous
{{{
223 1 Anonymous
qstat  -q
224 1 Anonymous
pbsnodes -a
225 1 Anonymous
}}}
226 1 Anonymous
227 1 Anonymous
228 1 Anonymous
=== Prologue and Epilogue Scripts ===
229 1 Anonymous
Get [repos:Externals/procksi_pbs.tgz] from the repository and untar it:
230 1 Anonymous
{{{
231 1 Anonymous
untar –xvzf procksi_pbs.tgz
232 1 Anonymous
}}}
233 1 Anonymous
234 1 Anonymous
The ''prologue'' script is executed just before the submitted job starts. Here, it generates a unique temp directory for each job in ''/scratch''. 
235 1 Anonymous
It must be installed on each node:
236 1 Anonymous
{{{
237 1 Anonymous
cp ./prologue $TORQUECFG/mom_priv
238 1 Anonymous
chmod 500 $TORQUECFG/mom_priv/prologue
239 1 Anonymous
}}}
240 1 Anonymous
 
241 1 Anonymous
The ''epilogue'' script is executed right after the submitted job has ended. Here, it deletes the job's temp directory from ''/scratch.'' It must be installed on each node:
242 1 Anonymous
{{{
243 1 Anonymous
cp ./epilogue $TORQUECFG/mom_priv
244 1 Anonymous
chmod 500 $TORQUECFG/mom_priv/epilogue
245 1 Anonymous
}}}
246 1 Anonymous
247 1 Anonymous
248 1 Anonymous
== MAUI ==
249 1 Anonymous
250 1 Anonymous
=== Register new services ===
251 1 Anonymous
Edit ''/etc/services'' and add at the end:
252 1 Anonymous
{{{ 
253 1 Anonymous
# PBS/MAUI services
254 1 Anonymous
pbs_maui  42559/tcp    # pbs scheduler (maui)
255 1 Anonymous
pbs_maui  42559/udp    # pbs scheduler (maui)
256 1 Anonymous
}}}
257 1 Anonymous
258 1 Anonymous
259 1 Anonymous
=== Setup and Configuration on the Head Node ===
260 1 Anonymous
Extract and build the distribution MAUI.
261 1 Anonymous
{{{
262 1 Anonymous
export MAUIDIR=/usr/local/maui
263 1 Anonymous
tar -xzvf MAUI.tar.gz
264 1 Anonymous
cd TORQUE
265 1 Anonymous
}}}
266 1 Anonymous
267 1 Anonymous
Configuration for a 64bit machine with the following compiler options:
268 1 Anonymous
{{{
269 1 Anonymous
FFLAGS   = “-m64 -march=[Add Architecture] -O3 -fPIC"
270 1 Anonymous
CFLAGS   = “-m64 -march=[Add Architecture] -O3 -fPIC"
271 1 Anonymous
CXXFLAGS = “-m64 -march=[Add Architecture] -O3 -fPIC"
272 1 Anonymous
LDFLAGS	 = “-L/usr/local/lib -L/usr/local/lib64"
273 1 Anonymous
}}}
274 1 Anonymous
'''Attention''': For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''.
275 1 Anonymous
276 1 Anonymous
Configure, build, and install:
277 1 Anonymous
{{{
278 1 Anonymous
./configure --with-pbs=$TORQUECFG --with-spooldir=$MAUIDIR
279 1 Anonymous
make
280 1 Anonymous
make install 
281 1 Anonymous
}}}
282 1 Anonymous
283 1 Anonymous
Fine-tune MAUI in $''MAUIDIR/maui.cfg'':
284 1 Anonymous
{{{
285 1 Anonymous
SERVERHOST            master01.procksi.local
286 1 Anonymous
287 1 Anonymous
# primary admin must be first in list
288 1 Anonymous
ADMIN1                procksi
289 1 Anonymous
ADMIN1                root
290 1 Anonymous
        
291 1 Anonymous
# Resource Manager Definition
292 1 Anonymous
RMCFG[MASTER01.PROCKSI.LOCAL]
293 1 Anonymous
]		
294 1 Anonymous
TYPE=PBS@RMNHOST@ 			
295 1 Anonymous
PORT=15001
296 1 Anonymous
EPORT=15004	[CAN BE ALTERNATIVELY: 15017 - TRY!!!]
297 1 Anonymous
298 1 Anonymous
SERVERPORT  42559
299 1 Anonymous
SERVERMODE  NORMAL
300 1 Anonymous
301 1 Anonymous
# Node Allocation:
302 1 Anonymous
NODEALLOCATIONPOLICY  PRIORITY
303 1 Anonymous
NODECFG[DEFAULT] PRIORITY='- JOBCOUNT'
304 1 Anonymous
}}}
305 1 Anonymous
306 1 Anonymous
307 1 Anonymous
Start the MAUI scheduler manually. Make sure that pbs_sched is not running any longer.
308 1 Anonymous
309 1 Anonymous
 * Start the scheduler:
310 1 Anonymous
 {{{
311 1 Anonymous
 /usr/local/sbin/maui
312 1 Anonymous
 }}}
313 1 Anonymous
 
314 1 Anonymous
315 1 Anonymous
Make the entire queuing system start at bootup:
316 1 Anonymous
{{{
317 1 Anonymous
cp ./pbs_master-node /etc/init.d/pbs 
318 1 Anonymous
/sbin/chkconfig --add pbs
319 1 Anonymous
/sbin/chkconfig pbs on
320 1 Anonymous
}}}
321 1 Anonymous
322 1 Anonymous
323 1 Anonymous
=== Setup and Configuration on the Slave Nodes ===
324 1 Anonymous
Get [repos:Externals/Cluster/procksi_pbs.tgz] from the repository and untar it:
325 1 Anonymous
{{{
326 1 Anonymous
untar –xvzf procksi_pbs.tgz
327 1 Anonymous
}}}
328 1 Anonymous
329 1 Anonymous
Make the entire queuing system start at bootup:
330 1 Anonymous
{{{
331 1 Anonymous
cp ./pbs_compute-node /etc/init.d/pbs
332 1 Anonymous
/sbin/chkconfig  --add  pbs
333 1 Anonymous
/sbin/chkconfig  pbs  on
334 1 Anonymous
}}}