JobManagement » History » Version 4
Paweł Widera, 11/28/2008 08:13 PM
MAUIDIR corrected
1 | 1 | Anonymous | = Job Management = |
---|---|---|---|
2 | 1 | Anonymous | The queuing system (resource manager) is the heart of the distributed computing on a cluster. It consists of three parts: the server, the scheduler, and the machine-oriented mini-server (MOM) executing the jobs. |
3 | 1 | Anonymous | |
4 | 1 | Anonymous | We are assuming the following configuration: |
5 | 1 | Anonymous | |
6 | 3 | Anonymous | ||PBS TORQUE|| version 2.1.8 ||server, basic scheduler, mom ||[source:Externals/Cluster/torque-2.1.8.tgz download from repository] |
7 | 3 | Anonymous | ||MAUI || version 3.2.6.p18 ||scheduler ||[source:Externals/Cluster/maui-3.2.6p18.tgz download from repository] |
8 | 1 | Anonymous | |
9 | 1 | Anonymous | |
10 | 1 | Anonymous | |
11 | 3 | Anonymous | Please check the distributors website's for newer versions: |
12 | 3 | Anonymous | |
13 | 1 | Anonymous | ||PBS TORQUE ||http://www.clusterresources.com/pages/products/torque-resource-manager.php |
14 | 1 | Anonymous | ||MAUI ||http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php |
15 | 1 | Anonymous | |
16 | 1 | Anonymous | |
17 | 1 | Anonymous | The install directories for ''TORQUE'' and ''MAUI'' will be: |
18 | 1 | Anonymous | |
19 | 1 | Anonymous | ||PBS TORQUE ||''/var/spool/torque'' |
20 | 1 | Anonymous | ||MAUI ||''/var/spool/maui'' |
21 | 1 | Anonymous | |
22 | 1 | Anonymous | |
23 | 1 | Anonymous | |
24 | 1 | Anonymous | == TORQUE == |
25 | 1 | Anonymous | |
26 | 1 | Anonymous | === Register new services === |
27 | 1 | Anonymous | Edit ''/etc/services'' and add at the end: |
28 | 1 | Anonymous | {{{ |
29 | 1 | Anonymous | # PBS/Torque services |
30 | 1 | Anonymous | pbs 15001/tcp # pbs_server |
31 | 1 | Anonymous | pbs 15001/udp # pbs_server |
32 | 1 | Anonymous | pbs_mom 15002/tcp # pbs_mom <-> pbs_server |
33 | 1 | Anonymous | pbs_mom 15002/udp # pbs_mom <-> pbs_server |
34 | 1 | Anonymous | pbs_resmom 15003/tcp # pbs_mom resource management |
35 | 1 | Anonymous | pbs_resmom 15003/udp # pbs_mom resource management |
36 | 1 | Anonymous | pbs_sched 15004/tcp # pbs scheduler (pbs_sched) |
37 | 1 | Anonymous | pbs_sched 15004/udp # pbs scheduler (pbs_sched) |
38 | 1 | Anonymous | }}} |
39 | 1 | Anonymous | |
40 | 1 | Anonymous | |
41 | 1 | Anonymous | === Setup and Configuration on the Master Node === |
42 | 1 | Anonymous | Extract and build the distribution TORQUE on the master node. Configure server, monitor and clients to use secure file transfer (scp). |
43 | 1 | Anonymous | {{{ |
44 | 1 | Anonymous | export TORQUECFG=/var/spool/torque |
45 | 1 | Anonymous | tar -xzvf TORQUE.tar.gz |
46 | 1 | Anonymous | cd TORQUE |
47 | 1 | Anonymous | }}} |
48 | 1 | Anonymous | |
49 | 1 | Anonymous | Configuration for a 64bit machine with the following compiler options: |
50 | 1 | Anonymous | {{{ |
51 | 1 | Anonymous | FFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC" |
52 | 1 | Anonymous | CFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC" |
53 | 1 | Anonymous | CXXFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC" |
54 | 1 | Anonymous | LDFLAGS = "-L/usr/local/lib -L/usr/local/lib64" |
55 | 1 | Anonymous | }}} |
56 | 1 | Anonymous | '''Attention''': For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''. |
57 | 1 | Anonymous | |
58 | 1 | Anonymous | Configure, build, and install: |
59 | 1 | Anonymous | {{{ |
60 | 1 | Anonymous | ./configure --prefix=/usr/local --with-spooldir=$TORQUECFG |
61 | 1 | Anonymous | make |
62 | 1 | Anonymous | make install |
63 | 1 | Anonymous | }}} |
64 | 1 | Anonymous | If not configures otherwise, binaries are installed in ''/usr/local/bin'' and ''/usr/local/sbin''. |
65 | 1 | Anonymous | |
66 | 1 | Anonymous | Initialise/configure the queuing system's server daemon (''pbs_server''): |
67 | 1 | Anonymous | {{{ |
68 | 1 | Anonymous | pbs_server -t create |
69 | 1 | Anonymous | }}} |
70 | 1 | Anonymous | |
71 | 1 | Anonymous | Set the PBS operator and manager (must be a valid user name). |
72 | 1 | Anonymous | {{{ |
73 | 1 | Anonymous | qmgr |
74 | 1 | Anonymous | > set server_name = master01.procksi.local |
75 | 1 | Anonymous | > set server scheduling = true |
76 | 1 | Anonymous | > set server operators = "root@master01.procksi.local,procksi@master01.procksi.local" |
77 | 1 | Anonymous | > set server managers = "root@master01.procksi.local,procksi@master01.procksi.local" |
78 | 1 | Anonymous | }}} |
79 | 1 | Anonymous | |
80 | 1 | Anonymous | Allow only ''procksi'' and ''root'' to submit jobs into the queue: |
81 | 1 | Anonymous | {{{ |
82 | 1 | Anonymous | > set server acl_users = "root,procksi" |
83 | 1 | Anonymous | > set server acl_user_enable = true |
84 | 1 | Anonymous | }}} |
85 | 1 | Anonymous | |
86 | 1 | Anonymous | Set email address for email that is sent by PBS: |
87 | 1 | Anonymous | {{{ |
88 | 1 | Anonymous | > set server mail_from = pbs@procksi.net |
89 | 1 | Anonymous | }}} |
90 | 1 | Anonymous | |
91 | 2 | Anonymous | Allow submissions from slave hosts (only): |
92 | 2 | Anonymous | '''ATTENTION: NEEDS TO BE CHECKED. DOES NOT WORK PROPERLY YET!! ''' |
93 | 1 | Anonymous | {{{ |
94 | 2 | Anonymous | {{{ |
95 | 1 | Anonymous | > set server allow_node_submit = true |
96 | 1 | Anonymous | > set server submit_hosts = master01.procksi.local |
97 | 1 | Anonymous | slave01.procksi.local |
98 | 1 | Anonymous | slave02.procksi.local |
99 | 1 | Anonymous | slave03.procksi.local |
100 | 1 | Anonymous | slave04.procksi.local |
101 | 1 | Anonymous | }}} |
102 | 1 | Anonymous | |
103 | 1 | Anonymous | |
104 | 1 | Anonymous | Restrict nodes that can access the PBS server: |
105 | 1 | Anonymous | {{{ |
106 | 1 | Anonymous | > set server acl_hosts = master01.procksi.local |
107 | 1 | Anonymous | slave01.procksi.local |
108 | 1 | Anonymous | slave02.procksi.local |
109 | 1 | Anonymous | slave03.procksi.local |
110 | 1 | Anonymous | slave04.procksi.local |
111 | 1 | Anonymous | > set acl_host_enable = true |
112 | 1 | Anonymous | }}} |
113 | 1 | Anonymous | |
114 | 1 | Anonymous | And set in ''torque.cfg'' in order |
115 | 1 | Anonymous | to use the internal interface: |
116 | 1 | Anonymous | {{{ |
117 | 1 | Anonymous | SERVERHOST master01.procksi.local |
118 | 1 | Anonymous | ALLOWCOMPUTEHOSTSUBMIT true |
119 | 1 | Anonymous | }}} |
120 | 2 | Anonymous | }}} |
121 | 1 | Anonymous | |
122 | 1 | Anonymous | Configure default node to be used (see below): |
123 | 1 | Anonymous | {{{ |
124 | 1 | Anonymous | > set server default_node = slave |
125 | 1 | Anonymous | }}} |
126 | 1 | Anonymous | |
127 | 1 | Anonymous | |
128 | 1 | Anonymous | Set the default queue to ''batch'' |
129 | 1 | Anonymous | {{{ |
130 | 1 | Anonymous | > set server default_queue=batch |
131 | 1 | Anonymous | }}} |
132 | 1 | Anonymous | |
133 | 1 | Anonymous | Configure the main queue ''batch'': |
134 | 1 | Anonymous | {{{ |
135 | 1 | Anonymous | > create queue batch queue_type=execution |
136 | 1 | Anonymous | > set queue batch started=true |
137 | 1 | Anonymous | > set queue batch enabled=true |
138 | 1 | Anonymous | > set queue batch resources_default.nodes=1 |
139 | 1 | Anonymous | }}} |
140 | 1 | Anonymous | |
141 | 1 | Anonymous | Configure queue ''test ''accordingly''. |
142 | 1 | Anonymous | |
143 | 1 | Anonymous | Specify all compute nodes to be used by creating/editing ''$TORQUECFG/server_priv/nodes.'' This may include the same machine where pbs_server will run. If the compute nodes have more than one processor, just add np=X after the name with X being the number of processors. Add node attributes so that a subset of nodes can be requested during the submission stage. |
144 | 1 | Anonymous | {{{ |
145 | 1 | Anonymous | master01.procksi.local np=2 procksi master xeon |
146 | 1 | Anonymous | slave01.procksi.local np=2 procksi slave xeon |
147 | 1 | Anonymous | slave02.procksi.local np=2 procksi slave xeon |
148 | 1 | Anonymous | slave03.procksi.local np=4 procksi slave opteron |
149 | 1 | Anonymous | slave04.procksi.local np=4 procksi slave opteron |
150 | 1 | Anonymous | }}} |
151 | 1 | Anonymous | |
152 | 1 | Anonymous | Although the master node (''master01'') has two processors as well, we only allow one processor to be used for the queueing system as the other processor will be used for handling all frontend communication and I/O. (Make sure that hyperthreading technology is disabled on the head node and all compute nodes!) |
153 | 1 | Anonymous | |
154 | 1 | Anonymous | Request job to be run on specific nodes (on submission): |
155 | 1 | Anonymous | * Run on any compute node: |
156 | 1 | Anonymous | {{{ |
157 | 1 | Anonymous | qsub -q batch -l nodes=1:procksi |
158 | 1 | Anonymous | }}} |
159 | 1 | Anonymous | * Run on any slave node: |
160 | 1 | Anonymous | {{{ |
161 | 1 | Anonymous | qsub -q batch -l nodes=1:slave |
162 | 1 | Anonymous | }}} |
163 | 1 | Anonymous | * Run on master node: |
164 | 1 | Anonymous | {{{ |
165 | 1 | Anonymous | qsub -q batch -l nodes=1:master |
166 | 1 | Anonymous | }}} |
167 | 1 | Anonymous | |
168 | 1 | Anonymous | |
169 | 1 | Anonymous | |
170 | 1 | Anonymous | |
171 | 1 | Anonymous | === Setup and Configuration on the Slave Nodes === |
172 | 1 | Anonymous | Extract and build the distribution TORQUE on each slave node. Configure monitor and clients to use secure file transfer (scp). |
173 | 1 | Anonymous | {{{ |
174 | 1 | Anonymous | export TORQUECFG=/var/spool/torque |
175 | 1 | Anonymous | tar -xzvf TORQUE.tar.gz |
176 | 1 | Anonymous | cd TORQUE |
177 | 1 | Anonymous | }}} |
178 | 1 | Anonymous | |
179 | 1 | Anonymous | Configuration for a 64bit machine with the following compiler options: |
180 | 1 | Anonymous | {{{ |
181 | 1 | Anonymous | FFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC" |
182 | 1 | Anonymous | CFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC" |
183 | 1 | Anonymous | CXXFLAGS = "-m64 -march=[Add Architecture] -O3 -fPIC" |
184 | 1 | Anonymous | LDFLAGS = "-L/usr/local/lib -L/usr/local/lib64" |
185 | 1 | Anonymous | }}} |
186 | 1 | Anonymous | Attention: For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''. |
187 | 1 | Anonymous | |
188 | 1 | Anonymous | Configure, build, and install: |
189 | 1 | Anonymous | {{{ |
190 | 1 | Anonymous | ./configure --prefix=/usr/local --with-spooldir=$TORQUECFG --disable-server --enable-mom --enable-clients --with-default-server=master01.procksi.local |
191 | 1 | Anonymous | make |
192 | 1 | Anonymous | make install |
193 | 1 | Anonymous | }}} |
194 | 1 | Anonymous | |
195 | 1 | Anonymous | Configure the compute nodes by creating/editing ''$TORQUECFG/mom_priv/config''. The first line specifies the PBS server, the second line specifies hosts which can be trusted to access mom services as non-root, and the last line allows copying data via NFS without using SCP. |
196 | 1 | Anonymous | {{{ |
197 | 1 | Anonymous | $pbsserver master01.procksi.local |
198 | 1 | Anonymous | $loglevel 255 |
199 | 1 | Anonymous | $restricted master01.procksi.local |
200 | 1 | Anonymous | $usecp master01.procksi.local:/home/procksi /home/procksi |
201 | 1 | Anonymous | }}} |
202 | 1 | Anonymous | |
203 | 1 | Anonymous | Start the queueing system (manually) in the correct order: |
204 | 1 | Anonymous | * Start the mom: |
205 | 1 | Anonymous | {{{ |
206 | 1 | Anonymous | /usr/local/sbin/pbs_mom |
207 | 1 | Anonymous | }}} |
208 | 1 | Anonymous | * Kill the server: |
209 | 1 | Anonymous | {{{ |
210 | 1 | Anonymous | /usr/local/sbin/qterm -t quick |
211 | 1 | Anonymous | }}} |
212 | 1 | Anonymous | * Start the server: |
213 | 1 | Anonymous | {{{ |
214 | 1 | Anonymous | /usr/local/sbin/pbs_server |
215 | 1 | Anonymous | }}} |
216 | 1 | Anonymous | * Start the scheduler: |
217 | 1 | Anonymous | {{{ |
218 | 1 | Anonymous | /usr/local/sbin/pbs_sched |
219 | 1 | Anonymous | }}} |
220 | 1 | Anonymous | |
221 | 1 | Anonymous | If you want to use MAUI as the final scheduler, keep in mind to kill ''pbs_sched'' after testing the TORQURE installation. |
222 | 1 | Anonymous | |
223 | 1 | Anonymous | |
224 | 1 | Anonymous | Check that all nodes are properly configured and correctly reporting |
225 | 1 | Anonymous | {{{ |
226 | 1 | Anonymous | qstat -q |
227 | 1 | Anonymous | pbsnodes -a |
228 | 1 | Anonymous | }}} |
229 | 1 | Anonymous | |
230 | 1 | Anonymous | |
231 | 1 | Anonymous | === Prologue and Epilogue Scripts === |
232 | 1 | Anonymous | Get [repos:Externals/procksi_pbs.tgz] from the repository and untar it: |
233 | 1 | Anonymous | {{{ |
234 | 1 | Anonymous | untar –xvzf procksi_pbs.tgz |
235 | 1 | Anonymous | }}} |
236 | 1 | Anonymous | |
237 | 1 | Anonymous | The ''prologue'' script is executed just before the submitted job starts. Here, it generates a unique temp directory for each job in ''/scratch''. |
238 | 3 | Anonymous | It must be installed on each NODE (master, slave): |
239 | 1 | Anonymous | {{{ |
240 | 3 | Anonymous | cp ./pbs/NODE/var/spool/torque/mom/priv/prologue $TORQUECFG/mom_priv |
241 | 1 | Anonymous | chmod 500 $TORQUECFG/mom_priv/prologue |
242 | 1 | Anonymous | }}} |
243 | 1 | Anonymous | |
244 | 3 | Anonymous | The ''epilogue'' script is executed right after the submitted job has ended. Here, it deletes the job's temp directory from ''/scratch.'' It must be installed on each NODE (master, slave) |
245 | 1 | Anonymous | {{{ |
246 | 3 | Anonymous | cp ./pbs/NODE/var/spool/torque/mom/priv/epilogue $TORQUECFG/mom_priv |
247 | 1 | Anonymous | chmod 500 $TORQUECFG/mom_priv/epilogue |
248 | 1 | Anonymous | }}} |
249 | 1 | Anonymous | |
250 | 1 | Anonymous | |
251 | 1 | Anonymous | == MAUI == |
252 | 1 | Anonymous | |
253 | 1 | Anonymous | === Register new services === |
254 | 1 | Anonymous | Edit ''/etc/services'' and add at the end: |
255 | 1 | Anonymous | {{{ |
256 | 1 | Anonymous | # PBS/MAUI services |
257 | 1 | Anonymous | pbs_maui 42559/tcp # pbs scheduler (maui) |
258 | 1 | Anonymous | pbs_maui 42559/udp # pbs scheduler (maui) |
259 | 1 | Anonymous | }}} |
260 | 1 | Anonymous | |
261 | 1 | Anonymous | |
262 | 1 | Anonymous | === Setup and Configuration on the Head Node === |
263 | 1 | Anonymous | Extract and build the distribution MAUI. |
264 | 1 | Anonymous | {{{ |
265 | 4 | Paweł Widera | export MAUIDIR=/var/spool/maui |
266 | 1 | Anonymous | tar -xzvf MAUI.tar.gz |
267 | 1 | Anonymous | cd TORQUE |
268 | 1 | Anonymous | }}} |
269 | 1 | Anonymous | |
270 | 1 | Anonymous | Configuration for a 64bit machine with the following compiler options: |
271 | 1 | Anonymous | {{{ |
272 | 1 | Anonymous | FFLAGS = “-m64 -march=[Add Architecture] -O3 -fPIC" |
273 | 1 | Anonymous | CFLAGS = “-m64 -march=[Add Architecture] -O3 -fPIC" |
274 | 1 | Anonymous | CXXFLAGS = “-m64 -march=[Add Architecture] -O3 -fPIC" |
275 | 1 | Anonymous | LDFLAGS = “-L/usr/local/lib -L/usr/local/lib64" |
276 | 1 | Anonymous | }}} |
277 | 1 | Anonymous | '''Attention''': For Intel Xenon processors use ''-march=nocona'', for AMD Opteron processors use ''-march=opteron''. |
278 | 1 | Anonymous | |
279 | 1 | Anonymous | Configure, build, and install: |
280 | 1 | Anonymous | {{{ |
281 | 1 | Anonymous | ./configure --with-pbs=$TORQUECFG --with-spooldir=$MAUIDIR |
282 | 1 | Anonymous | make |
283 | 1 | Anonymous | make install |
284 | 1 | Anonymous | }}} |
285 | 1 | Anonymous | |
286 | 1 | Anonymous | Fine-tune MAUI in $''MAUIDIR/maui.cfg'': |
287 | 1 | Anonymous | {{{ |
288 | 1 | Anonymous | SERVERHOST master01.procksi.local |
289 | 1 | Anonymous | |
290 | 1 | Anonymous | # primary admin must be first in list |
291 | 1 | Anonymous | ADMIN1 procksi |
292 | 1 | Anonymous | ADMIN1 root |
293 | 1 | Anonymous | |
294 | 1 | Anonymous | # Resource Manager Definition |
295 | 1 | Anonymous | RMCFG[MASTER01.PROCKSI.LOCAL] |
296 | 1 | Anonymous | ] |
297 | 1 | Anonymous | TYPE=PBS@RMNHOST@ |
298 | 1 | Anonymous | PORT=15001 |
299 | 1 | Anonymous | EPORT=15004 [CAN BE ALTERNATIVELY: 15017 - TRY!!!] |
300 | 1 | Anonymous | |
301 | 1 | Anonymous | SERVERPORT 42559 |
302 | 1 | Anonymous | SERVERMODE NORMAL |
303 | 1 | Anonymous | |
304 | 1 | Anonymous | # Node Allocation: |
305 | 1 | Anonymous | NODEALLOCATIONPOLICY PRIORITY |
306 | 1 | Anonymous | NODECFG[DEFAULT] PRIORITY='- JOBCOUNT' |
307 | 1 | Anonymous | }}} |
308 | 1 | Anonymous | |
309 | 1 | Anonymous | |
310 | 1 | Anonymous | Start the MAUI scheduler manually. Make sure that pbs_sched is not running any longer. |
311 | 1 | Anonymous | |
312 | 1 | Anonymous | * Start the scheduler: |
313 | 1 | Anonymous | {{{ |
314 | 1 | Anonymous | /usr/local/sbin/maui |
315 | 1 | Anonymous | }}} |
316 | 1 | Anonymous | |
317 | 1 | Anonymous | |
318 | 3 | Anonymous | Get [repos:Externals/Cluster/procksi_pbs.tgz] from the repository and untar it: |
319 | 1 | Anonymous | {{{ |
320 | 3 | Anonymous | untar –xvzf procksi_pbs.tgz |
321 | 1 | Anonymous | }}} |
322 | 1 | Anonymous | |
323 | 3 | Anonymous | Make the entire queuing system (Torque + Maui) start at bootup: |
324 | 3 | Anonymous | {{{ |
325 | 3 | Anonymous | cp ./pbs/master/etc/init.d/pbs_* /etc/init.d/ |
326 | 3 | Anonymous | /sbin/chkconfig --add pbs_mom |
327 | 3 | Anonymous | /sbin/chkconfig --add pbs_maui |
328 | 3 | Anonymous | /sbin/chkconfig --add pbs_server |
329 | 3 | Anonymous | /sbin/chkconfig pbs_mom on |
330 | 3 | Anonymous | /sbin/chkconfig pbs_maui on |
331 | 3 | Anonymous | /sbin/chkconfig pbs_server on |
332 | 3 | Anonymous | }}} |
333 | 1 | Anonymous | |
334 | 3 | Anonymous | If you want to use the simple scheduler that comes with PBS Torque, then substitute ''pbs_maui'' with ''pbs_sched''. |
335 | 3 | Anonymous | |
336 | 3 | Anonymous | |
337 | 1 | Anonymous | === Setup and Configuration on the Slave Nodes === |
338 | 1 | Anonymous | Get [repos:Externals/Cluster/procksi_pbs.tgz] from the repository and untar it: |
339 | 1 | Anonymous | {{{ |
340 | 1 | Anonymous | untar –xvzf procksi_pbs.tgz |
341 | 1 | Anonymous | }}} |
342 | 1 | Anonymous | |
343 | 1 | Anonymous | Make the entire queuing system start at bootup: |
344 | 1 | Anonymous | {{{ |
345 | 3 | Anonymous | cp ./pbs/slave/etc/init.d/pbs_mom /etc/init.d/ |
346 | 3 | Anonymous | /sbin/chkconfig --add pbs_mom |
347 | 3 | Anonymous | /sbin/chkconfig pbs_mom on |
348 | 1 | Anonymous | }}} |