ClusterMonitoring » History » Version 6
Anonymous, 08/18/2008 01:34 PM
1 | 1 | Anonymous | = Cluster Monitoring = |
---|---|---|---|
2 | 1 | Anonymous | The cluster resources and performance needs to be constantly monitored, and the users need to be tracked. |
3 | 1 | Anonymous | |
4 | 1 | Anonymous | We assume the following configuration: |
5 | 2 | Anonymous | ||!StorMan || 2.12_B928 |
6 | 1 | Anonymous | ||Ganglia || 3.0.4 |
7 | 1 | Anonymous | ||!JobMonarch || 0.2 |
8 | 5 | Anonymous | ||Monit || 4.9.2 |
9 | 2 | Anonymous | |
10 | 2 | Anonymous | |
11 | 2 | Anonymous | == !StorMan == |
12 | 3 | Anonymous | !StorMan is the ''DELL RAID Storage Manger" (RSM) for RAID systems. The agent has ''/dev/aac0'' open and listens on ports 34571, 34572, 34573. |
13 | 2 | Anonymous | |
14 | 2 | Anonymous | * Download [repos:Externals/Cluster/RSM-2.12_B928_Linux.tgz] from the repository and unpack it. Enter at the command line of the master node: |
15 | 2 | Anonymous | {{{ |
16 | 2 | Anonymous | untgz StorMan-2.12.i386.rpm |
17 | 2 | Anonymous | }}} |
18 | 2 | Anonymous | |
19 | 2 | Anonymous | * Install the RSM ignoring the dependencies. Enter at the command line of the master node: |
20 | 2 | Anonymous | {{{ |
21 | 2 | Anonymous | rpm -Uv --dodeps StorMan-2.12.i386.rpm |
22 | 2 | Anonymous | }}} |
23 | 2 | Anonymous | |
24 | 2 | Anonymous | * Make sure that the RSM starts at boot time. Enter at the command line of the master node: |
25 | 2 | Anonymous | {{{ |
26 | 2 | Anonymous | cd /usr/StoreMan |
27 | 2 | Anonymous | cp ./stor_agent /etc/init.d/rsd |
28 | 2 | Anonymous | /sbin/chkconfig --add rsd |
29 | 2 | Anonymous | /sbin/chkconfig rsd on |
30 | 2 | Anonymous | }}} |
31 | 2 | Anonymous | |
32 | 2 | Anonymous | * Start the RSM. Enter at the command line of the master node: |
33 | 2 | Anonymous | {{{ |
34 | 2 | Anonymous | /sbin/service rsd start |
35 | 2 | Anonymous | }}} |
36 | 2 | Anonymous | |
37 | 4 | Anonymous | == Further RAID Monitoring == |
38 | 4 | Anonymous | |
39 | 4 | Anonymous | === Command Line Interface (CLI) for Dell's RAID Controller === |
40 | 4 | Anonymous | The manual can be found at [http://support.euro.dell.com/support/edocs/storage/CS6CH/en/ug/dell_ceg.htm]] |
41 | 4 | Anonymous | |
42 | 4 | Anonymous | |
43 | 4 | Anonymous | === Comments by William Armitage === |
44 | 4 | Anonymous | On procksi0 i have dropped the raid tools into /usr/local/afa |
45 | 4 | Anonymous | |
46 | 4 | Anonymous | {{{ |
47 | 4 | Anonymous | root@master01:/usr/local/afa# ls -l /usr/local/afa |
48 | 4 | Anonymous | total 3032 |
49 | 4 | Anonymous | -rwxr--r-- 1 wja procksi 1893976 Nov 29 11:51 afacli |
50 | 4 | Anonymous | -rw-r--r-- 1 root root 0 Nov 29 11:58 cfg.log |
51 | 4 | Anonymous | -rw-r--r-- 1 wja procksi 572 Nov 29 11:51 getcfg.afa |
52 | 4 | Anonymous | -rw-r--r-- 1 root root 165 Dec 19 12:34 i2 |
53 | 4 | Anonymous | -rw-r--r-- 1 root root 2050 Dec 19 12:36 i2.log |
54 | 4 | Anonymous | -rw-r--r-- 1 root root 159 Dec 11 18:11 i2.orig |
55 | 4 | Anonymous | -rw-r--r-- 1 root root 325 Dec 19 12:34 i3 |
56 | 4 | Anonymous | -rw-r--r-- 1 root root 1153256 Dec 19 12:38 i3.log |
57 | 4 | Anonymous | -rw-r--r-- 1 root root 98 Nov 29 14:47 input |
58 | 4 | Anonymous | -rwxr--r-- 1 wja procksi 595 Nov 29 11:51 MAKEDEV.afa |
59 | 4 | Anonymous | -rw-r--r-- 1 root root 1152 Nov 30 12:39 output |
60 | 4 | Anonymous | -rw-r--r-- 1 root root 1152 Nov 29 14:48 output.0 |
61 | 4 | Anonymous | -rw-r--r-- 1 root root 1152 Nov 29 15:23 output.1 |
62 | 4 | Anonymous | -rw-r--r-- 1 root root 1152 Nov 29 18:04 output.2 |
63 | 4 | Anonymous | }}} |
64 | 4 | Anonymous | |
65 | 4 | Anonymous | afacli is from http://linux.dell.com/storage.shtml under the section AACRAID > Management Utility > afa-apps-snmp.2807420-A04.tar.gz; untar and pull apart afaapps-4.1-0.i386.rpm. |
66 | 4 | Anonymous | |
67 | 4 | Anonymous | [http://support.dell.com/support/downloads/download.aspx?c=us&l=en&s=gen&releaseid=R85529&formatcnt=1&fileid=112003] |
68 | 4 | Anonymous | |
69 | 4 | Anonymous | If you don't have /dev/afa0 create it by |
70 | 4 | Anonymous | {{{ |
71 | 4 | Anonymous | cd /dev |
72 | 4 | Anonymous | /usr/local/afa/MAKEDEV.afa afa0 |
73 | 4 | Anonymous | }}} |
74 | 4 | Anonymous | It disappeared after the reboot and needed recreating. |
75 | 4 | Anonymous | |
76 | 4 | Anonymous | afacli is described as a bad port of a dos program. |
77 | 4 | Anonymous | while command line it does wierd things to the terminal so feed it scripts. |
78 | 4 | Anonymous | it does echo to the output as well as any logging set but it uses escape |
79 | 4 | Anonymous | codes that write to the alternate screen in colour xterm and then immediatly |
80 | 4 | Anonymous | switches back at end. |
81 | 4 | Anonymous | |
82 | 4 | Anonymous | The "input" script comes from |
83 | 4 | Anonymous | [http://linux.dell.com/files/aacraid/nagios/check_raid_pl.txt][[br]] |
84 | 4 | Anonymous | {{{output log "output"}}} |
85 | 4 | Anonymous | |
86 | 4 | Anonymous | the more detailed script "i2" comes from |
87 | 4 | Anonymous | [http://www.techno-obscura.com/~delgado/notes/sles9-NagiosAfacli.html][[br]] |
88 | 4 | Anonymous | {{{output i2.out}}} |
89 | 4 | Anonymous | |
90 | 4 | Anonymous | "i3" dumps the controller logs. Its based on |
91 | 4 | Anonymous | [http://threebit.net/mail-archive/centos/msg02033.html][[br]] |
92 | 4 | Anonymous | {{{output i3.out}}} |
93 | 2 | Anonymous | |
94 | 5 | Anonymous | == Monitoring of Services: Monit == |
95 | 5 | Anonymous | You can find more about ''monit'' at [http://mon.wiki.kernel.org/]. |
96 | 1 | Anonymous | |
97 | 6 | Anonymous | * Add the DAG repository on the ''master node'' and ''slave nodes''. Enter at the command line as ''root'': |
98 | 5 | Anonymous | {{{ |
99 | 5 | Anonymous | wget http://apt.sw.be/redhat/el5/en/x86_64/rpmforge/RPMS/rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm |
100 | 5 | Anonymous | rpm -Uvh rpmforge-release-0.3.6-1.el5.rf.x86_64.rpm |
101 | 5 | Anonymous | }}} |
102 | 5 | Anonymous | |
103 | 6 | Anonymous | * Install ''monit'' on the ''master node'' and ''slave nodes''. Enter at the command line as ''root'': |
104 | 5 | Anonymous | {{{ |
105 | 5 | Anonymous | yum install monit |
106 | 1 | Anonymous | }}} |
107 | 1 | Anonymous | |
108 | 6 | Anonymous | * General configuration on the ''master node'' and ''slave nodes''. Edit ''/etc/monit.conf'' as ''root'': |
109 | 1 | Anonymous | |
110 | 6 | Anonymous | Start Monit as a daemon and check the services every 2 minutes: |
111 | 6 | Anonymous | {{{ |
112 | 6 | Anonymous | set daemon 120 |
113 | 6 | Anonymous | }}} |
114 | 6 | Anonymous | |
115 | 6 | Anonymous | Use ''syslog'' in order to rotate log files: |
116 | 6 | Anonymous | {{{ |
117 | 6 | Anonymous | set logfile syslog facility log_daemon |
118 | 6 | Anonymous | }}} |
119 | 6 | Anonymous | |
120 | 6 | Anonymous | Set list of mailservers to be used for alert delivery: |
121 | 6 | Anonymous | {{{ |
122 | 6 | Anonymous | set mailserver master01.procksi.local # primary mailserver |
123 | 6 | Anonymous | marian.cs.nott.ac.uk # fallback relay |
124 | 6 | Anonymous | }}} |
125 | 6 | Anonymous | |
126 | 6 | Anonymous | Change the alert message format: |
127 | 6 | Anonymous | {{{ |
128 | 6 | Anonymous | set mail-format { from: monit@procksi.net } |
129 | 6 | Anonymous | }}} |
130 | 6 | Anonymous | |
131 | 6 | Anonymous | Set the alert repicient(s): |
132 | 6 | Anonymous | {{{ |
133 | 6 | Anonymous | set alert procksi@cs.nott.ac.uk # receive all alerts |
134 | 6 | Anonymous | }}} |
135 | 6 | Anonymous | |
136 | 6 | Anonymous | |
137 | 5 | Anonymous | * Configure services to be monitored: |
138 | 5 | Anonymous | [TBC] |
139 | 5 | Anonymous | |
140 | 5 | Anonymous | * Configure files to be monitored: |
141 | 1 | Anonymous | [TBC] |
142 | 5 | Anonymous | |
143 | 5 | Anonymous | * Configure devices to be monitored: |
144 | 1 | Anonymous | [TBC] |
145 | 5 | Anonymous | |
146 | 6 | Anonymous | * Make the ''monit'' daemon start at bootup. Enter at the command line as ''root'' on the ''master node'' and ''slave nodes'': |
147 | 5 | Anonymous | {{{ |
148 | 5 | Anonymous | /sbin/chkconfig monit on |
149 | 5 | Anonymous | }}} |
150 | 6 | Anonymous | * Start the Ganglia Monitoring Daemon. Enter at the command line as ''root'' on the ''master node'' and ''slave nodes'': |
151 | 5 | Anonymous | {{{ |
152 | 5 | Anonymous | /sbin/service monit start |
153 | 5 | Anonymous | }}} |
154 | 5 | Anonymous | |
155 | 5 | Anonymous | |
156 | 5 | Anonymous | == Monitoring of Cluster Resources: Ganglia == |
157 | 1 | Anonymous | ''Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.'' |
158 | 1 | Anonymous | |
159 | 1 | Anonymous | * Download the latest release of the ''Ganglia Monitoring Core'' from [http://ganglia.sourceforge.net/ http://ganglia.sourceforge.net]. |
160 | 1 | Anonymous | * Install Ganglia into ''/usr/local/ganglia'', its web frontend into ''/usr/local/ganglia/html/', and its databases into ''/usr/local/ganglia/rrds/'. |
161 | 1 | Anonymous | * Install the ''Ganglia Monitoring Daemon'' (gmond) on each node, and the ''Ganglia Meta Daemon'' (gmetad) on the the head node. |
162 | 1 | Anonymous | |
163 | 1 | Anonymous | === Ganglia Monitoring Daemon === |
164 | 1 | Anonymous | * Configure, build and install Ganglia on each slave node (only with ''gmond''): |
165 | 1 | Anonymous | {{{ |
166 | 1 | Anonymous | ./configure --prefix=/usr/local |
167 | 1 | Anonymous | }}} |
168 | 1 | Anonymous | and on the master node (with ''gmond'' and ''gmetad''): |
169 | 1 | Anonymous | {{{ |
170 | 1 | Anonymous | ./configure --prefix=/usr/local --with-gmeta |
171 | 1 | Anonymous | }}} |
172 | 1 | Anonymous | |
173 | 1 | Anonymous | * Initialise the configuration file for ''gmond'': |
174 | 1 | Anonymous | {{{ |
175 | 1 | Anonymous | gmond --default >> /etc/gmond.conf |
176 | 1 | Anonymous | }}} |
177 | 1 | Anonymous | |
178 | 1 | Anonymous | * Configure the ''Ganglia Monitoring Daemon'' in ''/etc/gmond.conf'': |
179 | 1 | Anonymous | * Set the name of the cluster: |
180 | 1 | Anonymous | {{{ |
181 | 1 | Anonymous | cluster { |
182 | 1 | Anonymous | name = "ProCKSI" |
183 | 1 | Anonymous | } |
184 | 1 | Anonymous | }}} |
185 | 1 | Anonymous | * Set the IP address and port for multicast data exchange: |
186 | 1 | Anonymous | {{{ |
187 | 1 | Anonymous | udp_send_channel { |
188 | 1 | Anonymous | mcast_join = 239.2.11.71 |
189 | 1 | Anonymous | port = 8649 |
190 | 1 | Anonymous | } |
191 | 1 | Anonymous | udp_recv_channel { |
192 | 1 | Anonymous | mcast_join = 239.2.11.71 |
193 | 1 | Anonymous | port = 8649 |
194 | 1 | Anonymous | bind = 239.2.11.71 |
195 | 1 | Anonymous | } |
196 | 1 | Anonymous | }}} |
197 | 1 | Anonymous | |
198 | 1 | Anonymous | * Copy start-up script for ''gmond'': |
199 | 1 | Anonymous | {{{ |
200 | 1 | Anonymous | cp ./gmond/gmond.init /etc/init.d/gmond |
201 | 1 | Anonymous | }}} |
202 | 1 | Anonymous | |
203 | 1 | Anonymous | * Add additional route for correct data exchange via multicast using the ''internal'' interface (''eth0''). Modify ''/etc/inid.d/gmond'': |
204 | 1 | Anonymous | {{{ |
205 | 1 | Anonymous | #Add multicast route to internal interface |
206 | 1 | Anonymous | /sbin/route add -host 239.2.11.71 dev eth0 |
207 | 1 | Anonymous | daemon $GMOND |
208 | 1 | Anonymous | }}} |
209 | 1 | Anonymous | {{{ |
210 | 1 | Anonymous | #Remove multicast route to internal interface |
211 | 1 | Anonymous | /sbin/route delete -host 239.2.11.71 dev eth0 |
212 | 1 | Anonymous | killproc gmond |
213 | 1 | Anonymous | }}} |
214 | 1 | Anonymous | * Make the Ganglia Monitoring Daemon start at bootup. |
215 | 1 | Anonymous | {{{ |
216 | 1 | Anonymous | /sbin/chkconfig gmond on |
217 | 1 | Anonymous | }}} |
218 | 1 | Anonymous | * Start the Ganglia Monitoring Daemon: |
219 | 1 | Anonymous | {{{ |
220 | 1 | Anonymous | /sbin/service gmond start |
221 | 1 | Anonymous | }}} |
222 | 1 | Anonymous | |
223 | 1 | Anonymous | === Ganglia Meta Daemon === |
224 | 1 | Anonymous | * Install and configure the ''Ganglia Meta Daeomn'' (gmetad) on the master node. |
225 | 1 | Anonymous | |
226 | 1 | Anonymous | * Make the Ganglia Meta Daemon start at bootup. |
227 | 1 | Anonymous | {{{ |
228 | 1 | Anonymous | /sbin/chkconfig --add gmetad |
229 | 1 | Anonymous | /sbin/chkconfig gmetad on |
230 | 1 | Anonymous | }}} |
231 | 1 | Anonymous | * Start the Ganglia Meta Daemon: |
232 | 1 | Anonymous | {{{ |
233 | 1 | Anonymous | /sbin/service gmetad start |
234 | 1 | Anonymous | }}} |
235 | 1 | Anonymous | |
236 | 1 | Anonymous | * If the pie chart diagrams do not show up, you have to install the ''php-gd'' packages. |
237 | 1 | Anonymous | |
238 | 1 | Anonymous | |
239 | 1 | Anonymous | === Further Customisation === |
240 | 1 | Anonymous | In order to display more fine-grained time intervals, edit the following files in ''/usr/local/ganglia/html/'': |
241 | 1 | Anonymous | * '''header.php''' |
242 | 1 | Anonymous | {{{ |
243 | 1 | Anonymous | if (!$physical) { |
244 | 1 | Anonymous | $context_ranges[]="10 minutes"; |
245 | 1 | Anonymous | $context_ranges[]="20 minutes"; |
246 | 1 | Anonymous | $context_ranges[]="30 minutes"; |
247 | 1 | Anonymous | $context_ranges[]="1 hour"; |
248 | 1 | Anonymous | $context_ranges[]="2 hours"; |
249 | 1 | Anonymous | $context_ranges[]="4 hours"; |
250 | 1 | Anonymous | $context_ranges[]="8 hours"; |
251 | 1 | Anonymous | $context_ranges[]="12 hours"; |
252 | 1 | Anonymous | $context_ranges[]="1 day"; |
253 | 1 | Anonymous | $context_ranges[]="2 days"; |
254 | 1 | Anonymous | $context_ranges[]="week"; |
255 | 1 | Anonymous | $context_ranges[]="month"; |
256 | 1 | Anonymous | $context_ranges[]="year"; |
257 | 1 | Anonymous | }}} |
258 | 1 | Anonymous | |
259 | 1 | Anonymous | * '''get_context.php''' |
260 | 1 | Anonymous | {{{ |
261 | 1 | Anonymous | switch ($range) { |
262 | 1 | Anonymous | case "10 minutes": $start = -600; break; |
263 | 1 | Anonymous | case "20 minutes": $start = -1200; break; |
264 | 1 | Anonymous | case "30 minutes": $start = -1800; break; |
265 | 1 | Anonymous | case "1 hour": $start = -3600; break; |
266 | 1 | Anonymous | case "2 hours": $start = -7200; break; |
267 | 1 | Anonymous | case "4 hours": $start = -14400; break; |
268 | 1 | Anonymous | case "8 hours": $start = -28800; break; |
269 | 1 | Anonymous | case "12 hours": $start = -43200; break; |
270 | 1 | Anonymous | case "1 day": $start = -86400; break; |
271 | 1 | Anonymous | case "2 days": $start = -172800; break; |
272 | 1 | Anonymous | case "week": $start = -604800; break; |
273 | 1 | Anonymous | case "month": $start = -2419200; break; |
274 | 1 | Anonymous | case "year": $start = -31449600; break; |
275 | 1 | Anonymous | }}} |
276 | 1 | Anonymous | |
277 | 1 | Anonymous | |
278 | 1 | Anonymous | == !JobMonarch == |
279 | 1 | Anonymous | !JobMonarch is an add-on to Ganglia which provides PBS job monitoring through the web browser. |
280 | 1 | Anonymous | |
281 | 1 | Anonymous | See [http://subtrac.rc.sara.nl/oss/jobmonarch/wiki/Documentation http://subtrac.rc.sara.nl/oss/jobmonarch/wiki/Documentation] for information on requirements, configuration and installation. |
282 | 1 | Anonymous | |
283 | 1 | Anonymous | '''Attention''': Does not work properly yet. |
284 | 1 | Anonymous | |
285 | 1 | Anonymous | |
286 | 1 | Anonymous | == Domain Usage Monitoring == |
287 | 1 | Anonymous | All HTML documents must contain the following code in order to be tracked correctly. |
288 | 1 | Anonymous | |
289 | 1 | Anonymous | {{{ |
290 | 1 | Anonymous | <!-- Site Meter --> |
291 | 1 | Anonymous | <script type="text/javascript" src="http://s18.sitemeter.com/js/counter.js?site=s18procksi"> |
292 | 1 | Anonymous | </script> |
293 | 1 | Anonymous | <noscript> |
294 | 1 | Anonymous | <a href="http://s18.sitemeter.com/stats.asp?site=s18procksi" target="_top"> |
295 | 1 | Anonymous | <img src=[http://s18.sitemeter.com/meter.asp?site=s18procksi http://s18.sitemeter.com/meter.asp?site=s18procksi] |
296 | 1 | Anonymous | alt="Site Meter" border="0"/> |
297 | 1 | Anonymous | </a> |
298 | 1 | Anonymous | </noscript> |
299 | 1 | Anonymous | |
300 | 1 | Anonymous | <!-- Copyright (c)2006 Site Meter --> |
301 | 1 | Anonymous | }}} |