DataStandardisation » History » Version 3

Anonymous, 07/26/2007 04:29 PM

1 1 Anonymous
== Standardising Results with XML ==  
2 1 Anonymous
3 1 Anonymous
  * '''Orignal Proposal by Dr. Daniel ''' 
4 1 Anonymous
5 2 Anonymous
  
6 1 Anonymous
7 1 Anonymous
 ProCKSI utilises a variety of similarity comparison methods (e.g. USM, MaxCMO, TMaling, ...) producing different similarity measures  (e.g. Zscore, TMscore, RMSD, ...) each. Each of the comparison methods produces output with different formats and additional content such as alignments, rotation matrix, etc. Some of them produce just one output file, others a set of linked HTML files.
8 1 Anonymous
9 1 Anonymous
 The similarity comparisons are performed on compute nodes while the database that shall contain all results is located on the head node. Thus, all results must be parsed and transmitted (in a compressed form) from the compute to the head node before they can be made available in the database. I have devised a very general concept that parses the results from different methods in a first step directly on the compute node, translates them into a standardised format, which is parsed again on the head node and entered into the database.
10 1 Anonymous
11 1 Anonymous
 Hence, I have designed the prototype of an XML document that shall be used to store the results of similarity comparisons of pairs of protein structures with different comparison methods.
12 1 Anonymous
13 1 Anonymous
14 1 Anonymous
{{{
15 1 Anonymous
16 1 Anonymous
<SimilarityComparison>
17 1 Anonymous
    <Job>
18 1 Anonymous
        <ID> </ID>
19 1 Anonymous
        <Label> </Label>
20 1 Anonymous
    </Job>
21 1 Anonymous
22 1 Anonymous
    <Structures>
23 1 Anonymous
        <Structure>        
24 1 Anonymous
            <ID> </ID>
25 1 Anonymous
            <Label> </Label>
26 1 Anonymous
        </Structure>
27 1 Anonymous
        <Structure>        
28 1 Anonymous
            <ID> </ID>
29 1 Anonymous
            <Label> </Label>
30 1 Anonymous
        </Structure>
31 1 Anonymous
    <Structures>
32 1 Anonymous
33 1 Anonymous
    <Method>
34 1 Anonymous
        <ID> </ID>
35 1 Anonymous
        <Name> </Name>
36 1 Anonymous
37 1 Anonymous
        <Messages>
38 1 Anonymous
            <Errors>
39 1 Anonymous
                <Error> <Error>
40 1 Anonymous
            </Errors>
41 1 Anonymous
            <Warnings>
42 1 Anonymous
                <Warning> <Warning>
43 1 Anonymous
            </Warnings>
44 1 Anonymous
            <Notices>
45 1 Anonymous
                <Notice> <Notice>
46 1 Anonymous
            </Notices>
47 1 Anonymous
        </Messages>
48 1 Anonymous
49 1 Anonymous
        <Measures>
50 1 Anonymous
            <Measure>
51 1 Anonymous
                <Name> </Name>
52 1 Anonymous
                <Value> </Value>
53 1 Anonymous
            </Measure>
54 1 Anonymous
        </Measures>
55 1 Anonymous
56 1 Anonymous
        <Alignments>
57 1 Anonymous
            <Alignment> </Alignment>
58 1 Anonymous
        </Alignments>
59 1 Anonymous
60 1 Anonymous
        <Matrices>
61 1 Anonymous
            <Matrix>
62 1 Anonymous
                <Name> </Name>
63 1 Anonymous
                <Content> </Content>
64 1 Anonymous
            </Matrix>
65 1 Anonymous
        </Matrices>
66 1 Anonymous
67 1 Anonymous
        <Files>
68 1 Anonymous
            <File>
69 1 Anonymous
                <Label> </Label>
70 1 Anonymous
                <Name> </Name>
71 1 Anonymous
            </File>
72 1 Anonymous
        </Files>
73 1 Anonymous
    <Method>
74 1 Anonymous
75 1 Anonymous
</SimilarityComparison>
76 1 Anonymous
}}}
77 1 Anonymous
78 1 Anonymous
 * '''Comments from Pawel'''
79 1 Anonymous
{{{
80 1 Anonymous
81 1 Anonymous
On Mon, 23 Apr 2007 18:24:02 +0100
82 1 Anonymous
"Dr. Daniel Barthel" <daniel.barthel@nottingham.ac.uk> wrote:
83 1 Anonymous
84 1 Anonymous
> > Could I ask you to have a look and tell me what you think, please?
85 1 Anonymous
86 1 Anonymous
First of all I would advice lowercase. It's just less error prone.
87 1 Anonymous
88 1 Anonymous
> > 	<Job>
89 1 Anonymous
> > 		<ID> </ID>
90 1 Anonymous
> > 		<Label> </Label>
91 1 Anonymous
> > 	</Job
92 1 Anonymous
93 1 Anonymous
I think you should avoid putting everything into this XML file. As this
94 1 Anonymous
is being design to hold result only other things like job label should
95 1 Anonymous
be stored in database before and reference in results only by id.
96 1 Anonymous
97 1 Anonymous
So I propose something as simple as:
98 1 Anonymous
<job id="xxx" />
99 1 Anonymous
 
100 1 Anonymous
> > 	<Structures>
101 1 Anonymous
> > 		<Structure>		
102 1 Anonymous
> > 			<ID> </ID>
103 1 Anonymous
> > 			<Label> </Label>
104 1 Anonymous
> > 		</Structure>
105 1 Anonymous
> > 		<Structure>		
106 1 Anonymous
> > 			<ID> </ID>
107 1 Anonymous
> > 			<Label> </Label>
108 1 Anonymous
> > 		</Structure>
109 1 Anonymous
> > 	<Structures>
110 1 Anonymous
111 1 Anonymous
You don't need to enclose a list of elements inside another structure.
112 1 Anonymous
Using tree traversing or Xpath query it is not hard to retrieve a list
113 1 Anonymous
of all elements with the same name.
114 1 Anonymous
115 1 Anonymous
So I would do it like this:
116 1 Anonymous
<structure id="xxx" />
117 1 Anonymous
<structure id="yyy" />
118 1 Anonymous
However, it is less readable for a human that way.
119 1 Anonymous
120 1 Anonymous
> > 	<Method>
121 1 Anonymous
> > 		<ID> </ID>
122 1 Anonymous
> > 		<Name> </Name>
123 1 Anonymous
124 1 Anonymous
Same as for labels apply here:
125 1 Anonymous
<method id="xxx">
126 1 Anonymous
...
127 1 Anonymous
128 1 Anonymous
> > 		<Messages>
129 1 Anonymous
> > 			<Errors>
130 1 Anonymous
> > 				<Error> <Error>
131 1 Anonymous
> > 			</Errors>
132 1 Anonymous
> > 			<Warnings>
133 1 Anonymous
> > 				<Warning> <Warning>
134 1 Anonymous
> > 			</Warnings>
135 1 Anonymous
> > 			<Notices>
136 1 Anonymous
> > 				<Notice> <Notice>
137 1 Anonymous
> > 			</Notices>
138 1 Anonymous
> > 		</Messages>
139 1 Anonymous
140 1 Anonymous
More flexible would be to use type attribute in case there would be a
141 1 Anonymous
need of adding other kind of messages later.
142 1 Anonymous
143 1 Anonymous
<messages>
144 1 Anonymous
  <item type="error">xxx</item>
145 1 Anonymous
  <item type="warning">yyy</item>
146 1 Anonymous
  <item type="abc">zzz</item>
147 1 Anonymous
</messages>
148 1 Anonymous
 
149 1 Anonymous
> > 		<Measures>
150 1 Anonymous
> > 			<Measure>
151 1 Anonymous
> > 				<Name> </Name>
152 1 Anonymous
> > 				<Value> </Value>
153 1 Anonymous
> > 			</Measure>
154 1 Anonymous
> > 		</Measures>
155 1 Anonymous
> >
156 1 Anonymous
> > 		<Alignments>
157 1 Anonymous
> > 			<Alignment> </Alignment>
158 1 Anonymous
> > 		</Alignments>
159 1 Anonymous
> > 
160 1 Anonymous
> > 		<Matrices>
161 1 Anonymous
> > 			<Matrix>
162 1 Anonymous
> > 				<Name> </Name>
163 1 Anonymous
> > 				<Content> </Content>
164 1 Anonymous
> > 			</Matrix>
165 1 Anonymous
> > 		</Matrices>
166 1 Anonymous
> >
167 1 Anonymous
> > 		<Files>
168 1 Anonymous
> > 			<File>
169 1 Anonymous
> > 				<Label> </Label>
170 1 Anonymous
> > 				<Name> </Name>
171 1 Anonymous
> > 			</File>
172 1 Anonymous
> > 		</Files>
173 1 Anonymous
174 1 Anonymous
175 1 Anonymous
I don't fully understand the idea behind it. My impression is that
176 1 Anonymous
the subset of this elements could be a part of the method output. If so,
177 1 Anonymous
I would change it to less verbose:
178 1 Anonymous
179 1 Anonymous
<results>
180 1 Anonymous
  <value>666</value>
181 1 Anonymous
  <alignment>xxx</alignment>
182 1 Anonymous
  <matrix name="abc">yyy</matrix>
183 1 Anonymous
  <file name="qwe.asd"/>
184 1 Anonymous
</results>
185 1 Anonymous
186 1 Anonymous
If a single XML file is going to hold more than a single
187 1 Anonymous
comparison the structure could be modified like this:
188 1 Anonymous
189 1 Anonymous
<results>
190 1 Anonymous
  <comparison id="xxx">
191 1 Anonymous
    <messages>
192 1 Anonymous
      <item...
193 1 Anonymous
    </messages>
194 1 Anonymous
    <value...
195 1 Anonymous
    ...
196 1 Anonymous
    <file...
197 1 Anonymous
  </comparison>
198 1 Anonymous
  ...
199 1 Anonymous
  <comparison id="zzz">
200 1 Anonymous
    ...
201 1 Anonymous
  </comparison>
202 1 Anonymous
</results>
203 1 Anonymous
204 1 Anonymous
And when the structure is defined it would be nice to have also the DTD (Document Type Definition) file for easy validation of it's
205 1 Anonymous
correctness.
206 1 Anonymous
}}}
207 1 Anonymous
 
208 1 Anonymous
209 1 Anonymous
 * '''Azhar's modified proposal for protein multiverse on university grid'''
210 1 Anonymous
211 1 Anonymous
 My point of view is to use one xml file for all pairwise results of one method (by method i mean strcuture comparison algorithm e.g DaliLite, USM). I am trying to develop a prototype that can read a set of strctures from an input directory and run each method on these strctures sequentially on one machine. Complete output of each method would be written in an XML file for that method. 
212 1 Anonymous
213 2 Anonymous
 Therefore, I propose following XML specification for similarity comparison output:
214 1 Anonymous
215 1 Anonymous
{{{
216 1 Anonymous
217 1 Anonymous
218 1 Anonymous
 <Method Name="MaxCMO">
219 1 Anonymous
220 1 Anonymous
 <Pair No="0" Structure1="1QA9A-1.PDB" Structure2="1QA9A-1.PDB">
221 1 Anonymous
 <Measures Align="83" Overlap_No="124" Seq_matches="3" Seq_mismatches="80" Seq_identity="3.61"/>
222 1 Anonymous
</Pair>
223 1 Anonymous
224 1 Anonymous
<Pair No="1" Structure1="1QA9A-1.PDB" Structure2="1ASH_-1.PDB">
225 1 Anonymous
<Measures Align="98" Overlap_No="125" Seq_matches="6" Seq_mismatches="92" Seq_identity="6.12"/>
226 1 Anonymous
</Pair>
227 1 Anonymous
228 1 Anonymous
<Pair No="2" Structure1="1QA9A-1.PDB" Structure2="1JHGA-1.PDB">
229 1 Anonymous
<Measures Align="84" Overlap_No="102" Seq_matches="3" Seq_mismatches="81" Seq_identity="3.57"/>
230 1 Anonymous
</Pair>
231 1 Anonymous
232 1 Anonymous
         .
233 1 Anonymous
         .
234 1 Anonymous
         .
235 1 Anonymous
236 1 Anonymous
<Pair No="274" Structure1="1MYT_-1.PDB" Structure2="1HLB_-1.PDB">
237 1 Anonymous
<Measures Align="113" Overlap_No="252" Seq_matches="13" Seq_mismatches="100" Seq_identity="11.50"/>
238 1 Anonymous
</Pair>
239 1 Anonymous
240 1 Anonymous
<Pair No="275" Structure1="1HLB_-1.PDB" Structure2="1HLB_-1.PDB">
241 1 Anonymous
<Measures Align="147" Overlap_No="316" Seq_matches="36" Seq_mismatches="111" Seq_identity="24.49"/>
242 1 Anonymous
</Pair>
243 1 Anonymous
244 1 Anonymous
</Method>
245 1 Anonymous
  
246 1 Anonymous
     }}}
247 1 Anonymous
248 1 Anonymous
 Similarly one xml file for each of the remaining methods.
249 1 Anonymous
250 1 Anonymous
 These files could be merged finally to get the all by all results for whole pdb with each method seprately in a searchable and updateable XML file. 
251 1 Anonymous
252 1 Anonymous
 So far I have implemented all other methods except USM.