DataStandardisation » History » Version 2

Anonymous, 07/25/2007 11:22 AM

1 1 Anonymous
== Standardising Results with XML ==  
2 1 Anonymous
3 1 Anonymous
  * '''Orignal Proposal by Dr. Daniel ''' 
4 1 Anonymous
5 2 Anonymous
  
6 1 Anonymous
7 1 Anonymous
 ProCKSI utilises a variety of similarity comparison methods (e.g. USM, MaxCMO, TMaling, ...) producing different similarity measures  (e.g. Zscore, TMscore, RMSD, ...) each. Each of the comparison methods produces output with different formats and additional content such as alignments, rotation matrix, etc. Some of them produce just one output file, others a set of linked HTML files.
8 1 Anonymous
9 1 Anonymous
 The similarity comparisons are performed on compute nodes while the database that shall contain all results is located on the head node. Thus, all results must be parsed and transmitted (in a compressed form) from the compute to the head node before they can be made available in the database. I have devised a very general concept that parses the results from different methods in a first step directly on the compute node, translates them into a standardised format, which is parsed again on the head node and entered into the database.
10 1 Anonymous
11 1 Anonymous
 Hence, I have designed the prototype of an XML document that shall be used to store the results of similarity comparisons of pairs of protein structures with different comparison methods.
12 1 Anonymous
13 1 Anonymous
 Could I ask you to have a look and tell me what you think, please?
14 1 Anonymous
15 1 Anonymous
{{{
16 1 Anonymous
17 1 Anonymous
<SimilarityComparison>
18 1 Anonymous
    <Job>
19 1 Anonymous
        <ID> </ID>
20 1 Anonymous
        <Label> </Label>
21 1 Anonymous
    </Job>
22 1 Anonymous
23 1 Anonymous
    <Structures>
24 1 Anonymous
        <Structure>        
25 1 Anonymous
            <ID> </ID>
26 1 Anonymous
            <Label> </Label>
27 1 Anonymous
        </Structure>
28 1 Anonymous
        <Structure>        
29 1 Anonymous
            <ID> </ID>
30 1 Anonymous
            <Label> </Label>
31 1 Anonymous
        </Structure>
32 1 Anonymous
    <Structures>
33 1 Anonymous
34 1 Anonymous
    <Method>
35 1 Anonymous
        <ID> </ID>
36 1 Anonymous
        <Name> </Name>
37 1 Anonymous
38 1 Anonymous
        <Messages>
39 1 Anonymous
            <Errors>
40 1 Anonymous
                <Error> <Error>
41 1 Anonymous
            </Errors>
42 1 Anonymous
            <Warnings>
43 1 Anonymous
                <Warning> <Warning>
44 1 Anonymous
            </Warnings>
45 1 Anonymous
            <Notices>
46 1 Anonymous
                <Notice> <Notice>
47 1 Anonymous
            </Notices>
48 1 Anonymous
        </Messages>
49 1 Anonymous
50 1 Anonymous
        <Measures>
51 1 Anonymous
            <Measure>
52 1 Anonymous
                <Name> </Name>
53 1 Anonymous
                <Value> </Value>
54 1 Anonymous
            </Measure>
55 1 Anonymous
        </Measures>
56 1 Anonymous
57 1 Anonymous
        <Alignments>
58 1 Anonymous
            <Alignment> </Alignment>
59 1 Anonymous
        </Alignments>
60 1 Anonymous
61 1 Anonymous
        <Matrices>
62 1 Anonymous
            <Matrix>
63 1 Anonymous
                <Name> </Name>
64 1 Anonymous
                <Content> </Content>
65 1 Anonymous
            </Matrix>
66 1 Anonymous
        </Matrices>
67 1 Anonymous
68 1 Anonymous
        <Files>
69 1 Anonymous
            <File>
70 1 Anonymous
                <Label> </Label>
71 1 Anonymous
                <Name> </Name>
72 1 Anonymous
            </File>
73 1 Anonymous
        </Files>
74 1 Anonymous
    <Method>
75 1 Anonymous
76 1 Anonymous
</SimilarityComparison>
77 1 Anonymous
}}}
78 1 Anonymous
79 1 Anonymous
 * '''Comments from Pawel'''
80 1 Anonymous
{{{
81 1 Anonymous
82 1 Anonymous
On Mon, 23 Apr 2007 18:24:02 +0100
83 1 Anonymous
"Dr. Daniel Barthel" <daniel.barthel@nottingham.ac.uk> wrote:
84 1 Anonymous
85 1 Anonymous
> > Could I ask you to have a look and tell me what you think, please?
86 1 Anonymous
87 1 Anonymous
First of all I would advice lowercase. It's just less error prone.
88 1 Anonymous
89 1 Anonymous
> > 	<Job>
90 1 Anonymous
> > 		<ID> </ID>
91 1 Anonymous
> > 		<Label> </Label>
92 1 Anonymous
> > 	</Job
93 1 Anonymous
94 1 Anonymous
I think you should avoid putting everything into this XML file. As this
95 1 Anonymous
is being design to hold result only other things like job label should
96 1 Anonymous
be stored in database before and reference in results only by id.
97 1 Anonymous
98 1 Anonymous
So I propose something as simple as:
99 1 Anonymous
<job id="xxx" />
100 1 Anonymous
 
101 1 Anonymous
> > 	<Structures>
102 1 Anonymous
> > 		<Structure>		
103 1 Anonymous
> > 			<ID> </ID>
104 1 Anonymous
> > 			<Label> </Label>
105 1 Anonymous
> > 		</Structure>
106 1 Anonymous
> > 		<Structure>		
107 1 Anonymous
> > 			<ID> </ID>
108 1 Anonymous
> > 			<Label> </Label>
109 1 Anonymous
> > 		</Structure>
110 1 Anonymous
> > 	<Structures>
111 1 Anonymous
112 1 Anonymous
You don't need to enclose a list of elements inside another structure.
113 1 Anonymous
Using tree traversing or Xpath query it is not hard to retrieve a list
114 1 Anonymous
of all elements with the same name.
115 1 Anonymous
116 1 Anonymous
So I would do it like this:
117 1 Anonymous
<structure id="xxx" />
118 1 Anonymous
<structure id="yyy" />
119 1 Anonymous
However, it is less readable for a human that way.
120 1 Anonymous
121 1 Anonymous
> > 	<Method>
122 1 Anonymous
> > 		<ID> </ID>
123 1 Anonymous
> > 		<Name> </Name>
124 1 Anonymous
125 1 Anonymous
Same as for labels apply here:
126 1 Anonymous
<method id="xxx">
127 1 Anonymous
...
128 1 Anonymous
129 1 Anonymous
> > 		<Messages>
130 1 Anonymous
> > 			<Errors>
131 1 Anonymous
> > 				<Error> <Error>
132 1 Anonymous
> > 			</Errors>
133 1 Anonymous
> > 			<Warnings>
134 1 Anonymous
> > 				<Warning> <Warning>
135 1 Anonymous
> > 			</Warnings>
136 1 Anonymous
> > 			<Notices>
137 1 Anonymous
> > 				<Notice> <Notice>
138 1 Anonymous
> > 			</Notices>
139 1 Anonymous
> > 		</Messages>
140 1 Anonymous
141 1 Anonymous
More flexible would be to use type attribute in case there would be a
142 1 Anonymous
need of adding other kind of messages later.
143 1 Anonymous
144 1 Anonymous
<messages>
145 1 Anonymous
  <item type="error">xxx</item>
146 1 Anonymous
  <item type="warning">yyy</item>
147 1 Anonymous
  <item type="abc">zzz</item>
148 1 Anonymous
</messages>
149 1 Anonymous
 
150 1 Anonymous
> > 		<Measures>
151 1 Anonymous
> > 			<Measure>
152 1 Anonymous
> > 				<Name> </Name>
153 1 Anonymous
> > 				<Value> </Value>
154 1 Anonymous
> > 			</Measure>
155 1 Anonymous
> > 		</Measures>
156 1 Anonymous
> >
157 1 Anonymous
> > 		<Alignments>
158 1 Anonymous
> > 			<Alignment> </Alignment>
159 1 Anonymous
> > 		</Alignments>
160 1 Anonymous
> > 
161 1 Anonymous
> > 		<Matrices>
162 1 Anonymous
> > 			<Matrix>
163 1 Anonymous
> > 				<Name> </Name>
164 1 Anonymous
> > 				<Content> </Content>
165 1 Anonymous
> > 			</Matrix>
166 1 Anonymous
> > 		</Matrices>
167 1 Anonymous
> >
168 1 Anonymous
> > 		<Files>
169 1 Anonymous
> > 			<File>
170 1 Anonymous
> > 				<Label> </Label>
171 1 Anonymous
> > 				<Name> </Name>
172 1 Anonymous
> > 			</File>
173 1 Anonymous
> > 		</Files>
174 1 Anonymous
175 1 Anonymous
176 1 Anonymous
I don't fully understand the idea behind it. My impression is that
177 1 Anonymous
the subset of this elements could be a part of the method output. If so,
178 1 Anonymous
I would change it to less verbose:
179 1 Anonymous
180 1 Anonymous
<results>
181 1 Anonymous
  <value>666</value>
182 1 Anonymous
  <alignment>xxx</alignment>
183 1 Anonymous
  <matrix name="abc">yyy</matrix>
184 1 Anonymous
  <file name="qwe.asd"/>
185 1 Anonymous
</results>
186 1 Anonymous
187 1 Anonymous
If a single XML file is going to hold more than a single
188 1 Anonymous
comparison the structure could be modified like this:
189 1 Anonymous
190 1 Anonymous
<results>
191 1 Anonymous
  <comparison id="xxx">
192 1 Anonymous
    <messages>
193 1 Anonymous
      <item...
194 1 Anonymous
    </messages>
195 1 Anonymous
    <value...
196 1 Anonymous
    ...
197 1 Anonymous
    <file...
198 1 Anonymous
  </comparison>
199 1 Anonymous
  ...
200 1 Anonymous
  <comparison id="zzz">
201 1 Anonymous
    ...
202 1 Anonymous
  </comparison>
203 1 Anonymous
</results>
204 1 Anonymous
205 1 Anonymous
And when the structure is defined it would be nice to have also the DTD (Document Type Definition) file for easy validation of it's
206 1 Anonymous
correctness.
207 1 Anonymous
}}}
208 1 Anonymous
 
209 1 Anonymous
210 1 Anonymous
 * '''Azhar's modified proposal for protein multiverse on university grid'''
211 1 Anonymous
212 1 Anonymous
 My point of view is to use one xml file for all pairwise results of one method (by method i mean strcuture comparison algorithm e.g DaliLite, USM). I am trying to develop a prototype that can read a set of strctures from an input directory and run each method on these strctures sequentially on one machine. Complete output of each method would be written in an XML file for that method. 
213 1 Anonymous
214 2 Anonymous
 Therefore, I propose following XML specification for similarity comparison output:
215 1 Anonymous
216 1 Anonymous
{{{
217 1 Anonymous
218 1 Anonymous
219 1 Anonymous
 <Method Name="MaxCMO">
220 1 Anonymous
221 1 Anonymous
 <Pair No="0" Structure1="1QA9A-1.PDB" Structure2="1QA9A-1.PDB">
222 1 Anonymous
 <Measures Align="83" Overlap_No="124" Seq_matches="3" Seq_mismatches="80" Seq_identity="3.61"/>
223 1 Anonymous
</Pair>
224 1 Anonymous
225 1 Anonymous
<Pair No="1" Structure1="1QA9A-1.PDB" Structure2="1ASH_-1.PDB">
226 1 Anonymous
<Measures Align="98" Overlap_No="125" Seq_matches="6" Seq_mismatches="92" Seq_identity="6.12"/>
227 1 Anonymous
</Pair>
228 1 Anonymous
229 1 Anonymous
<Pair No="2" Structure1="1QA9A-1.PDB" Structure2="1JHGA-1.PDB">
230 1 Anonymous
<Measures Align="84" Overlap_No="102" Seq_matches="3" Seq_mismatches="81" Seq_identity="3.57"/>
231 1 Anonymous
</Pair>
232 1 Anonymous
233 1 Anonymous
         .
234 1 Anonymous
         .
235 1 Anonymous
         .
236 1 Anonymous
237 1 Anonymous
<Pair No="274" Structure1="1MYT_-1.PDB" Structure2="1HLB_-1.PDB">
238 1 Anonymous
<Measures Align="113" Overlap_No="252" Seq_matches="13" Seq_mismatches="100" Seq_identity="11.50"/>
239 1 Anonymous
</Pair>
240 1 Anonymous
241 1 Anonymous
<Pair No="275" Structure1="1HLB_-1.PDB" Structure2="1HLB_-1.PDB">
242 1 Anonymous
<Measures Align="147" Overlap_No="316" Seq_matches="36" Seq_mismatches="111" Seq_identity="24.49"/>
243 1 Anonymous
</Pair>
244 1 Anonymous
245 1 Anonymous
</Method>
246 1 Anonymous
  
247 1 Anonymous
     }}}
248 1 Anonymous
249 1 Anonymous
 Similarly one xml file for each of the remaining methods.
250 1 Anonymous
251 1 Anonymous
 These files could be merged finally to get the all by all results for whole pdb with each method seprately in a searchable and updateable XML file. 
252 1 Anonymous
253 1 Anonymous
 So far I have implemented all other methods except USM.