DataStandardisation » History » Version 2

Version 1 (Anonymous, 07/25/2007 10:40 AM) → Version 2/14 (Anonymous, 07/25/2007 11:22 AM)

== Standardising Results with XML ==

* '''Orignal Proposal by Dr. Daniel '''





Hi guys

ProCKSI utilises a variety of similarity comparison methods (e.g. USM, MaxCMO, TMaling, ...) producing different similarity measures (e.g. Zscore, TMscore, RMSD, ...) each. Each of the comparison methods produces output with different formats and additional content such as alignments, rotation matrix, etc. Some of them produce just one output file, others a set of linked HTML files.

The similarity comparisons are performed on compute nodes while the database that shall contain all results is located on the head node. Thus, all results must be parsed and transmitted (in a compressed form) from the compute to the head node before they can be made available in the database. I have devised a very general concept that parses the results from different methods in a first step directly on the compute node, translates them into a standardised format, which is parsed again on the head node and entered into the database.

Hence, I have designed the prototype of an XML document that shall be used to store the results of similarity comparisons of pairs of protein structures with different comparison methods.

Could I ask you to have a look and tell me what you think, please?

{{{

<SimilarityComparison>
<Job>
<ID> </ID>
<Label> </Label>
</Job>

<Structures>
<Structure>
<ID> </ID>
<Label> </Label>
</Structure>
<Structure>
<ID> </ID>
<Label> </Label>
</Structure>
<Structures>

<Method>
<ID> </ID>
<Name> </Name>

<Messages>
<Errors>
<Error> <Error>
</Errors>
<Warnings>
<Warning> <Warning>
</Warnings>
<Notices>
<Notice> <Notice>
</Notices>
</Messages>

<Measures>
<Measure>
<Name> </Name>
<Value> </Value>
</Measure>
</Measures>

<Alignments>
<Alignment> </Alignment>
</Alignments>

<Matrices>
<Matrix>
<Name> </Name>
<Content> </Content>
</Matrix>
</Matrices>

<Files>
<File>
<Label> </Label>
<Name> </Name>
</File>
</Files>
<Method>

</SimilarityComparison>
}}}

* '''Comments from Pawel'''
{{{

On Mon, 23 Apr 2007 18:24:02 +0100
"Dr. Daniel Barthel" <daniel.barthel@nottingham.ac.uk> wrote:

> > Could I ask you to have a look and tell me what you think, please?

First of all I would advice lowercase. It's just less error prone.

> > <Job>
> > <ID> </ID>
> > <Label> </Label>
> > </Job

I think you should avoid putting everything into this XML file. As this
is being design to hold result only other things like job label should
be stored in database before and reference in results only by id.

So I propose something as simple as:
<job id="xxx" />

> > <Structures>
> > <Structure>
> > <ID> </ID>
> > <Label> </Label>
> > </Structure>
> > <Structure>
> > <ID> </ID>
> > <Label> </Label>
> > </Structure>
> > <Structures>

You don't need to enclose a list of elements inside another structure.
Using tree traversing or Xpath query it is not hard to retrieve a list
of all elements with the same name.

So I would do it like this:
<structure id="xxx" />
<structure id="yyy" />
However, it is less readable for a human that way.

> > <Method>
> > <ID> </ID>
> > <Name> </Name>

Same as for labels apply here:
<method id="xxx">
...

> > <Messages>
> > <Errors>
> > <Error> <Error>
> > </Errors>
> > <Warnings>
> > <Warning> <Warning>
> > </Warnings>
> > <Notices>
> > <Notice> <Notice>
> > </Notices>
> > </Messages>

More flexible would be to use type attribute in case there would be a
need of adding other kind of messages later.

<messages>
<item type="error">xxx</item>
<item type="warning">yyy</item>
<item type="abc">zzz</item>
</messages>

> > <Measures>
> > <Measure>
> > <Name> </Name>
> > <Value> </Value>
> > </Measure>
> > </Measures>
> >
> > <Alignments>
> > <Alignment> </Alignment>
> > </Alignments>
> >
> > <Matrices>
> > <Matrix>
> > <Name> </Name>
> > <Content> </Content>
> > </Matrix>
> > </Matrices>
> >
> > <Files>
> > <File>
> > <Label> </Label>
> > <Name> </Name>
> > </File>
> > </Files>

I don't fully understand the idea behind it. My impression is that
the subset of this elements could be a part of the method output. If so,
I would change it to less verbose:

<results>
<value>666</value>
<alignment>xxx</alignment>
<matrix name="abc">yyy</matrix>
<file name="qwe.asd"/>
</results>

If a single XML file is going to hold more than a single
comparison the structure could be modified like this:

<results>
<comparison id="xxx">
<messages>
<item...
</messages>
<value...
...
<file...
</comparison>
...
<comparison id="zzz">
...
</comparison>
</results>

And when the structure is defined it would be nice to have also the DTD (Document Type Definition) file for easy validation of it's
correctness.
}}}


* '''Azhar's modified proposal for protein multiverse on university grid'''

My point of view is to use one xml file for all pairwise results of one method (by method i mean strcuture comparison algorithm e.g DaliLite, USM). I am trying to develop a prototype that can read a set of strctures from an input directory and run each method on these strctures sequentially on one machine. Complete output of each method would be written in an XML file for that method.

Therefore, I propose (for my experiment on university grid) following XML specification for similarity comparison output:

{{{

<Method Name="MaxCMO">

<Pair No="0" Structure1="1QA9A-1.PDB" Structure2="1QA9A-1.PDB">
<Measures Align="83" Overlap_No="124" Seq_matches="3" Seq_mismatches="80" Seq_identity="3.61"/>
</Pair>

<Pair No="1" Structure1="1QA9A-1.PDB" Structure2="1ASH_-1.PDB">
<Measures Align="98" Overlap_No="125" Seq_matches="6" Seq_mismatches="92" Seq_identity="6.12"/>
</Pair>

<Pair No="2" Structure1="1QA9A-1.PDB" Structure2="1JHGA-1.PDB">
<Measures Align="84" Overlap_No="102" Seq_matches="3" Seq_mismatches="81" Seq_identity="3.57"/>
</Pair>

.
.
.

<Pair No="274" Structure1="1MYT_-1.PDB" Structure2="1HLB_-1.PDB">
<Measures Align="113" Overlap_No="252" Seq_matches="13" Seq_mismatches="100" Seq_identity="11.50"/>
</Pair>

<Pair No="275" Structure1="1HLB_-1.PDB" Structure2="1HLB_-1.PDB">
<Measures Align="147" Overlap_No="316" Seq_matches="36" Seq_mismatches="111" Seq_identity="24.49"/>
</Pair>

</Method>

}}}

Similarly one xml file for each of the remaining methods.

These files could be merged finally to get the all by all results for whole pdb with each method seprately in a searchable and updateable XML file.

So far I have implemented all other methods except USM.