DataStandardisation » History » Version 13

Paweł Widera, 11/23/2007 07:59 PM
Experiment and method wrapping tags introduced.

1 1 Anonymous
2 13 Paweł Widera
h1. The [[ProCKSI]] "stand-alone" applications
3 1 Anonymous
4 13 Paweł Widera
5 13 Paweł Widera
6 13 Paweł Widera
h2. [[ProCKSI]]/Comparison
7 13 Paweł Widera
8 13 Paweł Widera
[[ProCKSI]]/Comparison integrates a variety of similarity comparison methods (e.g. USM, [[MaxCMO]], TMaling, ...) producing different similarity measures  (e.g. Zscore, TMscore, RMSD, ...) each. Each of the comparison methods produces output with different formats and additional content such as alignments, rotation matrix, etc. Some of them produce just one output file, others a set of linked HTML files.
9 13 Paweł Widera
10 1 Anonymous
Additionally, there are pre-processing methods, e.g. extracting models/chains from PDB structures, or preparing contact maps from structures/chains.
11 1 Anonymous
12 1 Anonymous
13 13 Paweł Widera
h2. [[ProCKSI]]/Consensus
14 13 Paweł Widera
15 13 Paweł Widera
[[ProCKSI]]/Consensus integrates a variety of post-processing methods that all deal with an etire dataset in form of similarity matrices or trees. The former can be clustered or combined in order to form a consensus using a Total Evidence approach, the latter can be combined using a Total Consensus approach. Each of these methods has its own input parameters and produces output in different format.
16 13 Paweł Widera
17 13 Paweł Widera
18 13 Paweł Widera
h1. Why is Data Standardisation necessary?
19 13 Paweł Widera
20 1 Anonymous
The goal can be described as follows: [[br]]
21 13 Paweł Widera
Allow the  [[ProCKSI]] "stand-alone" applications to 
22 13 Paweł Widera
 1. be developped independently from the [[ProCKSI]]/Server (incl. webserver/database), and allow collaborators to seamlessly integrate their own methods. One might even think of making the code publically available and allow the community to improve it.
23 13 Paweł Widera
 2. run on any (Linux) machine that has the necessary methods installed. This can be either a collaborator's desktop machine, the [[ProCKSI]] cluster, the University Cluster, or even a machine on the Grid.
24 13 Paweł Widera
 3. further distribute the given task using local machines, Grid and Web Service technology in order to obtain their results without the need to schedule everything from one central point (_Orchestration_ vs. _Choreography_).
25 13 Paweł Widera
 4. return its results in a standardised format that can easily be integrated into [[ProCKSI]]/Database and thus resused be the [[ProCKSI]]/Server and all other experiments "on the command line".
26 1 Anonymous
27 11 Anonymous
28 8 Anonymous
29 13 Paweł Widera
h1. Standardising Results with XML
30 13 Paweł Widera
  
31 1 Anonymous
32 13 Paweł Widera
The principle API for the [[ProCKSI]] "stand-alone" applications can be visualised as follows:
33 13 Paweł Widera
34 8 Anonymous
[[Image(ProCKSI-core-API.png)]]
35 11 Anonymous
36 13 Paweł Widera
One file in XML format is fed into the [[ProCKSI]]/Comparison and [[ProCKSI]]/Consensus applications, describing the entire dataset, all tasks and the necessary input parameters. At the end, one output file in XML format is written, which might link to further external files in specific format (e.g. PDF, CM, ...) if necessary.
37 1 Anonymous
38 8 Anonymous
39 13 Paweł Widera
40 13 Paweł Widera
h2. Input Specifications
41 13 Paweł Widera
42 9 Anonymous
These are the specifications for the XML input file:
43 6 Paweł Widera
44 1 Anonymous
In principle, all possible results from the requested methods are returned. All unnecessary results can be requested to be excluded. A log file is generated if a file name is provided.
45 1 Anonymous
46 13 Paweł Widera
Optional tags: *exclude* (measure, result), *log*
47 5 Paweł Widera
48 13 Paweł Widera
Optional attributes: *description*
49 13 Paweł Widera
50 13 Paweł Widera
<pre>
51 6 Paweł Widera
<package id="ID" description="TEXT">
52 5 Paweł Widera
  <log filename="FILENAME" />
53 10 Anonymous
  
54 1 Anonymous
  <dataset type="structure|tree|contact map|similarity matrix">
55 10 Anonymous
    <item id="ID" label="TEXT" filename="FILENAME" />
56 12 Paweł Widera
    :::
57 12 Paweł Widera
    <item id="ID" label="TEXT" filename="FILENAME" />
58 12 Paweł Widera
  </dataset>  
59 12 Paweł Widera
60 12 Paweł Widera
  <experiment id="ID" name="NAME">
61 1 Anonymous
    <method id="ID" name="NAME">
62 12 Paweł Widera
      <parameters>
63 1 Anonymous
        <param name="TEXT">VALUE</param>
64 12 Paweł Widera
        :::
65 12 Paweł Widera
        <param name="TEXT">VALUE</param>
66 12 Paweł Widera
      </parameters>
67 12 Paweł Widera
68 12 Paweł Widera
      <exclude>
69 1 Anonymous
        <measure>NAME</measure>
70 12 Paweł Widera
        :::
71 12 Paweł Widera
        <measure>NAME</measure>
72 1 Anonymous
73 1 Anonymous
        <result>NAME</result>
74 1 Anonymous
        :::
75 1 Anonymous
        <result>NAME</result>
76 1 Anonymous
      </exclude>
77 1 Anonymous
    </method>
78 1 Anonymous
    :::
79 12 Paweł Widera
    <method>
80 12 Paweł Widera
      ...
81 12 Paweł Widera
    </method>
82 12 Paweł Widera
  </experiment>
83 12 Paweł Widera
  :::
84 12 Paweł Widera
  <experiment>
85 12 Paweł Widera
    ...
86 9 Anonymous
  </experiment>
87 1 Anonymous
</package>
88 13 Paweł Widera
</pre>
89 1 Anonymous
90 1 Anonymous
The data used as an input could be protein structures, similarity trees, contact maps or similarity matrices. All specified methods should be able to operate on given data files. This dependency could be verified automatically using XML Schema.
91 1 Anonymous
92 13 Paweł Widera
93 13 Paweł Widera
h2. Output Specifications
94 13 Paweł Widera
95 1 Anonymous
These are the specifications for the XML output file:
96 1 Anonymous
97 13 Paweł Widera
Optional tags: *log*, *message*, *similarity* (used only if output is a _comparison_) 
98 1 Anonymous
99 13 Paweł Widera
Optional attributes: *description*, *node*, *start*, *end*, *ref_id* (only if output type is _composition_), *ref_id2* (only if output type is not _comparison_)
100 13 Paweł Widera
101 13 Paweł Widera
<pre>
102 1 Anonymous
<package id="ID" description="TEXT" node="TEXT" start="TIME" end="TIME">
103 1 Anonymous
  <log filename="FILENAME" />
104 1 Anonymous
105 1 Anonymous
  <message type="error|warning|info">TEXT</message>
106 1 Anonymous
  :::
107 1 Anonymous
  <message type="error|warning|info">TEXT</message>
108 1 Anonymous
  
109 1 Anonymous
  <dataset type="structure|tree|contact map|similarity matrix">
110 1 Anonymous
    <item id="ID" label="TEXT" filename="FILENAME" />
111 12 Paweł Widera
    :::
112 12 Paweł Widera
    <item id="ID" label="TEXT" filename="FILENAME" />
113 12 Paweł Widera
  </dataset>  
114 12 Paweł Widera
115 12 Paweł Widera
  <experiment id="ID" name="NAME">
116 12 Paweł Widera
    <method id="ID" name="NAME">
117 6 Paweł Widera
      <parameters>
118 12 Paweł Widera
        <parameter name="TEXT">VALUE</parameter>
119 12 Paweł Widera
        :::
120 12 Paweł Widera
        <parameter name="TEXT">VALUE</parameter>
121 12 Paweł Widera
      </parameters>
122 6 Paweł Widera
123 12 Paweł Widera
      <results type="transformation|comparison|composition" ref_id="" ref_id2=" ">
124 12 Paweł Widera
        <message type="error|warning|info">TEXT</message>
125 12 Paweł Widera
        :::
126 6 Paweł Widera
        <message type="error|warning|info">TEXT</message>
127 1 Anonymous
128 12 Paweł Widera
        <similarity measure="NAME">VALUE</similarity>
129 12 Paweł Widera
        :::
130 12 Paweł Widera
        <similarity measure="NAME">VALUE</similarity>
131 12 Paweł Widera
132 1 Anonymous
        <file type="TEXT" label="TEXT" name="FILENAME" />
133 12 Paweł Widera
        :::
134 12 Paweł Widera
        <file type="TEXT" label="TEXT" name="FILENAME" />
135 12 Paweł Widera
      </results>
136 12 Paweł Widera
      :::
137 12 Paweł Widera
      <results>
138 10 Anonymous
        ...
139 12 Paweł Widera
      </results>
140 12 Paweł Widera
    </method>
141 12 Paweł Widera
    :::
142 12 Paweł Widera
    <method>
143 12 Paweł Widera
      ...
144 12 Paweł Widera
    </method>
145 12 Paweł Widera
  </experiment>
146 1 Anonymous
  :::
147 1 Anonymous
  <experiment>
148 1 Anonymous
    ...
149 1 Anonymous
  </experiment>
150 1 Anonymous
</package>
151 13 Paweł Widera
</pre>
152 1 Anonymous
153 1 Anonymous
Message being an error, warning or additional information could be passed on a global or a method level. Dataset and parameterset defined in the input file (package) could be repeated in the output (results) if needed (self-contained output). Output could be a 1->1 transformation (e.g. structure -> contact map), a 2->1 comparison (e.g. 2*structure -> similarity measure) or N->1 composition (e.g. N*tree -> total tree or N*similarity matrix -> consensus similarity matrix). The results other than similarity measures for a pair of proteins are stored in external files and are just referenced from the XML file.
154 1 Anonymous
155 1 Anonymous
The alignment data could be described in the XML file, as there is no single format used by all programs.