DataStandardisation » History » Version 14

Paweł Widera, 08/27/2013 03:19 AM
Files re-attached after migration from trac.

1 14 Paweł Widera
h1. The ProCKSI "stand-alone" applications
2 1 Anonymous
3 14 Paweł Widera
h2. ProCKSI/Comparison
4 1 Anonymous
5 14 Paweł Widera
ProCKSI/Comparison integrates a variety of similarity comparison methods (e.g. USM, MaxCMO, TM-align, ...) producing different similarity measures  (e.g. Z-score, TM-score, RMSD, ...) each. Each of the comparison methods produces output with different formats and additional content such as alignments, rotation matrix, etc. Some of them produce just one output file, others a set of linked HTML files.
6 13 Paweł Widera
7 13 Paweł Widera
Additionally, there are pre-processing methods, e.g. extracting models/chains from PDB structures, or preparing contact maps from structures/chains.
8 1 Anonymous
9 14 Paweł Widera
h2. ProCKSI/Consensus
10 1 Anonymous
11 14 Paweł Widera
ProCKSI/Consensus integrates a variety of post-processing methods that all deal with an etire dataset in form of similarity matrices or trees. The former can be clustered or combined in order to form a consensus using a Total Evidence approach, the latter can be combined using a Total Consensus approach. Each of these methods has its own input parameters and produces output in different format.
12 1 Anonymous
13 1 Anonymous
14 13 Paweł Widera
h1. Why is Data Standardisation necessary?
15 1 Anonymous
16 14 Paweł Widera
The goal is to allow the ProCKSI "stand-alone" applications to: 
17 14 Paweł Widera
* be developped independently from the ProCKSI/Server (incl. webserver/database), and allow collaborators to seamlessly integrate their own methods. One might even think of making the code publically available and allow the community to improve it.
18 14 Paweł Widera
* run on any (Linux) machine that has the necessary methods installed. This can be either a collaborator's desktop machine, the ProCKSI cluster, the University Cluster, or even a machine on the Grid.
19 14 Paweł Widera
* further distribute the given task using local machines, Grid and Web Service technology in order to obtain their results without the need to schedule everything from one central point (_Orchestration_ vs. _Choreography_).
20 14 Paweł Widera
* return its results in a standardised format that can easily be integrated into ProCKSI/Database and thus resused be the ProCKSI/Server and all other experiments "on the command line".
21 13 Paweł Widera
22 1 Anonymous
23 13 Paweł Widera
h1. Standardising Results with XML
24 13 Paweł Widera
25 14 Paweł Widera
The principle API for the ProCKSI "stand-alone" applications can be visualised as follows:
26 11 Anonymous
27 14 Paweł Widera
!ProCKSI-core-API.png!
28 1 Anonymous
29 14 Paweł Widera
One file in XML format is fed into the ProCKSI/Comparison and ProCKSI/Consensus applications, describing the entire dataset, all tasks and the necessary input parameters. At the end, one output file in XML format is written, which might link to further external files in specific format (e.g. PDF, CM, ...) if necessary.
30 13 Paweł Widera
31 13 Paweł Widera
h2. Input Specifications
32 13 Paweł Widera
33 9 Anonymous
These are the specifications for the XML input file:
34 6 Paweł Widera
35 1 Anonymous
In principle, all possible results from the requested methods are returned. All unnecessary results can be requested to be excluded. A log file is generated if a file name is provided.
36 1 Anonymous
37 13 Paweł Widera
Optional tags: *exclude* (measure, result), *log*
38 5 Paweł Widera
39 13 Paweł Widera
Optional attributes: *description*
40 13 Paweł Widera
41 13 Paweł Widera
<pre>
42 6 Paweł Widera
<package id="ID" description="TEXT">
43 5 Paweł Widera
  <log filename="FILENAME" />
44 10 Anonymous
  
45 1 Anonymous
  <dataset type="structure|tree|contact map|similarity matrix">
46 10 Anonymous
    <item id="ID" label="TEXT" filename="FILENAME" />
47 12 Paweł Widera
    :::
48 12 Paweł Widera
    <item id="ID" label="TEXT" filename="FILENAME" />
49 12 Paweł Widera
  </dataset>  
50 12 Paweł Widera
51 12 Paweł Widera
  <experiment id="ID" name="NAME">
52 1 Anonymous
    <method id="ID" name="NAME">
53 12 Paweł Widera
      <parameters>
54 1 Anonymous
        <param name="TEXT">VALUE</param>
55 12 Paweł Widera
        :::
56 12 Paweł Widera
        <param name="TEXT">VALUE</param>
57 12 Paweł Widera
      </parameters>
58 12 Paweł Widera
59 12 Paweł Widera
      <exclude>
60 1 Anonymous
        <measure>NAME</measure>
61 12 Paweł Widera
        :::
62 12 Paweł Widera
        <measure>NAME</measure>
63 1 Anonymous
64 1 Anonymous
        <result>NAME</result>
65 1 Anonymous
        :::
66 1 Anonymous
        <result>NAME</result>
67 1 Anonymous
      </exclude>
68 1 Anonymous
    </method>
69 1 Anonymous
    :::
70 12 Paweł Widera
    <method>
71 12 Paweł Widera
      ...
72 12 Paweł Widera
    </method>
73 12 Paweł Widera
  </experiment>
74 12 Paweł Widera
  :::
75 12 Paweł Widera
  <experiment>
76 12 Paweł Widera
    ...
77 9 Anonymous
  </experiment>
78 1 Anonymous
</package>
79 13 Paweł Widera
</pre>
80 1 Anonymous
81 1 Anonymous
The data used as an input could be protein structures, similarity trees, contact maps or similarity matrices. All specified methods should be able to operate on given data files. This dependency could be verified automatically using XML Schema.
82 1 Anonymous
83 13 Paweł Widera
84 13 Paweł Widera
h2. Output Specifications
85 13 Paweł Widera
86 1 Anonymous
These are the specifications for the XML output file:
87 1 Anonymous
88 13 Paweł Widera
Optional tags: *log*, *message*, *similarity* (used only if output is a _comparison_) 
89 1 Anonymous
90 13 Paweł Widera
Optional attributes: *description*, *node*, *start*, *end*, *ref_id* (only if output type is _composition_), *ref_id2* (only if output type is not _comparison_)
91 13 Paweł Widera
92 13 Paweł Widera
<pre>
93 1 Anonymous
<package id="ID" description="TEXT" node="TEXT" start="TIME" end="TIME">
94 1 Anonymous
  <log filename="FILENAME" />
95 1 Anonymous
96 1 Anonymous
  <message type="error|warning|info">TEXT</message>
97 1 Anonymous
  :::
98 1 Anonymous
  <message type="error|warning|info">TEXT</message>
99 1 Anonymous
  
100 1 Anonymous
  <dataset type="structure|tree|contact map|similarity matrix">
101 1 Anonymous
    <item id="ID" label="TEXT" filename="FILENAME" />
102 12 Paweł Widera
    :::
103 12 Paweł Widera
    <item id="ID" label="TEXT" filename="FILENAME" />
104 12 Paweł Widera
  </dataset>  
105 12 Paweł Widera
106 12 Paweł Widera
  <experiment id="ID" name="NAME">
107 12 Paweł Widera
    <method id="ID" name="NAME">
108 6 Paweł Widera
      <parameters>
109 12 Paweł Widera
        <parameter name="TEXT">VALUE</parameter>
110 12 Paweł Widera
        :::
111 12 Paweł Widera
        <parameter name="TEXT">VALUE</parameter>
112 12 Paweł Widera
      </parameters>
113 6 Paweł Widera
114 12 Paweł Widera
      <results type="transformation|comparison|composition" ref_id="" ref_id2=" ">
115 12 Paweł Widera
        <message type="error|warning|info">TEXT</message>
116 12 Paweł Widera
        :::
117 6 Paweł Widera
        <message type="error|warning|info">TEXT</message>
118 1 Anonymous
119 12 Paweł Widera
        <similarity measure="NAME">VALUE</similarity>
120 12 Paweł Widera
        :::
121 12 Paweł Widera
        <similarity measure="NAME">VALUE</similarity>
122 12 Paweł Widera
123 1 Anonymous
        <file type="TEXT" label="TEXT" name="FILENAME" />
124 12 Paweł Widera
        :::
125 12 Paweł Widera
        <file type="TEXT" label="TEXT" name="FILENAME" />
126 12 Paweł Widera
      </results>
127 12 Paweł Widera
      :::
128 12 Paweł Widera
      <results>
129 10 Anonymous
        ...
130 12 Paweł Widera
      </results>
131 12 Paweł Widera
    </method>
132 12 Paweł Widera
    :::
133 12 Paweł Widera
    <method>
134 12 Paweł Widera
      ...
135 12 Paweł Widera
    </method>
136 12 Paweł Widera
  </experiment>
137 1 Anonymous
  :::
138 1 Anonymous
  <experiment>
139 1 Anonymous
    ...
140 1 Anonymous
  </experiment>
141 1 Anonymous
</package>
142 13 Paweł Widera
</pre>
143 1 Anonymous
144 1 Anonymous
Message being an error, warning or additional information could be passed on a global or a method level. Dataset and parameterset defined in the input file (package) could be repeated in the output (results) if needed (self-contained output). Output could be a 1->1 transformation (e.g. structure -> contact map), a 2->1 comparison (e.g. 2*structure -> similarity measure) or N->1 composition (e.g. N*tree -> total tree or N*similarity matrix -> consensus similarity matrix). The results other than similarity measures for a pair of proteins are stored in external files and are just referenced from the XML file.
145 1 Anonymous
146 1 Anonymous
The alignment data could be described in the XML file, as there is no single format used by all programs.