Version 13 - History - DataStandardisation - ProCKSI - Redmine

DataStandardisation » History » Version 13

Paweł Widera, 11/23/2007 07:59 PM
Experiment and method wrapping tags introduced.

 Anonymous
-Paweł Widera
+h1. The [[ProCKSI]] "stand-alone" applications
 Anonymous
 Paweł Widera
 Paweł Widera
-Paweł Widera
+h2. [[ProCKSI]]/Comparison
 Paweł Widera
-Paweł Widera
+[[ProCKSI]]/Comparison integrates a variety of similarity comparison methods (e.g. USM, [[MaxCMO]], TMaling, ...) producing different similarity measures  (e.g. Zscore, TMscore, RMSD, ...) each. Each of the comparison methods produces output with different formats and additional content such as alignments, rotation matrix, etc. Some of them produce just one output file, others a set of linked HTML files.
 Paweł Widera
-Anonymous
+Additionally, there are pre-processing methods, e.g. extracting models/chains from PDB structures, or preparing contact maps from structures/chains.
 Anonymous
 Anonymous
-Paweł Widera
+h2. [[ProCKSI]]/Consensus
 Paweł Widera
-Paweł Widera
+[[ProCKSI]]/Consensus integrates a variety of post-processing methods that all deal with an etire dataset in form of similarity matrices or trees. The former can be clustered or combined in order to form a consensus using a Total Evidence approach, the latter can be combined using a Total Consensus approach. Each of these methods has its own input parameters and produces output in different format.
 Paweł Widera
 Paweł Widera
-Paweł Widera
+h1. Why is Data Standardisation necessary?
 Paweł Widera
-Anonymous
+The goal can be described as follows: [[br]]
-Paweł Widera
+Allow the  [[ProCKSI]] "stand-alone" applications to
-Paweł Widera
+. be developped independently from the [[ProCKSI]]/Server (incl. webserver/database), and allow collaborators to seamlessly integrate their own methods. One might even think of making the code publically available and allow the community to improve it.
-Paweł Widera
+. run on any (Linux) machine that has the necessary methods installed. This can be either a collaborator's desktop machine, the [[ProCKSI]] cluster, the University Cluster, or even a machine on the Grid.
-Paweł Widera
+. further distribute the given task using local machines, Grid and Web Service technology in order to obtain their results without the need to schedule everything from one central point (_Orchestration_ vs. _Choreography_).
-Paweł Widera
+. return its results in a standardised format that can easily be integrated into [[ProCKSI]]/Database and thus resused be the [[ProCKSI]]/Server and all other experiments "on the command line".
 Anonymous
 Anonymous
 Anonymous
-Paweł Widera
+h1. Standardising Results with XML
 Paweł Widera
 Anonymous
-Paweł Widera
+The principle API for the [[ProCKSI]] "stand-alone" applications can be visualised as follows:
 Paweł Widera
-Anonymous
+[[Image(ProCKSI-core-API.png)]]
 Anonymous
-Paweł Widera
+One file in XML format is fed into the [[ProCKSI]]/Comparison and [[ProCKSI]]/Consensus applications, describing the entire dataset, all tasks and the necessary input parameters. At the end, one output file in XML format is written, which might link to further external files in specific format (e.g. PDF, CM, ...) if necessary.
 Anonymous
 Anonymous
 Paweł Widera
-Paweł Widera
+h2. Input Specifications
 Paweł Widera
-Anonymous
+These are the specifications for the XML input file:
 Paweł Widera
-Anonymous
+In principle, all possible results from the requested methods are returned. All unnecessary results can be requested to be excluded. A log file is generated if a file name is provided.
 Anonymous
-Paweł Widera
+Optional tags: *exclude* (measure, result), *log*
 Paweł Widera
-Paweł Widera
+Optional attributes: *description*
 Paweł Widera
-Paweł Widera
+<pre>
-Paweł Widera
+<package id="ID" description="TEXT">
-Paweł Widera
+  <log filename="FILENAME" />
 Anonymous
-Anonymous
+  <dataset type="structure|tree|contact map|similarity matrix">
-Anonymous
+    <item id="ID" label="TEXT" filename="FILENAME" />
-Paweł Widera
+    :::
-Paweł Widera
+    <item id="ID" label="TEXT" filename="FILENAME" />
-Paweł Widera
+  </dataset>
 Paweł Widera
-Paweł Widera
+  <experiment id="ID" name="NAME">
-Anonymous
+    <method id="ID" name="NAME">
-Paweł Widera
+      <parameters>
-Anonymous
+        <param name="TEXT">VALUE</param>
-Paweł Widera
+        :::
-Paweł Widera
+        <param name="TEXT">VALUE</param>
-Paweł Widera
+      </parameters>
 Paweł Widera
-Paweł Widera
+      <exclude>
-Anonymous
+        <measure>NAME</measure>
-Paweł Widera
+        :::
-Paweł Widera
+        <measure>NAME</measure>
 Anonymous
-Anonymous
+        <result>NAME</result>
-Anonymous
+        :::
-Anonymous
+        <result>NAME</result>
-Anonymous
+      </exclude>
-Anonymous
+    </method>
-Anonymous
+    :::
-Paweł Widera
+    <method>
-Paweł Widera
+      ...
-Paweł Widera
+    </method>
-Paweł Widera
+  </experiment>
-Paweł Widera
+  :::
-Paweł Widera
+  <experiment>
-Paweł Widera
+    ...
-Anonymous
+  </experiment>
-Anonymous
+</package>
-Paweł Widera
+</pre>
 Anonymous
-Anonymous
+The data used as an input could be protein structures, similarity trees, contact maps or similarity matrices. All specified methods should be able to operate on given data files. This dependency could be verified automatically using XML Schema.
 Anonymous
 Paweł Widera
-Paweł Widera
+h2. Output Specifications
 Paweł Widera
-Anonymous
+These are the specifications for the XML output file:
 Anonymous
-Paweł Widera
+Optional tags: *log*, *message*, *similarity* (used only if output is a _comparison_)
 Anonymous
-Paweł Widera
+Optional attributes: *description*, *node*, *start*, *end*, *ref_id* (only if output type is _composition_), *ref_id2* (only if output type is not _comparison_)
 Paweł Widera
-Paweł Widera
+<pre>
-Anonymous
+<package id="ID" description="TEXT" node="TEXT" start="TIME" end="TIME">
-Anonymous
+  <log filename="FILENAME" />
 Anonymous
-Anonymous
+  <message type="error|warning|info">TEXT</message>
-Anonymous
+  :::
-Anonymous
+  <message type="error|warning|info">TEXT</message>
 Anonymous
-Anonymous
+  <dataset type="structure|tree|contact map|similarity matrix">
-Anonymous
+    <item id="ID" label="TEXT" filename="FILENAME" />
-Paweł Widera
+    :::
-Paweł Widera
+    <item id="ID" label="TEXT" filename="FILENAME" />
-Paweł Widera
+  </dataset>
 Paweł Widera
-Paweł Widera
+  <experiment id="ID" name="NAME">
-Paweł Widera
+    <method id="ID" name="NAME">
-Paweł Widera
+      <parameters>
-Paweł Widera
+        <parameter name="TEXT">VALUE</parameter>
-Paweł Widera
+        :::
-Paweł Widera
+        <parameter name="TEXT">VALUE</parameter>
-Paweł Widera
+      </parameters>
 Paweł Widera
-Paweł Widera
+      <results type="transformation|comparison|composition" ref_id="" ref_id2=" ">
-Paweł Widera
+        <message type="error|warning|info">TEXT</message>
-Paweł Widera
+        :::
-Paweł Widera
+        <message type="error|warning|info">TEXT</message>
 Anonymous
-Paweł Widera
+        <similarity measure="NAME">VALUE</similarity>
-Paweł Widera
+        :::
-Paweł Widera
+        <similarity measure="NAME">VALUE</similarity>
 Paweł Widera
-Anonymous
+        <file type="TEXT" label="TEXT" name="FILENAME" />
-Paweł Widera
+        :::
-Paweł Widera
+        <file type="TEXT" label="TEXT" name="FILENAME" />
-Paweł Widera
+      </results>
-Paweł Widera
+      :::
-Paweł Widera
+      <results>
-Anonymous
+        ...
-Paweł Widera
+      </results>
-Paweł Widera
+    </method>
-Paweł Widera
+    :::
-Paweł Widera
+    <method>
-Paweł Widera
+      ...
-Paweł Widera
+    </method>
-Paweł Widera
+  </experiment>
-Anonymous
+  :::
-Anonymous
+  <experiment>
-Anonymous
+    ...
-Anonymous
+  </experiment>
-Anonymous
+</package>
-Paweł Widera
+</pre>
 Anonymous
-Anonymous
+Message being an error, warning or additional information could be passed on a global or a method level. Dataset and parameterset defined in the input file (package) could be repeated in the output (results) if needed (self-contained output). Output could be a 1->1 transformation (e.g. structure -> contact map), a 2->1 comparison (e.g. 2*structure -> similarity measure) or N->1 composition (e.g. N*tree -> total tree or N*similarity matrix -> consensus similarity matrix). The results other than similarity measures for a pair of proteins are stored in external files and are just referenced from the XML file.
 Anonymous
-Anonymous
+The alignment data could be described in the XML file, as there is no single format used by all programs.