DataStandardisation » History » Version 11

Anonymous, 11/01/2007 01:11 PM
Finalise Specifications; Diversify ProCKSI/Comparison and ProCKSI/Consensus

1 11 Anonymous
= The ProCKSI "stand-alone" applications =
2 1 Anonymous
3 11 Anonymous
== ProCKSI/Comparison ==
4 11 Anonymous
ProCKSI/Comparison integrates a variety of similarity comparison methods (e.g. USM, MaxCMO, TMaling, ...) producing different similarity measures  (e.g. Zscore, TMscore, RMSD, ...) each. Each of the comparison methods produces output with different formats and additional content such as alignments, rotation matrix, etc. Some of them produce just one output file, others a set of linked HTML files.
5 1 Anonymous
6 11 Anonymous
Additionally, there are pre-processing methods, e.g. extracting models/chains from PDB structures, or preparing contact maps from structures/chains.
7 1 Anonymous
8 11 Anonymous
== ProCKSI/Consensus ==
9 11 Anonymous
ProCKSI/Consensus integrates a variety of post-processing methods that all deal with an etire dataset in form of similarity matrices or trees. The former can be clustered or combined in order to form a consensus using a Total Evidence approach, the latter can be combined using a Total Consensus approach. Each of these methods has its own input parameters and produces output in different format.
10 11 Anonymous
11 11 Anonymous
= Why is Data Standardisation necessary? =
12 9 Anonymous
The goal can be described as follows: [[br]]
13 11 Anonymous
Allow the  ProCKSI "stand-alone" applications to 
14 11 Anonymous
 1. be developped independently from the ProCKSI/Server (incl. webserver/database), and allow collaborators to seamlessly integrate their own methods. One might even think of making the code publically available and allow the community to improve it.
15 1 Anonymous
 2. run on any (Linux) machine that has the necessary methods installed. This can be either a collaborator's desktop machine, the ProCKSI cluster, the University Cluster, or even a machine on the Grid.
16 7 Anonymous
 3. further distribute the given task using local machines, Grid and Web Service technology in order to obtain their results without the need to schedule everything from one central point (''Orchestration'' vs. ''Choreography'').
17 11 Anonymous
 4. return its results in a standardised format that can easily be integrated into ProCKSI/Database and thus resused be the ProCKSI/Server and all other experiments "on the command line".
18 1 Anonymous
19 1 Anonymous
20 8 Anonymous
= Standardising Results with XML =  
21 1 Anonymous
22 11 Anonymous
The principle API for the ProCKSI "stand-alone" applications can be visualised as follows:
23 8 Anonymous
24 8 Anonymous
[[Image(ProCKSI-core-API.png)]]
25 8 Anonymous
26 11 Anonymous
One file in XML format is fed into the ProCKSI/Comparison and ProCKSI/Consensus applications, describing the entire dataset, all tasks and the necessary input parameters. At the end, one output file in XML format is written, which might link to further external files in specific format (e.g. PDF, CM, ...) if necessary.
27 8 Anonymous
28 1 Anonymous
29 8 Anonymous
== Input Specifications ==
30 11 Anonymous
These are the specifications for the XML input file:
31 9 Anonymous
32 9 Anonymous
In principle, all possible results from the requested methods are returned. All unnecessary results can be requested to be excluded. A log file is generated if a file name is provided.
33 9 Anonymous
34 6 Paweł Widera
Optional tags: '''exclude''' (measure, result), '''log'''[[BR]]
35 1 Anonymous
Optional attributes: '''description'''
36 1 Anonymous
37 10 Anonymous
{{{
38 5 Paweł Widera
<package id="ID" description="TEXT">
39 5 Paweł Widera
  <log filename="FILENAME" />
40 10 Anonymous
  
41 5 Paweł Widera
  <dataset type="structure|tree|contact map|similarity matrix">
42 6 Paweł Widera
    <item id="ID" label="TEXT" filename="FILENAME" />
43 5 Paweł Widera
    :::
44 10 Anonymous
    <item id="ID" label="TEXT" filename="FILENAME" />
45 1 Anonymous
  </dataset>  
46 10 Anonymous
47 5 Paweł Widera
  <parameterset id="ID" name="TEXT">
48 6 Paweł Widera
    <param name="TEXT">VALUE</param>
49 5 Paweł Widera
    :::
50 1 Anonymous
    <param name="TEXT">VALUE</param>
51 5 Paweł Widera
52 5 Paweł Widera
    <exclude>
53 6 Paweł Widera
      <measure>NAME</measure>
54 5 Paweł Widera
      :::
55 5 Paweł Widera
      <measure>NAME</measure>
56 6 Paweł Widera
      
57 6 Paweł Widera
      <result>NAME</result>
58 1 Anonymous
      :::
59 1 Anonymous
      <result>NAME</result>
60 10 Anonymous
    </exclude>
61 10 Anonymous
  </parameterset>
62 10 Anonymous
63 1 Anonymous
</package>
64 9 Anonymous
}}}
65 9 Anonymous
66 9 Anonymous
The data used as an input could be protein structures, similarity trees, contact maps or similarity matrices. All specified methods should be able to operate on given data files. This dependency could be verified automatically using XML Schema.
67 9 Anonymous
68 1 Anonymous
== Output Specifications ==
69 11 Anonymous
These are the specifications for the XML output file:
70 1 Anonymous
71 8 Anonymous
Optional tags: '''log''', '''message''', '''similarity''' (used only if output is a ''comparison'') [[BR]]
72 6 Paweł Widera
Optional attributes: '''description''', '''node''', '''start''', '''end''', '''ref_id''' (only if output type is ''composition''), '''ref_id2''' (only if output type is not ''comparison'')
73 6 Paweł Widera
74 10 Anonymous
{{{
75 6 Paweł Widera
<package id="ID" description="TEXT" node="TEXT" start="TIME" end="TIME">
76 1 Anonymous
  <log filename="FILENAME" />
77 1 Anonymous
78 6 Paweł Widera
  <message type="error|warning|info">TEXT</message>
79 6 Paweł Widera
  :::
80 6 Paweł Widera
  <message type="error|warning|info">TEXT</message>
81 10 Anonymous
  
82 6 Paweł Widera
  <dataset type="structure|tree|contact map|similarity matrix">
83 6 Paweł Widera
    <item id="ID" label="TEXT" filename="FILENAME" />
84 6 Paweł Widera
    :::
85 10 Anonymous
    <item id="ID" label="TEXT" filename="FILENAME" />
86 6 Paweł Widera
  </dataset>  
87 10 Anonymous
88 6 Paweł Widera
  <parameterset>
89 1 Anonymous
    <method id="ID" name="NAME">
90 1 Anonymous
      <parameter name="TEXT">VALUE</parameter>
91 1 Anonymous
      :::
92 6 Paweł Widera
      <parameter name="TEXT">VALUE</parameter>
93 6 Paweł Widera
    </method>
94 1 Anonymous
    :::
95 6 Paweł Widera
    <method ...>
96 1 Anonymous
      ...
97 10 Anonymous
    </method>
98 6 Paweł Widera
  </parameterset>
99 10 Anonymous
100 6 Paweł Widera
  <results type="transformation|comparison|composition" ref_id="" ref_id2=" ">
101 6 Paweł Widera
    <method id="ID">
102 6 Paweł Widera
      <message type="error|warning|info">TEXT</message>
103 6 Paweł Widera
      :::
104 6 Paweł Widera
      <message type="error|warning|info">TEXT</message>
105 6 Paweł Widera
106 6 Paweł Widera
      <similarity measure="NAME">VALUE</similarity>
107 6 Paweł Widera
      :::
108 6 Paweł Widera
      <similarity measure="NAME">VALUE</similarity>
109 6 Paweł Widera
110 6 Paweł Widera
      <file type="TEXT" label="TEXT" name="FILENAME" />
111 1 Anonymous
      :::
112 1 Anonymous
      <file type="TEXT" label="TEXT" name="FILENAME" />
113 10 Anonymous
    <method>
114 10 Anonymous
  </results>
115 10 Anonymous
  
116 9 Anonymous
</package>
117 9 Anonymous
}}}
118 10 Anonymous
119 9 Anonymous
Message being an error, warning or additional information could be passed on a global or a method level. Dataset and parameterset defined in the input file (package) could be repeated in the output (results) if needed (self-contained output). Output could be a 1->1 transformation (e.g. structure -> contact map), a 2->1 comparison (e.g. 2*structure -> similarity measure) or N->1 composition (e.g. N*tree -> total tree or N*similarity matrix -> consensus similarity matrix). The results other than similarity measures for a pair of proteins are stored in external files and are just referenced from the XML file.
120 10 Anonymous
121 1 Anonymous
The alignment data could be described in the XML file, as there is no single format used by all programs.