DataStandardisation » History » Version 11
Version 10 (Anonymous, 10/29/2007 08:58 AM) → Version 11/14 (Anonymous, 11/01/2007 01:11 PM)
= The ProCKSI "stand-alone" applications ''Core'' Application =
== ProCKSI/Comparison ==
ProCKSI/Comparison integrates ProCKSI utilises a variety of similarity comparison methods (e.g. USM, MaxCMO, TMaling, ...) producing different similarity measures (e.g. Zscore, TMscore, RMSD, ...) each. Each of the comparison methods produces output with different formats and additional content such as alignments, rotation matrix, etc. Some of them produce just one output file, others a set of linked HTML files.
Additionally, there are pre-processing pre- and post-processing methods, e.g. extracting models/chains preparation of contact maps from PDB structures, or preparing contact maps from structures/chains.
== ProCKSI/Consensus ==
ProCKSI/Consensus integrates a variety clustering of post-processing methods similarity matrices, that all deal with an etire dataset in form of similarity matrices or trees. The former can be clustered or combined in order to form a consensus using a Total Evidence approach, the latter can be combined using a Total Consensus approach. Each of these methods has its have their own input parameters and produces output in produce different format. output.
= Why is Data Standardisation necessary? =
The goal can be described as follows: [[br]]
Allow the ProCKSI "stand-alone" applications ''core'' application to
1. be developped independently from the ProCKSI/Server ProCKSI ''framework'' or ''server'' (incl. webserver/database), and allow collaborators to seamlessly integrate their own methods. One might even think of making the code publically available and allow the community to improve it.
2. run on any (Linux) machine that has the necessary methods installed. This can be either a collaborator's desktop machine, the ProCKSI cluster, the University Cluster, or even a machine on the Grid.
3. further distribute the given task using local machines, Grid and Web Service technology in order to obtain their results without the need to schedule everything from one central point (''Orchestration'' vs. ''Choreography'').
4. return its results in a standardised format that can easily be integrated into ProCKSI/Database the ProCKSI database and thus resused be the ProCKSI/Server ProCKSI framework and all other experiments "on the command line".
= Standardising Results with XML =
The principle API for the ProCKSI "stand-alone" applications ''core'' application can be visualised as follows:
[[Image(ProCKSI-core-API.png)]]
One file in XML format is fed into the ProCKSI/Comparison and ProCKSI/Consensus applications, ProCKSI "stand-alone" ''core'' application, describing the entire dataset, all tasks and the necessary input parameters. At the end, one output file in XML format is written, which might link to further external files in specific format (e.g. PDF, CM, ...) if necessary.
== Input Specifications ==
These are This is the specifications latest proposal for the XML input file:
In principle, all possible results from the requested methods are returned. All unnecessary results can be requested to be excluded. A log file is generated if a file name is provided.
Optional tags: '''exclude''' (measure, result), '''log'''[[BR]]
Optional attributes: '''description'''
{{{
<package id="ID" description="TEXT">
<log filename="FILENAME" />
<dataset type="structure|tree|contact map|similarity matrix">
<item id="ID" label="TEXT" filename="FILENAME" />
:::
<item id="ID" label="TEXT" filename="FILENAME" />
</dataset>
<parameterset id="ID" name="TEXT">
<param name="TEXT">VALUE</param>
:::
<param name="TEXT">VALUE</param>
<exclude>
<measure>NAME</measure>
:::
<measure>NAME</measure>
<result>NAME</result>
:::
<result>NAME</result>
</exclude>
</parameterset>
</package>
}}}
The data used as an input could be protein structures, similarity trees, contact maps or similarity matrices. All specified methods should be able to operate on given data files. This dependency could be verified automatically using XML Schema.
== Output Specifications ==
These are This is the specifications latest proposal for the XML output file:
Optional tags: '''log''', '''message''', '''similarity''' (used only if output is a ''comparison'') [[BR]]
Optional attributes: '''description''', '''node''', '''start''', '''end''', '''ref_id''' (only if output type is ''composition''), '''ref_id2''' (only if output type is not ''comparison'')
{{{
<package id="ID" description="TEXT" node="TEXT" start="TIME" end="TIME">
<log filename="FILENAME" />
<message type="error|warning|info">TEXT</message>
:::
<message type="error|warning|info">TEXT</message>
<dataset type="structure|tree|contact map|similarity matrix">
<item id="ID" label="TEXT" filename="FILENAME" />
:::
<item id="ID" label="TEXT" filename="FILENAME" />
</dataset>
<parameterset>
<method id="ID" name="NAME">
<parameter name="TEXT">VALUE</parameter>
:::
<parameter name="TEXT">VALUE</parameter>
</method>
:::
<method ...>
...
</method>
</parameterset>
<results type="transformation|comparison|composition" ref_id="" ref_id2=" ">
<method id="ID">
<message type="error|warning|info">TEXT</message>
:::
<message type="error|warning|info">TEXT</message>
<similarity measure="NAME">VALUE</similarity>
:::
<similarity measure="NAME">VALUE</similarity>
<file type="TEXT" label="TEXT" name="FILENAME" />
:::
<file type="TEXT" label="TEXT" name="FILENAME" />
<method>
</results>
</package>
}}}
Message being an error, warning or additional information could be passed on a global or a method level. Dataset and parameterset defined in the input file (package) could be repeated in the output (results) if needed (self-contained output). Output could be a 1->1 transformation (e.g. structure -> contact map), a 2->1 comparison (e.g. 2*structure -> similarity measure) or N->1 composition (e.g. N*tree -> total tree or N*similarity matrix -> consensus similarity matrix). The results other than similarity measures for a pair of proteins are stored in external files and are just referenced from the XML file.
The alignment data could be described in the XML file, as there is no single format used by all programs.
== ProCKSI/Comparison ==
ProCKSI/Comparison integrates ProCKSI utilises a variety of similarity comparison methods (e.g. USM, MaxCMO, TMaling, ...) producing different similarity measures (e.g. Zscore, TMscore, RMSD, ...) each. Each of the comparison methods produces output with different formats and additional content such as alignments, rotation matrix, etc. Some of them produce just one output file, others a set of linked HTML files.
Additionally, there are pre-processing pre- and post-processing methods, e.g. extracting models/chains preparation of contact maps from PDB structures, or preparing contact maps from structures/chains.
== ProCKSI/Consensus ==
ProCKSI/Consensus integrates a variety clustering of post-processing methods similarity matrices, that all deal with an etire dataset in form of similarity matrices or trees. The former can be clustered or combined in order to form a consensus using a Total Evidence approach, the latter can be combined using a Total Consensus approach. Each of these methods has its have their own input parameters and produces output in produce different format. output.
= Why is Data Standardisation necessary? =
The goal can be described as follows: [[br]]
Allow the ProCKSI "stand-alone" applications ''core'' application to
1. be developped independently from the ProCKSI/Server ProCKSI ''framework'' or ''server'' (incl. webserver/database), and allow collaborators to seamlessly integrate their own methods. One might even think of making the code publically available and allow the community to improve it.
2. run on any (Linux) machine that has the necessary methods installed. This can be either a collaborator's desktop machine, the ProCKSI cluster, the University Cluster, or even a machine on the Grid.
3. further distribute the given task using local machines, Grid and Web Service technology in order to obtain their results without the need to schedule everything from one central point (''Orchestration'' vs. ''Choreography'').
4. return its results in a standardised format that can easily be integrated into ProCKSI/Database the ProCKSI database and thus resused be the ProCKSI/Server ProCKSI framework and all other experiments "on the command line".
= Standardising Results with XML =
The principle API for the ProCKSI "stand-alone" applications ''core'' application can be visualised as follows:
[[Image(ProCKSI-core-API.png)]]
One file in XML format is fed into the ProCKSI/Comparison and ProCKSI/Consensus applications, ProCKSI "stand-alone" ''core'' application, describing the entire dataset, all tasks and the necessary input parameters. At the end, one output file in XML format is written, which might link to further external files in specific format (e.g. PDF, CM, ...) if necessary.
== Input Specifications ==
These are This is the specifications latest proposal for the XML input file:
In principle, all possible results from the requested methods are returned. All unnecessary results can be requested to be excluded. A log file is generated if a file name is provided.
Optional tags: '''exclude''' (measure, result), '''log'''[[BR]]
Optional attributes: '''description'''
{{{
<package id="ID" description="TEXT">
<log filename="FILENAME" />
<dataset type="structure|tree|contact map|similarity matrix">
<item id="ID" label="TEXT" filename="FILENAME" />
:::
<item id="ID" label="TEXT" filename="FILENAME" />
</dataset>
<parameterset id="ID" name="TEXT">
<param name="TEXT">VALUE</param>
:::
<param name="TEXT">VALUE</param>
<exclude>
<measure>NAME</measure>
:::
<measure>NAME</measure>
<result>NAME</result>
:::
<result>NAME</result>
</exclude>
</parameterset>
</package>
}}}
The data used as an input could be protein structures, similarity trees, contact maps or similarity matrices. All specified methods should be able to operate on given data files. This dependency could be verified automatically using XML Schema.
== Output Specifications ==
These are This is the specifications latest proposal for the XML output file:
Optional tags: '''log''', '''message''', '''similarity''' (used only if output is a ''comparison'') [[BR]]
Optional attributes: '''description''', '''node''', '''start''', '''end''', '''ref_id''' (only if output type is ''composition''), '''ref_id2''' (only if output type is not ''comparison'')
{{{
<package id="ID" description="TEXT" node="TEXT" start="TIME" end="TIME">
<log filename="FILENAME" />
<message type="error|warning|info">TEXT</message>
:::
<message type="error|warning|info">TEXT</message>
<dataset type="structure|tree|contact map|similarity matrix">
<item id="ID" label="TEXT" filename="FILENAME" />
:::
<item id="ID" label="TEXT" filename="FILENAME" />
</dataset>
<parameterset>
<method id="ID" name="NAME">
<parameter name="TEXT">VALUE</parameter>
:::
<parameter name="TEXT">VALUE</parameter>
</method>
:::
<method ...>
...
</method>
</parameterset>
<results type="transformation|comparison|composition" ref_id="" ref_id2=" ">
<method id="ID">
<message type="error|warning|info">TEXT</message>
:::
<message type="error|warning|info">TEXT</message>
<similarity measure="NAME">VALUE</similarity>
:::
<similarity measure="NAME">VALUE</similarity>
<file type="TEXT" label="TEXT" name="FILENAME" />
:::
<file type="TEXT" label="TEXT" name="FILENAME" />
<method>
</results>
</package>
}}}
Message being an error, warning or additional information could be passed on a global or a method level. Dataset and parameterset defined in the input file (package) could be repeated in the output (results) if needed (self-contained output). Output could be a 1->1 transformation (e.g. structure -> contact map), a 2->1 comparison (e.g. 2*structure -> similarity measure) or N->1 composition (e.g. N*tree -> total tree or N*similarity matrix -> consensus similarity matrix). The results other than similarity measures for a pair of proteins are stored in external files and are just referenced from the XML file.
The alignment data could be described in the XML file, as there is no single format used by all programs.