DataStorage » History » Version 15

Version 14 (Anonymous, 11/01/2007 12:58 PM) → Version 15/16 (Anonymous, 11/01/2007 12:58 PM)


h1.
= Data Storage

=

This page describes the design of the database that is/will be used in order to store all necessary pieces information that are obtained from the "stand-alone" [[ProCKSI]]/Comparison ProCKSI/Comparison and [[ProCKSI]]/Consensus ProCKSI/Consensus applications (see [[DataStandardisation]]).

h2.
[wiki:DataStandardisation]).

==
Database Design for the (static) Protein Multiverse

==
The database stores results from _Transformations_, _Comparisons_ ''Transformations'', ''Comparisons'' and _Compositions_:
''Compositions'':
* A _Transformation_ ''Transformation'' is a process that derives ONE (main) _Result_ ''Result'' from ONE single input file.[[br]]
+Example+: __Example__: The transformation of _Structure_, _Tree_, _!SimilarityMatrix_, ''Structure'', ''Tree'', ''!SimilarityMatrix'', etc., using a certain _Method_ ''Method'' with a certain _!ParameterSet_, ''!ParameterSet'', produces a contact map, a tree, ...

* A _Comparison_ ''Comparison'' is a process that derives ONE (main) _Result_ ''Result'' from TWO input files. [[br]]
+Example+ __Example__ The comparison of _Structures_, _Trees_, ''Structures'', ''Trees'', etc., using a _Method_ ''Method'' with a certain _!ParameterSet_, ''!ParameterSet'', produces a similarity value and an alignment

* An _Aggregation_ ''Aggregation'' is a process that derive ONE (main) _Result_ ''Result'' from SEVERAL input files that are grouped together into _!DataSets_. ''!DataSets''. [[br]]
+Example+ __Example__ The aggregation of _!SimilarityMatrices_, _Trees_, ''!SimilarityMatrices'', ''Trees'', using a _Method_ ''Method'' with a certain _!ParameterSet_, ''!ParameterSet'', produces a consensus similarity matrix, a consensus tree, ...

[[Image(ProteinMultiverseDataBase6.png)]]



* There are multiple (similarity comparison) _Methods_: ''Methods'': e.g. USM, [[MaxCMO]], DaliLite, MaxCMO, !DaliLite, ...

* Each _Method_ ''Method'' is executed with a specific _!ParameterSet_, ''!ParameterSet'', which is a combination of different _Parameters_ ''Parameters'' with its values: e.g. [[MaxCMO]]/restarts/10, MaxCMO/restarts/10, USM/compressor/bzip2, ...

* If a _Method_ ''Method'' does not accept any _Parameters_, ''Parameters'', the _!ParameterSet_ ''!ParameterSet'' does exist but is empty; e.g. DaliLite, !DaliLite, CE, ...

* Each _Method_ ''Method'' procudes multiple similarity _Measures_: ''Measures'': e.g. DaliLite/Z, !DaliLite/Z, FAST/Z, [[MaxCMO]]/Overlap, MaxCMO/Overlap, ...



* Each _Structure_ ''Structure'' is uniquely determined by its PDB code, model and chain. (Domains are not taken into accout yet.) The location of the PDB file is given and a link to a further _Container_ ''Container'' file that holds further information in XML format: e.g. sequence, secondary structure, experimental resolution, ...

* Each _Structure_ ''Structure'' is extended by further classifiction information from _CATH_ ''CATH'' and _SCOP_ ''SCOP'' in separtate relations.

* Multiple _Structures_ ''Structures'' can be grouped together into _!DataSets_, ''!DataSets'', which are needed for _Aggregations_.

''Aggregations''.

* The location of the _Containers_ ''Containers'' in which results are stored can be found in the _Transformations_, _Comparisons_, ''Transformations'', ''Comparisons'', and _Aggregations_ ''Aggregations'' relations, respectively.

* Additionally, similarity values from _Comparisons_ ''Comparisons'' are stored directly in the database for quicker access. Alignments could be accessed in the same way, as soon as a standardised format has been defined.

Note that this design does not allow _Datasets_ ''Datasets'' to comprise other files than _Structures_ ''Structures'' although some of the _Results_ ''Results'' need to be grouped into a _!DataSet_, ''!DataSet'', too.[[br]]
+Example+ __Example__ Contact maps that have been produces by a _Transformation_ ''Transformation'' of _Structures_ ''Structures'' and that are available from within the _Containers_ ''Containers'' need to form a _!DataSet_ ''!DataSet'' in order to act as input for the _Comparisons_ ''Comparisons'' with the USM or [[MaxCMO]] _Methods_.

h3.
MaxCMO ''Methods''.

===
Storing Further Information and Results externally

===
Similarity values are stored directly in the relational database. All further information regarding one structure (e.g. sequence, resolution, ...) or regarding a pair of structures (e.g. alignment, rotation/translation matrices, ...) are stored in external files.[[br]]
For storing further information for _single structures_, ''single structures'', there are several approaches:

* All information in one file: file too big

* All information in separate files grouped by the protein structure

For storing further information for _pairs ''pairs of structures_, structures'', there are several approaches:

* All information in separate files grouped by methods: files too big

* All information in separate files grouped by pairs: too many files

* All information in separate files grouped by the the first structure: files with unbalanced sizes

* All information in separate files with fixed size:[[br]]
"Bin-packing" algorithm decides where to put new information, and opens a new "bin" if necessary. "Bins" must be balanced from time to time in order to provide a fast retrieval of information.

h2.


==
Extended Database Design for (dynamic) Management of Experiments (ProCKSI)

==

This has not been modelled yet, but the database for the (static) Protein Multiverse was designed with the [[ProCKSI]] ProCKSI integration in mind.

Some remarks:

* _Experiments_ ''Experiments'' (formerly _Requests_) ''Requests'') apply _Methods_ to_!DataSet_ ''Methods'' to''!DataSet'' with a certain _!ParameterSet_.
''!ParameterSet''.
* _Packages_ ''Packages'' (formerly _Jobs_) ''Jobs'') deal with a subset of a _!DataSet_ ''!DataSet'' and a subset of the requested _Methods_, ''Methods'', partitioning the the 3D problem space, and are calculated using the [[ProCKSI]]'s ProCKSI's "stand-alone" core application "in one go". If they are sent to a queuing system, they become a _Job_ ''Job'' there.

* It has to be discussed if there is still the need of a _Tasks_ ''Tasks'' relation in the database, which have always been rather _!RequestMethods_. ''!RequestMethods''.