DataStorage » History » Version 15

Anonymous, 11/01/2007 12:58 PM
Diversify ProCKSI/Comparison and ProCKSI/Consensus

1 1 Anonymous
2 15 Anonymous
h1. Data Storage
3 1 Anonymous
4 1 Anonymous
5 15 Anonymous
This page describes the design of the database that is/will be used in order to store all necessary pieces information that are obtained from the "stand-alone" [[ProCKSI]]/Comparison and [[ProCKSI]]/Consensus applications (see [[DataStandardisation]]).
6 15 Anonymous
7 15 Anonymous
8 15 Anonymous
h2. Database Design for the (static) Protein Multiverse
9 15 Anonymous
10 15 Anonymous
The database stores results from _Transformations_, _Comparisons_ and _Compositions_:
11 15 Anonymous
* A _Transformation_ is a process that derives ONE (main) _Result_ from ONE single input file.[[br]]
12 15 Anonymous
   +Example+: The transformation of _Structure_, _Tree_, _!SimilarityMatrix_, etc., using a certain _Method_ with a certain _!ParameterSet_, produces a contact map, a tree, ...
13 15 Anonymous
* A _Comparison_ is a process that derives ONE (main) _Result_ from TWO input files. [[br]]
14 15 Anonymous
   +Example+ The comparison of _Structures_, _Trees_, etc., using a _Method_ with a certain _!ParameterSet_, produces a similarity value and an alignment
15 15 Anonymous
* An _Aggregation_ is a process that derive ONE (main) _Result_ from SEVERAL input files that are grouped together into  _!DataSets_. [[br]]
16 15 Anonymous
  +Example+ The aggregation of _!SimilarityMatrices_, _Trees_, using a _Method_ with a certain _!ParameterSet_, produces a consensus similarity matrix, a consensus tree, ...
17 15 Anonymous
18 13 Anonymous
[[Image(ProteinMultiverseDataBase6.png)]]
19 3 Anonymous
20 15 Anonymous
* There are multiple (similarity comparison) _Methods_: e.g. USM, [[MaxCMO]], DaliLite, ...
21 15 Anonymous
* Each _Method_ is executed with a specific _!ParameterSet_, which is a combination of different _Parameters_ with its values: e.g. [[MaxCMO]]/restarts/10, USM/compressor/bzip2, ... 
22 15 Anonymous
* If a _Method_ does not accept any _Parameters_, the _!ParameterSet_ does exist but is empty; e.g. DaliLite, CE, ...
23 15 Anonymous
* Each _Method_ procudes multiple similarity _Measures_: e.g. DaliLite/Z, FAST/Z, [[MaxCMO]]/Overlap, ...
24 1 Anonymous
25 15 Anonymous
* Each _Structure_ is uniquely determined by its PDB code, model and chain. (Domains are not taken into accout yet.) The location of the PDB file is given and a link to a further _Container_ file that holds further information in XML format: e.g. sequence, secondary structure, experimental resolution, ...
26 15 Anonymous
* Each _Structure_ is extended by further classifiction information from _CATH_ and _SCOP_ in separtate relations.
27 15 Anonymous
* Multiple _Structures_ can be grouped together into _!DataSets_, which are needed for _Aggregations_.
28 1 Anonymous
29 15 Anonymous
* The location of the _Containers_ in which results are stored can be found in the _Transformations_, _Comparisons_, and _Aggregations_ relations, respectively. 
30 15 Anonymous
* Additionally, similarity values from _Comparisons_ are stored directly in the database for quicker access. Alignments could be accessed in the same way, as soon as a standardised format has been defined.
31 5 Anonymous
32 15 Anonymous
Note that this design does not allow _Datasets_ to comprise other files than _Structures_ although some of the _Results_ need to be grouped into a _!DataSet_, too.[[br]]
33 15 Anonymous
+Example+ Contact maps that have been produces by a _Transformation_ of _Structures_ and that are available from within the _Containers_ need to form a _!DataSet_ in order to act as input for the _Comparisons_ with the USM or [[MaxCMO]] _Methods_.
34 5 Anonymous
35 15 Anonymous
36 15 Anonymous
h3. Storing Further Information and Results externally
37 15 Anonymous
38 5 Anonymous
Similarity values are stored directly in the relational database. All further information regarding one structure (e.g. sequence, resolution, ...) or regarding a pair of structures (e.g. alignment, rotation/translation matrices, ...) are stored in external files.[[br]]
39 15 Anonymous
For storing further information for _single structures_, there are several approaches:
40 15 Anonymous
* All information in one file: file too big
41 15 Anonymous
* All information in separate files grouped by the protein structure
42 1 Anonymous
43 15 Anonymous
For storing further information for _pairs of structures_, there are several approaches:
44 15 Anonymous
* All information in separate files grouped by methods: files too big
45 15 Anonymous
* All information in separate files grouped by pairs: too many files
46 15 Anonymous
* All information in separate files grouped by the the first structure: files with unbalanced sizes
47 15 Anonymous
* All information in separate files with fixed size:[[br]]
48 1 Anonymous
   "Bin-packing" algorithm decides where to put new information, and opens a new "bin" if necessary. "Bins" must be balanced from time to time in order to provide a fast retrieval of information.
49 1 Anonymous
50 1 Anonymous
51 1 Anonymous
52 15 Anonymous
h2. Extended Database Design for (dynamic) Management of Experiments (ProCKSI)
53 1 Anonymous
54 15 Anonymous
55 15 Anonymous
This has not been modelled yet, but the database for the (static) Protein Multiverse was designed with the [[ProCKSI]] integration in mind.
56 15 Anonymous
57 1 Anonymous
Some remarks:
58 15 Anonymous
* _Experiments_ (formerly _Requests_) apply _Methods_ to_!DataSet_ with a certain _!ParameterSet_.
59 15 Anonymous
* _Packages_ (formerly _Jobs_) deal with a subset of a _!DataSet_ and a subset of the requested _Methods_, partitioning the the 3D problem space, and are calculated using the [[ProCKSI]]'s "stand-alone" core application "in one go". If they are sent to a queuing system, they become a _Job_ there.
60 15 Anonymous
* It has to be discussed if there is still the need of a _Tasks_ relation in the database, which have always been rather _!RequestMethods_.