DataStorage » History » Version 16

Paweł Widera, 08/27/2013 03:36 AM
Files re-attached

1 15 Anonymous
h1. Data Storage
2 1 Anonymous
3 16 Paweł Widera
This page describes the design of the database that is/will be used in order to store all necessary pieces information that are obtained from the "stand-alone" ProCKSI/Comparison and ProCKSI/Consensus applications (see [[DataStandardisation]]).
4 1 Anonymous
5 1 Anonymous
h2. Database Design for the (static) Protein Multiverse
6 1 Anonymous
7 1 Anonymous
The database stores results from _Transformations_, _Comparisons_ and _Compositions_:
8 16 Paweł Widera
* A _Transformation_ is a process that derives ONE (main) _Result_ from ONE single input file.
9 16 Paweł Widera
  +Example+: The transformation of _Structure_, _Tree_, _SimilarityMatrix_, etc., using a certain _Method_ with a certain _ParameterSet_, produces a contact map, a tree, ...
10 16 Paweł Widera
* A _Comparison_ is a process that derives ONE (main) _Result_ from TWO input files. 
11 16 Paweł Widera
  +Example+: The comparison of _Structures_, _Trees_, etc., using a _Method_ with a certain _ParameterSet_, produces a similarity value and an alignment
12 16 Paweł Widera
* An _Aggregation_ is a process that derive ONE (main) _Result_ from SEVERAL input files that are grouped together into  _DataSets_. 
13 16 Paweł Widera
  +Example+: The aggregation of _SimilarityMatrices_, _Trees_, using a _Method_ with a certain _ParameterSet_, produces a consensus similarity matrix, a consensus tree, ...
14 1 Anonymous
15 16 Paweł Widera
!ProteinMultiverseDataBase-6.png!
16 1 Anonymous
17 16 Paweł Widera
* There are multiple (similarity comparison) _Methods_: e.g. USM, MaxCMO, DaliLite, ...
18 16 Paweł Widera
* Each _Method_ is executed with a specific _ParameterSet_, which is a combination of different _Parameters_ with its values: e.g. MaxCMO/restarts/10, USM/compressor/bzip2, ... 
19 16 Paweł Widera
* If a _Method_ does not accept any _Parameters_, the _ParameterSet_ does exist but is empty; e.g. DaliLite, CE, ...
20 16 Paweł Widera
* Each _Method_ procudes multiple similarity _Measures_: e.g. DaliLite/Z, FAST/Z, MaxCMO/Overlap, ...
21 15 Anonymous
22 15 Anonymous
* Each _Structure_ is uniquely determined by its PDB code, model and chain. (Domains are not taken into accout yet.) The location of the PDB file is given and a link to a further _Container_ file that holds further information in XML format: e.g. sequence, secondary structure, experimental resolution, ...
23 1 Anonymous
* Each _Structure_ is extended by further classifiction information from _CATH_ and _SCOP_ in separtate relations.
24 16 Paweł Widera
* Multiple _Structures_ can be grouped together into _DataSets_, which are needed for _Aggregations_.
25 15 Anonymous
26 1 Anonymous
* The location of the _Containers_ in which results are stored can be found in the _Transformations_, _Comparisons_, and _Aggregations_ relations, respectively. 
27 15 Anonymous
* Additionally, similarity values from _Comparisons_ are stored directly in the database for quicker access. Alignments could be accessed in the same way, as soon as a standardised format has been defined.
28 1 Anonymous
29 16 Paweł Widera
Note that this design does not allow _Datasets_ to comprise other files than _Structures_ although some of the _Results_ need to be grouped into a _DataSet_, too.
30 16 Paweł Widera
+Example+: Contact maps that have been produces by a _Transformation_ of _Structures_ and that are available from within the _Containers_ need to form a _DataSet_ in order to act as input for the _Comparisons_ with the USM or MaxCMO _Methods_.
31 15 Anonymous
32 15 Anonymous
h3. Storing Further Information and Results externally
33 15 Anonymous
34 16 Paweł Widera
Similarity values are stored directly in the relational database. All further information regarding one structure (e.g. sequence, resolution, ...) or regarding a pair of structures (e.g. alignment, rotation/translation matrices, ...) are stored in external files.
35 1 Anonymous
For storing further information for _single structures_, there are several approaches:
36 15 Anonymous
* All information in one file: file too big
37 15 Anonymous
* All information in separate files grouped by the protein structure
38 15 Anonymous
39 15 Anonymous
For storing further information for _pairs of structures_, there are several approaches:
40 15 Anonymous
* All information in separate files grouped by methods: files too big
41 1 Anonymous
* All information in separate files grouped by pairs: too many files
42 1 Anonymous
* All information in separate files grouped by the the first structure: files with unbalanced sizes
43 16 Paweł Widera
* All information in separate files with fixed size:
44 16 Paweł Widera
  "Bin-packing" algorithm decides where to put new information, and opens a new "bin" if necessary. "Bins" must be balanced from time to time in order to provide a fast retrieval of information.
45 15 Anonymous
46 1 Anonymous
h2. Extended Database Design for (dynamic) Management of Experiments (ProCKSI)
47 15 Anonymous
48 16 Paweł Widera
This has not been modelled yet, but the database for the (static) Protein Multiverse was designed with the ProCKSI integration in mind.
49 15 Anonymous
50 1 Anonymous
Some remarks:
51 16 Paweł Widera
* _Experiments_ (formerly _Requests_) apply _Methods_ to_DataSet_ with a certain _ParameterSet_.
52 16 Paweł Widera
* _Packages_ (formerly _Jobs_) deal with a subset of a _DataSet_ and a subset of the requested _Methods_, partitioning the the 3D problem space, and are calculated using the ProCKSI's "stand-alone" core application "in one go". If they are sent to a queuing system, they become a _Job_ there.
53 15 Anonymous
* It has to be discussed if there is still the need of a _Tasks_ relation in the database, which have always been rather _!RequestMethods_.