DataStorage » History » Version 16
Paweł Widera, 08/27/2013 03:36 AM
Files re-attached
1 | 15 | Anonymous | h1. Data Storage |
---|---|---|---|
2 | 1 | Anonymous | |
3 | 16 | Paweł Widera | This page describes the design of the database that is/will be used in order to store all necessary pieces information that are obtained from the "stand-alone" ProCKSI/Comparison and ProCKSI/Consensus applications (see [[DataStandardisation]]). |
4 | 1 | Anonymous | |
5 | 1 | Anonymous | h2. Database Design for the (static) Protein Multiverse |
6 | 1 | Anonymous | |
7 | 1 | Anonymous | The database stores results from _Transformations_, _Comparisons_ and _Compositions_: |
8 | 16 | Paweł Widera | * A _Transformation_ is a process that derives ONE (main) _Result_ from ONE single input file. |
9 | 16 | Paweł Widera | +Example+: The transformation of _Structure_, _Tree_, _SimilarityMatrix_, etc., using a certain _Method_ with a certain _ParameterSet_, produces a contact map, a tree, ... |
10 | 16 | Paweł Widera | * A _Comparison_ is a process that derives ONE (main) _Result_ from TWO input files. |
11 | 16 | Paweł Widera | +Example+: The comparison of _Structures_, _Trees_, etc., using a _Method_ with a certain _ParameterSet_, produces a similarity value and an alignment |
12 | 16 | Paweł Widera | * An _Aggregation_ is a process that derive ONE (main) _Result_ from SEVERAL input files that are grouped together into _DataSets_. |
13 | 16 | Paweł Widera | +Example+: The aggregation of _SimilarityMatrices_, _Trees_, using a _Method_ with a certain _ParameterSet_, produces a consensus similarity matrix, a consensus tree, ... |
14 | 1 | Anonymous | |
15 | 16 | Paweł Widera | !ProteinMultiverseDataBase-6.png! |
16 | 1 | Anonymous | |
17 | 16 | Paweł Widera | * There are multiple (similarity comparison) _Methods_: e.g. USM, MaxCMO, DaliLite, ... |
18 | 16 | Paweł Widera | * Each _Method_ is executed with a specific _ParameterSet_, which is a combination of different _Parameters_ with its values: e.g. MaxCMO/restarts/10, USM/compressor/bzip2, ... |
19 | 16 | Paweł Widera | * If a _Method_ does not accept any _Parameters_, the _ParameterSet_ does exist but is empty; e.g. DaliLite, CE, ... |
20 | 16 | Paweł Widera | * Each _Method_ procudes multiple similarity _Measures_: e.g. DaliLite/Z, FAST/Z, MaxCMO/Overlap, ... |
21 | 15 | Anonymous | |
22 | 15 | Anonymous | * Each _Structure_ is uniquely determined by its PDB code, model and chain. (Domains are not taken into accout yet.) The location of the PDB file is given and a link to a further _Container_ file that holds further information in XML format: e.g. sequence, secondary structure, experimental resolution, ... |
23 | 1 | Anonymous | * Each _Structure_ is extended by further classifiction information from _CATH_ and _SCOP_ in separtate relations. |
24 | 16 | Paweł Widera | * Multiple _Structures_ can be grouped together into _DataSets_, which are needed for _Aggregations_. |
25 | 15 | Anonymous | |
26 | 1 | Anonymous | * The location of the _Containers_ in which results are stored can be found in the _Transformations_, _Comparisons_, and _Aggregations_ relations, respectively. |
27 | 15 | Anonymous | * Additionally, similarity values from _Comparisons_ are stored directly in the database for quicker access. Alignments could be accessed in the same way, as soon as a standardised format has been defined. |
28 | 1 | Anonymous | |
29 | 16 | Paweł Widera | Note that this design does not allow _Datasets_ to comprise other files than _Structures_ although some of the _Results_ need to be grouped into a _DataSet_, too. |
30 | 16 | Paweł Widera | +Example+: Contact maps that have been produces by a _Transformation_ of _Structures_ and that are available from within the _Containers_ need to form a _DataSet_ in order to act as input for the _Comparisons_ with the USM or MaxCMO _Methods_. |
31 | 15 | Anonymous | |
32 | 15 | Anonymous | h3. Storing Further Information and Results externally |
33 | 15 | Anonymous | |
34 | 16 | Paweł Widera | Similarity values are stored directly in the relational database. All further information regarding one structure (e.g. sequence, resolution, ...) or regarding a pair of structures (e.g. alignment, rotation/translation matrices, ...) are stored in external files. |
35 | 1 | Anonymous | For storing further information for _single structures_, there are several approaches: |
36 | 15 | Anonymous | * All information in one file: file too big |
37 | 15 | Anonymous | * All information in separate files grouped by the protein structure |
38 | 15 | Anonymous | |
39 | 15 | Anonymous | For storing further information for _pairs of structures_, there are several approaches: |
40 | 15 | Anonymous | * All information in separate files grouped by methods: files too big |
41 | 1 | Anonymous | * All information in separate files grouped by pairs: too many files |
42 | 1 | Anonymous | * All information in separate files grouped by the the first structure: files with unbalanced sizes |
43 | 16 | Paweł Widera | * All information in separate files with fixed size: |
44 | 16 | Paweł Widera | "Bin-packing" algorithm decides where to put new information, and opens a new "bin" if necessary. "Bins" must be balanced from time to time in order to provide a fast retrieval of information. |
45 | 15 | Anonymous | |
46 | 1 | Anonymous | h2. Extended Database Design for (dynamic) Management of Experiments (ProCKSI) |
47 | 15 | Anonymous | |
48 | 16 | Paweł Widera | This has not been modelled yet, but the database for the (static) Protein Multiverse was designed with the ProCKSI integration in mind. |
49 | 15 | Anonymous | |
50 | 1 | Anonymous | Some remarks: |
51 | 16 | Paweł Widera | * _Experiments_ (formerly _Requests_) apply _Methods_ to_DataSet_ with a certain _ParameterSet_. |
52 | 16 | Paweł Widera | * _Packages_ (formerly _Jobs_) deal with a subset of a _DataSet_ and a subset of the requested _Methods_, partitioning the the 3D problem space, and are calculated using the ProCKSI's "stand-alone" core application "in one go". If they are sent to a queuing system, they become a _Job_ there. |
53 | 15 | Anonymous | * It has to be discussed if there is still the need of a _Tasks_ relation in the database, which have always been rather _!RequestMethods_. |