DataManagement » History » Version 7

Paweł Widera, 12/09/2013 02:48 PM
Newline formatting corrected.

1 1 Anonymous
2 6 Anonymous
h1. Data Management and Representation
3 1 Anonymous
4 1 Anonymous
5 1 Anonymous
6 6 Anonymous
h2. Definitions
7 6 Anonymous
8 6 Anonymous
9 6 Anonymous
* *User* (to be implemented in the future):
10 6 Anonymous
*** Represented by a unique email-address
11 6 Anonymous
*** Authentication to gain access to user data (requests, personalised settings)
12 6 Anonymous
*** Manage multiple requests
13 6 Anonymous
14 6 Anonymous
* *Request*:
15 6 Anonymous
*** Unique handle for the combination of a dataset, tasks, and request parameters
16 6 Anonymous
*** Request parameters: e.g. request description, settings for notification by email
17 6 Anonymous
18 6 Anonymous
* *Task*:
19 7 Paweł Widera
*** Something to be performed with the given dataset
20 7 Paweł Widera
21 1 Anonymous
     e.g. calculation of PDB structure pictures (1D), or comparison of pairs of proteins with a given similarity method (2D)
22 6 Anonymous
*** Task parameters: e.g. parameters for each comparison method, output parameters for picture generation, ...
23 1 Anonymous
24 1 Anonymous
* *Job*:
25 7 Paweł Widera
*** Everything that lives in a queue
26 7 Paweł Widera
27 1 Anonymous
     e.g. local queue (ProCKSI cluster), remote queue (University cluster), external queue (web service, grid)
28 7 Paweł Widera
*** Currently, a job is equal to a task:
29 7 Paweł Widera
30 7 Paweł Widera
     e.g. task = pairwise comparison of the proteins in the _entire_ dataset with _several_ given similarity methods
31 7 Paweł Widera
32 6 Anonymous
     jobs = _separate_ jobs calculating all pairwise comparisons of the entire dataset with _one_ similarity method
33 7 Paweł Widera
*** Future plans:
34 7 Paweł Widera
35 7 Paweł Widera
     Divide 3D problem space into subsets of datasets and methods, each subset being an independent job
36 7 Paweł Widera
37 6 Anonymous
     See next section for further details on the _3D Problem Space_
38 6 Anonymous
39 6 Anonymous
* *Dataset*:
40 6 Anonymous
*** Currently: Collection of PDB structures, previously calculated similarity matrices 
41 1 Anonymous
*** Future plans: Previously calculated similarity matrices should be uploaded in a post-processing step, not in a pre-processing step (#28)
42 1 Anonymous
43 1 Anonymous
* *Results*:
44 1 Anonymous
*** Currently, entire similarity matrices of different sources 
45 1 Anonymous
*** Future plans: Generate similarity matrices directly from single pairwise comparison results stored in the database
46 1 Anonymous
47 1 Anonymous
48 1 Anonymous
49 1 Anonymous
h2. The 3D Problem and Solution Spaces
50 1 Anonymous
51 7 Paweł Widera
* *Problem Space*:
52 7 Paweł Widera
   The problem space for an all-against-all comparison of a dataset of P protein structures using M different similarity comparison methods can be represented a 3D cube: 
53 7 Paweł Widera
54 7 Paweł Widera
   x: Dataset: list of proteins
55 7 Paweł Widera
56 7 Paweł Widera
   y: Dataset: list of proteins
57 7 Paweł Widera
58 1 Anonymous
   z: Tasks: list of similarity comparison methods
59 7 Paweł Widera
* *Partitionig the Problem Space*:
60 7 Paweł Widera
61 7 Paweł Widera
   For a most efficient calculation of all cells in the 3D problem space, it can be subdivided into sub-cubes, which are called jobs when placed into the queue of a queing system. Examples:
62 7 Paweł Widera
63 1 Anonymous
   a. Comparison of _one pair of proteins_ using _one method_ in the task list => [[PxPxM]] jobs, each performing 1 comparison
64 1 Anonymous
   b. All-against-all comparison of the _entire dataset_ with _one one method_ => M jobs, each performing PxP comparisons
65 1 Anonymous
   c. Comparison of _one pair of proteins_ using _all methods_ in the task list => PxP jobs, each performing M comparisons
66 1 Anonymous
   d. Intelligent partitioning of the 3D problem space, comparing a subset of proteins with a subset of methods
67 7 Paweł Widera
* *Solution Space*:
68 7 Paweł Widera
69 7 Paweł Widera
   Each similarity comparison _methods_ can provide several similarity _measures_
70 7 Paweł Widera
71 1 Anonymous
   For one slice in the 3D problems space using one particular method, we might get several slices in the 3D solution space providing several measures
72 7 Paweł Widera
* *Special Cases*:
73 7 Paweł Widera
74 3 Anonymous
   The 3D problem space is reduced to a 2D problem space (1xPxM) when using methods that to not compare pairs of proteins but work on one single protein, e.g. calculating the PDB picture, or getting additional data from the iHOP web service.      
75 6 Anonymous
76 3 Anonymous
77 1 Anonymous
78 1 Anonymous
h2. Lifecycle of Requests
79 1 Anonymous
80 7 Paweł Widera
A new request is submitted in the browser:
81 7 Paweł Widera
82 7 Paweł Widera
* the request is registered in the database, request parameters, the dataset (structures and matrices) and user tasks are added
83 7 Paweł Widera
84 7 Paweł Widera
   Request Status: P = prepared
85 7 Paweł Widera
86 7 Paweł Widera
The _cron_ administrational tools check the status of all _tasks_ periodically and sets the status of the _request_ accordingly:
87 7 Paweł Widera
88 7 Paweł Widera
* As soon as the first task has been queued (Q) or been processed even further (R, C, F), the request is said to be _running_ and the status of the request is changed in the database.
89 7 Paweł Widera
90 1 Anonymous
   Request Status: R = running
91 7 Paweł Widera
* As soon as the last task has finished successfully (F) or with errors (E), the request is said to have _finished_ and its status is changed in the database. The user gets a nofification email if requested.
92 7 Paweł Widera
93 1 Anonymous
   Request Status: F = finished
94 7 Paweł Widera
* If a request has finished, and the expiration date (soft limit) has been exceeded, the request is said to have _expired_ and its status is changed in the database. The user gets a nofification email if requested.
95 7 Paweł Widera
96 1 Anonymous
   Request Status: X = Expired
97 1 Anonymous
* If a request has been expired and the deletion date (hard limit) has been exceeded, the complete requeste including all tasks and data is deleted from the database and hard disk. The user gets a nofification email if requested.
98 1 Anonymous
99 6 Anonymous
100 5 Anonymous
101 6 Anonymous
h2. Lifecycle of Tasks
102 5 Anonymous
103 1 Anonymous
*Attention:* Currently, a _Job_ equals a _Task_.
104 1 Anonymous
105 7 Paweł Widera
A new request is submitted in the browser:
106 7 Paweł Widera
107 7 Paweł Widera
* All _Tasks_, which the user has selected to be performed, are prepared and registered into the database.
108 7 Paweł Widera
 
109 6 Anonymous
   Status of Tasks: P = prepared
110 6 Anonymous
111 7 Paweł Widera
The _cron_ administrational tools (_sch jobs to the queing system. 
112 7 Paweł Widera
113 7 Paweł Widera
   Status of Tasks: Q = Queued
114 7 Paweł Widera
115 7 Paweł Widera
   In a future version, the scheduler should analyse all tasks, partition the 3D problem space into sub-cubes, and submit these as jobs to the queing system.
116 7 Paweł Widera
117 7 Paweł Widera
* If a _Task_ starts, it changes its own status in the database.
118 7 Paweł Widera
119 7 Paweł Widera
   In a future version, the _scheduler_ should check the status of a task/job directly in the PBS queuing system, detect if it has started, and set the status in the database accordingly.
120 7 Paweł Widera
121 7 Paweł Widera
   Status of Tasks: R = Running
122 7 Paweł Widera
123 7 Paweł Widera
* If a _Task_ reaches its end, it changes its own status in the database.
124 7 Paweł Widera
125 7 Paweł Widera
   In a future version, the _scheduler_ should check the status of a task/job directly in the PBS queuing system, detect if it has finished and set the status in the database accordingly.
126 7 Paweł Widera
127 7 Paweł Widera
   Status of Tasks: C = Completed
128 7 Paweł Widera
129 7 Paweł Widera
* If a _Task_ has been completed, its results are post-processed (e.g. registered in the database), and the status of the task is changed in the database, and an expiration and deletion date is set. The user gets a nofification email if requested.
130 7 Paweł Widera
131 6 Anonymous
   Task Status: F = Finished (successfully)
132 7 Paweł Widera
* In case that there have occured any serious problems, the status of the task is changed in the database accordingly.
133 7 Paweł Widera
134 1 Anonymous
   Task Status; E = Errors
135 1 Anonymous
 
136 1 Anonymous
137 1 Anonymous
h2. Task and Job Dependencies
138 1 Anonymous
139 7 Paweł Widera
* Some tasks must have finished successfully before a dependent task can be started:
140 7 Paweł Widera
141 1 Anonymous
   e.g. _Contacts_ must have been calculated before _USM_ and _MaxCMO_ similarities can be calculated