Version 6 - History - DataManagement - ProCKSI - Redmine

DataManagement » History » Version 6

Version 5 (Anonymous, 07/16/2007 03:06 PM) → Version 6/7 (Anonymous, 07/16/2007 03:06 PM)

h1. = Data Management and Representation

h2. =

== Definitions

==

* *User* '''User''' (to be implemented in the future):
***
* Represented by a unique email-address
***
* Authentication to gain access to user data (requests, personalised settings)
***
* Manage multiple requests

* *Request*:
*** '''Request''':
* Unique handle for the combination of a dataset, tasks, and request parameters
***
* Request parameters: e.g. request description, settings for notification by email

* *Task*:
*** '''Task''':
* Something to be performed with the given dataset[[br]]
e.g. calculation of PDB structure pictures (1D), or comparison of pairs of proteins with a given similarity method (2D)
***
* Task parameters: e.g. parameters for each comparison method, output parameters for picture generation, ...

* *Job*:
*** '''Job''':
* Everything that lives in a queue[[br]]
e.g. local queue (ProCKSI cluster), remote queue (University cluster), external queue (web service, grid)
***
* Currently, a job is equal to a task:[[br]]
e.g. task = pairwise comparison of the proteins in the _entire_ ''entire'' dataset with _several_ ''several'' given similarity methods[[br]]
jobs = _separate_ ''separate'' jobs calculating all pairwise comparisons of the entire dataset with _one_ ''one'' similarity method
***
* Future plans:[[br]]
Divide 3D problem space into subsets of datasets and methods, each subset being an independent job[[br]]
See next section for further details on the _3D ''3D Problem Space_

Space''

* *Dataset*:
*** '''Dataset''':
* Currently: Collection of PDB structures, previously calculated similarity matrices
***
* Future plans: Previously calculated similarity matrices should be uploaded in a post-processing step, not in a pre-processing step (#28)

(ticket:28)

* *Results*:
*** '''Results''':
* Currently, entire similarity matrices of different sources
***
* Future plans: Generate similarity matrices directly from single pairwise comparison results stored in the database

h2.

== The 3D Problem and Solution Spaces

==
* *Problem Space*:[[br]] '''Problem Space''':[[br]]
The problem space for an all-against-all comparison of a dataset of P protein structures using M different similarity comparison methods can be represented a 3D cube: [[br]]
x: Dataset: list of proteins[[br]]
y: Dataset: list of proteins[[br]]
z: Tasks: list of similarity comparison methods

* *Partitionig '''Partitionig the Problem Space*:[[br]] Space''':[[br]]
For a most efficient calculation of all cells in the 3D problem space, it can be subdivided into sub-cubes, which are called jobs when placed into the queue of a queing system. Examples:[[br]]
a. Comparison of _one ''one pair of proteins_ proteins'' using _one method_ ''one method'' in the task list => [[PxPxM]] PxPxM jobs, each performing 1 comparison
b. All-against-all comparison of the _entire dataset_ ''entire dataset'' with _one ''one one method_ method'' => M jobs, each performing PxP comparisons
c. Comparison of _one ''one pair of proteins_ proteins'' using _all methods_ ''all methods'' in the task list => PxP jobs, each performing M comparisons
d. Intelligent partitioning of the 3D problem space, comparing a subset of proteins with a subset of methods

* *Solution Space*:[[br]] '''Solution Space''':[[br]]
Each similarity comparison _methods_ ''methods'' can provide several similarity _measures_[[br]] ''measures''[[br]]
For one slice in the 3D problems space using one particular method, we might get several slices in the 3D solution space providing several measures

* *Special Cases*:[[br]] '''Special Cases''':[[br]]
The 3D problem space is reduced to a 2D problem space (1xPxM) when using methods that to not compare pairs of proteins but work on one single protein, e.g. calculating the PDB picture, or getting additional data from the iHOP web service.

h2.

== Lifecycle of Requests

==
A new request is submitted in the browser:[[br]]

* the request is registered in the database, request parameters, the dataset (structures and matrices) and user tasks are added[[br]]
Request Status: P = prepared[[br]]
The _cron_ ''cron'' administrational tools check the status of all _tasks_ ''tasks'' periodically and sets the status of the _request_ ''request'' accordingly:[[br]]

* As soon as the first task has been queued (Q) or been processed even further (R, C, F), the request is said to be _running_ ''running'' and the status of the request is changed in the database.[[br]]
Request Status: R = running

* As soon as the last task has finished successfully (F) or with errors (E), the request is said to have _finished_ ''finished'' and its status is changed in the database. The user gets a nofification email if requested.[[br]]
Request Status: F = finished

* If a request has finished, and the expiration date (soft limit) has been exceeded, the request is said to have _expired_ ''expired'' and its status is changed in the database. The user gets a nofification email if requested.[[br]]
Request Status: X = Expired

* If a request has been expired and the deletion date (hard limit) has been exceeded, the complete requeste including all tasks and data is deleted from the database and hard disk. The user gets a nofification email if requested.

h2.

== Lifecycle of Tasks

*Attention:* ==
'''Attention:''' Currently, a _Job_ ''Job'' equals a _Task_. ''Task''.

A new request is submitted in the browser:[[br]]

* All _Tasks_, ''Tasks'', which the user has selected to be performed, are prepared and registered into the database.[[br]]
Status of Tasks: P = prepared

The _cron_ ''cron'' administrational tools (_sch (''sch jobs to the queing system. [[br]]
Status of Tasks: Q = Queued[[br]]
In a future version, the scheduler should analyse all tasks, partition the 3D problem space into sub-cubes, and submit these as jobs to the queing system.[[br]]

* If a _Task_ ''Task'' starts, it changes its own status in the database.[[br]]
In a future version, the _scheduler_ ''scheduler'' should check the status of a task/job directly in the PBS queuing system, detect if it has started, and set the status in the database accordingly.[[br]]
Status of Tasks: R = Running[[br]]

* If a _Task_ ''Task'' reaches its end, it changes its own status in the database.[[br]]
In a future version, the _scheduler_ ''scheduler'' should check the status of a task/job directly in the PBS queuing system, detect if it has finished and set the status in the database accordingly.[[br]]
Status of Tasks: C = Completed[[br]]

* If a _Task_ ''Task'' has been completed, its results are post-processed (e.g. registered in the database), and the status of the task is changed in the database, and an expiration and deletion date is set. The user gets a nofification email if requested.[[br]]
Task Status: F = Finished (successfully)

* In case that there have occured any serious problems, the status of the task is changed in the database accordingly.[[br]]
Task Status; E = Errors

h2.

== Task and Job Dependencies

==
* Some tasks must have finished successfully before a dependent task can be started:[[br]]
e.g. _Contacts_ ''Contacts'' must have been calculated before _USM_ ''USM'' and _MaxCMO_ ''MaxCMO'' similarities can be calculated