MidPoint 3.8 and later
Starting with midPoint 3.8, the tasks module provides the following functionality:
- A task can distribute the work to multiple nodes at once (not only to multiple threads as it was up to midPoint 3.7.x).
- A task can be resumed at the place where it was suspended (not always from the beginning as it was up to midPoint 3.7.x).
This is implemented using bucket-based work state management along with configurable task partitioning.
In this article we describe the overall picture and the details of work segmentation definition. and task partitioning are the topics of separate ones.
Bucket-based work state management
The work is divided into buckets - abstract chunks of work.
Usually the work consists of iteration over a set of objects: either stored in midPoint repository (e.g. for recomputation task) or stored on a resource (e.g. import, reconciliation or live synchronization). So, the most natural way of segmentation of the work into buckets is by defining a bucket as a set of objects for which a particular item - let us call it discriminator - has a value in a given interval. The interval can be numeric, alphanumeric, or of anything comparable (e.g. timestamps). In the future, OIDs can be used for segmentation as well.
There are other possibilities as well. For example, one could segment users according to employee type, organization membership, and so on. Work buckets can be defined using arbitrary search filter(s) over the set of objects.
Basic bucket state
A major distinction for a work bucket is: is it complete or not? The work bucket is declared
complete if there's no work that can be done on it. It does not mean that all the objects were successfully processed, though. Some of them might incur failures; however, this is considered a normal situation and such objects are treated as processed. (Re-processing of such objects can be implemented in the future, if needed.) The other state is
ready meaning that there is some part of the bucket (maybe all of it) that needs to be processed.
Buckets are kept in the task
workState data structure. This allows us to track the progress done (at coarse-grained level), restarting the work on last known point if necessary.
Multi-node work distribution
Buckets allow us not only to track the progress, but to easily distribute the work among multiple worker tasks, with the intention of their distribution among cluster nodes.
For such multi-node scenario there is a coordinator task and worker tasks. Coordinator holds the authoritative list of buckets to be processed. Each worker tries to grab one or more buckets to work on. Such buckets are then copied from the coordinator's
workState into the worker's one. To know they were allocated their state in coordinator's state is marked as
delegated. (This is the third possible state besides
complete). After the bucket is processed, it is removed from worker's
workState and marked in coordinator's
Minor bucket state a.k.a. bucket progress (future plans)
This is only an idea of a future work.
Sometimes we want to be able to track the progress in more details to avoid needless re-processing objects from the start of the current bucket to the place where the processing stopped. (This might be crucial for situations where the whole processing consists of a single bucket.)
Most typical way how to track in-bucket progress is to:
- sort processed objects by some progress-tracking property (OID, icfs:name, icfs:uid, or basically anything);
- remember last processed object's progress-tracking property value.
Note that object ordering is not required to manage major bucket state. Nor must the property used for bucket segmentation (discriminator) be the same as minor progress-tracking property - although they will be probably the same for the majority of cases.
Configuring work segmentation into buckets
The following structure is used (embedded in task's
Kind of task with respect to the work state management:
Besides these values, there is also
|buckets||How buckets are created, delegated, completed, how they are translated into objects for processing.|
|workers||How workers are created and managed. This is applicable only to tasks of |
|partitions||How subtasks for individual partitions are created and managed. This is applicable only to tasks of |
Bucket segmentation definition
The segmentation is defined like this:
This is to be read such that that the discriminator (
ri:uid attribute) is expected to have a numeric value from 0 to 99999 (inclusive) and we want do divide this range into 100 buckets. So the first one will contain values from 0 to 999, second one from 1000 to 1999, then 2000-2999, etc. And the last one (100th) will contain values from 99000 to 99999, inclusive.
Current implementation supports the following segmentation definitions:
|all definitions||discriminator||Item whose values will used to segment objects into buckets (if applicable). Usually required.|
|matchingRule||Matching rule to be applied when creating filters (if applicable). Optional.|
|numberOfBuckets||Number of buckets to be created (if applicable). Optional.|
|numericSegmentation||from||Start of the processing space (inclusive). If omitted, 0 is assumed.|
End of the processing space (exclusive). If not present, both
Size of one bucket. If not present it is computed as the total processing space divided by number of buckets (i.e.
Characters that make up the prefix or interval. Currently, the string segmentation is done by creating all possible boundaries (by combining
This is a multivalued property: the first value contains characters that occupy the first place in the boundary. The second value contains characters destined for the second place, etc.
An example: if
Another example: if
Beware: current implementation requires that the characters are specified in the order that complies with the matching rule used. Otherwise, empty intervals might be generated, like when using "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" there will be an interval of e.g. "values greater than
If a value
|oidSegmentation||The same as stringSegmentation but providing defaults of |
|explicitSegmentation||content||Explicit content of work buckets to be used. This is useful e.g. when dealing with filter-based buckets. But any other bucket content (e.g. numeric intervals, string intervals, string prefixes) might be used here as well.|
oidSegmentation is the easiest one to be used when dealing with repository objects. The following creates 162 = 256 segments.
The following configuration provides string interval buckets:
- less than
- greater or equal
a, less than
- greater or equal
b, less than
- greater or equal
y, less than
- greater or equal
(comparison is done on normalized form of the
The following configuration provides three buckets. The first comprises
identifier values less than 123. The second comprises values from 123 (inclusive) to 200 (exclusive). And the last one contains values greater than or equal to 200.
The following configuration provides four buckets. The first three correspond to users with
administrative. The last one corresponds to user with no