Hgq Design Doc

Group Quota Design

Motivating Scenarios

??? What's some good use cases here What didn't the old code do that the new code can?

Some questions we'd like customer use cases to address:

High Level Design and Definitions

The HGQ design is intended to allow administrator to restrict the aggregrate number of slots running jobs submitted by groups of users.

These sets of users are organized into hierarchical groups, with the "none" group being the name of the root. The admin is expected to assign a quota to every leaf and interior node in the tree, except for the root. The assigned quotas can be absolute numbers or a floating point number from 0 to 1, which represents a percentage of the immediate parent. If absolute, it represents a weighted number of slots, where the each slot is multiplied by a configurable weight, which defaults to number of cores. All groups named must be predeclared in the config file. Note the quota is independent of user priority.


Can we get crisp definitions of each of the fields in the GroupEntry structure?

Here is some annotation from the meeting on fields that didn't already have in-code doc:

    // these are set from configuration
   string name;
    double config_quota;  // Could be static (>=1) or dynamic (0<x<1)
   bool static_quota; // Flag for if config_quota is static or dynamic
   bool accept_surplus; // true if this group will accept surplus
    bool autoregroup; // true if will participate in autoregroup phase

    // current usage information coming into this negotiation cycle
    double usage; // accountant's value for usage under thi sgroup
    ClassAdListDoesNotDeleteAds* submitterAds; // list of submitter ads under this group
    double priority; // group's priority from acct

Meaning of quota for "static" quota:

The static quota for a given group indicates the minimum number of machines/slots that group is expected to be allocated, given sufficient demand. The sum of the static quota for all the children nodes of any given parent must be less than or equal to the parent's static quota.

The sum of the children's static quota may be less than the parent. If so, the remainder is assigned to the parent.

For dynamic (proportional) quota

A dynamic (proportional) quota indicates the percentage of the parent's node resources the group is expected to be allocated, given sufficient demand. If the children of a node have proportional quota, each node then is assigned an absolute quota based on the proportion assigned to their parent's node.

The sum of all the sibling quota should be <= 1.0. (if not, they are normalized to 1 with a warning message)

specifying quotas

Each job then specifies what group it should be in with the "+AccountingGroup = "group_name.username" syntax. See also: https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2728

quota terminology

Note: The term "quota" is overloaded. Sometimes in the code and documentation, it means "the amount assigned by the administrator to a group" (entry->config_quota). It may also be the value translated from configured quota to actual (possibly weighted) slot quantity (entry->quota). The quantity finally assigned to a group, after quota computation and surplus sharing and fractional-quota distribution, is referred to as 'allocated' (entry->allocated).


First, the code builds up a data structure which describes each group, it's position in the tree, the administratively configured quota, whether it is static or dynamic quota, whether this group accepts_surplus or autoregroup. For each group, the current weighted usage is fetched from the accountant, as is the current userprio. The number of running and idle jobs is copied from the submitter ad from each submitter, and summed into the corresponding group structure. Note that the number of running jobs also includes jobs running in flocked-to pools. Each group also contains a list of all the related submitter ads.

If autoregroup is on, the submitters are also appended to the root's list of submitter ads.

After (weighted) slot quotas are assigned to all the group entries, surplus sharing is computed for all groups in the hierarchy configured to accept surplus. Following surplus sharing, when slot weighting is not enabled, any fractional quota allocations are consolidated and distributed in a round robin fashion.

Surplus Sharing

The primary purpose of surplus sharing is to allow group quotas to "float" locally based on demand. For example, if one configures group A, A.B, and A.C, where gropup A does not share surplus, but A.B and A.C do share surplus, then A.B and A.C can float against each other, while maintaining the constraint that quota(A.B) + quota(A.C) <= quota(A). Surplus quota is always shared at the lowest possible level before being passed upwards.

The basic principle for surplus sharing is: surplus quota is distributed among sibling groups in proportion to assigned quota. For example, if group A has twice the quota of group B, group A will be awarded twice the surplus. Some additional points:

  • available surplus consists of any surplus shared from the level above in the hierarchy, plus any surplus coming up from sibling sub-trees
  • any groups with surplus sharing not enabled do not participate in surplus distribution
  • if a group does not need all of its potential surplus, any it does not use will be shared among remaining participating groups
  • the parent group of siblings participates in sharing, effectively as another sibling
  • any surplus unused after sharing among siblings (and parent) is sent up the hierarchy to be shared at the level above

Fractional Quota Consolidation

When slot weighting is not enabled, fractional quota values for groups are consolidated and distributed in round robin fashion to ensure that all quotas are integer values.
  • available remainder for consolidation consists of remainder coming from upper level in hierarchy, combined with any remainder coming up from sibling subtrees
  • remainders are not accepted by groups not accepting surplus
  • siblings having received remainder least recently are favored in round robin - siblings are ordered by time of last receipt of a remainder
  • remainder unused at a level is sent up to parent

Allocation Rounds

Allocation rounds are a method to address the scenario where jobs submitted under an accounting group do not satisfy mutual job/slot requirements for enough slots to achieve their quota. When GROUP_QUOTA_MAX_ALLOCATION_ROUNDS > 1, then each group that has not met its allocated quota has its 'requested' value re-set to be equal to whatever its current (weighted) usage is. (i.e. it is assumed that no further jobs under that group will match slots until next negotiation cycle). This frees up the unused quota for other groups that may be able to use it as surplus.

The following steps are iterated GROUP_QUOTA_MAX_ALLOCATION_ROUNDS times:

  1. (starting after 1st round) re-set 'requested' values to current usage
  2. (re)compute quota allocations
  3. allow all groups to renegotiate

Round Robin Rate

Round robin rate is a method to address the 'overlapping effective pool' problem: this is a scenario where the jobs in two or more accounting groups are in fact competing for a subset of the total available resources. For example, if a pool has 100 linux machines and 100 windows machines, and 200 jobs from 2 accounting groups are competing only for the linux machines. Without intervention, the first group to negotiate can acquire all 100 linux machines and starve the 2nd group.

To address this problem, there is a loop around negotiation that operates like so:

  1. (initialize all quota limits at zero)
  2. increase each quota limit by the round robin rate (up to allocated quota)
  3. run negotiation with those limits
  4. repeat

Round robin rate is convigured via: GROUP_QUOTA_ROUND_ROBIN_RATE, which defaults to "infinity", which emulates legacy behavior.

(note: There is some interest in developing alternative approaches to allocation rounds and round robin rate that require fewer nested loops on top of basic negotiation)

accounting group negotiation order

we sort the submitters in "starvation order", by GROUP_SORT_EXPR, defaults to the ratio of current group usage / configured group quota

Finally, we negotiate with each group in that order, with a quota limited as calculated above.