Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Add notes about disregarding a metric

When enabled, mod-data-import will split large jobs into smaller “chunks,” adding each chunk into a queue and dynamically ordering them to ensure a fair distribution of jobs are run at the same time, considering metrics such as job size, tenant usage (for multi-tenant environments), and how long a job has been waiting.  The algorithm for selecting which chunk will be run next is highly configurable, allowing experimentation and "dialing in" of parameters for specific tenants and deployments.  Details on how this algorithm works, as well as how to customize it, may be found below:

Approach

When a worker becomes available, it will calculate and assign every waiting chunk a single numerical “score.” This score will combine many factors according to the parameters, and is designed to represent a holistic view of the chunk, including the job size, waiting time, and more.  Higher scores are better.

Factors considered

Metric

Calculation type (see "Implementation notes")

Parameters

Job size

Unbounded logarithmic

SCORE_JOB_SIZE_SMALLESTSCORE_JOB_SIZE_LARGEST, reference is SCORE_JOB_SIZE_REFERENCE

Age

Bounded logarithmic

SCORE_AGE_NEWESTSCORE_AGE_OLDEST
SCORE_AGE_EXTREME_VALUE above some threshold (SCORE_AGE_EXTREME_THRESHOLD minutes)

Tenant usage

Linear

SCORE_TENANT_USAGE_MINSCORE_TENANT_USAGE_MAX

Part number

Unbounded logarithmic

SCORE_PART_NUMBER_FIRSTSCORE_PART_NUMBER_LAST

Job size

This metric considers the total size of the job, in records.  This allows control over prioritizing smaller jobs over larger ones; for instance, if a large job has been running for many hours (or would otherwise have priority), it may be desired for a job with only a handful of records to be able to "skip the line" and get processed next.

...

This is logarithmic for implementation-specific reasons, but since this is intended to be used on a very small range, this does not really matter.

Environment variables

Parameter

Sample value

Justification/Notes

SCORE_JOB_SMALLEST

40


SCORE_JOB_LARGEST

-40

Larger jobs should be deprioritized

SCORE_JOB_REFERENCE

100000


SCORE_AGE_NEWEST

0

New jobs get no boost

SCORE_AGE_OLDEST

150

More important than job + chunk size

SCORE_AGE_EXTREME_THRESHOLD_MINUTES

4320 minutes

72 hours

, This should probably be confirmed/updated

SCORE_AGE_EXTREME_VALUE

10000

Jump to the top of the queue

SCORE_TENANT_USAGE_MIN

100

If the tenant has no jobs running, then it should be prioritized

SCORE_TENANT_USAGE_MAX

-200

If the tenant is using all available workers, it should be significantly deprioritized. If no other tenants are competing, this will not matter (since all jobs would be offset by this)

SCORE_PART_NUMBER_FIRST

1

Very small; we only want to order parts amongst others within a job (which would likely have the same score otherwise)

SCORE_PART_NUMBER_LAST

0

The last chunk will likely have a higher score due to the chunk size metric.

SCORE_PART_NUMBER_LAST_REFERENCE100Does not really matter due to small range

Implementation notes

Want to disregard a metric?

To disregard/ignore a metric, simply set its parameters to zero.  For example, to emulate first-come-first-serve, set all but the age-related parameters to zero (and, since there are no other metrics to compete with, simple values of newest=0, oldest=1, extreme_threshold=9999, extreme_value=2, or any other combination where newest < oldest < extreme_value, will work).

Customization tips

To make it easier to visualize this algorithm, see this playground https://codesandbox.io/s/di-scoring-playground-x4kqrw?file=/src/ChunkAdd.tsx.  This makes it easy to simulate tenant usage and jobs of different ages, sizes, and tenants.

...

Whenever a worker looks for a job, we log all calculated scores, to make it easier to calibrate in production.  Look for {{Current worker tenant usage}} and 

math stuffs

look at log

holistic ranker/code deetsCurrent worker tenant usage (lists how many workers are in use by each) and Calculated scores (lists the score for each job, listed as tenant/queue chunk ID/score).

Calculations

Warning to the reader: this section gets extremely technical.

Unbounded logarithmic

These represent potentially unbounded values. As such, they do not have a limit; instead, we will define a reference/expected value for the upper bound (that would represent a typical upper bound). However, since we are using a logarithm, the effect of values past the expected bounds is minimal.

To interactively see how this calculation works, see https://codesandbox.io/s/di-unbounded-logarithmic-playground-yf4yyz

Image Added

For example: the size of a chunk. If we expect chunks to typically have a size from 1 to 32, we could define scores 0 to 5. With this, we would get the following scores:

Value range

Score range

[1,2]

(0,1]

[3,4]

(1,2]

[5,8]

(2,3]

[9,16]

(3,4]

[17,32]

(4,5]. This is the upper bound of the expected range, but since the real value could be infinite, it can keep going…

[33,64]

(5,6]

… towards ∞

Why is this a good approach?

This may seem way overcomplicated, however, the above math can be written as one or two lines of code. Some other questions I asked myself while deciding on this were:

  • Why not just use a linear approach? We want to group things into classes of different sizes; for example, jobs could be “small” (e.g. 0-100), “medium” (100-1000), and “large” (1000-10000+). Two “medium” jobs could be 900 records apart, 9x the entire range of “small”, diluting the differences within that class. By using a logarithmic scale, we can retain granularity within each group.

  • Why not just have classes/ranges like you mentioned above? We could do this, however, the next immediate question is “how many ranges should there be” and “what range should each one cover.” We could attempt to answer these, but we cannot anticipate the needs of the future. For the latter, we could make configuration variables, but just sorting into five groups would introduce fifteen variables (lower, upper, score); which seems overcomplicated.

Bounded logarithmic

This acts like logarithmic, however, upon reaching the reference value/expected upper bound (EXTREME_THRESHOLD), the EXTREME_VALUE will be used instead. This is useful for something like Age, where we want to gradually increase the score over time, however, after a certain amount of time, we want that job to finish ASAP (effectively be bumped to the top). This represents the maximum we want this parameter to ever get to.

Image Added

Linear

This works for percentages; if the value corresponds to 50%, then the score will be halfway between the upper and lower scores.

Image Added

Adding additional metrics

The code backing this is highly extensible; to add your own factors, create a new class extending QueueItemRanker with a single method score(DataImportQueueItem queueItem, Map<String, Long> tenantUsage).  Then, to include it in the calculations, simply add your custom ranker to the score  method of QueueItemHolisticRanker.