When enabled, mod-data-import
will split large jobs into smaller “chunks,” adding each chunk into a queue and dynamically ordering them to ensure a fair distribution of jobs are run at the same time, considering metrics such as job size, tenant usage (for multi-tenant environments), and how long a job has been waiting. The algorithm for selecting which chunk will be run next is highly configurable, allowing experimentation and "dialing in" of parameters for specific tenants and deployments. Details on how this algorithm works, as well as how to customize it, may be found below:
Approach
When a worker becomes available, it will calculate and assign every waiting chunk a single numerical “score.” This score will combine many factors according to the parameters, and is designed to represent a holistic view of the chunk, including the job size, waiting time, and more. Higher scores are better.
Factors considered
Metric | Calculation type (see "Implementation notes") | Parameters |
---|---|---|
Job size | Unbounded logarithmic |
|
Age | Bounded logarithmic |
|
Tenant usage | Linear |
|
Part number | Unbounded logarithmic |
|
Job size
This metric considers the total size of the job, in records. This allows control over prioritizing smaller jobs over larger ones; for instance, if a large job has been running for many hours (or would otherwise have priority), it may be desired for a job with only a handful of records to be able to "skip the line" and get processed next.
This is computed on a logarithmic scale, meaning that every doubling of the value only increases the score by one. For example, if the score ranges from 5 (smallest) to 0 (largest), and the reference value is 32, a job with size 1 would get score 5, size 2 gets score 4, size 4 gets score 3, size 8 gets score 2, size 16 gets score 1, and size 32 gets score 0. For more details on the calculation, see “Implementation notes” below.
Age
This metric considers how long this chunk has been waiting, based on when the user selected "Run" in the interface. This control is useful since it allows jobs that have been waiting longer to be prioritized.
Additionally, this metric considers an "extreme value," allowing a sort of failsafe to prevent other factors from de-prioritizing very old jobs. For example, if we set the normal newest — oldest scores to range from 0 - 100, and job size 0 - 1000, job size could very well outweigh the age, and keep bumping an old but large job to the back. An example usage of this failsafe could be to ensure no job waits more than 24 hours; with a threshold of 24 hours, the value could be set to something like 10000 — more than enough to jump any jobs waiting longer than a day to the top of the queue.
Tenant usage
This considers how many workers are currently being used by a tenant; jobs from a tenant which is currently saturating the queue would be deprioritized, pushing for an even distribution amongst all tenants. Note that this will have no effect if in a single-tenant environment (or only one tenant is currently importing data), since all jobs' scores would be affected equally.
This is done on a "linear" scale, making it percentage-based; a tenant using 25% of the workers will have a score 25% of the way between the min and max values.
Part number
Lastly, this metric is useful to ensure chunks run in order (otherwise chunks from the same job could run in a non-deterministic order). It is recommended to keep this range very low (e.g. just 0 to -1), since otherwise every chunk in the same job should have the same score.
This is logarithmic for implementation-specific reasons, but since this is intended to be used on a very small range, this does not really matter.
Environment variables
Parameter | Sample value | Justification/Notes |
---|---|---|
|
| |
|
| Larger jobs should be deprioritized |
|
| |
|
| New jobs get no boost |
|
| More important than job + chunk size |
|
| 72 hours, This should probably be confirmed/updated |
|
| Jump to the top of the queue |
|
| If the tenant has no jobs running, then it should be prioritized |
|
| If the tenant is using all available workers, it should be significantly deprioritized. If no other tenants are competing, this will not matter (since all jobs would be offset by this) |
|
| Very small; we only want to order parts amongst others within a job (which would likely have the same score otherwise) |
|
| The last chunk will likely have a higher score due to the chunk size metric. |
SCORE_PART_NUMBER_LAST_REFERENCE | 100 | Does not really matter due to small range |
Implementation notes
Customization tips
To make it easier to visualize this algorithm, see this playground https://codesandbox.io/s/di-scoring-playground-x4kqrw?file=/src/ChunkAdd.tsx. This makes it easy to simulate tenant usage and jobs of different ages, sizes, and tenants.
Logarithmic factors can often be difficult to calibrate, since they can potentially be infinite, making it difficult to choose a good reference value. When attempting to determine this, don't look for the "largest possible" value, just use something that is a "typical" or "expected" large value; any values that exceed the reference value will still have their score calculated; it will just exceed the range. To aid in this calibration, we developed https://codesandbox.io/s/di-unbounded-logarithmic-playground-yf4yyz, which shows score ranges for a given lower and upper value:
Logging
Whenever a worker looks for a job, we log all calculated scores, to make it easier to calibrate in production. Look for {{Current worker tenant usage}} and
math stuffs
look at log
holistic ranker/code deets