PRIORIS: Dynamic Adapting Scheduling for HPC — Eliminating Job Failure through Robust Resource Allocation

Vishesh Goyal, Dr. Pavithra N.

2025High Performance ComputingJob SchedulingDynamic SchedulingResource AllocationDependency ManagementPythonIEEE PublishedSystems DesignAlgorithm Design

View on IEEE Download PDF

PRIORIS: Dynamic Adapting Scheduling for HPC — Eliminating Job Failure through Robust Resource Allocation

Overview

Every large-scale computing cluster, from cloud engines to supercomputers, runs on a job scheduler. Most of them are decades-old algorithms that don't know what's happening in the system right now. PRIORIS is an IEEE-published adaptive job scheduling framework for High Performance Computing environments that replaces static scheduling and failure prediction with real-time resource awareness, dependency-driven job promotion, and starvation prevention. Evaluated on 5000 synthetic jobs, it reduced makespan by 24.7% and average wait time by 31.5% compared to the standard First-Come-First-Served baseline.

About This Research

Most HPC job schedulers aren't bad — they're just static in a dynamic world. First-Come-First-Served doesn't know a high-priority job is waiting behind a resource hog. Shortest Job First can't handle job dependencies. Failure prediction models require historical training data and break down in new environments. None of them adapt to what's actually happening in the system right now. PRIORIS takes a different approach entirely. Instead of predicting failures, it avoids them by checking real-time resource availability before every job dispatch and dynamically reordering the queue based on current system state.

Core Algorithm

The heart of PRIORIS is a Calculated Priority Metric (CPM) that integrates base priority, estimated runtime, and resource cost into a single scheduling score. Short, lightweight jobs get promoted. Heavy resource consumers get appropriately penalised. Every position in the queue is earned — this is not round-robin or FCFS.

Four Mechanisms

Dependency-Driven Promotion

If Job A depends on Job B, Job B is automatically promoted by a distance proportional to Job A's runtime. This prevents high-priority jobs from stalling on unresolved dependencies — a problem that SLURM and PBS Pro don't handle without external configuration.

Dynamic Resource Checking

Before any job executes, the scheduler verifies live CPU, memory, disk I/O, and network bandwidth availability. Jobs that can't run right now don't block the queue — they're pushed down and reconsidered at the next cycle.

Waiting Queue with Anti-Starvation

Jobs pushed down more than 3 times enter a dedicated waiting queue that is prioritised over the main queue every 4th scheduling cycle. No job waits forever.

Limited Parallelism

Up to 2 jobs with adjacent priority scores may execute simultaneously, maximising utilisation without oversubscription.

Benchmark Comparison

PRIORIS was evaluated against four scheduling strategies.

Scheduler	Handles Dependencies	Real-Time Adaptation	Prior Data Required
FCFS	No	No	No
Shortest Job First	No	No	No
Failure Prediction Model	No	Partial	Yes
PRIORIS	Yes	Yes	No

PRIORIS outperformed every benchmark — including the failure prediction model that requires historical training data — without needing any prior data at all.

Known Limitation

500 dependency violations remain under specific edge-case scheduling sequences. This is acknowledged in the paper and is the primary target for the next iteration.

Most schedulers ask "will this job fail?" PRIORIS asks "do we have the resources to run this job right now?" That shift — from prediction to prevention — is simpler, more interpretable, and more robust.

Published at the 2025 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), Kuala Lumpur.

DOI: 10.1109/I2CACIS65476.2025.11101086

View on IEEE Download PDF