Gang scheduling

In computer science, gang scheduling is a scheduling algorithm for parallel systems that schedules related threads or processes to run simultaneously on different processors. Usually these will be threads all belonging to the same process, but they may also be from different processes, where the processes could have a producer-consumer relationship or come from the same MPI program.

Gang scheduling is used to ensure that if two or more threads or processes communicate with each other, they will all be ready to communicate at the same time. If they were not gang-scheduled, then one could wait to send or receive a message to another while it is sleeping, and vice versa. When processors are over-subscribed and gang scheduling is not used within a group of processes or threads which communicate with each other, each communication event could suffer the overhead of a context switch.

Gang scheduling is based on a data structure called the Ousterhout matrix. In this matrix each row represents a time slice, and each column a processor. The threads or processes of each job are packed into a single row of the matrix. During execution, coordinated context switching is performed across all nodes to switch from the processes in one row to those in the next row.

Gang scheduling is stricter than coscheduling. It requires all threads of the same process to run concurrently, while coscheduling allows for fragments, which are sets of threads that do not run concurrently with the rest of the gang.

Gang scheduling was implemented and used in production mode on several parallel machines, most notably the Connection Machine CM-5.

Types

Bag of gangs (BoG)

In gang scheduling, one to one mapping happens, which means each task will be mapped to a processor. Usually, jobs are considered as independent gangs, but with a bag of gangs scheme, all the gangs can be combined and sent together to the system. When jobs are executed in the system, the execution can never be completed until and unless all the gangs that belong to the same BoG complete their executions.

The response time is further affected when a priority job arrives. Whenever a priority job arrives at the system, that job will be given priority with respect to all other jobs, even over the ones which are currently being executed on the processors. In this case, when a priority job arrives, the sub-gang which is currently executing on the system will be stopped and all the progress that has been made will be lost and need to be redone. This interruption of the job will further delay the total response time of the BoG.

Largest gang first served (LGFS)

In the above execution scheme, the tasks which correspond to increasing job size are placed in a queue, with the tasks belonging to the largest gang scheduled first, but this method of execution tends to lead to the starvation of resources of smaller jobs and is therefore unfit to be executed in systems where the number of processors is comparatively low.

Blocking case: The processors assigned to the interrupted jobs are blocked and cannot execute other jobs in their queue until the jobs from the damaged processors are cleared.

Scheduling algorithm

General case: In the general case, a central node is designated in the network to handle task allocation and the resource allocation. It maintains the information in an Ousterhout matrix. In strict gang scheduling, one row is selected at a time following which a node scheduler schedules a process in the respective cell of that row.
Processor/Memory module (Also called Processing Element).
2-way network which allows 1-1 Communication.
A synchronizer which performs synchronization of all PE’s after a constant interval.

The synchronization algorithm is performed in two stages.

The local memory of the node is utilized as the swap space for pre-empted jobs. The main advantages of the SHARE scheduled system are that it guarantees the service time for accepted jobs and supports both batch and interactive jobs.

Synchronization:

Each gang of processes utilizing the same resources are mapped to a different processor. The SHARE system primarily consists of three collaborating modules.

2. Best fit. Unlike first fit, the used slots are sorted based on capacity, but not in sequential order. The slot with the smallest sufficient capacity is chosen. If none of the used slots have sufficient capacity, then only one new slot is opened. Once the new slot is opened, the processing elements (PEs) are allocated in the slot in sequential order as per the previous algorithm.