Long/inconsistent requeue times with 2.9.1 + slurm

**Required Info:**
 - AWS ParallelCluster version [e.g. 2.9.0]: 2.9.1
 - Full cluster configuration without any credentials or personal data
```
[aws]
aws_region_name = us-west-2

[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}

[global]
cluster_template = default
update_check = true
sanity_check = true

[cluster default]
key_name = ec2-admin
base_os = centos7
scheduler = slurm
master_instance_type = m5.large
vpc_settings = default
queue_settings = compute
tags = {"purpose": "development-ondemand"}
custom_ami = ami-0cbdde0662f037f7c
master_root_volume_size = 100
compute_root_volume_size = 100

[vpc default]
vpc_id = vpc-fa52d99e
master_subnet_id = subnet-00bbd054b223b7501
vpc_security_group_id = sg-74e42c12
use_public_ips = false

[queue compute]
enable_efa = false
compute_resource_settings = default

[compute_resource default]
instance_type = m5.xlarge
```
 - Cluster name: slurm-test
 - [Optional] Arn of the cluster CloudFormation main stack: arn:aws:cloudformation:us-west-2:171496337684:stack/parallelcluster-slurm-test/047735a0-09a4-11eb-8cb8-067714b800aa

**Bug description and how to reproduce:**
When a node with a job is terminated, the time it takes for slurm to update the job/node state is quite variable - sometimes a few minutes, but I've seen it take nearly an hour. Here is an example run:

first job running:
```
[ip-172-31-58-150]keien:slurm-test [14:06]
$ cat requeue.sh 
#!/bin/bash

hostname
date
sleep 600

[ip-172-31-58-150]keien:~ [14:06]
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                 2   compute requeue.    keien  R       2:28      1 compute-dy-m5xlarge-1

[ip-172-31-58-150]keien:~ [14:06]
$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST 
compute*        up   infinite     9  idle~ compute-dy-m5xlarge-[2-10]
compute*        up   infinite      1    mix compute-dy-m5xlarge-1

[ip-172-31-58-150]keien:slurm-test [14:07]
$ cat slurm-2.out 
compute-dy-m5xlarge-1
Thu Oct  8 14:03:45 PDT 2020
```

terminate node:
```
[compute-dy-m5xlarge-1]keien:~ [14:07]
$ sudo shutdown -h now

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

[sudo] password for keien:
Connection to compute-dy-m5xlarge-1 closed by remote host.
Connection to compute-dy-m5xlarge-1 closed.
```

right after termination:
```

[ip-172-31-58-150]keien:slurm-test [14:08]
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                 2   compute requeue.    keien  R       5:49      1 compute-dy-m5xlarge-1

[ip-172-31-58-150]keien:slurm-test [14:09]
$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST 
compute*        up   infinite     9  idle~ compute-dy-m5xlarge-[2-10]
compute*        up   infinite      1    mix compute-dy-m5xlarge-1
```

attempting to salloc while in this state:
```
[ip-172-31-58-150]keien:slurm-test [14:15]
$ salloc bash
salloc: Granted job allocation 3
keien@ip-172-31-58-150 ~/ws/slurm-test $ srun hostname
srun: error: Task launch for 3.0 failed on node compute-dy-m5xlarge-1: Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
```

the salloc job gets stuck in CG after the above:
```
[ip-172-31-58-150]keien:slurm-test [14:17]
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                 3   compute     bash    keien CG       1:03      1 compute-dy-m5xlarge-1
                 2   compute requeue.    keien  R      14:02      1 compute-dy-m5xlarge-1
```

job restarted:
```
[ip-172-31-58-150]keien:slurm-test [15:15]
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
                 2   compute requeue.    keien  R       4:38      1 compute-dy-m5xlarge-2

[ip-172-31-58-150]keien:slurm-test [15:15]
$ sinfo
PARTITION    AVAIL  TIMELIMIT  NODES  STATE NODELIST 
compute*        up   infinite     9  idle~ compute-dy-m5xlarge-[1,3-10]
compute*        up   infinite      1    mix compute-dy-m5xlarge-2

[ip-172-31-58-150]keien:slurm-test [15:15]
$ cat slurm-2.out 
compute-dy-m5xlarge-2
Thu Oct  8 15:10:45 PDT 2020
```

logs:
```
[root@ip-172-31-58-150 log]# cat slurmctld.log 
[2020-10-08T13:33:39.952] error: chdir(/var/log): Permission denied
[2020-10-08T13:33:39.953] slurmctld version 20.02.4 started on cluster parallelcluster
[2020-10-08T13:33:42.279] No memory enforcing mechanism configured.
[2020-10-08T13:33:42.279] layouts: no layout to initialize[2020-10-08T13:33:42.711] layouts: loading entities/relations information
[2020-10-08T13:33:42.711] error: Could not open node state file /var/spool/slurm.state/node_state: No such file or directory
[2020-10-08T13:33:42.711] error: NOTE: Trying backup state save file. Information may be lost!        
[2020-10-08T13:33:42.711] No node state file (/var/spool/slurm.state/node_state.old) to recover       
[2020-10-08T13:33:42.711] error: Could not open job state file /var/spool/slurm.state/job_state: No such file or directory[2020-10-08T13:33:42.711] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-10-08T13:33:42.711] No job state file (/var/spool/slurm.state/job_state.old) to recover
[2020-10-08T13:33:42.712] select/cons_tres: select_p_node_init
[2020-10-08T13:33:42.712] select/cons_tres: preparing for 3 partitions
[2020-10-08T13:33:42.997] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:42.998] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:42.999] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:42.999] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.000] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.000] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.001] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.002] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.002] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.003] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.004] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.004] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.005] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.005] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.006] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.007] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.007] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.008] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.008] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.009] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.014] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.014] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.014] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.014] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia0): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia1): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia2): No such file or directory     
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia3): No such file or directory     
[2020-10-08T13:33:43.016] error: Could not open reservation state file /var/spool/slurm.state/resv_state: No such file or directory
[2020-10-08T13:33:43.016] error: NOTE: Trying backup state save file. Reservations may be lost        
[2020-10-08T13:33:43.016] No reservation state file (/var/spool/slurm.state/resv_state.old) to recover[2020-10-08T13:33:43.016] error: Could not open trigger state file /var/spool/slurm.state/trigger_state: No such file or directory
[2020-10-08T13:33:43.016] error: NOTE: Trying backup state save file. Triggers may be lost!
[2020-10-08T13:33:43.016] No trigger state file (/var/spool/slurm.state/trigger_state.old) to recover 
[2020-10-08T13:33:43.016] read_slurm_conf: backup_controller not specified
[2020-10-08T13:33:43.016] Reinitializing job accounting state
[2020-10-08T13:33:43.016] select/cons_tres: reconfigure
[2020-10-08T13:33:43.016] select/cons_tres: select_p_node_init
[2020-10-08T13:33:43.016] select/cons_tres: preparing for 3 partitions
[2020-10-08T13:33:43.016] Running as primary controller
[2020-10-08T13:33:43.122] No parameter for mcs plugin, default values set
[2020-10-08T13:33:43.122] mcs: MCSParameters = (null). ondemand set.
[2020-10-08T13:34:43.393] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-10-08T13:38:43.546] error: Could not open job state file /var/spool/slurm.state/job_state: No such file or directory
[2020-10-08T13:38:43.546] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-10-08T13:38:43.546] No job state file (/var/spool/slurm.state/job_state.old) found
[2020-10-08T13:59:04.324] _slurm_rpc_submit_batch_job: JobId=2 InitPrio=4294901759 usec=1258
[2020-10-08T13:59:04.472] sched: Allocate JobId=2 NodeList=compute-dy-m5xlarge-1 #CPUs=1 Partition=compute
[2020-10-08T14:03:28.136] Node compute-dy-m5xlarge-1 now responding
[2020-10-08T14:03:43.670] job_time_limit: Configuration for JobId=2 complete
[2020-10-08T14:03:43.670] Resetting JobId=2 start time for node power up
[2020-10-08T14:13:43.077] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:15:27.168] sched: _slurm_rpc_allocate_resources JobId=3 NodeList=compute-dy-m5xlarge-1 
usec=284
[2020-10-08T14:15:27.168] prolog_running_decr: Configuration for JobId=3 is complete
[2020-10-08T14:15:38.897] job_step_signal: JobId=3 StepId=0 not found
[2020-10-08T14:16:10.000] job_step_signal: JobId=3 StepId=0 not found
[2020-10-08T14:16:30.292] _job_complete: JobId=3 WEXITSTATUS 233
[2020-10-08T14:16:30.292] _job_complete: JobId=3 done
[2020-10-08T14:18:43.275] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:19:43.321] Resending TERMINATE_JOB request JobId=3 Nodelist=compute-dy-m5xlarge-1      
[2020-10-08T14:23:43.479] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:28:43.768] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:33:43.965] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:38:43.163] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:43:43.368] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:48:43.566] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:53:43.896] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:58:43.102] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T15:03:09.284] requeue job JobId=2 due to failure of node compute-dy-m5xlarge-1
[2020-10-08T15:03:09.284] Requeuing JobId=2
[2020-10-08T15:03:09.284] cleanup_completing: JobId=3 completion process took 2799 seconds
[2020-10-08T15:03:09.284] error: Nodes compute-dy-m5xlarge-1 not responding, setting DOWN
[2020-10-08T15:05:30.789] backfill: Started JobId=2 in compute on compute-dy-m5xlarge-2
[2020-10-08T15:10:33.585] Node compute-dy-m5xlarge-2 now responding
[2020-10-08T15:10:43.589] job_time_limit: Configuration for JobId=2 complete
[2020-10-08T15:10:43.589] Resetting JobId=2 start time for node power up

[root@ip-172-31-58-150 parallelcluster]# cat clustermgtd 
2020-10-08 13:33:59,248 INFO [clustermgtd:main] ClusterManager Startup
2020-10-08 13:33:59,248 INFO [clustermgtd:_get_config] Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2020-10-08 13:33:59,251 INFO [clustermgtd:set_config] Applying new clustermgtd config: ClustermgtdConfig(_config=<configparser.ConfigParser object at 0x7fadf58e8f60>, region='us-west-2', cluster_name='slurm-test', dynamodb_table='parallelcluster-slurm-test', master_private_ip='172.31.58.150', master_hostname='ip-172-31-58-150.us-west-2.compute.internal cerebras.aws', instance_name_type_mapping={'m5xlarge': 'm5.xlarge'}, _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, loop_time=60, disable_all_cluster_management=False, heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', boto3_config=<botocore.config.Config object at 0x7fadf3fd24a8>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_clustermgtd_logging.conf', disable_ec2_health_check=False, disable_scheduled_event_health_check=False, health_check_timeout=180, disable_all_health_checks=False, launch_max_batch_size=500, update_node_address=True, terminate_max_batch_size=1000, node_replacement_timeout=3600, terminate_drain_nodes=True, 
terminate_down_nodes=True, orphaned_instance_timeout=120, hosted_zone='Z08971062IG8KMY8Z2MYN', dns_domain='slurm-test.pcluster', use_private_hostname=False)
2020-10-08 13:34:00,432 INFO [clustermgtd:_get_config] Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2020-10-08 13:34:00,434 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...      
2020-10-08 13:34:00,576 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING
2020-10-08 13:34:00,576 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2020-10-08 13:34:05,749 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster
2020-10-08 13:34:06,580 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2020-10-08 13:34:07,018 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2020-10-08 13:34:07,019 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2020-10-08 13:34:07,019 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance

[root@ip-172-31-58-150 parallelcluster]# cat slurm_resume.log 
2020-10-08 13:59:05,759 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: compute-dy-m5xlarge-1
2020-10-08 13:59:05,934 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['compute-dy-m5xlarge-1']
2020-10-08 13:59:07,620 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('compute-dy-m5xlarge-1', EC2Instance(id='i-077d1784d495dd49f', private_ip='172.31.55.18', hostname='ip-172-31-55-18', launch_time=datetime.datetime(2020, 10, 8, 20, 59, 7, tzinfo=tzutc())))"]
2020-10-08 13:59:07,620 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB
2020-10-08 13:59:07,652 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED
2020-10-08 13:59:07,652 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z08971062IG8KMY8Z2MYN - slurm-test.pcluster
2020-10-08 13:59:08,503 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED
2020-10-08 13:59:08,504 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['compute-dy-m5xlarge-1']
2020-10-08 13:59:08,506 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
2020-10-08 15:05:32,031 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: compute-dy-m5xlarge-2
2020-10-08 15:05:32,105 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['compute-dy-m5xlarge-2']
2020-10-08 15:05:33,728 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('compute-dy-m5xlarge-2', EC2Instance(id='i-01186bf7ece5d1a85', private_ip='172.31.49.131', hostname='ip-172-31-49-131', launch_time=datetime.datetime(2020, 10, 8, 22, 5, 33, tzinfo=tzutc())))"]
2020-10-08 15:05:33,728 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB
2020-10-08 15:05:33,763 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED
2020-10-08 15:05:33,763 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z08971062IG8KMY8Z2MYN - slurm-test.pcluster
2020-10-08 15:05:34,257 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED
2020-10-08 15:05:34,258 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['compute-dy-m5xlarge-2']
2020-10-08 15:05:34,260 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.

[root@ip-172-31-58-150 parallelcluster]# cat slurm_suspend.log 
2020-10-08 15:03:10,025 - [slurm_plugin.suspend:main] - INFO - Suspending following nodes. Clustermgtd will cleanup orphaned instances: compute-dy-m5xlarge-1
2020-10-08 15:03:10,026 - [slurm_plugin.suspend:main] - INFO - SuspendProgram finished. Nodes will be 
available after SuspendTimeout
```

As you can see, the above node took just under an hour to bring down.

I'd really like to see slurm update the node state a lot faster so we can avoid this long restart time as well as getting a bunch of jobs stuck trying to run on a dead node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Long/inconsistent requeue times with 2.9.1 + slurm #2117

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Long/inconsistent requeue times with 2.9.1 + slurm #2117

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions