Closed
Description
Required Info:
- AWS ParallelCluster version [e.g. 2.9.0]: 2.9.1
- Full cluster configuration without any credentials or personal data
[aws]
aws_region_name = us-west-2
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
[global]
cluster_template = default
update_check = true
sanity_check = true
[cluster default]
key_name = ec2-admin
base_os = centos7
scheduler = slurm
master_instance_type = m5.large
vpc_settings = default
queue_settings = compute
tags = {"purpose": "development-ondemand"}
custom_ami = ami-0cbdde0662f037f7c
master_root_volume_size = 100
compute_root_volume_size = 100
[vpc default]
vpc_id = vpc-fa52d99e
master_subnet_id = subnet-00bbd054b223b7501
vpc_security_group_id = sg-74e42c12
use_public_ips = false
[queue compute]
enable_efa = false
compute_resource_settings = default
[compute_resource default]
instance_type = m5.xlarge
- Cluster name: slurm-test
- [Optional] Arn of the cluster CloudFormation main stack: arn:aws:cloudformation:us-west-2:171496337684:stack/parallelcluster-slurm-test/047735a0-09a4-11eb-8cb8-067714b800aa
Bug description and how to reproduce:
When a node with a job is terminated, the time it takes for slurm to update the job/node state is quite variable - sometimes a few minutes, but I've seen it take nearly an hour. Here is an example run:
first job running:
[ip-172-31-58-150]keien:slurm-test [14:06]
$ cat requeue.sh
#!/bin/bash
hostname
date
sleep 600
[ip-172-31-58-150]keien:~ [14:06]
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute requeue. keien R 2:28 1 compute-dy-m5xlarge-1
[ip-172-31-58-150]keien:~ [14:06]
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 9 idle~ compute-dy-m5xlarge-[2-10]
compute* up infinite 1 mix compute-dy-m5xlarge-1
[ip-172-31-58-150]keien:slurm-test [14:07]
$ cat slurm-2.out
compute-dy-m5xlarge-1
Thu Oct 8 14:03:45 PDT 2020
terminate node:
[compute-dy-m5xlarge-1]keien:~ [14:07]
$ sudo shutdown -h now
We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:
#1) Respect the privacy of others.
#2) Think before you type.
#3) With great power comes great responsibility.
[sudo] password for keien:
Connection to compute-dy-m5xlarge-1 closed by remote host.
Connection to compute-dy-m5xlarge-1 closed.
right after termination:
[ip-172-31-58-150]keien:slurm-test [14:08]
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute requeue. keien R 5:49 1 compute-dy-m5xlarge-1
[ip-172-31-58-150]keien:slurm-test [14:09]
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 9 idle~ compute-dy-m5xlarge-[2-10]
compute* up infinite 1 mix compute-dy-m5xlarge-1
attempting to salloc while in this state:
[ip-172-31-58-150]keien:slurm-test [14:15]
$ salloc bash
salloc: Granted job allocation 3
keien@ip-172-31-58-150 ~/ws/slurm-test $ srun hostname
srun: error: Task launch for 3.0 failed on node compute-dy-m5xlarge-1: Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
the salloc job gets stuck in CG after the above:
[ip-172-31-58-150]keien:slurm-test [14:17]
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 compute bash keien CG 1:03 1 compute-dy-m5xlarge-1
2 compute requeue. keien R 14:02 1 compute-dy-m5xlarge-1
job restarted:
[ip-172-31-58-150]keien:slurm-test [15:15]
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute requeue. keien R 4:38 1 compute-dy-m5xlarge-2
[ip-172-31-58-150]keien:slurm-test [15:15]
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 9 idle~ compute-dy-m5xlarge-[1,3-10]
compute* up infinite 1 mix compute-dy-m5xlarge-2
[ip-172-31-58-150]keien:slurm-test [15:15]
$ cat slurm-2.out
compute-dy-m5xlarge-2
Thu Oct 8 15:10:45 PDT 2020
logs:
[root@ip-172-31-58-150 log]# cat slurmctld.log
[2020-10-08T13:33:39.952] error: chdir(/var/log): Permission denied
[2020-10-08T13:33:39.953] slurmctld version 20.02.4 started on cluster parallelcluster
[2020-10-08T13:33:42.279] No memory enforcing mechanism configured.
[2020-10-08T13:33:42.279] layouts: no layout to initialize[2020-10-08T13:33:42.711] layouts: loading entities/relations information
[2020-10-08T13:33:42.711] error: Could not open node state file /var/spool/slurm.state/node_state: No such file or directory
[2020-10-08T13:33:42.711] error: NOTE: Trying backup state save file. Information may be lost!
[2020-10-08T13:33:42.711] No node state file (/var/spool/slurm.state/node_state.old) to recover
[2020-10-08T13:33:42.711] error: Could not open job state file /var/spool/slurm.state/job_state: No such file or directory[2020-10-08T13:33:42.711] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-10-08T13:33:42.711] No job state file (/var/spool/slurm.state/job_state.old) to recover
[2020-10-08T13:33:42.712] select/cons_tres: select_p_node_init
[2020-10-08T13:33:42.712] select/cons_tres: preparing for 3 partitions
[2020-10-08T13:33:42.997] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:42.998] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:42.999] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:42.999] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.000] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.000] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.001] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.002] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.002] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.003] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.004] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.004] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.005] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.005] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.006] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.007] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.007] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.008] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.008] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.009] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia1): No such file or directory
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia2): No such file or directory
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia3): No such file or directory
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia1): No such file or directory
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia2): No such file or directory
[2020-10-08T13:33:43.010] error: gres_device_major: stat(/dev/nvidia3): No such file or directory
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia1): No such file or directory
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia2): No such file or directory
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia3): No such file or directory
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.011] error: gres_device_major: stat(/dev/nvidia1): No such file or directory
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia2): No such file or directory
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia3): No such file or directory
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia1): No such file or directory
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia2): No such file or directory
[2020-10-08T13:33:43.012] error: gres_device_major: stat(/dev/nvidia3): No such file or directory
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia1): No such file or directory
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia2): No such file or directory
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia3): No such file or directory
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia1): No such file or directory
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia2): No such file or directory
[2020-10-08T13:33:43.013] error: gres_device_major: stat(/dev/nvidia3): No such file or directory
[2020-10-08T13:33:43.014] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.014] error: gres_device_major: stat(/dev/nvidia1): No such file or directory
[2020-10-08T13:33:43.014] error: gres_device_major: stat(/dev/nvidia2): No such file or directory
[2020-10-08T13:33:43.014] error: gres_device_major: stat(/dev/nvidia3): No such file or directory
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia1): No such file or directory
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia2): No such file or directory
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia3): No such file or directory
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia0): No such file or directory
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia1): No such file or directory
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia2): No such file or directory
[2020-10-08T13:33:43.015] error: gres_device_major: stat(/dev/nvidia3): No such file or directory
[2020-10-08T13:33:43.016] error: Could not open reservation state file /var/spool/slurm.state/resv_state: No such file or directory
[2020-10-08T13:33:43.016] error: NOTE: Trying backup state save file. Reservations may be lost
[2020-10-08T13:33:43.016] No reservation state file (/var/spool/slurm.state/resv_state.old) to recover[2020-10-08T13:33:43.016] error: Could not open trigger state file /var/spool/slurm.state/trigger_state: No such file or directory
[2020-10-08T13:33:43.016] error: NOTE: Trying backup state save file. Triggers may be lost!
[2020-10-08T13:33:43.016] No trigger state file (/var/spool/slurm.state/trigger_state.old) to recover
[2020-10-08T13:33:43.016] read_slurm_conf: backup_controller not specified
[2020-10-08T13:33:43.016] Reinitializing job accounting state
[2020-10-08T13:33:43.016] select/cons_tres: reconfigure
[2020-10-08T13:33:43.016] select/cons_tres: select_p_node_init
[2020-10-08T13:33:43.016] select/cons_tres: preparing for 3 partitions
[2020-10-08T13:33:43.016] Running as primary controller
[2020-10-08T13:33:43.122] No parameter for mcs plugin, default values set
[2020-10-08T13:33:43.122] mcs: MCSParameters = (null). ondemand set.
[2020-10-08T13:34:43.393] SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-10-08T13:38:43.546] error: Could not open job state file /var/spool/slurm.state/job_state: No such file or directory
[2020-10-08T13:38:43.546] error: NOTE: Trying backup state save file. Jobs may be lost!
[2020-10-08T13:38:43.546] No job state file (/var/spool/slurm.state/job_state.old) found
[2020-10-08T13:59:04.324] _slurm_rpc_submit_batch_job: JobId=2 InitPrio=4294901759 usec=1258
[2020-10-08T13:59:04.472] sched: Allocate JobId=2 NodeList=compute-dy-m5xlarge-1 #CPUs=1 Partition=compute
[2020-10-08T14:03:28.136] Node compute-dy-m5xlarge-1 now responding
[2020-10-08T14:03:43.670] job_time_limit: Configuration for JobId=2 complete
[2020-10-08T14:03:43.670] Resetting JobId=2 start time for node power up
[2020-10-08T14:13:43.077] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:15:27.168] sched: _slurm_rpc_allocate_resources JobId=3 NodeList=compute-dy-m5xlarge-1
usec=284
[2020-10-08T14:15:27.168] prolog_running_decr: Configuration for JobId=3 is complete
[2020-10-08T14:15:38.897] job_step_signal: JobId=3 StepId=0 not found
[2020-10-08T14:16:10.000] job_step_signal: JobId=3 StepId=0 not found
[2020-10-08T14:16:30.292] _job_complete: JobId=3 WEXITSTATUS 233
[2020-10-08T14:16:30.292] _job_complete: JobId=3 done
[2020-10-08T14:18:43.275] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:19:43.321] Resending TERMINATE_JOB request JobId=3 Nodelist=compute-dy-m5xlarge-1
[2020-10-08T14:23:43.479] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:28:43.768] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:33:43.965] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:38:43.163] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:43:43.368] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:48:43.566] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:53:43.896] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T14:58:43.102] error: Nodes compute-dy-m5xlarge-1 not responding
[2020-10-08T15:03:09.284] requeue job JobId=2 due to failure of node compute-dy-m5xlarge-1
[2020-10-08T15:03:09.284] Requeuing JobId=2
[2020-10-08T15:03:09.284] cleanup_completing: JobId=3 completion process took 2799 seconds
[2020-10-08T15:03:09.284] error: Nodes compute-dy-m5xlarge-1 not responding, setting DOWN
[2020-10-08T15:05:30.789] backfill: Started JobId=2 in compute on compute-dy-m5xlarge-2
[2020-10-08T15:10:33.585] Node compute-dy-m5xlarge-2 now responding
[2020-10-08T15:10:43.589] job_time_limit: Configuration for JobId=2 complete
[2020-10-08T15:10:43.589] Resetting JobId=2 start time for node power up
[root@ip-172-31-58-150 parallelcluster]# cat clustermgtd
2020-10-08 13:33:59,248 INFO [clustermgtd:main] ClusterManager Startup
2020-10-08 13:33:59,248 INFO [clustermgtd:_get_config] Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2020-10-08 13:33:59,251 INFO [clustermgtd:set_config] Applying new clustermgtd config: ClustermgtdConfig(_config=<configparser.ConfigParser object at 0x7fadf58e8f60>, region='us-west-2', cluster_name='slurm-test', dynamodb_table='parallelcluster-slurm-test', master_private_ip='172.31.58.150', master_hostname='ip-172-31-58-150.us-west-2.compute.internal cerebras.aws', instance_name_type_mapping={'m5xlarge': 'm5.xlarge'}, _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, loop_time=60, disable_all_cluster_management=False, heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', boto3_config=<botocore.config.Config object at 0x7fadf3fd24a8>, logging_config='/opt/parallelcluster/pyenv/versions/3.6.9/envs/node_virtualenv/lib/python3.6/site-packages/slurm_plugin/logging/parallelcluster_clustermgtd_logging.conf', disable_ec2_health_check=False, disable_scheduled_event_health_check=False, health_check_timeout=180, disable_all_health_checks=False, launch_max_batch_size=500, update_node_address=True, terminate_max_batch_size=1000, node_replacement_timeout=3600, terminate_drain_nodes=True,
terminate_down_nodes=True, orphaned_instance_timeout=120, hosted_zone='Z08971062IG8KMY8Z2MYN', dns_domain='slurm-test.pcluster', use_private_hostname=False)
2020-10-08 13:34:00,432 INFO [clustermgtd:_get_config] Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf
2020-10-08 13:34:00,434 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster...
2020-10-08 13:34:00,576 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING
2020-10-08 13:34:00,576 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler
2020-10-08 13:34:05,749 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving list of EC2 instances associated with the cluster
2020-10-08 13:34:06,580 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions
2020-10-08 13:34:07,018 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions
2020-10-08 13:34:07,019 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) []
2020-10-08 13:34:07,019 - [slurm_plugin.clustermgtd:_terminate_orphaned_instances] - INFO - Checking for orphaned instance
[root@ip-172-31-58-150 parallelcluster]# cat slurm_resume.log
2020-10-08 13:59:05,759 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: compute-dy-m5xlarge-1
2020-10-08 13:59:05,934 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['compute-dy-m5xlarge-1']
2020-10-08 13:59:07,620 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('compute-dy-m5xlarge-1', EC2Instance(id='i-077d1784d495dd49f', private_ip='172.31.55.18', hostname='ip-172-31-55-18', launch_time=datetime.datetime(2020, 10, 8, 20, 59, 7, tzinfo=tzutc())))"]
2020-10-08 13:59:07,620 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB
2020-10-08 13:59:07,652 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED
2020-10-08 13:59:07,652 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z08971062IG8KMY8Z2MYN - slurm-test.pcluster
2020-10-08 13:59:08,503 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED
2020-10-08 13:59:08,504 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['compute-dy-m5xlarge-1']
2020-10-08 13:59:08,506 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
2020-10-08 15:05:32,031 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: compute-dy-m5xlarge-2
2020-10-08 15:05:32,105 - [slurm_plugin.common:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['compute-dy-m5xlarge-2']
2020-10-08 15:05:33,728 - [slurm_plugin.common:_update_slurm_node_addrs] - INFO - Nodes are now configured with instances: (x1) ["('compute-dy-m5xlarge-2', EC2Instance(id='i-01186bf7ece5d1a85', private_ip='172.31.49.131', hostname='ip-172-31-49-131', launch_time=datetime.datetime(2020, 10, 8, 22, 5, 33, tzinfo=tzutc())))"]
2020-10-08 15:05:33,728 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB
2020-10-08 15:05:33,763 - [slurm_plugin.common:_store_assigned_hostnames] - INFO - Database update: COMPLETED
2020-10-08 15:05:33,763 - [slurm_plugin.common:_update_dns_hostnames] - INFO - Updating DNS records for Z08971062IG8KMY8Z2MYN - slurm-test.pcluster
2020-10-08 15:05:34,257 - [slurm_plugin.common:_update_dns_hostnames] - INFO - DNS records update: COMPLETED
2020-10-08 15:05:34,258 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['compute-dy-m5xlarge-2']
2020-10-08 15:05:34,260 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
[root@ip-172-31-58-150 parallelcluster]# cat slurm_suspend.log
2020-10-08 15:03:10,025 - [slurm_plugin.suspend:main] - INFO - Suspending following nodes. Clustermgtd will cleanup orphaned instances: compute-dy-m5xlarge-1
2020-10-08 15:03:10,026 - [slurm_plugin.suspend:main] - INFO - SuspendProgram finished. Nodes will be
available after SuspendTimeout
As you can see, the above node took just under an hour to bring down.
I'd really like to see slurm update the node state a lot faster so we can avoid this long restart time as well as getting a bunch of jobs stuck trying to run on a dead node.