`MountTmpfsAtTemp` doesn't seem to always have the desired effect #1326

yyc · 2024-06-11T21:30:36Z

Describe the bug
MountTmpfsAtTemp=false doesn't always seem to take effect. A large proportion of instances come up with tmpfs still mounted to /tmp, and then get terminated.

This causes a lot of instance churn on scale-up.

I suspect there is something else running on startup that writes to the /tmp dir at around the same time that bk-mount-instance-storage.sh is run, blocking the unmount operation and causing it to fail. Is there anything in the AMI that might do that?

note: it's entirely possible that there's something in our custom AMI that we build atop of the buildkite-provided one that is responsible for this. If you can't reproduce this, that is still good info that we need to audit our startup processes

Steps To Reproduce
Steps to reproduce the behavior:

Create a elastic-ci stack on version v6.21.0
Set MountTmpfsAtTemp to false
wait for the Autoscaling Group to come online
observe some instances transition to a InService state, but get terminated after ~1 minute.

Expected behavior
when MountTmpfsAtTemp is set to false, the instance runs systemctl mask --now tmp.mount on startup, which correctly unmounts tmpfs from the /tmp directory. We observe this in a small number of instances that come up:

[yuchuanyuan@ip-10-0-102-149 ~]$ sudo cat /var/log/elastic-stack.log 
Starting /usr/local/bin/bk-mount-instance-storage.sh...
Disabling automatic mount of tmpfs at /tmp
Created symlink /etc/systemd/system/tmp.mount → /dev/null.
Mounting instance storage...
No NVMe drives to mount.
Please check that your instance type supports instance storage.

<truncated>


[yuchuanyuan@ip-10-0-102-149 ~]$ systemctl status tmp.mount
○ tmp.mount
     Loaded: masked (Reason: Unit tmp.mount is masked.)
     Active: inactive (dead) since Tue 2024-06-11 20:02:24 UTC; 1h 18min ago
   Duration: 8.556s
        CPU: 7ms

Jun 11 20:02:24 ip-10-0-102-149.ec2.internal systemd[1]: Unmounting tmp.mount - /tmp...
Jun 11 20:02:24 ip-10-0-102-149.ec2.internal systemd[1]: tmp.mount: Deactivated successfully.
Jun 11 20:02:24 ip-10-0-102-149.ec2.internal systemd[1]: Unmounted tmp.mount - /tmp.

Actual behaviour
On a large number of instances, instead we see the following:

[yuchuanyuan@ip-10-0-114-131 ~]$ sudo cat /var/log/elastic-stack.log 
Starting /usr/local/bin/bk-mount-instance-storage.sh...
Disabling automatic mount of tmpfs at /tmp
Created symlink /etc/systemd/system/tmp.mount → /dev/null.
Job failed. See "journalctl -xe" for details.
/usr/local/bin/bk-mount-instance-storage.sh errored with exit code 1 on line 33.
Starting /usr/local/bin/bk-configure-docker.sh...
Sourcing /usr/local/lib/bk-configure-docker.sh...

<truncated>

[yuchuanyuan@ip-10-0-114-131 ~]$ systemctl status tmp.mount
● tmp.mount - /tmp
     Loaded: masked (Reason: Unit tmp.mount is masked.)
     Active: active (mounted) (Result: exit-code) since Tue 2024-06-11 20:35:25 UTC; 24s ago
      Where: /tmp
       What: tmpfs
      Tasks: 0 (limit: 9247)
     Memory: 44.0K
        CPU: 6ms
     CGroup: /system.slice/tmp.mount

Jun 11 20:35:25 ip-10-0-114-131.ec2.internal systemd[1]: Unmounting tmp.mount - /tmp...
Jun 11 20:35:25 ip-10-0-114-131.ec2.internal umount[1923]: umount: /tmp: target is busy.
Jun 11 20:35:25 ip-10-0-114-131.ec2.internal systemd[1]: tmp.mount: Mount process exited, code=exited, status=32/n/a
Jun 11 20:35:25 ip-10-0-114-131.ec2.internal systemd[1]: Failed unmounting tmp.mount - /tmp.

Stack parameters:

AWS Region: us-east-1
Version: v6.21.0
AMI: built atop ami-04ca34320055d861c
instance types: m5.large, m5a.large, m5a.xlarge, m6a.xlarge

The text was updated successfully, but these errors were encountered:

DrJosh9000 · 2024-06-12T05:21:47Z

Hey @yyc, I haven't reproduced yet, and I'm not sure what, if anything, we would be running that is using /tmp concurrently with stack setup. However, I have an idea for a workaround in #1327. WDYT?

yyc · 2024-06-17T15:39:26Z

Hey @yyc, I haven't reproduced yet, and I'm not sure what, if anything, we would be running that is using /tmp concurrently with stack setup. However, I have an idea for a workaround in #1327. WDYT?

@DrJosh9000 Yes I've tested that and it looks like it works! Thanks for the quick fix :)

DrJosh9000 mentioned this issue Jun 12, 2024

Lazy-unmount /tmp before masking #1327

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`MountTmpfsAtTemp` doesn't seem to always have the desired effect #1326

`MountTmpfsAtTemp` doesn't seem to always have the desired effect #1326

yyc commented Jun 11, 2024 •

edited

Loading

DrJosh9000 commented Jun 12, 2024

yyc commented Jun 17, 2024

MountTmpfsAtTemp doesn't seem to always have the desired effect #1326

MountTmpfsAtTemp doesn't seem to always have the desired effect #1326

Comments

yyc commented Jun 11, 2024 • edited Loading

DrJosh9000 commented Jun 12, 2024

yyc commented Jun 17, 2024

`MountTmpfsAtTemp` doesn't seem to always have the desired effect #1326

`MountTmpfsAtTemp` doesn't seem to always have the desired effect #1326

yyc commented Jun 11, 2024 •

edited

Loading