Failure to spawn session

zclaytor · October 29, 2025, 12:55pm

I am unable to spawn a session on the science console this morning, regardless of the size of server requested. I wonder if it’s related to the job I ran on the XL image last night, which produced a large number of files. Here are the last few log messages before it failed:

2025-10-29T12:44:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount is taking longer than expected, consider using OnRootMismatch - https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods
2025-10-29T12:45:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount, processed 182827 files.
2025-10-29T12:46:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount, processed 305511 files.
2025-10-29T12:47:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount, processed 429705 files.

gailr · October 29, 2025, 1:23pm

Can you give this another try? We’ve confirmed that others are able to start up instances. If you won’t lose anything by logging out and logging back in, do that. Otherwise, just try to start a session.

rliming · October 29, 2025, 1:24pm

Do you know the number of files, or ballpark order of magnitude?

zclaytor · October 29, 2025, 1:40pm

Roughly 500,000 files, give or take 100,000

zclaytor · October 29, 2025, 1:47pm

Trying to spawn a new one now after logging out and back in… seems to be hanging a bit after a processed 429494 files logging message, I’ll update as soon as it either succeeds or fails… and failed.

zclaytor · October 29, 2025, 2:57pm

Another attempt failed, but the log messages have changed:

2025-10-29T14:44:48Z [Warning] 0/6 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {jupyter.org/used-by-singleuser: true}. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
2025-10-29T14:45:01Z [Warning] 0/6 nodes are available: 1 node(s) didn't match PersistentVolume's node affinity, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {jupyter.org/used-by-singleuser: true}. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
2025-10-29T14:45:04Z [Normal] Successfully assigned jupyterhub/jupyter-zclaytor to ip-10-0-192-233.ec2.internal
Spawn failed:

i.e., it’s not saying anything about processing files anymore.

I’ll have to come back to this tomorrow.

zclaytor · October 29, 2025, 3:19pm

This may not be relevant to this issue, but just in case it is, one of the TIKE devs at STScI told me they had a similar problem before:

…caused by a database that keeps track of file changes for real-time collaboration, so it might be a different underlying issue than the one you’re having (just in case though: if you have a jupyter_ystore.db, clearing that solved our issue)

mburleigh · October 29, 2025, 5:28pm

What path did you save the files to? We will be increasing the spawn timeout this evening with an update.

zclaytor · October 29, 2025, 5:41pm

They’re in ~/projects/lcss_scratch/wavelets/ if I recall correctly. In there, there are some dozens of subdirectories that have the form s0001, and all the files are in those subdirectories.

zclaytor · October 29, 2025, 8:51pm

by the way, if you need to delete the wavelets directory to make it work again, that’s fine. I backed enough of it up before I logged off.

mburleigh · October 30, 2025, 12:40pm

zclaytor:

2025-10-29T12:44:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount is taking longer than expected, consider using OnRootMismatch - https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods
2025-10-29T12:45:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount, processed 182827 files.
2025-10-29T12:46:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount, processed 305511 files.
2025-10-29T12:47:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount, processed 429705 files.

I am currently trying to reproduce the issue to verify the fix.

mburleigh · October 30, 2025, 3:56pm

I have tarball’ed the “~/projects/lcss_scratch” directory contents to “~/projects/lcss_scratch_backup_20251030_154138.tar.gz” which contained over 700,000 files. We have not implemented the fix for this yet but you should be able to open your session now. Let me know if you have any issues.

zclaytor · October 30, 2025, 4:16pm

Thanks! I confirm that I can now open the session as normal. I promise to be more careful with file creation in the future

By the way, yesterday I read the discussion of Fornax Storage Resources in the documentation. Would this have been avoided if I were saving files to, e.g., shared storage? or would I have broken access for everyone? (or something in between?)

mburleigh · October 30, 2025, 6:44pm

The shared-storage is a NFS mount and does not have fsGroup ownership issues (they handle permissions differently). But it would be much lower performance.

asawyers · October 30, 2025, 10:15pm

You definitely do not want to be writing a million files to the efs storage - don’t do that. We would like to understand what the usecase is, what you’re doing - the particular issue we’ve run into here an underlying storage driver in kubernetes behavior that is getting triggered. Maybe we can setup a call with you tomorrow or Monday?

zclaytor · October 31, 2025, 2:10pm

Noted, thanks for the information. I have some time for a call on Monday, but in the meantime the use case is that I am running frequency transforms for every TESS light curve to build an experimental similarity search for light curves for MAST. 1.6 million light curves —> 1.6 million transforms, and I’m storing them in the same file hierarchy that the TESS light curves use in s3.

asawyers · December 2, 2025, 1:14am

We deployed a fix - it should be resolved. Feel free to reopen if there’s any further issues.

Fornax Community Forum

Failure to spawn session