Failure to spawn session

I am unable to spawn a session on the science console this morning, regardless of the size of server requested. I wonder if it’s related to the job I ran on the XL image last night, which produced a large number of files. Here are the last few log messages before it failed:

2025-10-29T12:44:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount is taking longer than expected, consider using OnRootMismatch - https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods
2025-10-29T12:45:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount, processed 182827 files.
2025-10-29T12:46:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount, processed 305511 files.
2025-10-29T12:47:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount, processed 429705 files.

Can you give this another try? We’ve confirmed that others are able to start up instances. If you won’t lose anything by logging out and logging back in, do that. Otherwise, just try to start a session.

Do you know the number of files, or ballpark order of magnitude?

Roughly 500,000 files, give or take 100,000

1 Like

Trying to spawn a new one now after logging out and back in… seems to be hanging a bit after a processed 429494 files logging message, I’ll update as soon as it either succeeds or fails… and failed.

Another attempt failed, but the log messages have changed:

2025-10-29T14:44:48Z [Warning] 0/6 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {jupyter.org/used-by-singleuser: true}. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
2025-10-29T14:45:01Z [Warning] 0/6 nodes are available: 1 node(s) didn't match PersistentVolume's node affinity, 2 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {jupyter.org/used-by-singleuser: true}. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling.
2025-10-29T14:45:04Z [Normal] Successfully assigned jupyterhub/jupyter-zclaytor to ip-10-0-192-233.ec2.internal
Spawn failed: 

i.e., it’s not saying anything about processing files anymore.

I’ll have to come back to this tomorrow.

This may not be relevant to this issue, but just in case it is, one of the TIKE devs at STScI told me they had a similar problem before:

…caused by a database that keeps track of file changes for real-time collaboration, so it might be a different underlying issue than the one you’re having (just in case though: if you have a jupyter_ystore.db, clearing that solved our issue)

What path did you save the files to? We will be increasing the spawn timeout this evening with an update.

They’re in ~/projects/lcss_scratch/wavelets/ if I recall correctly. In there, there are some dozens of subdirectories that have the form s0001, and all the files are in those subdirectories.

by the way, if you need to delete the wavelets directory to make it work again, that’s fine. I backed enough of it up before I logged off.

I am currently trying to reproduce the issue to verify the fix.

I have tarball’ed the “~/projects/lcss_scratch” directory contents to “~/projects/lcss_scratch_backup_20251030_154138.tar.gz” which contained over 700,000 files. We have not implemented the fix for this yet but you should be able to open your session now. Let me know if you have any issues.

Thanks! I confirm that I can now open the session as normal. I promise to be more careful with file creation in the future :slight_smile:

By the way, yesterday I read the discussion of Fornax Storage Resources in the documentation. Would this have been avoided if I were saving files to, e.g., shared storage? or would I have broken access for everyone? (or something in between?)

The shared-storage is a NFS mount and does not have fsGroup ownership issues (they handle permissions differently). But it would be much lower performance.

You definitely do not want to be writing a million files to the efs storage - don’t do that. We would like to understand what the usecase is, what you’re doing - the particular issue we’ve run into here an underlying storage driver in kubernetes behavior that is getting triggered. Maybe we can setup a call with you tomorrow or Monday?

Noted, thanks for the information. I have some time for a call on Monday, but in the meantime the use case is that I am running frequency transforms for every TESS light curve to build an experimental similarity search for light curves for MAST. 1.6 million light curves —> 1.6 million transforms, and I’m storing them in the same file hierarchy that the TESS light curves use in s3.

We deployed a fix - it should be resolved. Feel free to reopen if there’s any further issues.