I am unable to spawn a session on the science console this morning, regardless of the size of server requested. I wonder if it’s related to the job I ran on the XL image last night, which produced a large number of files. Here are the last few log messages before it failed:
2025-10-29T12:44:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount is taking longer than expected, consider using OnRootMismatch - https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#configure-volume-permission-and-ownership-change-policy-for-pods
2025-10-29T12:45:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount, processed 182827 files.
2025-10-29T12:46:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount, processed 305511 files.
2025-10-29T12:47:58Z [Warning] Setting volume ownership for c111cf2a-bbc6-41e8-b848-ffed59fb5efc/volumes/kubernetes.io~csi/pvc-cea232cf-699d-4b1d-bfc3-25b2533a4302/mount, processed 429705 files.
Can you give this another try? We’ve confirmed that others are able to start up instances. If you won’t lose anything by logging out and logging back in, do that. Otherwise, just try to start a session.
Trying to spawn a new one now after logging out and back in… seems to be hanging a bit after a processed 429494 files logging message, I’ll update as soon as it either succeeds or fails… and failed.
This may not be relevant to this issue, but just in case it is, one of the TIKE devs at STScI told me they had a similar problem before:
…caused by a database that keeps track of file changes for real-time collaboration, so it might be a different underlying issue than the one you’re having (just in case though: if you have a jupyter_ystore.db, clearing that solved our issue)
They’re in ~/projects/lcss_scratch/wavelets/ if I recall correctly. In there, there are some dozens of subdirectories that have the form s0001, and all the files are in those subdirectories.
I have tarball’ed the “~/projects/lcss_scratch” directory contents to “~/projects/lcss_scratch_backup_20251030_154138.tar.gz” which contained over 700,000 files. We have not implemented the fix for this yet but you should be able to open your session now. Let me know if you have any issues.
Thanks! I confirm that I can now open the session as normal. I promise to be more careful with file creation in the future
By the way, yesterday I read the discussion of Fornax Storage Resources in the documentation. Would this have been avoided if I were saving files to, e.g., shared storage? or would I have broken access for everyone? (or something in between?)
The shared-storage is a NFS mount and does not have fsGroup ownership issues (they handle permissions differently). But it would be much lower performance.
You definitely do not want to be writing a million files to the efs storage - don’t do that. We would like to understand what the usecase is, what you’re doing - the particular issue we’ve run into here an underlying storage driver in kubernetes behavior that is getting triggered. Maybe we can setup a call with you tomorrow or Monday?
Noted, thanks for the information. I have some time for a call on Monday, but in the meantime the use case is that I am running frequency transforms for every TESS light curve to build an experimental similarity search for light curves for MAST. 1.6 million light curves —> 1.6 million transforms, and I’m storing them in the same file hierarchy that the TESS light curves use in s3.