I am reducing a lot (~100s) of XMM-Newton datasets using python notebooks that call terminal SAS commands. When I want to reduce more than one dataset on my local machine, I make sure to use separate jupyter kernel for each datasets or each loop over datasets, so that environmental variables that SAS uses (e.g. $SAS_ODF) don’t get mixed up between data sets. Now I want to try to reduce a few ~10s of datasets in parallel here on Fornax, and have a couple of questions:
What would be the best approach to make sure the notebooks use separate kernels? What I am doing now is creating copies of sas environment via instructions in this thread and then run separate notebooks in my-sas1, my-sas2… and so on. I want to make sure that this is truly using separate kernels, and see whether there’s a simpler way.
Is there a more clever way to parallelize this? Making copies of environments and notebooks is somewhat onerous once we’re talking about ~100s of datasets. I wonder if this could be more automated so everything can be done in a ~day using the XL server vs many days on the smallest ones.
I asked @rjtanner about this and it seems that there’s not obvious ways to make the sas software not need separate kernels/environments unfrotunately.
In order to parallelize, you don’t need a separate kernel for each run. Calling the command line tool as a subprocess and passing the environment variables to each call will ensure the calls are isolated. This is how I usually do it.
In the following:
obsids defines the list of observation IDs you want to process, assumed already downloaded.
run_rgsproc is an example of the function that you want to run. Change this as you wish following the skeleton. Its job is to process a single observation.
For completeness, note that you may need to run cifbuild and odfingest before rgsproc following the standard analysis procedure. Those. can be parallelized in a similar fashion.
Run this in the sas environment in the terminal (activate with microconda activate sas) or using the sas notebook kernel.
import os
import subprocess
from multiprocessing import Pool, cpu_count
def run_rgsproc(obsid):
"""Setup and run rgsproc.
This is the main function that processes a single observation"""
# Example steps may include:
# chdir to obsid/ODF
os.chdir(f'{obsid}/ODF')
# clone os.environ to isolate from other parallel runs
envs = os.environ.copy()
# add variables needed for SAS
envs['SAS_CCF'] = '...'
env['SAS_ODF'] = '...'
# setup the command
# add extra arguments as needed. e.g. withmlambdacolumn=yes bkgcorrect=no ..
command = 'rgsproc'
# call the command as a subprocess so we can pass our envs dictionary.
# log the output to a file
with open(f'{obsid}_rgsproc.log', 'w') as logfile:
result = subprocess.run(
command, shell=True, text=True, env=envs, check=False,
stdout=logfile, stderr=subprocess.STDOUT
)
return result.returncode
# Now make the parallel call
obsids = ['0606321501', '0606321601', ...]
nprocesses = min(cpu_count(), len(obsids))
with Pool(nprocesses) as pool:
result = pool.map(run_rgsproc, obsids)
# check 'result' for the return codes from the commands
Thanks for this. Took me a while, but it seems to work!
The downside in general is that for something like data reduction of CCDs/gratings, the CPU/memory ratio of all allocations is far from optimal. Current server option ratios are 4 GB RAM/CPU, but for large number of dataset reduction workflows something like 1-2 GB RAM/CPU server would be much more appropriate. Not sure if this is because CPU is what’s more expensive, or other reasons.