List all files in a directory in s3 bucket

I have two related questions, the first is a specific problem, the second is to see if there is a better way of doing this in general.

The specific problem I am having is I am trying to list all the files in a specific directory in the s3 bucket. I am running the command:

obsid = ‘0123700101’
level = ‘PPS’
os.system(f'aws s3 ls s3://nasa-heasarc/xmm/data/rev0/{obsid}/{level}')

and I am getting the following error:

An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

The general problem I am trying to solve is this: I need a way of downloading a specific file, and/or a group of files with filenames that follow a specific pattern, For example, all ‘*.PNG’ files, or all files with ‘SUMM’ or ‘EPX’ in the name. If I know the exact filename I can download that file using,

obsid = '0123700101'
repo = 'aws'
PPSfile = 'P0123700101EPX000OEXPMP8000.PNG'

query = """SELECT * FROM xmmmaster WHERE obsid='{0}'""".format(obsid)
tab = Heasarc.query_tap(query).to_table()
data_source = Heasarc.locate_data(tab, catalog_name='xmmmaster')
data_source[repo] = data_source[repo]+'PPS/'+PPSfile
Heasarc.download_data(download_link,host=repo,location=f'./{obsid}/PPS')

but if I don’t know the exact filename I can’t download it. So I tried getting a list of all files in the directory in the s3 bucket to get the exact filenames and I got the error.

Is there a way of formatting the query to return the filenames? Or a better way of doing this?

Hi Ryan,

One alternative to this approach that I might suggest is using the ‘s3fs’ python module, rather than calling the aws s3 cli.

You would need to install it in the Conda environment you’re working in:

micromamba install s3fs

From there you can instantiate a class that lets you perform some common filesystem operations, including listing directories:

from s3fs import S3FileSystem

# Omitting anon=True would cause s3fs to complain that no credentials
#  have been supplied later on - we don't need credentials because
#  HEASARC is an open data bucket
current_s3 = S3FileSystem(anon=True)

# Now you can run ls on the HEASARC bucket like this
obsid = '0123700101'
level = 'PPS'
s3_uri = f's3://nasa-heasarc/xmm/data/rev0/{obsid}/{level}'

all_files = current_s3.ls(s3_uri)

Where all_files is a Python list.

From there, s3fs can be used as a drop-in replacement for Python ‘with open’ context managers, the data can be streamed into memory directly from s3 using astropy’s fits.open, or downloaded (I imagine what you need to do for XMM) using:

current_s3.download('nasa-heasarc/xmm/data/rev0/0201903501/PPS/PP0201903501PPSMSG000_0.ASC', 'PP0201903501PPSMSG000_0.ASC')
1 Like

Using s3fs is an option as @djtuner notes.

To answer the question directly.
The buckets are public, so aws need to know that explicitely by adding --no-sign (equivalent of anon=True in s3fs):

os.system(f'aws s3 --no-sign ls s3://nasa-heasarc/xmm/data/rev0/{obsid}/{level}')

For the second part, I think your code should work if you fix the last line (data_source instead of download_link):

Heasarc.download_data(data_source, host=repo, location=f'./{obsid}/PPS')

I have two comments. aws s3 ls is sensitive to a trailing slash, at least in my experience. You have to put it to see the contents or it just shows you the directory itself.

You can use one of our APIs to find the files of a particular type for you.

I am not sure I know how to do this. Can you give an example?

The simplest way is to use the Xamin command line interface. It’s described down in the long CLI doc in the Products section. So the command would be like

java -jar ~/lib/users.jar table=chanmaster position='ty pyx' messages=none products noproducts=link,point filterstring='*/*evt*'

for instance to fetch the event files. The filtering is done on the file names themselves. This is following how Browse does it. I don’t think the CLI has an option to spit them out as s3 addresses, something we should add, because the GUI does it. But it’s easy to tack on the front of the S3 url in place of /FTP.

Our ObsTAP service will also make this easier.

Now into the weeds. There a lot of ways with the VO services, and in the end, they all come back to DataLink. Abdu, you and I should talk about how we should use it in astroquery to expand its data products abilities. I wrote a tutorial for the navo-workshop on how to do it with PyVO (pre-pandemic even), but the others never liked it and it was never merged in. It’s arguable that end users shouldn’t have to do this, but astroquery.heasarc can do it for them.

Let me know if anybody wants all the gory details and we’ll talk offline. In the end you end up with an API call like

…/xamin/vo/datalink?datalink_key=obsid&id=ivo://nasa.heasarc/xrismmastr?000120000/xrism.obs.resolve.event_cl.evt

But how you discover how to use that is the hard part. It’s meant for internals only.

I want the gory details. Ultimately it doesn’t matter how complicated it is, because the commands are going to be buried inside pySAS where 99.998% of pySAS users will never see it.

The purpose is to cover the few cases where users will need only one or two PPS files for XMM analysis and they don’t want to download >1000 PPS files for a single Obs ID.

1 Like

As Tess mentioned, the trailing / is required to list object keys under a common prefix and there’s also —recursive if needed, however ‘s3 ls’ doesn’t support globbing patterns like *.PNG (need to grep or filter in code), yet other ‘aws s3’ commands like ‘s3 cp’ do have —include *.html —exclude *.json glob style filters.

%aws s3 ls s3://nasa-heasarc/xmm/data/rev0/0123700101/PPS/
2023-08-04 15:36:50 203657 P0123700101CAX000CATPLT0000.PDF
2023-08-04 15:36:50 5888 P0123700101CAX000D0001A0000.HTM
2023-08-04 15:36:50 5848 P0123700101CAX000D0003A0000.HTM

This doc describes what AWS is doing to make the flat key/value store look like a hierarchy, and describes the slash a bit in the “important” callout box: Organizing objects in the Amazon S3 console by using folders - Amazon Simple Storage Service

1 Like

I’m marking @azoghbi’s response as the solution since it solves the immediate issue. I needed to add --no-sign to the command, but also note that a trailing slash is needed, as mentioned by @tjaffe.

For the larger issue of getting filenames and copying specific files from the archive will take some more testing.

1 Like