Mashup API Tutorial

An introduction to using the Mashup API to query MAST data and catalogs programmatically.

To start with, here are all the includes we need:

In [4]:
import sys
import os
import time
import re
import json

try: # Python 3.x
    from urllib.parse import quote as urlencode
    from urllib.request import urlretrieve
except ImportError:  # Python 2.x
    from urllib import pathname2url as urlencode
    from urllib import urlretrieve

try: # Python 3.x
    import http.client as httplib 
except ImportError:  # Python 2.x
    import httplib   

from astropy.table import Table
import numpy as np

import pprint
pp = pprint.PrettyPrinter(indent=4)

Basic MAST Query

Here we will perform a basic MAST query on M101, equivalent to choosing "All MAST Observations" and searching for M101 in the Portal like so:

We will then select some observations, view their data products, and download some data.

Step 0: Mashup Request

All mashup requests (except direct download requests) have the same form:

  • HTTPS connect to MAST server
  • POST Mashup request to /api/v0/invoke
  • Mashup request is of the form "request={request json object}

Because every request looks the same, we will write a function to handle the HTTPS interaction, taking in a Mashup request and returning the server response.

In [2]:
def mastQuery(request):
    """Perform a MAST query.
    
        Parameters
        ----------
        request (dictionary): The Mashup request json object
        
        Returns head,content where head is the response HTTP headers, and content is the returned data"""
    
    server='mast.stsci.edu'

    # Grab Python Version 
    version = ".".join(map(str, sys.version_info[:3]))

    # Create Http Header Variables
    headers = {"Content-type": "application/x-www-form-urlencoded",
               "Accept": "text/plain",
               "User-agent":"python-requests/"+version}

    # Encoding the request as a json string
    requestString = json.dumps(request)
    requestString = urlencode(requestString)
    
    # opening the https connection
    conn = httplib.HTTPSConnection(server)

    # Making the query
    conn.request("POST", "/api/v0/invoke", "request="+requestString, headers)

    # Getting the response
    resp = conn.getresponse()
    head = resp.getheaders()
    content = resp.read().decode('utf-8')

    # Close the https connection
    conn.close()

    return head,content

Step 1: Name Resolver

The first step of this query is to "resolve" M101 into a position on the sky. To do this we use the Mast.Name.Lookup Mashup service.

As with all of our services, we recommend using the json format, as that output is most easily parsed.

In [3]:
objectOfInterest = 'M101'

resolverRequest = {'service':'Mast.Name.Lookup',
                     'params':{'input':objectOfInterest,
                               'format':'json'},
                     }

headers,resolvedObjectString = mastQuery(resolverRequest)

resolvedObject = json.loads(resolvedObjectString)

pp.pprint(resolvedObject)
{   'resolvedCoordinate': [   {   'cacheDate': 'Apr 12, 2017 9:28:27 PM',
                                  'cached': True,
                                  'canonicalName': 'MESSIER 101',
                                  'decl': 54.34895,
                                  'objectType': 'G',
                                  'ra': 210.80227,
                                  'radius': 0.24000000000000002,
                                  'resolver': 'NED',
                                  'resolverTime': 114,
                                  'searchRadius': -1.0,
                                  'searchString': 'm101'}],
    'status': ''}

The resolver returns a variety of informaton about the resolved object, however for our purposes all we need are the RA and Dec:

In [4]:
objRa = resolvedObject['resolvedCoordinate'][0]['ra']
objDec = resolvedObject['resolvedCoordinate'][0]['decl']

Step 2: MAST Query

Now that we have the RA and Dec we can perform the MAST query on M101. To do this we will use the Mashup service Mast.Caom.Cone. The output of this query is the information that gets loaded into the grid when running a Portal query, like so:

Because M101 has been observed many times, there will be several thousand results. We can use the Mashup 'page' and 'pagesize' properties to control how we view these results, either by choosing a pagesize large enough to accommodate all of the results, or choosing a smaller pagesize and paging through them using the page property. The json response object will include information about paging, so chack that to see if you need to collect additional results.

Note: page and pagesize much both be specified (or neither), if only one is specified, the other will be ignored.

In [5]:
mastRequest = {'service':'Mast.Caom.Cone',
               'params':{'ra':objRa,
                         'dec':objDec,
                         'radius':0.2},
               'format':'json',
               'pagesize':2000,
               'page':1,
               'removenullcolumns':True,
               'removecache':True}

headers,mastDataString = mastQuery(mastRequest)

mastData = json.loads(mastDataString)

print(mastData.keys())
print("Query status:",mastData['status'])
dict_keys(['fields', 'msg', 'paging', 'status', 'data'])
Query status: COMPLETE

In the json response object, the "fields" dictionary holds the column names and types. The column names are not the formatted column headings that appear in the Portal grid (these are not guarenteed to be unique), but instead are the column names from the database. These names can be accessed in the Portal by hovering over a column name, or in the details pane of "Show Details." Details about returned columns for various queries can be found in the "Related Pages" section of the API documentation.

In [6]:
pp.pprint(mastData['fields'][:5])
[   {'name': 'dataproduct_type', 'type': 'string'},
    {'name': 'obs_collection', 'type': 'string'},
    {'name': 'instrument_name', 'type': 'string'},
    {'name': 'project', 'type': 'string'},
    {'name': 'filters', 'type': 'string'}]

The data is found (predictably) under the "data" keyword. The data is a list of dictionaries, where each row corresponds to one observation collection (just the in the Portal grid):

In [7]:
pp.pprint(mastData['data'][0])
{   '_selected_': None,
    'calib_level': 2,
    'dataURL': None,
    'dataproduct_type': 'cube',
    'distance': 0,
    'em_max': 394.2,
    'em_min': 301.4,
    'filters': 'U',
    'instrument_name': 'UVOT',
    'jpegURL': 'http://archive.stsci.edu/cgi-bin/hla/fitscut.cgi?red=sw00030896001uuu[6]&size=ALL&output_size=2320',
    'objID': 15000541440,
    'obs_collection': 'SWIFT',
    'obs_id': '00030896001',
    'obs_title': None,
    'obsid': 15000804761,
    'project': None,
    'proposal_id': None,
    'proposal_pi': None,
    's_dec': 54.3320323874192,
    's_ra': 210.884468835902,
    's_region': 'POLYGON -148.68197799999996 54.250447 -148.72010699999998 '
                '54.380959 -148.771094 54.525161 -148.80113844209316 '
                '54.522015545787568 -148.80339800000002 54.528132 '
                '-149.00890800000002 54.505617 -149.22571413697636 '
                '54.475317193971954 -149.23336 54.474477 -149.23331320240226 '
                '54.4742473291218 -149.25735499999996 54.470859 '
                '-149.25591688496976 54.465950017458638 -149.37013000000002 '
                '54.452035 -149.36860315816716 54.445640407505977 '
                '-149.38111000000004 54.443785 -149.339132 54.309445 '
                '-149.32235856865978 54.265305792795992 -149.31201271706755 '
                '54.2313318148205 -149.29647799999998 54.175928999999996 '
                '-149.28874914509623 54.176570406780428 -149.28693099999998 '
                '54.171759 -149.13378760314151 54.189321891688223 -149.070602 '
                '54.194462 -148.83541400000001 54.220082 -148.83659838226143 '
                '54.22526868924821 -148.83561599999996 54.225382 '
                '-148.83630396949408 54.228123086154028 -148.706526 54.241669 '
                '-148.70769828707938 54.247103078929328 -148.68197799999996 '
                '54.250447 -148.68197799999996 54.250447',
    't_exptime': 1556.7224828509793,
    't_max': 54160.8269792,
    't_min': 54160.089537,
    't_obs_release': None,
    'target_classification': None,
    'target_name': 'M101ULX-1',
    'wavelength_region': 'OPTICAL'}

The data table can be used as is, but it can also be translated into different formats depending on user preference. Here we will demonstrate how to put the results of a MAST query into an Astropy table.

In [8]:
mastDataTable = Table()

for col,atype in [(x['name'],x['type']) for x in mastData['fields']]:
    if atype=="string":
        atype="str"
    if atype=="boolean":
        atype="bool"
    mastDataTable[col] = np.array([x.get(col,None) for x in mastData['data']],dtype=atype)
    
print(mastDataTable)
dataproduct_type obs_collection instrument_name ...    distance   _selected_
---------------- -------------- --------------- ... ------------- ----------
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
            cube          SWIFT            UVOT ...           0.0      False
             ...            ...             ... ...           ...        ...
           image            HLA         ACS/WFC ... 21.7850555924      False
           image            HLA         ACS/WFC ... 21.7850555924      False
           image            HLA         ACS/WFC ... 21.7850555924      False
           image            HLA         ACS/WFC ... 21.7850555924      False
           image            HLA         ACS/WFC ... 21.7850555924      False
           image            HLA         ACS/WFC ... 21.7850652219      False
           image            HLA         ACS/WFC ... 21.7850652219      False
           image            HLA         ACS/WFC ... 21.7850652219      False
           image            HLA         ACS/WFC ... 21.7850652219      False
           image            HLA         ACS/WFC ...  21.785098363      False
           image            HLA         ACS/WFC ...  21.785098363      False
Length = 2000 rows

At this point we are ready to do analysis on these observations. However, if we want to access the actual data products, there are a few more steps.

Step 2 tangent: filtered query

An alternative to the cone search query is the filtered queries. This is analogous the the Advanced Search in the Portal and results in the same list of observations as the cone search, only filtered on other criteria. The services we'll use to do this are Mast.Caom.Filtered and Mast.Caom.Filtered.Position.

Filtered queries can often end up being quite large, so we will first do a query that just returns the number of results and decide if it is manageable before we do the full query. We do this by supplying the parameter "columns":"COUNT_BIG(*)".

In [14]:
mashupRequest = {"service":"Mast.Caom.Filtered",
                 "format":"json",
                 "params":{
                     "columns":"COUNT_BIG(*)",
                     "filters":[
                         {"paramName":"filters",
                          "values":["NUV","FUV"],
                          "separator":";"
                         },
                         {"paramName":"t_max",
                          "values":[{"min":52264.4586,"max":54452.8914}], #MJD
                         },
                         {"paramName":"obsid",
                          "values":[],
                          "freeText":"%200%"}
                     ]}}
    
headers,outString = mastQuery(mashupRequest)
countData = json.loads(outString)

pp.pprint(countData)
{   'data': [{'Column1': 1068}],
    'fields': [{'name': 'Column1', 'type': 'string'}],
    'msg': '',
    'paging': {   'page': 1,
                  'pageSize': 1,
                  'pagesFiltered': 1,
                  'rows': 1,
                  'rowsFiltered': 1,
                  'rowsTotal': 1},
    'status': 'COMPLETE'}

1,068 isn't too many observations so we can go ahead and request them. The only thing we need to do differently is change "columns":"COUNT_BIG(*)" to "columns":"*".

In [16]:
mashupRequest = {"service":"Mast.Caom.Filtered",
                 "format":"json",
                 "params":{
                     "columns":"*",
                     "filters":[
                         {"paramName":"filters",
                          "values":["NUV","FUV"],
                          "separator":";"
                         },
                         {"paramName":"t_max",
                          "values":[{"min":52264.4586,"max":54452.8914}], #MJD
                         },
                         {"paramName":"obsid",
                          "values":[],
                          "freeText":"%200%"}
                     ]}}
    
headers,outString = mastQuery(mashupRequest)
filteredData = json.loads(outString)

print(filteredData.keys())
print("Query status:",filteredData['status'])
dict_keys(['status', 'msg', 'data', 'fields', 'paging'])
Query status: COMPLETE
In [17]:
pp.pprint(filteredData['data'][0])
{   'calib_level': 2,
    'dataURL': None,
    'dataproduct_type': 'image',
    'em_max': 300.7,
    'em_min': 169.3,
    'filters': 'NUV',
    'instrument_name': 'GALEX',
    'jpegURL': 'http://galex.stsci.edu/data/GR6/pipe/01-vsn/05082-NGA_Tol0618m402/d/01-main/0001-img/07-try/qa/NGA_Tol0618m402-xd-int_2color.jpg',
    'objID': 1000001200,
    'obs_collection': 'GALEX',
    'obs_id': '2484652324694261760',
    'obs_title': None,
    'obsid': 1000001200,
    'project': 'NGS',
    'proposal_id': None,
    'proposal_pi': None,
    's_dec': -40.1602476853901,
    's_ra': 95.1588307864903,
    's_region': 'CIRCLE ICRS  95.15883079 -40.16024769 0.625',
    't_exptime': 7268.5,
    't_max': 53396.30620370371,
    't_min': 53041.70260416667,
    't_obs_release': 55327.10813,
    'target_classification': None,
    'target_name': 'NGA_Tol0618m402',
    'wavelength_region': 'UV'}

To add position to a filtered query we use the service Mast.Caom.Filtered.Position and add a new parameter "position":"positionString" where positionString has the form "ra dec radius" in degrees.

In [18]:
mashupRequest = {
        "service":"Mast.Caom.Filtered.Position",
        "format":"json",
        "params":{
            "columns":"COUNT_BIG(*)",
            "filters":[
                {"paramName":"dataproduct_type",
                 "values":["cube"]
                }],
            "position":"210.8023, 54.349, 0.24"
        }}

headers,outString = mastQuery(mashupRequest)
countData = json.loads(outString)

pp.pprint(countData)
{   'data': [{'Column1': 789}],
    'fields': [{'name': 'Column1', 'type': 'string'}],
    'msg': '',
    'paging': {   'page': 1,
                  'pageSize': 1,
                  'pagesFiltered': 1,
                  'rows': 1,
                  'rowsFiltered': 1,
                  'rowsTotal': 1},
    'status': 'COMPLETE'}

Step 3: Getting Data Products

Before we can download observational data, we need to figure out what data products are associated with the observation(s) we are interested in. To do that we will use the Mast.Caom.Products service. This service takes the "obsid" ("Product Group ID" is the formated label visible in the Portal) and returns information about the associated data products. This query can be thought of as somewhat analogous to adding an observation to the basket in the Portal.

In [9]:
interestingObservation = mastDataTable[1300]
print("Observation:",
      [interestingObservation[x] for x in ['dataproduct_type', 'obs_collection', 'instrument_name']])
Observation: ['image', 'HST', 'STIS/CCD']
In [10]:
obsid = interestingObservation['obsid']

productRequest = {'service':'Mast.Caom.Products',
                 'params':{'obsid':obsid},
                 'format':'json',
                 'pagesize':100,
                 'page':1}   

headers,obsProductsString = mastQuery(productRequest)

obsProducts = json.loads(obsProductsString)

print("Number of data products:",len(obsProducts["data"]))
print("Product information column names:")
pp.pprint(obsProducts['fields'])
Number of data products: 5
Product information column names:
[   {'name': 'obsID', 'type': 'string'},
    {'name': 'obs_collection', 'type': 'string'},
    {'name': 'dataproduct_type', 'type': 'string'},
    {'name': 'obs_id', 'type': 'string'},
    {'name': 'description', 'type': 'string'},
    {'name': 'type', 'type': 'string'},
    {'name': 'dataURI', 'type': 'string'},
    {'name': 'productType', 'type': 'string'},
    {'name': 'productGroupDescription', 'type': 'string'},
    {'name': 'productSubGroupDescription', 'type': 'string'},
    {'name': 'productDocumentationURL', 'type': 'string'},
    {'name': 'project', 'type': 'string'},
    {'name': 'prvversion', 'type': 'string'},
    {'name': 'productFilename', 'type': 'string'},
    {'name': 'size', 'type': 'int'},
    {'name': '_selected_', 'type': 'boolean'}]

We might not want to download all of the available products, let's take a closer look and see which ones are important.

In [14]:
pp.pprint([x.get('productType',"") for x in obsProducts["data"]])
['AUXILIARY', 'AUXILIARY', 'SCIENCE', 'SCIENCE', 'SCIENCE']

Let's download all of the science products. We'll start by making an astropy table containing just the science product information. Then we'll download the datafiles using two different methods.

In [15]:
sciProdArr = [x for x in obsProducts['data'] if x.get("productType",None) == 'SCIENCE']
scienceProducts = Table()

for col,atype in [(x['name'],x['type']) for x in obsProducts['fields']]:
    if atype=="string":
        atype="str"
    if atype=="boolean":
        atype="bool"
    if atype == "int":
        atype = "float" # array may contain nan values, and they do not exist in numpy integer arrays
    scienceProducts[col] = np.array([x.get(col,None) for x in sciProdArr],dtype=atype)

print("Number of science products:",len(scienceProducts))
print(scienceProducts)
Number of science products: 3
  obsID    obs_collection dataproduct_type ...    size    _selected_
---------- -------------- ---------------- ... ---------- ----------
2003488089            HST            image ...  2255040.0      False
2003488089            HST            image ... 12144960.0      False
2003488089            HST            image ... 10532160.0      False

Step 4a: Downloading products using the bundler

This is how downloading is done through the portal. If we pick the right fields and correctly build the query it will always work, we don't need to care about how our particular data products are being accessed. The downside is that this is a more complicated way to access data products.

We will use the Mast.Bundle.Request service to download all of our desired data products as a gzipped tarball.

The fields we need to create the download request are: dataURI, description, and dataproduct_type. We will also use obs_collection, obs_id, and productFilename to create a download path for each file that is guaranteed to be unique.

In [16]:
urls = scienceProducts['dataURI']
descriptions = scienceProducts['description'] 
productTypes = scienceProducts['dataproduct_type']
outPaths = ["mastFiles/"+x['obs_collection']+'/'+x['obs_id']+'/'+x['productFilename'] for x in scienceProducts]
zipFilename = "mastDownload"
extension = "tar.gz"

Now that we have collected all the information we need, we can build the download request. Note that two of the parameters (urlList and pathList) take comma separated strings, while another two (descriptionList and productTypeList) take lists. This a known issue, and will be fixed, but for now it's the way it is.

This query may take some time, if we are returned a status of EXECUTING, we will simply rerun the query until it completes.

In [17]:
mashupRequest = {"service":"Mast.Bundle.Request",
                 "params":{"urlList":",".join(urls),
                           "filename":zipFilename,
                           "pathList":",".join(outPaths),
                           "descriptionList":list(descriptions),
                           "productTypeList":list(productTypes),
                           "extension":extension},
                 "format":"json",
                 "page":1,
                 "pagesize":1000}  

headers,bundleString = mastQuery(mashupRequest)
bundleInfo = json.loads(bundleString)

pp.pprint(bundleInfo)
{   'bytesStreamed': 24933767,
    'manifestUrl': 'https://dwmastiisv3.stsci.edu/portal/Download/stage/anonymous/public/76c7ce65-c2b6-4255-b93f-8cfdcee93238/mastDownload_MANIFEST.HTML',
    'msg': '',
    'progress': 1,
    'status': 'COMPLETE',
    'statusList': {   'mast:HST/product/o4qpf3ecq_flt.fits': 'COMPLETE',
                      'mast:HST/product/o4qpf3ecq_raw.fits': 'COMPLETE',
                      'mast:HST/product/o4qpf3ecq_x2d.fits': 'COMPLETE'},
    'url': 'https://dwmastiisv3.stsci.edu/portal/Download/stage/anonymous/public/76c7ce65-c2b6-4255-b93f-8cfdcee93238/mastDownload.tar.gz'}

The information returned from this query tells us the status of each file we tried to downlad and gives us two important urls. The 'manifestUrl' displays information about each downloaded file, and if it was not downloaded, gives the associated error message. This is the manifest.html document that you get when downloading through the Portal. The 'url' is the location of the actual file containing our data. We can download this file using any method we like.

In [18]:
urlretrieve(bundleInfo['url'], zipFilename+"."+extension)
Out[18]:
('mastDownload.tar.gz', <http.client.HTTPMessage at 0x10af41630>)

Step 4b: Direct Download

Instead of going through the Mast.Bundle.Request service, we can directly download the data files one at a time, using the information in the 'dataURI' field. This field can contain either a url or a uri for the data. If we have a data url we can use our favorite method of data download to access it, however if we have a uri, we will need to go through the mast downloader.

We will loop through the files and download them, saving them as the name given in 'filename.' These file names are not guaranteed to be unique, so proceed with caution. However in this case all the files we wish to download have different names, so it's okay.

In [19]:
for row in scienceProducts:     
    if "http" in row['dataURI']: # link is url, so can just download 
        urlretrieve(row['dataURI'], row['productFilename'])
    else: # link is uri, so need to go through direct download request
        server='mast.stsci.edu'
        uri = row['dataURI'].lstrip('mast:') # need to remove the mast: before sending to download service
        conn = httplib.HTTPSConnection(server)
        conn.request("GET", "/api/v0/download/file/"+uri)
        resp = conn.getresponse()
        fileContent = resp.read()
        with open(row['productFilename'],'wb') as FLE:
            FLE.write(fileContent)
        conn.close()