Below are two sets of metadata which will be stored in the new archive in regard to Images and the Image Sets to which they belong.  Included where relevant are the equivalent fields in the Virtual Observatory ObsCore Data Model,  a potential source from which to obtain the data, and a comment.  The images table was written specifically with the idea of spatial images in mind.  Other data products (spectra, time series, etc) would have different amounts of granularity in the header information available.

Images Database Table:

Column NameUnitsVO EquivalentSourceComments
sourcename
target_nameOBJECT Keyword
radegs_raCenter of Ra Axis
dec degs_decCenter of Dec Axis
image_field_of_viewdegs_fovQuadratic Mean of the ra & dec extents (NAXISn*CDELTn)
spatial_region
s_regionDerived from spatial DataSTC-S String Defined in the TAP (Dowler, et al 2010)
ra_element_count
s_xel1Relevant NAXISn Keyword
ra_pixel_size

abs(Relevant CDELTn)
dec_element_count
s_xel2Relevant NAXISn Keyword
dec_pixel_size

abs(Relevant CDELTn)
spatial_resolutionarcsecs_resolutionsqrt(BMAJ*BMIN)Geometric mean of the synthesized beam axes.
beam_axis_ratio

BMAJ/BMINRatio of the synthesized beam axes.
starttimeMJDt_min

These values should be obtained by some larger-scale process.  VlassMgr, in the case of quicklook images. 

endtimeMJDt_max
exposure_timest_exptime
min_frequencyHzem_minMinimum of Spectral Axis
max_frequencyHzem_maxMaximum of Spectral Axis
rest_frequencyHz
RESTFRQ KeywordUsed for transformations
band_code

weblogRequires outside information source for accuracy. 

polarization_id


pol_statesPolarization CRVALn value

CASA uses the CRVALn value to convey polarization information, with [1,2,3,4] mapped to [I,Q,U,V]. 

Default to 'None' in case of other values.

telescope
instrument_nameTELESCOP KeywordThis must accommodate images for multiple instruments (i.e. VLA + Single Dish)
file_id

automatically generatedlink to information about the physical file
image_id
obs_idautomatically generatedUnique Identifier for the Image
image_units

o_ucd

BTYPE & BUNIT KeywordsDescription of the physical quantity measured in the image
max_intensityimage_units
image table values
min_intensityimage_units
image table values
rms_noiseimage_units
weblogBoth of these will need to come from the last stage of the imaging process, to maintain accuracy.  For quicklook images, that is stage7.  That may change (and will likely be different for the Single Epoch products).
thumbnail

weblog, find the matching thumbnail image generated.
tags


Internal tagging system to facilitate searches.
ra_pixel_size

Appropriate CDELTn
dec_pixel_size

Appropriate CDELTn

FITS Data Description Keywords:

For the purpose of generality, FITS provides a detail-independent method of data access.  It's easier to think of the data axis descriptors in groupings by their axis number.  The NAXIS keyword provides the total number of dimensions within the data.  For the nth dimension of the data, we have a set of descriptor keywords which should be considered and used together:

  • NAXISn  - Total data size along this axis
  • CRPIXn  - Our reference location
  • CRVALn - The physical value at our reference location
  • CDELTn - The increment along the axis
  • CTYPEn - The axis label
  • CUNITn  - The axis units

The CTYPEn and CUNITn values provide information about the axis to which this group of values applies.  The rest of the keywords can then be used to calculate points of interest upon that axis.  For instance, in axes longer than 1, we have:

Minimum:  CRVALn + CDELTn*(1-CRPIXn)

Center:      CRVALn + CDELTn*(NAXISn/2 - CRPIXn)

Maximum: CRVALn + CDELTn*(NAXISn - CRPIXn)

For the Frequency axis, which only has a single point (NAXISn=1), our calculations are simpler:

Minimum:  CRVALn - CDELTn/2

Center :     CRVALn

Maximum: CRVALn +CDELTn/2

Image Sets Database Table:

The Image Set information will need to come from outside sources, as most of the information is not guaranteed to be in the FITS files themselves.  Vlass Manager holds all the needed information for their images, but future development will need to provide the relevant metadata as image sources broaden beyond VLASS.

Column NameVO EquivalentSourceComment
image_set_idobs_idautomatically generatedUnique Identifier for the Image Set
project_code
Required to facilitate Ingestion
configuration
VlassMgrThis will need to hold the entire list used for the imaging. 
collection_nameobs_collectionVlassMgr


calibration_levelcalib_levelVlassMgrAs defined by the VO in their 0-4 system
product_file_id
automatically generatedLink to the imaging products tar file
tags

internal tags which apply to all images of the set


VO ObsCore Remaining Fields:

VO RequirementValueSource

access_url



access_estsize
files.filesize, or combined value for an image set
dataproduct_type'image'default
access_format'fits'default

obs_publisher_did


Obtained upon registering with the Virtual Observatory
facility_name'NRAO'default
t_resolution
images.exposure_time
t_xel1default
em_res_powernull default
em_xel1default
pol_xel1default


Thumbnails

We won't store thumbnails in NGAS as such, for each we will:

  1. compute the sha1 hash of the file
  2. store the file in a filesystem at $ROOT/$1/$2/$3/$FILENAME, where $ROOT is a CAPO property that maps to the root of the filesystem, $1 is the first two characters of the sha1 sum, $2 the second two, and $3 the third two.
  3. in the metadata database we will store the $1/$2/$3/$FILENAME path
  • No labels

34 Comments

  1. What would target_name be in the context of, say, quicklook images?

    Do we want to allow none, any or all of these database columns to be null?

    1. Currently the quicklook images do not populate the OBJECT keyword, so allowing that to be null would be a good idea.

      The time-domain information (start, stop, integration time) might also need to be null-able since if the information isn't provided (either with a time axis in the file, or an outside source).  For the start & stop time we could fall back upon the times at the project level, those values are null-able themselves.

      1. In general, the "sourcename" parameter should be optional. Allowing this to be null is an acceptable solution.

  2. So this is likely been thought of, but for columns that have scan/EB analogs (positions, frequencies, whatever), I'd use the same units, and where possible, column names, to make it easier for the front end to search positions across EBs and images without converting stuff.

    1. I certainly had missed that opportunity.  I've done some renaming of the columns that have analogs in the EB/scan/subscan tables to match their names.  

  3. The OBJECT keyword will be filled once CAS-10664 is fixed - it's a bug in tclean/MFS. Of course what it will be filled with is a little unclear in the case of VLASS quicklook images. I suggest the VLASS quicklook sub-mosaic images have the OBJECT keyword set to be the center of the sub-mosaic e.g. J123456.7+123456.

    1. Thanks for the link to that bug, it'll be nice when that's in. 

      A slightly more pressing concern is that the polarization information in the header is incomplete as far as I can tell.  For VLASS Single Epoch Imaging, we'll need to be able to distinguish between the I, Q, and U images (apparently they're not doing V).  I've searched for a ticket, but have found nothing.  Do you know of one, or do we need to get this on the radar for the SE processing?

      1. Good point, I am not sure what would come out of the SE pipeline. CASA seems to do the correct thing - there is a Stokes axis (CTYPE4='STOKES') with appropriate values set for CRVAL4, CDELT4 and CRPIX4. So assuming the pipeline uses the same code we should be fine I think.

        1. I do see that CASA puts the fields in, but the pipeline (or the parameters given for quicklook creation) don't provide much that seems intuitively useful to me:

          CTYPE4  = 'STOKES  '
          CRVAL4  =   1.000000000000E+00
          CDELT4  =   1.000000000000E+00
          CRPIX4  =   1.000000000000E+00
          CUNIT4  = '        '

          However, I don't have anything but intensity to check against offhand (the quicklook images didn't handle polarization).  Am I missing a piece of the CASA rosetta stone?

          1. Yes, this is a special thing... The way it is done is that Stokes I,Q,U and V are mapped to values 1,2,3,4 in the Stokes axis. 

            So if you have a cube, you have NAXIS4=4, CRVAL4=1, CRPIX4=1 and CDELT1=1, and your four planes then have STOKES=1,2,3,4 corresponding to I,Q,U,V. If you just have a single plane of Stokes Q, for example, then NAXIS4=1, CRVAL4=3,CRPIX4=1 and CDELT4=1 etc.

            1. Oh, great, puts to rest my biggest concern (a potential CASA change needed).  Now I just need to negotiate getting the time information from the Vlass Manager. 


  4. Some detailed comments now that I have had a chance to look this over in more detail:

    1) replace "spacial" with "spatial" throughout

    2) Some VO keywords are associated with specific units (see http://www.ivoa.net/documents/ObsCore/20170509/REC-ObsCore-v1.1-20170509.pdf) , we should probably match these. In fact I would add a column with units in to the tables above so we don't forget. e.g.

    • s_resolution - arcsec
    • t_exptime in seconds
    • ra, dec, fov in degrees
    • resolving power dimensionless (frequency/(frequency resolution)) in our case

    3) em_min and em_max seem to be wavelengths (with units of meters) in the VO standard, so our min_frequency and max_frequency would not map directly to those, we would need a conversion.


  5. spacial_region is also in the Images db table and should be updated.

  6. Fixed the spacial→spatial issue both on the page and in the database. 

    A question about the conversion between frequency and wavelength: FITS doesn't provide a standard header element for a velocity frame of reference.  The quicklook image headers show SPECSYS, VELREF, and RESTFRQ elements.  Do we know which of these, if any, would be consistently populated by CASA?

    Or am I over complicating matters, and we can fall back on a simple c/ν conversion?

    1. All three seem to be populated by CASA, even for continuum images. The c/v conversion is problematic due to the convention usually used by radio astronomers which equates frequency channel widths to velocity channel widths.

  7. For consistency in naming, wouldn't we want source_name, start_time, end_time?  Also, facilty_name→facility_name.


    But more importantly, is this going to fit into a framework that allows for more than just images?  A trivial example would be, say, the observing log.  A more complicated one would be a PSRFITS file, or a realfast SDM+BDF dataset.  The framework should allow for more than just images to be stuffed into the archive, and should also retain linkages (when there is a clear one) between items.  Now, for VLASS images, there might not be a one-to-one association with a filesetID (the data resultant from a single EB), depending on how the imaging was done.  But for other items (observing logs, for instance), we want clear associations between items (usually a parent→child type association).


    1. sourcename, starttime, and endtime are chosen for consistency with the execblock & scan tables in the database (so the UI  can ask for the same column names).  My initial versions of those had underscores, but I revised them to fit with the existing nomenclature.  I'll fix that typo presently.

      I gather that VO interactivity is the larger framework we'll be using for the more general exposure of data products, at least to start with.  Some of these changes from the initial images table were made to mesh more easily with the VO OBSCORE.  We'll need to keep to that coverage of OBSCORE as we add new type of files to what we ingest into the archive. 

      We already use the type of parent-child structure you have in mind for the calibration tar files, so I could see that easily being extended to things like the observation logs or other tightly coupled items in the archive.  

      Images present a particular problem due to the their multiple sources:  1 EB, many EBs, 4 EBs + GBT data, even other images (in the future).  Any set of images ingested into the archive are brought in with a tar file of 'image products' from the creating CASA run (currently what's in the VLASS quicklook cache with their images, to be further refined later).  Those image products will be delivered along side that image when it is retrieved from the archive.  Those image products will contain the linkage back to the image source(s).   


  8. Some comments on the spatial information:

    The "field_of_view" column name is going to be confusing to our users because there is a concept of "FoV" of an interferometer (the primary beam), which is not necessarily the same as the size of the image (which I think is what is meant here). Can we call it something different, like "image_size"?

    I have no idea what "spatial_region" refers to, and how it is different from the image size. Please could someone explain?

    I'm not sure of the relevance of ra_element_count and dec_element_count as a means of searching for images in the archive, because they depend on the pixel size chosen for the imaging, which can be arbitrary. The only reason to keep them, in my view, is for VO compatibility.

    The spatial_resolution column should carry the geometric mean of the PSF, sqrt(BMAJ*BMIN), and probably the PSF axis ratio, BMAJ/BMIN, rather than the pixel size.

    1. spatial_region is a VO specific thing.  It's another description of the location imaged (meant for doing searches by defining a geometric region on the sky and looking for overlaps).  It is somewhat redundant, but I suspect it will be simpler to perform the transformation once upon ingestion.

      The fields in this table are meant to be a super-set of the VO & front-end search information, so not all the fields are going to show up directly in the search interface.  Some (like the element counts) can simply be left out of what we present. 

      Your calculation for the spatial_resolution is far better than mine.  I was mostly going by the FITS standard, but I knew there would be cases where those simple calculations were inappropriate.   The axis ratio is a good idea too, I'll add it.

      When it comes to a specific image, isn't the image size the same as the field of view?  I know there is far more flexibility in where you can image from an observation, but this is the specific result of an imaging process.  Wouldn't the wider field of view be more appropriately tied to the execution block?


      1. The field of view question is more one of confusing terminology, because for interferometers it can mean something different (indeed, for optical/IR, the field of view of an instrument can be different from the image one chooses to archive from it). If others don't think it could be confusing then it's OK to keep, but image_size is maybe clearer.

        1. Ah, I was misinterpreting what you were saying.  I see you point, but I see a similar potential issue with image_size (as in kilobytes, or degrees?).  How about image_extent? 

          1. Perhaps "image_field_of_view" would be sufficient to clarify.

  9. Some comments on the temporal information:

    starttime probably maps to t_min rather than t_max, and vice versa for endtime.

    exposure_time is a difficult item to quantify, it isn't something we carry along with the data, and will be a function of flagging applied to the input visibility data. It may be possible to come up with something to go in this slot but it will take some work, so "null" should be an option for right now.

    1. Fixed the mix up in starttime/endtime and t_min/t_max.

      I agree with the issue of the exposure time.  I mostly put it on there to provoke discussion.  I suspect the endpoints of the observations will do for most purposes. 

  10. Some comments on spectral information:

    min_frequency/max_frequency should be separated from observing_band, since there is frequency overlap in the receiver responses and, e.g., 39.5GHz can be covered by either the Ka-band or Q-band receiver at the VLA.

    observing_band should have the option to include more than one receiver band name associated with an image.

    We will need to add a center_frequency.

    spectral_resolution should be the resolution in frequency, not resolving power as for optical/IR. We should be able to convert between them for VO compatibility, but for most of our users this means a channel width (in Hz/kHz/MHz, etc.). We should also include a rest_frequency, and be able to convert spectral_resolution and center_frequency to velocity given the rest_frequency, via the archive search interface.


  11. The intent is for the min/max frequency values to take the place of the observing_band column in the initial table definition.  As of now, there's no plan to carry the band identifiers, simply the frequency values.  However, we should have a plan in place for where we will perform that mapping. 

    It seems to me that the closest we can get to a channel width is the bandwidth (CDELTn) information in the header.  We could carry that, along with a center_frequency & rest_frequency.  However, there is a problem with this assumption. 

    Multi-band images (or even single-band observations with discontinuous spectral windows) raise an interesting issue:  the 'bandwidth' (CDELTn) provided by the FITS header would not relate to the actual span of frequencies summed for the image in most cases.   That would certainly require some sort of caveat or explanation.  CASA does provide spectral window numbers, but that does not insure that the data span the entire reported bandwidth (unless I'm missing another piece of the CASA rosetta stone?). 

    Unfortunately, I don't have any resolution to this conundrum right now, let alone a good one.  Thoughts?

    1. I think that the bandwidth carried by CASA is the max-min frequency. It can't do much else, because if individual channels are flagged I don't think the effective bandwidth is something that is tracked by CASA.

  12. You still may want the observing band if possible, because there are many frequencies which can be observed at two bands, and sometimes you may want to know which one.


    Now, it's a moot point for VLASS, but if we're making a general framework, i think you want to carry band information where it's known.


    1. Indeed. People will want to search on observing_band as well as min/max frequency.

  13. I have a question about how how flexible the CASA pipeline is for imaging:  I know you can subdivide a MS by polarization and spectral window.  Can that be done by time as well?  I was considering moving the start/end/exposure times into the image_products table, if they would be shared among images generated together. 

    The CASA imaging documentation doesn't seem to lend itself toward a time-series of images, but I could be missing something easily enough. 

  14. At the moment it is very inflexible, as far as the VLA is concerned. But that will change for SRDP, which Jeff Kern should comment on.

  15. I think we should start a conversation about VLASS specific metadata, I believe we are at that point. I'm thinking epoch, I'm thinking type (quicklook, continuum, tapered, coarse cube, fine cube), cumulative. Maybe some notion of which cube a specific image layer belongs to.

    I also think we should note any of the fields above that are specific to VLASS, or at least not generic, and abstract them.

    Those are strawman ideas. Discuss.

  16. None of the items are VLASS specific at the moment. The other things you note are currently all encoded in the product name for VLASS (which is good, because there's nothing in the metadata in the FITS files that provides this information).

  17. Earlier this week I had some discussions with people from the CADC regarding the metadata for VLASS images. CADC has ingested our VLASS1.1 Quick Look images into their archive, but noted they are missing a value for the INSTRUMENT keyword. For other telescopes the format of this keyword is <receiver>-<backend>, the equivalent of which for VLASS would be 'S-WIDAR'. They have hardcoded 'S-WIDAR' into their metadata for now, and we should consider whether we want this in our metadata as well. Images from multiple receiver bands could be specified as, e.g., SC-WIDAR for composite S and C band data. At present, this keyword is not in FITS files produced by CASA, but we can investigate putting it in the pipeline weblog for future harvesting.