Below are two sets of metadata which will be stored in the new archive in regard to Images and the Image Sets to which they belong. Included where relevant are the equivalent fields in the Virtual Observatory ObsCore Data Model, a potential source from which to obtain the data, and a comment. The images table was written specifically with the idea of spatial images in mind. Other data products (spectra, time series, etc) would have different amounts of granularity in the header information available.
Images Database Table:
Column Name | Units | VO Equivalent | Source | Comments |
---|---|---|---|---|
sourcename | target_name | OBJECT Keyword | ||
ra | deg | s_ra | Center of Ra Axis | |
dec | deg | s_dec | Center of Dec Axis | |
image_field_of_view | deg | s_fov | Quadratic Mean of the ra & dec extents (NAXISn*CDELTn) | |
spatial_region | s_region | Derived from spatial Data | STC-S String Defined in the TAP (Dowler, et al 2010) | |
ra_element_count | s_xel1 | Relevant NAXISn Keyword | ||
ra_pixel_size | abs(Relevant CDELTn) | |||
dec_element_count | s_xel2 | Relevant NAXISn Keyword | ||
dec_pixel_size | abs(Relevant CDELTn) | |||
spatial_resolution | arcsec | s_resolution | sqrt(BMAJ*BMIN) | Geometric mean of the synthesized beam axes. |
beam_axis_ratio | BMAJ/BMIN | Ratio of the synthesized beam axes. | ||
starttime | MJD | t_min | These values should be obtained by some larger-scale process. VlassMgr, in the case of quicklook images. | |
endtime | MJD | t_max | ||
exposure_time | s | t_exptime | ||
min_frequency | Hz | em_min | Minimum of Spectral Axis | |
max_frequency | Hz | em_max | Maximum of Spectral Axis | |
rest_frequency | Hz | RESTFRQ Keyword | Used for transformations | |
band_code | weblog | Requires outside information source for accuracy. | ||
polarization_id | pol_states | Polarization CRVALn value | CASA uses the CRVALn value to convey polarization information, with [1,2,3,4] mapped to [I,Q,U,V]. Default to 'None' in case of other values. | |
telescope | instrument_name | TELESCOP Keyword | This must accommodate images for multiple instruments (i.e. VLA + Single Dish) | |
file_id | automatically generated | link to information about the physical file | ||
image_id | obs_id | automatically generated | Unique Identifier for the Image | |
image_units | o_ucd | BTYPE & BUNIT Keywords | Description of the physical quantity measured in the image | |
max_intensity | image_units | image table values | ||
min_intensity | image_units | image table values | ||
rms_noise | image_units | weblog | Both of these will need to come from the last stage of the imaging process, to maintain accuracy. For quicklook images, that is stage7. That may change (and will likely be different for the Single Epoch products). | |
thumbnail | weblog, find the matching thumbnail image generated. | |||
tags | Internal tagging system to facilitate searches. | |||
ra_pixel_size | Appropriate CDELTn | |||
dec_pixel_size | Appropriate CDELTn |
FITS Data Description Keywords:
For the purpose of generality, FITS provides a detail-independent method of data access. It's easier to think of the data axis descriptors in groupings by their axis number. The NAXIS keyword provides the total number of dimensions within the data. For the nth dimension of the data, we have a set of descriptor keywords which should be considered and used together:
- NAXISn - Total data size along this axis
- CRPIXn - Our reference location
- CRVALn - The physical value at our reference location
- CDELTn - The increment along the axis
- CTYPEn - The axis label
- CUNITn - The axis units
The CTYPEn and CUNITn values provide information about the axis to which this group of values applies. The rest of the keywords can then be used to calculate points of interest upon that axis. For instance, in axes longer than 1, we have:
Minimum: CRVALn + CDELTn*(1-CRPIXn)
Center: CRVALn + CDELTn*(NAXISn/2 - CRPIXn)
Maximum: CRVALn + CDELTn*(NAXISn - CRPIXn)
For the Frequency axis, which only has a single point (NAXISn=1), our calculations are simpler:
Minimum: CRVALn - CDELTn/2
Center : CRVALn
Maximum: CRVALn +CDELTn/2
Image Sets Database Table:
The Image Set information will need to come from outside sources, as most of the information is not guaranteed to be in the FITS files themselves. Vlass Manager holds all the needed information for their images, but future development will need to provide the relevant metadata as image sources broaden beyond VLASS.
Column Name | VO Equivalent | Source | Comment |
---|---|---|---|
image_set_id | obs_id | automatically generated | Unique Identifier for the Image Set |
project_code | Required to facilitate Ingestion | ||
configuration | VlassMgr | This will need to hold the entire list used for the imaging. | |
collection_name | obs_collection | VlassMgr | |
calibration_level | calib_level | VlassMgr | As defined by the VO in their 0-4 system |
product_file_id | automatically generated | Link to the imaging products tar file | |
tags | internal tags which apply to all images of the set |
VO ObsCore Remaining Fields:
VO Requirement | Value | Source |
---|---|---|
access_url | ||
access_estsize | files.filesize, or combined value for an image set | |
dataproduct_type | 'image' | default |
access_format | 'fits' | default |
obs_publisher_did | Obtained upon registering with the Virtual Observatory | |
facility_name | 'NRAO' | default |
t_resolution | images.exposure_time | |
t_xel | 1 | default |
em_res_power | null | default |
em_xel | 1 | default |
pol_xel | 1 | default |
Thumbnails
We won't store thumbnails in NGAS as such, for each we will:
- compute the sha1 hash of the file
- store the file in a filesystem at $ROOT/$1/$2/$3/$FILENAME, where $ROOT is a CAPO property that maps to the root of the filesystem, $1 is the first two characters of the sha1 sum, $2 the second two, and $3 the third two.
- in the metadata database we will store the $1/$2/$3/$FILENAME path
34 Comments
Stephan Witz
What would target_name be in the context of, say, quicklook images?
Do we want to allow none, any or all of these database columns to be null?
James Sheckard
Currently the quicklook images do not populate the OBJECT keyword, so allowing that to be null would be a good idea.
The time-domain information (start, stop, integration time) might also need to be null-able since if the information isn't provided (either with a time axis in the file, or an outside source). For the start & stop time we could fall back upon the times at the project level, those values are null-able themselves.
Claire Chandler
In general, the "sourcename" parameter should be optional. Allowing this to be null is an acceptable solution.
Stephan Witz
So this is likely been thought of, but for columns that have scan/EB analogs (positions, frequencies, whatever), I'd use the same units, and where possible, column names, to make it easier for the front end to search positions across EBs and images without converting stuff.
James Sheckard
I certainly had missed that opportunity. I've done some renaming of the columns that have analogs in the EB/scan/subscan tables to match their names.
Mark Lacy
The OBJECT keyword will be filled once CAS-10664 is fixed - it's a bug in tclean/MFS. Of course what it will be filled with is a little unclear in the case of VLASS quicklook images. I suggest the VLASS quicklook sub-mosaic images have the OBJECT keyword set to be the center of the sub-mosaic e.g. J123456.7+123456.
James Sheckard
Thanks for the link to that bug, it'll be nice when that's in.
A slightly more pressing concern is that the polarization information in the header is incomplete as far as I can tell. For VLASS Single Epoch Imaging, we'll need to be able to distinguish between the I, Q, and U images (apparently they're not doing V). I've searched for a ticket, but have found nothing. Do you know of one, or do we need to get this on the radar for the SE processing?
Mark Lacy
Good point, I am not sure what would come out of the SE pipeline. CASA seems to do the correct thing - there is a Stokes axis (CTYPE4='STOKES') with appropriate values set for CRVAL4, CDELT4 and CRPIX4. So assuming the pipeline uses the same code we should be fine I think.
James Sheckard
I do see that CASA puts the fields in, but the pipeline (or the parameters given for quicklook creation) don't provide much that seems intuitively useful to me:
However, I don't have anything but intensity to check against offhand (the quicklook images didn't handle polarization). Am I missing a piece of the CASA rosetta stone?
Mark Lacy
Yes, this is a special thing... The way it is done is that Stokes I,Q,U and V are mapped to values 1,2,3,4 in the Stokes axis.
So if you have a cube, you have NAXIS4=4, CRVAL4=1, CRPIX4=1 and CDELT1=1, and your four planes then have STOKES=1,2,3,4 corresponding to I,Q,U,V. If you just have a single plane of Stokes Q, for example, then NAXIS4=1, CRVAL4=3,CRPIX4=1 and CDELT4=1 etc.
James Sheckard
Oh, great, puts to rest my biggest concern (a potential CASA change needed). Now I just need to negotiate getting the time information from the Vlass Manager.
Mark Lacy
Some detailed comments now that I have had a chance to look this over in more detail:
1) replace "spacial" with "spatial" throughout
2) Some VO keywords are associated with specific units (see http://www.ivoa.net/documents/ObsCore/20170509/REC-ObsCore-v1.1-20170509.pdf) , we should probably match these. In fact I would add a column with units in to the tables above so we don't forget. e.g.
3) em_min and em_max seem to be wavelengths (with units of meters) in the VO standard, so our min_frequency and max_frequency would not map directly to those, we would need a conversion.
Rick Lively [X]
spacial_region is also in the Images db table and should be updated.
James Sheckard
Fixed the spacial→spatial issue both on the page and in the database.
A question about the conversion between frequency and wavelength: FITS doesn't provide a standard header element for a velocity frame of reference. The quicklook image headers show SPECSYS, VELREF, and RESTFRQ elements. Do we know which of these, if any, would be consistently populated by CASA?
Or am I over complicating matters, and we can fall back on a simple c/ν conversion?
Mark Lacy
All three seem to be populated by CASA, even for continuum images. The c/v conversion is problematic due to the convention usually used by radio astronomers which equates frequency channel widths to velocity channel widths.
Bryan Butler
For consistency in naming, wouldn't we want source_name, start_time, end_time? Also, facilty_name→facility_name.
But more importantly, is this going to fit into a framework that allows for more than just images? A trivial example would be, say, the observing log. A more complicated one would be a PSRFITS file, or a realfast SDM+BDF dataset. The framework should allow for more than just images to be stuffed into the archive, and should also retain linkages (when there is a clear one) between items. Now, for VLASS images, there might not be a one-to-one association with a filesetID (the data resultant from a single EB), depending on how the imaging was done. But for other items (observing logs, for instance), we want clear associations between items (usually a parent→child type association).
James Sheckard
sourcename, starttime, and endtime are chosen for consistency with the execblock & scan tables in the database (so the UI can ask for the same column names). My initial versions of those had underscores, but I revised them to fit with the existing nomenclature. I'll fix that typo presently.
I gather that VO interactivity is the larger framework we'll be using for the more general exposure of data products, at least to start with. Some of these changes from the initial images table were made to mesh more easily with the VO OBSCORE. We'll need to keep to that coverage of OBSCORE as we add new type of files to what we ingest into the archive.
We already use the type of parent-child structure you have in mind for the calibration tar files, so I could see that easily being extended to things like the observation logs or other tightly coupled items in the archive.
Images present a particular problem due to the their multiple sources: 1 EB, many EBs, 4 EBs + GBT data, even other images (in the future). Any set of images ingested into the archive are brought in with a tar file of 'image products' from the creating CASA run (currently what's in the VLASS quicklook cache with their images, to be further refined later). Those image products will be delivered along side that image when it is retrieved from the archive. Those image products will contain the linkage back to the image source(s).
Claire Chandler
Some comments on the spatial information:
The "field_of_view" column name is going to be confusing to our users because there is a concept of "FoV" of an interferometer (the primary beam), which is not necessarily the same as the size of the image (which I think is what is meant here). Can we call it something different, like "image_size"?
I have no idea what "spatial_region" refers to, and how it is different from the image size. Please could someone explain?
I'm not sure of the relevance of ra_element_count and dec_element_count as a means of searching for images in the archive, because they depend on the pixel size chosen for the imaging, which can be arbitrary. The only reason to keep them, in my view, is for VO compatibility.
The spatial_resolution column should carry the geometric mean of the PSF, sqrt(BMAJ*BMIN), and probably the PSF axis ratio, BMAJ/BMIN, rather than the pixel size.
James Sheckard
spatial_region is a VO specific thing. It's another description of the location imaged (meant for doing searches by defining a geometric region on the sky and looking for overlaps). It is somewhat redundant, but I suspect it will be simpler to perform the transformation once upon ingestion.
The fields in this table are meant to be a super-set of the VO & front-end search information, so not all the fields are going to show up directly in the search interface. Some (like the element counts) can simply be left out of what we present.
Your calculation for the spatial_resolution is far better than mine. I was mostly going by the FITS standard, but I knew there would be cases where those simple calculations were inappropriate. The axis ratio is a good idea too, I'll add it.
When it comes to a specific image, isn't the image size the same as the field of view? I know there is far more flexibility in where you can image from an observation, but this is the specific result of an imaging process. Wouldn't the wider field of view be more appropriately tied to the execution block?
Claire Chandler
The field of view question is more one of confusing terminology, because for interferometers it can mean something different (indeed, for optical/IR, the field of view of an instrument can be different from the image one chooses to archive from it). If others don't think it could be confusing then it's OK to keep, but image_size is maybe clearer.
James Sheckard
Ah, I was misinterpreting what you were saying. I see you point, but I see a similar potential issue with image_size (as in kilobytes, or degrees?). How about image_extent?
Claire Chandler
Perhaps "image_field_of_view" would be sufficient to clarify.
Claire Chandler
Some comments on the temporal information:
starttime probably maps to t_min rather than t_max, and vice versa for endtime.
exposure_time is a difficult item to quantify, it isn't something we carry along with the data, and will be a function of flagging applied to the input visibility data. It may be possible to come up with something to go in this slot but it will take some work, so "null" should be an option for right now.
James Sheckard
Fixed the mix up in starttime/endtime and t_min/t_max.
I agree with the issue of the exposure time. I mostly put it on there to provoke discussion. I suspect the endpoints of the observations will do for most purposes.
Claire Chandler
Some comments on spectral information:
min_frequency/max_frequency should be separated from observing_band, since there is frequency overlap in the receiver responses and, e.g., 39.5GHz can be covered by either the Ka-band or Q-band receiver at the VLA.
observing_band should have the option to include more than one receiver band name associated with an image.
We will need to add a center_frequency.
spectral_resolution should be the resolution in frequency, not resolving power as for optical/IR. We should be able to convert between them for VO compatibility, but for most of our users this means a channel width (in Hz/kHz/MHz, etc.). We should also include a rest_frequency, and be able to convert spectral_resolution and center_frequency to velocity given the rest_frequency, via the archive search interface.
James Sheckard
The intent is for the min/max frequency values to take the place of the observing_band column in the initial table definition. As of now, there's no plan to carry the band identifiers, simply the frequency values. However, we should have a plan in place for where we will perform that mapping.
It seems to me that the closest we can get to a channel width is the bandwidth (CDELTn) information in the header. We could carry that, along with a center_frequency & rest_frequency. However, there is a problem with this assumption.
Multi-band images (or even single-band observations with discontinuous spectral windows) raise an interesting issue: the 'bandwidth' (CDELTn) provided by the FITS header would not relate to the actual span of frequencies summed for the image in most cases. That would certainly require some sort of caveat or explanation. CASA does provide spectral window numbers, but that does not insure that the data span the entire reported bandwidth (unless I'm missing another piece of the CASA rosetta stone?).
Unfortunately, I don't have any resolution to this conundrum right now, let alone a good one. Thoughts?
Claire Chandler
I think that the bandwidth carried by CASA is the max-min frequency. It can't do much else, because if individual channels are flagged I don't think the effective bandwidth is something that is tracked by CASA.
Bryan Butler
You still may want the observing band if possible, because there are many frequencies which can be observed at two bands, and sometimes you may want to know which one.
Now, it's a moot point for VLASS, but if we're making a general framework, i think you want to carry band information where it's known.
Claire Chandler
Indeed. People will want to search on observing_band as well as min/max frequency.
James Sheckard
I have a question about how how flexible the CASA pipeline is for imaging: I know you can subdivide a MS by polarization and spectral window. Can that be done by time as well? I was considering moving the start/end/exposure times into the image_products table, if they would be shared among images generated together.
The CASA imaging documentation doesn't seem to lend itself toward a time-series of images, but I could be missing something easily enough.
Claire Chandler
At the moment it is very inflexible, as far as the VLA is concerned. But that will change for SRDP, which Jeff Kern should comment on.
Stephan Witz
I think we should start a conversation about VLASS specific metadata, I believe we are at that point. I'm thinking epoch, I'm thinking type (quicklook, continuum, tapered, coarse cube, fine cube), cumulative. Maybe some notion of which cube a specific image layer belongs to.
I also think we should note any of the fields above that are specific to VLASS, or at least not generic, and abstract them.
Those are strawman ideas. Discuss.
Claire Chandler
None of the items are VLASS specific at the moment. The other things you note are currently all encoded in the product name for VLASS (which is good, because there's nothing in the metadata in the FITS files that provides this information).
Claire Chandler
Earlier this week I had some discussions with people from the CADC regarding the metadata for VLASS images. CADC has ingested our VLASS1.1 Quick Look images into their archive, but noted they are missing a value for the INSTRUMENT keyword. For other telescopes the format of this keyword is <receiver>-<backend>, the equivalent of which for VLASS would be 'S-WIDAR'. They have hardcoded 'S-WIDAR' into their metadata for now, and we should consider whether we want this in our metadata as well. Images from multiple receiver bands could be specified as, e.g., SC-WIDAR for composite S and C band data. At present, this keyword is not in FITS files produced by CASA, but we can investigate putting it in the pipeline weblog for future harvesting.