% This is uvh5_memo.tex, a memo on the uvh5 format % The minted package is used for syntax highlighting of code, and requires the % pygments package to be installed. pdflatex also needs to be invoked with the % -shell-escape option. To compile this document: % $ pdflatex -shell-escape uvh5_memo.tex % This should compile the document into uvh5_memo.pdf \documentclass[11pt, oneside]{article} \usepackage{geometry} \geometry{letterpaper} \usepackage{graphicx} \usepackage[titletoc,toc,title]{appendix} \usepackage{amssymb} \usepackage{physics} \usepackage{array} \usepackage{makecell} \usepackage{hyperref} \hypersetup{ colorlinks = true } \usepackage{cleveref} \crefformat{footnote}{#2\footnotemark[#1]#3} \usepackage{minted} \title{Memo: UVH5 file format} \author{Paul La Plante, and the pyuvdata team} \date{November 28, 2018\\ Revised April 2, 2021\\ Revised July 14, 2022 } \begin{document} \maketitle \tableofcontents \section{Introduction} \label{sec:intro} This memo introduces a new HDF5\footnote{\url{https://www.hdfgroup.org/}}-based file format of a UVData object in \verb+pyuvdata+\footnote{\url{https://github.com/HERA-Team/pyuvdata}}, a python package that provides an interface to interferometric data. Here, we describe the required and optional elements and the structure of this file format, called \textit{UVH5}. Note that this file format is specifically designed to represent UVData objects. Other HDF5-based datasets for radio interferometers, such as katdal\footnote{\url{https://github.com/ska-sa/katdal}} or HDFITS\footnote{\url{https://github.com/telegraphic/fits2hdf}} \textit{are not compatible} with the standard as defined here. We refer the reader to the documentation of those other formats to find out more about them. We assume that the user has a working knowledge of HDF5 and the associated python bindings in the package \verb+h5py+\footnote{\url{https://www.h5py.org/}}, as well as UVData objects in pyuvdata. For more information about HDF5, please visit \url{https://portal.hdfgroup.org/display/HDF5/HDF5}. For more information about the parameters present in a UVData object, please visit \url{http://pyuvdata.readthedocs.io/en/latest/uvdata_parameters.html}. An example for how to interact with UVData objects in pyuvdata is available at \url{http://pyuvdata.readthedocs.io/en/latest/tutorial.html}. Note that throughout the documentation, we assume a row-major convention (i.e., C-ordering) for the dimension specification of multi-dimensional arrays. For example, for a two-dimensional array with shape ($N$, $M$), the $M$-dimension is varying fastest, and is contiguous in memory. This convention is the same as Python and the underlying C-based HDF5 library. Users of languages with the opposite column-major convention (i.e., Fortran-ordering, seen also in MATLAB and Julia) must transpose these axes. \section{Overview} \label{sec:overview} A UVH5 object contains the interferometric data from a radio telescope, as well as the associated metadata necessary to interpret it. A UVH5 file contains two primary HDF5 groups: the \verb+Header+ group, which contains the metadata, and the \verb+Data+ group, which contains the data itself, the flags, and information about the number of samples corresponding to the data. Datasets in the \verb+Data+ group are also typically passed through HDF5's compression pipeline, to reduce the amount of on-disk space required to store the data. However, because HDF5 is aware of any compression applied to a dataset, there is little that the user has to explicitly do when reading data. For users interested in creating new files, the use of compression is not strictly required by the UVH5 format, again because the HDF5 file is self-documenting in this regard. However, be warned that most UVH5 files ``in the wild'' typically feature compression of datasets in the \verb+Data+ group. In the disucssion below, we discuss required and optional datasets in the various groups. We note in parenthesis the corresponding attribute of a UVData object. Note that in nearly all cases, the names are coincident, to make things as transparent as possible to the user. \section{Header} \label{sec:header} The \verb+Header+ group of the file contains the metadata necessary to interpret the data. We begin with the required parameters, then continue to optional ones. Unless otherwise noted, all datasets are scalars (i.e., not arrays). The precision of the data type is also not specified as part of the format, because in general the user is free to set it according to the desired use case (and HDF5 records the precision and endianness when generating datasets). When using the standard \verb+h5py+-based implementation in pyuvdata, this typically results in 32-bit integers and double precision floating point numbers. Each entry in the list contains \textbf{(1)} the exact name of the dataset in the HDF5 file, in boldface, \textbf{(2)} the expected datatype of the dataset, in italics, \textbf{(3)} a brief description of the data, and \textbf{(4)} the name of the corresponding attribute on a UVData object. Note that unlike in other formats, names of HDF5 datasets can be quite long, and so in most cases the name of the dataset corresponds to the name of the UVData attribute. Note that string datatypes should be handled with care. See Appendix~\ref{appendix:strings} for appropriately defining them for interoperability between different HDF5 implementations. \subsection{Required Parameters} \label{sec:req_params} \begin{itemize} \item \textbf{latitude}: \textit{float} The latitude of the telescope site, in degrees. (\textit{latitude}) \item \textbf{longitude}: \textit{float} The longitude of the telescope site, in degrees. (\textit{longitude}) \item \textbf{altitude}: \textit{float} The altitude of the telescope site, in meters. (\textit{altitude}) \item \textbf{telescope\_name}: \textit{string} The name of the telescope used to take the data. The value is used to check that metadata is self-consistent for known telescopes in pyuvdata. (\textit{telescope\_name}) \item \textbf{instrument}: \textit{string} The name of the instrument, typically the telescope name. (\textit{instrument}) \item \textbf{history}: \textit{string} The history of the data file. (\textit{history}) \item \textbf{Nants\_data}: \textit{int} The number of antennas that data in the file corresponds to. May be smaller than the number of antennas in the array. (\textit{Nants\_data}) \item \textbf{Nants\_telescope}: \textit{int} The number of antennas in the array. May be larger than the number of antennas with data corresponding to them. (\textit{Nants\_telescope}) \item \textbf{ant\_1\_array}: \textit{int} An array of the first antenna numbers corresponding to baselines present in the data. All entries in this array must exist in the antenna\_numbers array. This is a one-dimensional array of size Nblts. (\textit{ant\_1\_array}) \item \textbf{ant\_2\_array}: \textit{int} An array of the second antenna numbers corresponding to baselines present in the data. All entries in this array must exist in the antenna\_numbers array. This is a one-dimensional array of size Nblts. (\textit{ant\_2\_array}) \item \textbf{antenna\_numbers}: \textit{int} An array of the numbers of the antennas present in the radio telescope (note that these are not indices, they do not need to start at zero or be continuous). This is a one-dimensional array of size Nants\_telescope. Note there must be one entry for every unique antenna in ant\_1\_array and ant\_2\_array, but there may be additional entries. (\textit{antenna\_names}) \item \textbf{antenna\_names}: \textit{string} An array of the names of antennas present in the radio telescope. This is a one-dimensional array of size Nants\_telescope. Note there must be one entry for every unique antenna in ant\_1\_array and ant\_2\_array, but there may be additional entries. (\textit{antenna\_names}) \item \textbf{Nbls}: \textit{int} the number of baselines present in the data. For full cross-correlation data (including auto-correlations), this should be Nants\_data$\times$(Nants\_data+1)/2. (\textit{Nbls}) \item \textbf{Nblts}: \textit{int} The number of baseline-times (i.e., the number of spectra) present in the data. Note that this value need not be equal to Nbls $\times$ Ntimes. (\textit{Nblts}) \item \textbf{Nspws}: \textit{int} The number of spectral windows present in the data. (\textit{Nspws}) \item \textbf{Nfreqs}: \textit{int} The total number of frequency channels in the data across all spectral windows. (\textit{Nfreqs}) \item \textbf{Npols}: \textit{int} The number of polarization products in the data. (\textit{Npols}) \item \textbf{Ntimes}: \textit{int} The number of time samples present in the data. (\textit{Ntimes}) \item \textbf{uvw\_array}: \textit{float} An array of the uvw-coordinates corresponding to each observation in the data. This is a two-dimensional array of size (Nblts, 3). Units are in meters. (\textit{uvw\_array}) \item \textbf{time\_array}: \textit{float} An array of the Julian Date corresponding to the temporal midpoint of the corresponding baseline's integration. This is a one-dimensional array of size Nblts. (\textit{time\_array}) \item \textbf{integration\_time}: \textit{float} An array of the duration in seconds of an integration. This is a one-dimensional array of size Nblts. (\textit{integration\_time}) \item \textbf{freq\_array}: \textit{float} An array of all the frequencies (for all spectral windows) stored in the file in Hertz. This is a one-dimensional array of size (Nfreqs). (\textit{freq\_array}) \item \textbf{channel\_width}: \textit{float} The width of frequency channels in the file in Hertz. This is a one-dimensional array of size (Nfreqs). (\textit{channel\_width}) \item \textbf{spw\_array}: \textit{int} An array of the spectral windows in the file. This is a one-dimensional array of size Nspws. (\textit{spw\_array}) \item \textbf{flex\_spw}: \textit{python bool}\footnote{Note that this is \textit{not} the same as the \texttt{H5T\_NATIVE\_HBOOL} type; instead, it is an \texttt{H5Tenum} type, with an explicit \texttt{TRUE} and \texttt{FALSE} value. Such a type is created automatically when using \texttt{h5py}, both for Python \texttt{bool} and numpy \texttt{np.bool\_} types. See Appendix~\ref{appendix:boolean} for an example of how to define this in C. Such a definition should follow analogously in other languages.} Whether the data are saved using flexible spectral windows. If more than one spectral window is present in the data, this must be \texttt{True}. See Sec.~\ref{sec:flex_spw} for a discussion of the details. (\textit{flex\_spw}) \item \textbf{polarization\_array}: \textit{int} An array of the polarizations contained in the file. This is a one-dimensional array of size Npols. Note that the polarizations should be stored as an integer, and use the convention defined in AIPS Memo 117. (\textit{polarization\_array}) \item \textbf{antenna\_positions}: \textit{float} An array of the antenna coordinates relative to the reference position of the radio telescope array, which is implicitly defined by the \textit{latitude}, \textit{longitude}, and \textit{altitude} (LLA) parameters. More explicitly, these are the ECEF coordinates of individual antennas minus the ECEF coordinates of the reference telescope position, such that the telescope position plus the values stored in \textit{antenna\_positions} equals the position of individual elements in ECEF. The conversion between LLA and ECEF is given by WGS84. This is a two-dimensional array of size (Nants\_telescope, 3). (\textit{antenna\_positions}) \item \textbf{phase\_center\_catalog}: A series of nested datasets, similar to a dict in python (\textit{phase\_center\_catalog}). The top level keys are integers giving the phase center catalog IDs which are used to identify which baseline-times are phased to which phase center via the \textit{phase\_center\_id\_array}. The next level keys must include: \begin{itemize} \item \textbf{cat\_name}: \textit{string} The phase center catalog name. This does not have to be unique, non-unique values can be used to indicate sets of phase centers that make up a mosaic observation. \item \textbf{cat\_type}: \textit{string} One of four allowed values: \textbf{(1)} sidereal, \textbf{(2)} ephem, \textbf{(3)} driftscan, \textbf{(4)} unprojected. Sidereal means a phase center that is fixed in RA and Dec in a given celestial frame. Ephem means a phase center that has an RA and Dec that moves with time. Driftscan means a phase center with a fixed azimuth and elevation (note that this includes w-projection, even at zenith). Unprojected means no phasing, including w-projection, has been applied. \item \textbf{cat\_lon}: \textit{float} The longitudinal coordinate of the phase center, either a single value or a one dimensional array of length Npts (the number of ephemeris data points) for ephem type phase centers. This is commonly RA, but can also be galactic longitude. It is azimuth for driftscan phase centers. \item \textbf{cat\_lat}: \textit{float} The latitudinal coordinate of the phase center, either a single value or a one dimensional array of length Npts (the number of ephemeris data points) for ephem type phase centers. This is commonly Dec, but can also be galactic latitude. It is elevation (altitude) for driftscan phase centers. \item \textbf{cat\_frame}: \textit{string} The coordinate frame that the phase center coordinates are defined in. It must be an astropy supported frame (e.g. fk4, fk5, icrs, gcrs, cirs, galactic). \end{itemize} And may include: \begin{itemize} \item \textbf{cat\_epoch}: \textit{float} The epoch in years for the phase center coordinate. For most frames this is the Julian epoch (e.g. 2000.0 for j2000) but for the FK4 frame this will be treated as the Bessel-Newcomb epoch (e.g. 1950.0 for B1950). This parameter is not used for frames without an epoch (e.g. ICRS) unless the there is proper motion (specified in the cat\_pm\_ra and cat\_pm\_dec keys). \item \textbf{cat\_times}: \textit{float} Time in Julian Date for ephemeris points, a one dimensional array of length Npts (the number of ephemeris data points). Only used for ephem type phase centers. \item \textbf{cat\_pm\_ra}: \textit{float} (sidereal only) Proper motion in RA in milliarcseconds per year for the source. \item \textbf{cat\_pm\_dec}: \textit{float} (sidereal only) Proper motion in Dec in milliarcseconds per year for the source \item \textbf{cat\_dist}: \textit{float} Distance to the source in parsec (useful if parallax is important), either a single value or a one dimensional array of length Npts (the number of ephemeris data points) for ephem type phase centers. \item \textbf{cat\_vrad}: \textit{float } Radial velocity of the source in km/sec, either a single value or a one dimensional array of length Npts (the number of ephemeris data points) for ephem type phase centers. \item \textbf{info\_source}: \textit{string} Information about provenance of the source details. Typically this is set either to ``file'' if it originates from a file read operation, and ``user'' if it was added because of a call to the \verb+phase()+ method in \verb+pyuvdata+. But it can also be set to contain more detailed information. \end{itemize} \item \textbf{phase\_center\_id\_array}: \textit{int} A one dimensional array of length Nblts containing the cat\_id from the phase\_center\_catalog that each baseline-time is phased to. (\textit{phase\_center\_id\_array}) \item \textbf{phase\_center\_app\_ra}: \textit{float} Apparent right ascension of the phase center in the topocentric frame of the observatory, in radians. This is a one-dimensional array of size Nblts. In the event that there are multiple phase centers, the phase\_center\_id\_array can be used to identify which phase center is used for this calculation. For unprojected phase types, this is just the apparent LST (LAST). (\textit{phase\_center\_app\_ra}) \item \textbf{phase\_center\_app\_dec}: \textit{float} Apparent declination of the phase center in the topocentric frame of the observatory, in radians. This is a one-dimensional array of size Nblts. In the event that there are multiple phase centers, the phase\_center\_id\_array can be used to identify which phase center is used for this calculation. For unprojected phase types, this is just the telescope latitude. (\textit{phase\_center\_app\_ra}) \item \textbf{phase\_center\_frame\_pa}: \textit{float} Position angle between the hour circle (which is a great circle that goes through the target postion and both poles) in the apparent/topocentric frame, and the frame given in the \textit{phase\_center\_catalog} under the \textit{cat\_frame} dataset. This is a one dimensional array of length Nblts. In the event that there are multiple phase centers with different frames, the phase\_center\_id\_array can be used to identify which frame is used for each baseline-time in this calculation. This is set to zero for unprojected phase types. (\textit{phase\_center\_frame\_pa}) \item \textbf{version}: \textit{string} The version of the HDF5 file. The latest version (and the one described in this memo) is Version 1.0. Note it should be a string, such as \verb+`1.0'+. See Sec.~\ref{sec:version_history} for the version history of the HDF5 specification. (\textit{version}) \end{itemize} \subsection{Optional Parameters} \label{sec:opt_params} \begin{itemize} \item \textbf{flex\_spw\_id\_array}: \textit{int} The mapping of individual channels along the frequency axis to individual spectral windows, as listed in the \textit{spw\_array}. This is a one-dimensional array of size (Nfreqs). Note this is \textbf{required} if the file uses flexible spectral windows (see Sec.~\ref{sec:flex_spw}). (\textit{flex\_spw\_id\_array}) \item \textbf{dut1}: \textit{float} difference between UT1 (defined with respect to the Earth's angle of rotation, which includes whole and partial ``leap seconds") and UTC (which \emph{only} includes whole leap seconds), in seconds, with typical precision of 1 ms. AIPS 117 calls it \verb+UT1UTC+. Note that this is slightly different from the value DUT1 which is broadcast by various time signal services (e.g., NIST), which only supply this difference with precision of 0.1 seconds. (\textit{dut1}) \item \textbf{earth\_omega}: \textit{float} Earth's rotation rate in degrees per day. Note the difference in units, which is inherited from the way this quantity is handled in UVFITS datasets (AIPS 117 calls it \verb+DEGPDY+). (\textit{earth\_omega}) \item \textbf{gst0}: \textit{float} Greenwich sidereal time at midnight on reference date, in degrees. AIPS 117 calls it \verb+GSTIA0+ (\textit{gst0}) \item \textbf{rdate}: \textit{string} Date for which GST0 (or whichever time saved in that field) applies. Note this is different from how UVFITS handles this quantity, which is saved as a float rather than a string. The user is encouraged to ensure it is being handled self-consistently for their desired application. (\textit{rdate}) \item \textbf{timesys}: \textit{string} Time system. pyuvdata currently only supports UTC. (\textit{timesys}) \item \textbf{x\_orientation}: \textit{string} The orientation of the x-arm of a dipole antenna. It is assumed to be the same for all antennas in the dataset. For instance, ``East'' or ``North'' may be used. (\textit{x\_orientation}). \item \textbf{antenna\_diameters}: \textit{float} An array of the diameters of the antennas in meters. This is a one-dimensional array of size (Nants\_telescope). (\textit{Nants\_telescope}) \item \textbf{uvplane\_reference\_time}: \textit{int} The time at which the phase center is normal to the chosen UV plane for phasing. Used for interoperability with the FHD package\footnote{\url{https://github.com/EoRImaging/FHD}}. \item \textbf{lst\_array}: \textit{float} An array corresponding to the local sidereal time of the center of each observation in the data in units of radians. If it is not specified, it is calculated from the latitude/longitude and the time\_array. Saving it in the file can be useful for files with many values in the \textit{time\_array}, which would expensive to recompute. (\textit{lst\_array}) \end{itemize} \subsection{Extra Keywords} \label{sec:extra_keywords} UVData objects support ``extra keywords'', which are additional bits of arbitrary metadata useful to carry around with the data but which are not formally supported as a reserved keyword in the \verb+Header+. In a UVH5 file, extra keywords are handled by creating a datagroup called \verb+extra_keywords+ inside the \verb+Header+ datagroup. In a UVData object, extra keywords are expected to be scalars, but UVH5 makes no formal restriction on this. Also, when possible, these quantities should be HDF5 datatypes, to support interoperability between UVH5 readers. Inside of the extra\_keywords datagroup, each extra keyword is saved as a key-value pair using a dataset, where the name of the extra keyword is the name of the dataset and its corresponding value is saved in the dataset. Though the use of HDF5 attributes can also be used to save additional metadata, it is not recommended, due to the lack of support inside of pyuvdata for ensuring the attributes are properly saved when writing out. \section{Data} \label{sec:data} In addition to the \verb+Header+ datagroup in the root namespace, there must be one called \verb+Data+. This datagroup saves the visibility data, flags, and number of samples corresponding to each entry. All three datasets must be present in a valid UVH5 file. They are also all expected to be the same shape: (Nblts, Nfreqs, Npols). Note that due to the intermixing of the baseline and time axes, it is \textit{not} required for data to exist for every baseline and time in the file. This behavior is similar to UVFITS and MIRIAD file formats. Also note that there is no explicit ordering required for the baseline-time axis. A common ordering is to write the data in ``correlator order'', and have all baselines for a single time $t_i$, followed by all baselines for the next time $t_{i+1}$, etc. However, this is merely a convention, and is not explicitly required for the UVH5 format. \subsection{Visdata Dataset} \label{sec:visdata} The visibility data is saved as a dataset named \verb+visdata+. It should be a 3-dimensional, complex-type dataset with shape (Nblts, Nfreqs, Npols). Most commonly this is saved as an 8-byte complex number (a 4-byte float for the real and imaginary parts), though some flexibility is possible. 16-byte complex floating point numbers (composed of two 8-byte floats), as well as 8-byte complex integers (two 4-byte signed integers), are also common. In all cases, a compound datatype is defined, with an \verb+`r'+ field and an \verb+`i'+ field, corresponding to the real and imaginary parts, respectively. The real and imaginary types must also be the same datatype. For instance, they should both be 8-byte floating point numbers, or 32-bit (4-byte) integers. Mixing datatypes between the real and imaginary parts is not allowed. Using \verb+h5py+, the datatype for \verb+visdata+ can be specified as \verb+`c8'+ (8-byte complex numbers, corresponding to the \verb+np.complex64+ datatype) or \verb+`c16'+ (16-byte complex numbers, corresponding to the \verb+np.complex128+ datatype) out-of-the-box, with no special handling by the user. \verb+h5py+ transparently handles the definition of the compound datatype. For examples of how to handle complex integer datatypes in \verb+h5py+, see Appendix~\ref{appendix:integers}. \subsubsection{Conjugation Convention} A cross-correlation between two antennas is defined by the baseline connecting them, and the conjugation of one of the input data streams. Accordingly, the \textit{uvw} coordinates and the conjugation of the visibility data are interconnected, based on the definition of one's coordinate system. For UVH5 files, it is assumed that the convention for the Radio Interferometer Measurement Equation (RIME) of a visibility $\mathcal{V}$ for antennas $i$ and $j$ is as follows \cite{tms}: \begin{equation} \mathcal{V}(u_j - u_i, v_j - v_i) = \int \dd{l} \dd{m} I(l, m) g_i(l, m) e^{-2\pi i (u_i l + v_i m)} g_j^*(l, m) e^{2\pi i (u_j l + v_j m)}. \end{equation} That is, the baseline vector defined by the $uvw$ coordinates is directed from antenna $i$ to antenna $j$ (so the baseline vector can be computed as $\vb{r_j} - \vb{r_i}$, where $\vb{r}$ is the position vector of a given antennas), and the data corresponding to antenna $j$ is conjugated. Note that if a file is generated with the opposite convention, it is usually sufficient to multiply $uvw$ coordinates by $-1$ to generate a self-consistent dataset, as well as conjugating the data in the \verb+data_array+. \subsection{Flags Dataset} \label{sec:flags} The flags corresponding to the data are saved as a dataset named \verb+flags+. It is a 3-dimensional, boolean-type dataset with shape (Nblts, Nfreqs, Npols). Values of True correspond to instances of flagged data, and False is non-flagged. Note that the boolean type of the data is \textit{not} the HDF5-provided \verb+H5T_NATIVE_HBOOL+, and instead is defined to conform to the \verb+h5py+ implementation of the numpy boolean type. When creating this dataset from \verb+h5py+, one can specify the datatype as \verb+np.bool_+. Behind the scenes, this defines an HDF5 enum datatype. See Appendix~\ref{appendix:boolean} for an example of how to write a compatible dataset from C. As with the nsamples dataset discussed below, compression is typically applied to the flags dataset. The LZF filter (included in all HDF5 libraries) provides a good compromise between speed and compression, and is used in most HERA datasets. Note that HDF5 supports many other types of filters, such as ZLIB, SZIP, and BZIP2.\footnote{For more information, see \href{https://portal.hdfgroup.org/display/HDF5/Using+Compression+in+HDF5}{the documentation on using compression filters in HDF5}.} In the special cases of single-valued arrays, the dataset occupies virtually no disk space. \subsection{Nsamples Dataset} \label{sec:nsamples} The number of data points averaged into each data entry is saved as a dataset named \verb+nsamples+. It is a 3-dimensional, floating-point type dataset with shape (Nblts, Nfreqs, Npols). Note that it is \textit{not} required to be an integer, and should \textit{not} be saved as an integer type. The product of the integration\_time array and the data in the nsample array reflects the total amount of time that went into a visibility. The best practice is for the nsamples dataset to track flagging within an integration time (leading to a decrease of the nsamples array value to be less than 1) and LST averaging (leading to an increase in the nsamples array value). Datasets that have not been LST averaged should have values in nsamples that are less than or equal to 1. Although this convention is not adhered to by all data formats serviced by \verb+pyuvdata+, it is recommended to follow it as closely as possible in UVH5 files. What \textit{should} be true is the product of the integration\_time array and nsamples array corresponding to the total amount of time included in a visibility. \section{Version History} \label{sec:version_history} The UVH5 specification has been through several minor version updates, and in the interest of maximizing interoperability between different readers and writers external to \verb+pyuvdata+, it is useful to define a version history. This is not a strict semantic versioning scheme, but instead intended to capture some of the important changes that the specification has gone through. Note that, as much as possible, \verb+pyuvdata+ intends to be fully compatible, and be able to read any valid UVH5 file written. Those interested in writing fully compatible readers/writers may look there for further details. It is strongly encouraged that independent UVH5 writers conform to the latest version (Version 1.1 at time of writing), while readers are encouraged to support backwards compatibility as much as possible. If readers cannot support all revisions, reading more recent versions should be prioritized. \subsection{\texttt{version} dataset} When present, the version information is stored in the Header as a string-based dataset with the key \verb+version+. Note that files have not always contained this dataset, but as much as possible, new files written should contain this dataset to clarify. \subsection{Version 0.x/0.1} Historically, UVH5 files written by \verb+pyuvdata+ and the HERA correlator did not include the \verb+version+ dataset as part of the header. Implicitly, these files are v0.x. More recently, \verb+pyuvdata+ has begun writing the version information to files, and so the \verb+version+ dataset is present in these files. Below, we discuss some of the changes that occurred within the Version 0.1 generation, to make users aware of the different flavors of UVH5 files they may encounter ``in the wild.'' \subsubsection{\texttt{integration\_time} dataset} Initially, UVH5 files were written with a single value for \verb+integration_time+. It has since been modified to its current length of \verb+Nblts+ to allow for data with varying integration time between time samples or baselines. \subsubsection{Flexible Spectral Windows} \label{sec:flex_spw} A significant update to how the frequency axis was handled in UVData objects was implemented to allow for a more flexible handling of data from different spectral windows. Initially, following the method of handling multiple spectral windows in UVFITS files, the spectral window (\verb+spw+) axis was treated as a separate axis in metadata and data arrays. However, this approach is relatively inflexible, because it requires all spectral windows to have the same number of frequency channels to efficiently store the data (the alternatives being to use ragged-length arrays, which are inefficient for storing or accessing the data, or padded arrays which can contain a large amount of wasted storage to ensure arrays are regularly spaced). \begin{figure}[h!] \begin{center} \includegraphics[width=0.95\textwidth]{uvh5_diagram.pdf} \caption{A summary of the different combinations of the rank of data arrays (reflected by UVH5 version), and flexible spectral windows. The various data and metadata values and ranks are listed in detail in Table~\ref{table:spw}.} \label{fig:spws} \end{center} \end{figure} To overcome these limitations, taking inspiration from how frequency data are stored in MIRIAD, the idea of ``flexible spectral windows'' was adopted to save the frequency information. Analogously to how baselines and times are collapsed to a ``baseline-time axis'', frequencies and spectral windows are collapsed to a ``frequency-spectral window'' axis. This allows for more versatility in how data from different spectral windows are stored inside of a single file, but it requires the change of several important components of metadata. We summarize these changes here. \begin{itemize} \item The value for \verb+Nfreqs+ is the total number of frequency channels saved in the data across all spectral windows. \item Where required, the number of spectral windows \verb+Nspws+ is required to be 1. \item The \verb+channel_width+ dataset was changed from a single number to a 1-d array of length \verb+Nfreqs+. \item The \verb+flex_spw+ dataset was added to identify whether the file in question supports flexible spectral windows (if \verb+True+) or not (if \verb+False+). \item The \verb+flex_spw_id_array+ dataset was added to identify which spectral window a given channel belongs. This is required if \texttt{flex\_spw} is \texttt{True}. \end{itemize} It is possible to save files self-consistently without using flexible spectral windows \textit{if and only if there is a single spectral window}. We outline the various (valid) combinations below in Sec.~\ref{sec:version_table}. \subsection{Version 1.0} \subsubsection{Rank-3 Array Convention} \label{sec:rank3_arrays} Version 1.0 of UVH5 represents a significant change in the way that the data arrays (\verb+visdata+, \verb+flags+, and \verb+nsamples+) and metadata arrays are stored. The previously vestigial spectral-window axis is removed, meaning that data arrays are rank-3 instead of rank-4. Explicitly, these arrays have shape (Nblts, Nfreqs, Npols), where Nfreqs includes the number of channels across all spectral windows. This also affects the \textit{freq\_array} dataset, which went from a rank-2 array to rank-1 of size (Nfreqs). The description of data and metadata in the body of this memo assumes the Version 1.0 specification. Although \verb+pyuvdata+ plans to indefinitely support files written with the previous convention (i.e., having an explicit spectral window-axis), UVH5 files should be written such that they conform to Version 1.0. \subsection{Version 1.1} Historically, only a single phase center was supported and only sidereal or unprojected (zenith drift without w projection) phasing types were supported. When multiple phase center phasing was added, along with support for more types of phase centers, the following parameters were added (described in \ref{sec:req_params}): \begin{itemize} \item \textbf{phase\_center\_catalog} \item \textbf{phase\_center\_id\_array} \item \textbf{phase\_center\_app\_ra} \item \textbf{phase\_center\_app\_dec} \item \textbf{phase\_center\_frame\_pa} \end{itemize} and the following header items (found in versions less than 1.1) were removed: \begin{itemize} \item \textbf{phase\_center\_ra}: \textit{float} The right ascension of the phase center of the observation in radians. Required if phase\_type is ``phased''. (\textit{phase\_center\_ra}) \item \textbf{phase\_center\_dec}: \textit{float} The declination of the phase center of the observation in radians. Required if phase\_type is ``phased''. (\textit{phase\_center\_dec}). \item \textbf{phase\_center\_epoch}: \textit{float} The epoch year of the phase applied to the data (\textit{e.g.}, 2000.). Required if phase\_type is ``phased''. (\textit{phase\_center\_epoch}) \item \textbf{phase\_center\_frame}: \textit{string} The frame the data and uvw\_array are phased to. Options are ``gcrs'' and ``icrs'', with default ``icrs''. These frames are defined as \href{https://docs.astropy.org/en/stable/coordinates/index.html}{coordinate systems in astropy}. (\textit{phase\_center\_frame}) \item \textbf{object\_name}: \textit{string} The name of the object tracked by the telescope. For a drift-scan antenna, this is typically ``zenith''. (\textit{object\_name}) \item \textbf{phase\_type}: \textit{string} The phase type of the observation. Should be ``phased'' or ``drift''. Note that ``drift'' in this context more accurately means ``unphased'', in that baselines are computing using ENU coordinates, without any $w$-projection. Any other value is treated as an unrecognized type. (\textit{phase\_type}) \end{itemize} Prior to version 1.1, the new phase attributes were sometimes written to files along with the header items listed above. During this time, the \textbf{phase\_center\_catalog} was written as a python dict converted to a JSON-formatted string. This intermediate file format was undocumented and not widely used, but it is possible some files like this exist ``in the wild". \subsection{Table Summarizing Changes} \label{sec:version_table} In the interest of summarizing all of the historical changes in a single place, we outline below the changes that have occurred in the UVH5 specification. We note what they are currently, along with how they were saved previously. \begin{table}[t] \begin{center} \begin{tabular}{ m{10em} | m{10em} | m{10em} | m{4em} } \textbf{Dataset} & \makecell[cl]{\textbf{Current}\\\textbf{Convention}} & \makecell[cl]{\textbf{Previous}\\\textbf{Convention}} & \textbf{Version Changed} \\\hline\hline \texttt{Header/version} & String corresponding to version & Not present & v0.1 \\[1em] \hline \makecell[tl]{\texttt{Header/}\\\texttt{integration\_time}} & Array of float, shape (Nblts) & Single float (assumed to apply to all baseline-times) & v0.1 \\[1.75em] \hline \makecell[tl]{\texttt{Header/}\\\texttt{phase\_center\_catalog}} & nested datasets, similar to a dict in python & Not present & v1.1 \\[1.75em] \hline \makecell[tl]{\texttt{Header/}\\\texttt{phase\_center\_id\_array}} & Array of int, shape (Nblts) & Not present & v1.1 \\[1.75em] \hline \makecell[tl]{\texttt{Header/}\\\texttt{phase\_center\_app\_ra}} & Array of float, shape (Nblts) & Not present & v1.1 \\[1.75em] \hline \makecell[tl]{\texttt{Header/}\\\texttt{phase\_center\_app\_dec}} & Array of float, shape (Nblts) & Not present & v1.1 \\[1.75em] \hline \makecell[tl]{\texttt{Header/}\\\texttt{phase\_center\_app\_pa}} & Array of float, shape (Nblts) & Not present & v1.1 \\[1.75em] \end{tabular} \end{center} \caption{A table summarizing changes that have occurred in the UVH5 specification.} \label{table:history} \end{table} We also summarize the combination of data and metadata properties for the cases of: (A) rank-3 data arrays, flexible spectral windows; (B) rank-3 data arrays, no flexible spectral windows; (C) rank-4 data arrays, flexible spectral windows; (D) rank-4 data arrays, no flexible spectral windows. See Figure~\ref{fig:spws} for a visual representation. \textbf{Note that we include the following only as a reference! We encourage UVH5 writers to conform as much as possible to the v1.0 specification (options A or B).} \begin{table}[t] \scriptsize \begin{center} \begin{tabular}{ m{8em} | m{10em} | m{10em} | m{10em} | m{10em}} \textbf{Dataset} & \textbf{Type A} & \textbf{Type B} & \textbf{Type C} & \textbf{Type D} \\ \hline\hline \texttt{Header/Nspws} & Number of spectral windows & 1 & Number of spectral windows & Number of spectral windows \\ \hline \texttt{Header/Nfreqs} & Number of frequencies across all spectral windows & Number of frequencies & Number of frequencies across all spectral windows & Number of frequencies \textit{per} spectral window \\ \hline \makecell[cl]{\texttt{Header/}\\\texttt{channel\_width}} & Shape (Nfreqs) & Shape (Nfreqs) & Shape (Nfreqs) & Scalar (assumed to apply to all frequencies) \\[1.75em] \hline \makecell[cl]{\texttt{Header/}\\\texttt{flex\_spw\_id\_array}} & Shape (Nfreqs) & Not present & Shape (Nfreqs) & Not present \\[1.75em] \hline \texttt{Header/flex\_spw} & \texttt{True} & \texttt{False} & \texttt{True} & \texttt{False} \textbf{OR} not present \\[1.75em] \hline \makecell[cl]{\texttt{Header/}\\\texttt{freq\_array}} & Shape (Nfreqs) & Shape (Nfreqs) & Shape (Nfreqs) & Shape (Nspws, Nfreqs) \\[1.75em] \hline \texttt{Data/visdata} & Shape (Nblts, Nfreqs, Npols) & Shape (Nblts, Nfreqs, Npols) & Shape (Nblts, 1, Nfreqs, Npols) & Shape (Nblts, Nspws, Nfreqs, Npols) \\[1.75em] \hline \texttt{Data/flags} & Shape (Nblts, Nfreqs, Npols) & Shape (Nblts, Nfreqs, Npols) & Shape (Nblts, 1, Nfreqs, Npols) & Shape (Nblts, Nspws, Nfreqs, Npols) \\[1.75em] \hline \texttt{Data/nsamples} & Shape (Nblts, Nfreqs, Npols) & Shape (Nblts, Nfreqs, Npols) & Shape (Nblts, 1, Nfreqs, Npols) & Shape (Nblts, Nspws, Nfreqs, Npols) \\ \end{tabular} \end{center} \caption{A table summarizing the different data and metadata values for different file types. Type A, B, C, and D refer to the combinations of data array rank and flexible spectral windows in Figure~\ref{fig:spws}. Note that UVH5 writers are strongly encouraged to write files compatible with Type A or B (i.e., UVH5 v1.0), whereas readers are encouraged to be as flexible as possible (within reason).} \label{table:spw} \end{table} \begin{thebibliography}{9} \bibitem{tms} A.~Richard~Thompson, James~M.~Moran, and George~W.~Swenson,~Jr., ``Interferometry and Synthesis in Radio Astronomy, 3rd Edition'', 2017. \end{thebibliography} \newpage \begin{appendices} \section{Strings in HDF5} \label{appendix:strings} String datatypes are finicky, and require special handling to ensure that they are compatible with the HDF5 bindings in various languages. This is especially true for files written from \verb+h5py+, which handles strings differently between python2 and python3. Though python2 is nearing its end-of-life, UVH5 should be backwards compatible with older versions of \verb+h5py+ as much as possible. To help service this, all string-type metadata in UVH5 files \textit{must} be fixed-length ASCII type. Not only does this allow for interoperability between different \verb+h5py+ versions, but it also ensures that strings can be round-tripped through other HDF5 bindings, such as those in C, MATLAB, IDL, Fortran\footnote{Strings in Fortran are not null-terminated, so these require special handling.}, etc. Note that the string should use one byte per character, and be null-terminated. This corresponds to the numpy \verb+S+ datatype in both versions of python2 and python3. When writing a string-like dataset from \verb+h5py+, scalar data should be written by casting a string to a \verb+numpy.string_+ object. Array data should be written as a \verb+S+ dataset, where \verb++ represents the length of the strings to be saved. Upon reading, strings can be cast to bytes using the \verb+tostring()+ method, at which point the data is \verb++-type (python2) or can be decoded as UTF-8 to become \verb++-type (python3). Below is an example for how to read and write string scalar and array-type datasets using \verb+h5py+ in python2 and python3. \subsection{Target String Type} The following is the output of \texttt{h5dump} for a string-like dataset in a UVH5 file. UVH5 writers are strongly encouraged (though not required) to follow the same convention. Although something like UTF-8 is more flexible, restricting strings to ASCII allows for greater interoperability with other file formats such as MIRIAD and UVFITS. \begin{verbatim} $ h5dump -V h5dump: Version 1.12.0 $ h5dump -d Header/history -A simulated_bda_file.uvh5 HDF5 "simulated_bda_file.uvh5" { DATASET "Header/history" { DATATYPE H5T_STRING { STRSIZE 1035; STRPAD H5T_STR_NULLPAD; CSET H5T_CSET_ASCII; CTYPE H5T_C_S1; } DATASPACE SCALAR } } \end{verbatim} \subsection{Writing strings in python2} \begin{minted}{python} import numpy as np import h5py # open file and write string datasets with h5py.File('test_file.uvh5', 'w') as f: header = f.create_group('Header') # scalar dataset header['scalar_string'] = np.string_('Hello world!') # array dataset str_array = np.array(['hello', 'world']) n_words = len(str_array) max_len_words = np.amax([len(n) for n in str_array]) dtype = "S{:d}".format(max_len_words) header.create_dataset('array_string', (n_words,), dtype=dtype, data=str_array) # read the data back in again with h5py.File('test_file.uvh5', 'r') as f: header = f['Header'] # read scalar dataset scalar_string = header['scalar_string'][()].tobytes() assert scalar_string == 'Hello world!' # read array dataset str_array_file = [n.tobytes() for n in header['array_string'][()]] assert np.all(str_array_file == str_array) \end{minted} \subsection{Writing strings in python3} \begin{minted}{python} import numpy as np import h5py # open file and write string datasets with h5py.File('test_file.uvh5', 'w') as f: header = f.create_group('Header') # scalar dataset header['scalar_string'] = np.string_('Hello world!') # array dataset str_array = ['hello', 'world'] header['array_string'] = np.string_(str_array) # read the data back in again with h5py.File('test_file.uvh5', 'r') as f: header = f['Header'] # read scalar dataset scalar_string = header['scalar_string'][()].tobytes().decode('UTF-8') assert scalar_string == 'Hello world!' # read array dataset str_array_file = [n.tobytes().decode('UTF-8') for n in header['array_string'][()]] assert np.all(str_array_file == str_array) \end{minted} \section{Integer Datatype Support for Visibility Data} \label{appendix:integers} The HERA correlator writes datasets which have 32-bit integer real and imaginary components. Due to the self-describing nature of HDF5 datasets, this information is captured by the file format. Nevertheless, special handling must be used to interpret these datasets as complex numbers. The \verb+astype+ context manager in \verb+h5py+ is used to convert the datatype on the fly from integers to complex numbers. Below is an example of how to do this. \begin{minted}{python} import numpy as np import h5py # define integer datatype int_dtype = np.dtype([('r', ' #define CPTR(VAR,CONST) ((VAR)=(CONST),&(VAR)) typedef enum { FALSE, TRUE } bool_t; int main() { bool_t val; static hid_t boolenumtype; hid_t file_id, dspace_id, flags_id; herr_t status; /* define enum type */ boolenumtype = H5Tcreate(H5T_ENUM, sizeof(bool_t)); H5Tenum_insert(boolenumtype, "FALSE", CPTR(val, FALSE )); H5Tenum_insert(boolenumtype, "TRUE" , CPTR(val, TRUE )); /* open a new file */ file_id = H5Fcreate("test_file.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT); /* define array dimensions */ int Nblts = 10; int Nfreqs = 16; int Npols = 4; hsize_t dims[3] = {Nblts, Nfreqs, Npols}; /* initialize data array with FALSE values */ bool_t data[Nblts][Nfreqs][Npols]; for (int i=0; i