Go to the previous, next section.

NetCDF File Structure and Performance

NetCDF is a data abstraction for scientific data access and a software library that provides a concrete implementation of the interfaces that support that abstraction. The implementation provides a machine-independent format for representing scientific data. Although the netCDF file format is completely hidden below the interfaces, some understanding of the implementation and associated file structure may help to make clear which netCDF operations are expensive and why.

A detailed description of the netCDF format is not appropriate for this User's Guide, however. It is not needed to read and write netCDF files or understand efficiency issues. Programs that use only the documented interfaces and that make no other assumptions about the format will continue to work even if the netCDF format is changed in the future, because any such change will be made below the documented interfaces and will support earlier versions of netCDF data.

This chapter describes the structure of a netCDF file and some characteristics of the XDR layer that provides network transparency in enough detail to understand netCDF performance issues.

Parts of a NetCDF File

A netCDF data set is stored as a single file comprising three parts:

a header, containing all the information about dimensions, attributes, and variables except for the variable data;
fixed-size data, containing the data for variables that don't have an unlimited dimension; and
record data, containing the data records for variables that have an unlimited dimension.

All the data are represented in XDR form to make them machine-independent.

The descriptive header at the beginning of the netCDF file is an XDR encoding of a high-level data structure that represents information about the dimensions, variables, and attributes in the file. The variable descriptions in this header contain offsets to the beginning of each variable's data or the relative offset of a variable within a record. The descriptions also contain the dimension size and information needed to determine how to map multidimensional indices for each variable to the appropriate offsets.

This header has no usable extra space; it is only as large as it needs to be for the dimensions, variables, and attributes in each netCDF file. This has the advantage that netCDF files are compact, requiring very little overhead to store the ancillary data that makes the files self-describing. A potential disadvantage of this organization is that any operation on a netCDF file that requires expanding the header, for example adding a set of new dimensions and new variables to an existing netCDF file, will be as expensive as copying the file. This expense is incurred when ncendef() is called, after a call to ncredef(). If you create all necessary dimensions, variables, and attributes before writing variable data, and avoid later additions and renamings of netCDF components that require more space in the header part of the file, you avoid the cost associated with expanding the header.

The fixed-size data part that follows the header contains all the variable data for variables that do not employ the unlimited (record) dimension. The data for each variable is stored contiguously in this part of the file. If there is no unlimited dimension, this is the last part of the netCDF file.

The record-data part that follows the fixed-size data consists of a variable number of records, each of which contains data for all the record variables. The record data for each variable is stored contiguously in each record.

The order in which the data in the fixed-size data part and in each record appears is the same as the order in which the variables were defined, in increasing numerical order by netCDF variable ID. This knowledge can sometimes be used to enhance data access performance, since the best data access is currently achieved by reading or writing the data in sequential order.

The XDR Layer

XDR is a standard for describing and encoding data and a library of functions for external data representation, allowing programmers to encode data structures in a machine-independent way. NetCDF employs XDR for representing all data, in both the header part and the data parts. XDR is used to write portable data that can be read on any other machine for which the XDR library has been implemented.

Many vendors provide an XDR library along with other C run-time libraries. The netCDF software distribution also includes Sun's portable implementation of XDR for platforms that don't already have a vendor-supplied XDR library.

An I/O layer implemented much like the C standard I/O (stdio) library is used by the XDR layer to read and write XDR-encoded data to netCDF files. Hence an understanding of the standard I/O library provides answers to most questions about multiple processes accessing data concurrently, the use of I/O buffers, and the costs of opening and closing netCDF files. In particular, it is possible to have one process writing a netCDF file while other processes read it. Data reads and writes are no more atomic than calls to stdio fread() and fwrite(). An ncsync() call (NCSNC() for FORTRAN) is analogous to the fflush() call in the standard I/O library, writing unwritten buffered data so other processes can read it; ncsync() also brings header changes up-to-date (e.g., changes to attribute values).

As in the stdio library, flushes are also performed when "seeks" occur to a different area of the file. Hence the order of read and write operations can influence I/O performance significantly. Reading data in the same order in which it was written within each record will minimize buffer flushes.

There is one unusual case where the situation is more complex: when a writer enters define mode to add some additional dimensions, variables, or attributes to an existing netCDF file that is also open for reading by other processes. In this case, when the writer leaves define mode, a new copy of the file is created with the new dimensions, attributes, or variables and the old data, but readers that still have the file open will not see the changes. You should not expect netCDF data access to work with multiple writers having the same file open for writing simultaneously.

For VMS systems, the performance penalty for permitting shared access (under the current implementation of stdio in the C run-time library) seemed too great to make shared access the default, so netCDF files on VMS are opened non-shared. This still permits multiple simultaneous readers of the same file, but one writer prevents any readers from accessing the file. Implementors can easily allow shared access for a VMS implementation, if shared access is a more important requirement than access speed.

It is possible to tune an implementation of netCDF for some platforms by replacing the I/O layer beneath XDR with a different platform-specific I/O layer. This may change the similarities between netCDF and standard I/O, and hence characteristics related to data sharing, buffering, and the cost of I/O operations.

The cost of using a canonical representation for data like XDR varies according to the type of data and whether the XDR form is the same as the machine's native form for that type. XDR is especially efficient for byte, character, and short integer data.

For some data types on some machines, the time required to convert data to and from XDR form can be significant. The best case is byte arrays, for which very little conversion expense occurs, since the XDR library has built-in support for them. The netCDF implementation includes similar support added to XDR for arrays of short (16-bit) integers. The worst case is reading or writing large arrays of floating-point data on a machine that does not use IEEE floating-point as its native representation. The XDR library incurs the expense of a function call for each floating-point quantity accessed. On some architectures the cost of a function invocation for each floating-point number can dominate the cost of netCDF access to floating-point fields.

The distributed netCDF implementation is meant to be portable. Platform-specific ports that further optimize the implementation for better I/O performance or that unroll the loops in the XDR library to optimize XDR conversion of long integer and floating-point arrays are practical and desirable in cases where higher performance for data access is necessary.

Go to the previous, next section.