Go to the previous, next section.
The Network Common Data Form, or netCDF, is an interface to a library of data access programs for storing and retrieving scientific data. NetCDF is an abstraction that supports a view of data as a collection of self-describing, network-transparent objects that can be accessed through a simple interface. Collections of named multidimensional variables can be randomly accessed, without knowing details of how the data are stored. Auxiliary information about the data, such as what units are used, can be stored with the data. Generic utilities and application programs can be written that access arbitrary netCDF files and transform, combine, analyze, or display specified fields of the data. The development of such applications may lead to improved accessibility of data and improved reusability of software for scientific data management, analysis, and display.
The netCDF software implements an abstract data type, which means that all operations to access and manipulate data in a netCDF file must use only the set of functions provided by the interface. The actual representation of the data is hidden from applications that use the interface, so that how the data are stored could be changed without affecting such programs. The physical representation of netCDF data is designed to be independent of the computer on which the data were written. Future changes to the netCDF interface will be compatible with the interface described here, so that neither existing netCDF files nor programs accessing them will require modification.
The netCDF interface is supported for both C and FORTRAN, and for UNIX, VMS, OS/2 and MSDOS operating systems. The netCDF software that implements this interface is freely available via FTP and free of licensing restrictions to encourage its wide use.
Why not use an existing database management system (DBMS) for storing scientific data? We looked at available database software, both commercial and research-oriented, and concluded that existing packages are currently inadequate for the kinds of scientific data access supported by the netCDF interface.
First, most existing DBMSs have poor support for multidimensional objects as the basic unit of data access. An alternative data model of comparable power and elegance to the relational model for databases is needed for scientific data. If such a model existed, programs that shared that model could be used in flexible combinations to support effective systems for building applications and visualizing data. Representing multidimensional arrays as relations makes some useful kinds of data access awkward and provides little support for the abstractions of multidimensional data and coordinate systems.
Related to this is a second problem with general-purpose database systems: their poor performance on large scientific data sets. Collections of satellite images, scientific model outputs, climate observations covering decades, high-resolution atmospheric profile data, and other large data sets are beyond the capabilities of most DBMSs to organize and index for efficient retrieval.
Finally, general-purpose database systems provide, at significant cost in terms of both resources and access performance, many facilities that are not needed in the analysis, management, and display of scientific data. For example, elaborate update facilities, concurrency control, audit trails, report writers, and mechanisms designed for transaction-processing are unnecessary for most scientific applications.
To achieve network-transparency, the netCDF is implemented on top of a layer of software for external data representation (XDR). XDR, developed by Sun Microsystems, Inc., is a nonproprietary standard for describing and encoding data. It supports encoding arbitrary C data structures into machine-independent sequences of bits. The encoding used for floating-point numbers is the Institute for Electrical and Electronics Engineers (IEEE) standard for normalized floating-point numbers. XDR has been implemented on a wide variety of computers, including Suns, VAXs, Apple Macintoshes, IBM RS 6000s, IBM PS/2s, IBM mainframes, and CRAYs. It assumes only that eight-bit bytes can be encoded and decoded in a consistent way.
Translating data into and out of XDR form adds overhead to data transfers, but for many applications the extra CPU cycles used to convert data to and from a machine-independent representation are not significant. The amount of XDR overhead depends on many factors, including the data type, the type of computer, the granularity of data access, and how well the implementation has been tuned to the computer on which it is run. For a large set of applications, the overhead of the XDR layer is a reasonable price to pay for portable, network-transparent data access.
Often when an abstraction layer is added to hide the details of an underlying implementation, some computations that can be expressed simply in terms of the abstraction may be computationally expensive. Furthermore, it may not be obvious which of several ways of expressing a computation through the abstract interface will make efficient use of computing resources, without understanding something about the implementation. It is certainly possible to use the netCDF interface to access data in inefficient ways: for example, by requesting a slice of variable data that requires a single value out of each record. See section NetCDF File Structure and Performance, for a discussion of performance characteristics of the implementation and how to apply knowledge of the underlying implementation to use the interface effectively when performance is an important concern.
NetCDF can be used as an archive format for storing data, but it may take more space than a special-purpose archive format that exploits knowledge of particular characteristics of a set of data. Compression of data is possible with netCDF (e.g., using arrays of eight-bit bytes to encode low-resolution floating-point numbers instead of arrays of 32-bit numbers), but netCDF was not designed to achieve optimal compression of scientific data.
The advantages of a special-purpose archive format for small archives should be compared to the benefits of machine-independence and the ability to store ancillary data (data about the data) that the netCDF interface provides. For large archives, only two programs need to be provided for each archive format, one to translate archived data into netCDF form and the other to translate back to the archive format. Tools provided for manipulating netCDF data will then be available without sacrificing the advantages of the archive format and without requiring the wholesale conversion of large existing archives.
The development of the netCDF interface began with a modest goal related to Unidata's needs: to provide a common interface between Unidata applications and ingested real-time meteorological data. Since Unidata software was intended to run on multiple hardware platforms with access from both C and FORTRAN, achieving Unidata's goals had the potential for providing a package that was useful in a broader context. By making the package widely available and collaborating with other organizations with similar needs, we hoped to improve the current situation in which scientific software is only rarely reused by others in the same discipline and almost never reused between disciplines (Fulker, 1988).
Important concepts employed in the netCDF software originated in a paper (Treinish and Gough, 1987) that described data-access software developed at the NASA Goddard National Space Science Data Center (NSSDC). The interface provided by this software was called the Common Data Format (CDF). The NASA CDF was originally developed as a platform-specific FORTRAN library to support an abstraction for storing multidimensional scientific data.
The NASA CDF package had been used for many different kinds of data in an extensive collection of applications. It had the virtues of simplicity (only 13 subroutines), independence from storage format, generality, ability to support logical user views of data, and support for generic applications.
Unidata held a workshop on CDF in Boulder in August 1987. We proposed exploring the possibility of collaborating with NASA to extend the CDF FORTRAN interface, to define a C interface, and to permit the access of data aggregates with a single call, while maintaining compatibility with the existing NASA interface.
Independently, Dave Raymond at the New Mexico Institute of Mining and Technology had developed a package of C software for UNIX that supported self-describing scientific data along with a "pipes and filters" approach to processing, analyzing, and displaying scientific data. This package also used the "Common Data Format" name, later changed to C-Based Analysis and Display System (CANDIS). Unidata learned of Raymond's work (Raymond, 1988), and incorporated some of his ideas, such as the use of named dimensions and variables with differing shapes in a single data object, into the Unidata netCDF interface.
In early 1988, Glenn Davis of Unidata developed a prototype netCDF package in C that was layered on a nonproprietary external data representation standard (XDR) developed by Sun Microsystems. This prototype proved that a single-file, network-transparent implementation of the CDF interface could be achieved at acceptable cost and that the resulting programs could be implemented on both UNIX and VMS systems. However, it also demonstrated that providing a small, portable, and NASA CDF-compatible FORTRAN interface with the desired generality was not practical. (NASA's CDF and Unidata's netCDF have since evolved separately, but recent CDF versions share many characteristics with netCDF.)
In early 1988, Joe Fahle of SeaSpace, Inc. (a commercial software development firm in San Diego, California), a participant in the 1987 Unidata CDF workshop, independently developed a CDF package in C that extended the NASA CDF interface in several important ways (Fahle, 1989). Like Raymond's package, the SeaSpace CDF software permitted variables with unrelated shapes to be included in the same data object and permitted a general "hyperslab" form of access to multidimensional arrays. Fahle's implementation was used at SeaSpace as the intermediate form of storage for a variety of steps in their image-processing system.
After studying Fahle's interface, we concluded that it solved many of the problems we had identified in trying to stretch the NASA interface to our purposes. In August 1988, we convened a small workshop to agree on a Unidata netCDF interface, and to resolve remaining open issues. Attending were Joe Fahle of SeaSpace, Michael Gough of Apple (an author of the NASA CDF software), Angel Li of the University of Miami (who had implemented our prototype netCDF software on VMS and was a potential user), and Unidata systems development staff. Consensus was reached at the workshop after some further simplifications were discovered. A document incorporating the results of the workshop into a proposed Unidata netCDF interface specification was distributed widely for comments before Glenn Davis implemented the software it described. Comparison with other data-access interfaces and recent experience in using netCDF are discussed in (Rew and Davis, 1990a) and (Rew and Davis, 1990b).
In October 1991, we announced version 2.0 of the netCDF software
distribution. Slight modifications to the C interface (declaring
dimension sizes to be long
rather than int
) improved the
usability of netCDF on inexpensive platforms such as MSDOS computers,
without requiring recompilation on other platforms. This change to the
interface required no changes to the associated file format.
This Guide documents the March 1993 release of netCDF 2.3, which preserves the same file format but adds a few new functions to the C and Fortran interfaces.
The suggested extension for netCDF files has been changed from
.cdf
to .nc
, in order to avoid a clash with the NASA CDF
file extension. Although the old extension is still supported in a
backward compatible way, new netCDF files should use the new filename
extension, where practical.
Subsampling along specified dimensions (using `strides') is one of the
commonly-asked-for features supported by the new ncvarputg()
and
ncvargetg()
interfaces for generalized hyperslab access. In
addition, these interfaces permit accessing data that is not contiguous
in memory. In a generalized hyperslab, an index mapping vector is used
to define the mapping between points in the generalized hyperslab and
the memory locations of the corresponding values. (See section Write a Generalized Hyperslab of Values.)
There are also some new interfaces that can be used to write, read, and inquire about records, where a record may contain multiple variables of different types and shapes. Where before you had to access a record's worth of data using multiple calls, now you may instead use a single call. (See section Put a Record.)
New optimizations for the library have resulted in significant speedups for accessing cross-sections involving non-contiguous data. Many bugs have been fixed, portability has been improved, and the installation has been greatly simplified for most systems.
The ncdump utility now supports several new command-line options including the ability to specify for which variables data values will be output, to provide brief annotations in the form of CDL comments to identify data values for large multidimensional variables, or to provide full annotations in the form of trailing CDL comments for every data value.
We continue to use netCDF for Unidata system software and applications. We have begun to use netCDF interfaces for earth-referenced image-analysis software, as well as for the output of decoders for a wide variety of meteorological and oceanographic data.
We have recently released a preliminary set of netCDF operators that we hope will eventually be completed to provide an algebra of useful operations on generic scientific data stored in netCDF files. The complete set of operators includes selectors to extract subsets of variables or reduce the dimensionality of a netCDF file; combiners to merge netCDF files, combine variables, or increase dimensionality; graphics generators to read netCDF files and produce graphical output; mathematical operators; and specialized data converters to convert units, convert to or from standard archive forms, or to convert to a canonical form for comparison. By composing these fundamental operators, users would have a wide variety of capabilities.
We have submitted a proposal for resources to extend the netCDF library in an upward-compatible way to support netCDF servers on a network. With netCDF servers, clients could access cross-sections of data efficiently, as if they were stored in a local file. A netCDF server would also have the ability to support virtual netCDF objects that provide different views of large or remote data sets. Servers might also be capable of providing a transparent netCDF interface to non-netCDF archives.
Other desirable extensions that we would like to see added to netCDF
include an interface for record access by key or coordinate value,
transparent data packing by use of three new reserved attributes
(_Nbits
, _Scale
, and _Offset
), support for pointers
to data cross-sections in other files (what Joe Fahle calls
"assemblies"), better string support, and a well-tested C++ interface.
Some of these extensions may be included in future releases.
Go to the previous, next section.