npy array format FormatWritten by Paul Bourke November 2022
The npy data files are a straightforward way of storing NumPy data arrays. The format is complex enough to support a wide range of data types, and simple enough that it isn't difficult to read and write npy file using other programming languages. A npy file conveys information on the data type (short, int, float etc), the dimensions of the data and (weirdly) whether it is arranged in "fortran" order or not. The npy files contain a header which includes a magic string (so one can check it is a npy data file), a version number (for example version 2 supports larger data elements) and a "dictionary" consisting of plain text string with datatype and dimensions. Following the header is the data, until the end of the file, of the appropriate dimensions and datatype.
We propose a standard binary file format (NPY) for persisting
a single arbitrary NumPy array on disk. The format stores all of
the shape and dtype information necessary to reconstruct the array
correctly even on another machine with a different architecture.
The format is designed to be as simple as possible while achieving
its limited goals. The implementation is intended to be pure
Python and distributed as part of the main numpy package. A lightweight, omnipresent system for saving NumPy arrays to disk
is a frequent need. Python in general has pickle [1] for saving
most Python objects to disk. This often works well enough with
NumPy arrays for many purposes, but it has a few drawbacks: Dumping or loading a pickle file require the duplication of the
data in memory. For large arrays, this can be a showstopper. The array data is not directly accessible through
memory-mapping. Now that numpy has that capability, it has
proved very useful for loading large amounts of data (or more to
the point: avoiding loading large amounts of data when you only
need a small part). Both of these problems can be addressed by dumping the raw bytes
to disk using ndarray.tofile() and numpy.fromfile(). However,
these have their own problems: The data which is written has no information about the shape or
dtype of the array. It is incapable of handling object arrays. The NPY file format is an evolutionary advance over these two
approaches. Its design is mostly limited to solving the problems
with pickles and tofile()/fromfile(). It does not intend to solve
more complicated problems for which more complicated formats like
HDF5 [2] are a better solution. Neville Newbie has just started to pick up Python and NumPy. He
has not installed many packages, yet, nor learned the standard
library, but he has been playing with NumPy at the interactive
prompt to do small tasks. He gets a result that he wants to
save. Annie Analyst has been using large nested record arrays to
represent her statistical data. She wants to convince her
R-using colleague, David Doubter, that Python and NumPy are
awesome by sending him her analysis code and data. She needs
the data to load at interactive speeds. Since David does not
use Python usually, needing to install large packages would turn
him off. Simon Seismologist is developing new seismic processing tools.
One of his algorithms requires large amounts of intermediate
data to be written to disk. The data does not really fit into
the industry-standard SEG-Y schema, but he already has a nice
record-array dtype for using it internally. Polly Parallel wants to split up a computation on her multicore
machine as simply as possible. Parts of the computation can be
split up among different processes without any communication
between processes; they just need to fill in the appropriate
portion of a large array with their results. Having several
child processes memory-mapping a common array is a good way to
achieve this. The format MUST be able to: Represent all NumPy arrays including nested record
arrays and object arrays. Represent the data in its native binary form. Be contained in a single file. Support Fortran-contiguous arrays directly. Store all of the necessary information to reconstruct the array
including shape and dtype on a machine of a different
architecture. Both little-endian and big-endian arrays must be
supported and a file with little-endian numbers will yield
a little-endian array on any machine reading the file. The
types must be described in terms of their actual sizes. For
example, if a machine with a 64-bit C “long int” writes out an
array with “long ints”, a reading machine with 32-bit C “long
ints” will yield an array with 64-bit integers. Be reverse engineered. Datasets often live longer than the
programs that created them. A competent developer should be
able to create a solution in his preferred programming language to
read most NPY files that he has been given without much
documentation. Allow memory-mapping of the data. Be read from a filelike stream object instead of an actual file.
This allows the implementation to be tested easily and makes the
system more flexible. NPY files can be stored in ZIP files and
easily read from a ZipFile object. Store object arrays. Since general Python objects are
complicated and can only be reliably serialized by pickle (if at
all), many of the other requirements are waived for files
containing object arrays. Files with object arrays do not have
to be mmapable since that would be technically impossible. We
cannot expect the pickle format to be reverse engineered without
knowledge of pickle. However, one should at least be able to
read and write object arrays with the same generic interface as
other arrays. Be read and written using APIs provided in the numpy package
itself without any other libraries. The implementation inside
numpy may be in C if necessary. The format explicitly does not need to: Support multiple arrays in a file. Since we require filelike
objects to be supported, one could use the API to build an ad
hoc format that supported multiple arrays. However, solving the
general problem and use cases is beyond the scope of the format
and the API for numpy. Fully handle arbitrary subclasses of numpy.ndarray. Subclasses
will be accepted for writing, but only the array data will be
written out. A regular numpy.ndarray object will be created
upon reading the file. The API can be used to build a format
for a particular subclass, but that is out of scope for the
general NPY format. The first 6 bytes are a magic string: exactly “x93NUMPY”. The next 1 byte is an unsigned byte: the major version number of
the file format, e.g. x01. The next 1 byte is an unsigned byte: the minor version number of
the file format, e.g. x00. Note: the version of the file format
is not tied to the version of the numpy package. The next 2 bytes form a little-endian unsigned short int: the
length of the header data HEADER_LEN. The next HEADER_LEN bytes form the header data describing the
array’s format. It is an ASCII string which contains a Python
literal expression of a dictionary. It is terminated by a newline
(’n’) and padded with spaces (’x20’) to make the total length of
the magic string + 4 + HEADER_LEN be evenly divisible by 16 for
alignment purposes. The dictionary contains three keys: An object that can be passed as an argument to the
numpy.dtype() constructor to create the array’s dtype. Whether the array data is Fortran-contiguous or not.
Since Fortran-contiguous arrays are a common form of
non-C-contiguity, we allow them to be written directly to
disk for efficiency. The shape of the array. For repeatability and readability, this dictionary is formatted
using pprint.pformat() so the keys are in alphabetic order. Following the header comes the array data. If the dtype contains
Python objects (i.e. dtype.hasobject is True), then the data is
a Python pickle of the array. Otherwise the data is the
contiguous (either C- or Fortran-, depending on fortran_order)
bytes of the array. Consumers can figure out the number of bytes
by multiplying the number of elements given by the shape (noting
that shape=() means there is 1 element) by dtype.itemsize. The version 1.0 format only allowed the array header to have a
total size of 65535 bytes. This can be exceeded by structured
arrays with a large number of columns. The version 2.0 format
extends the header size to 4 GiB. numpy.save will automatically
save in 2.0 format if the data requires it, else it will always use
the more compatible 1.0 format. The description of the fourth element of the header therefore has
become: The next 4 bytes form a little-endian unsigned int: the length
of the header data HEADER_LEN. We recommend using the “.npy” extension for files following this
format. This is by no means a requirement; applications may wish
to use this file format but use an extension specific to the
application. In the absence of an obvious alternative, however,
we suggest using “.npy”. For a simple way to combine multiple arrays into a single file,
one can use ZipFile to contain multiple “.npy” files. We
recommend using the file extension “.npz” for these archives. |