Reading binary files from PCs on "other endian" machines
Written by Paul Bourke
March 1991
This document briefly describes the byte swapping required when a binary
file created on a DOS/WIndows is to be read on a computer which has its
bytes ordered the other way.
There are various datatypes which may be read, the simplest is characters
where no byte swapping is required. The next simplest is an
unsigned short integer
represented by 2 bytes. If the two bytes are read sequentially then the
integer value on a big endian machine is 256*byte1+byte2. If the integer
was written with a little endian machine such as a DOS/WINDOWS computer
then the integer is 256*byte2+byte1.
While this approach can be used for unsigned shorts, ints, and longs and
can be easily modified for signed versions of the same, it is rather
difficult for real numbers (floats and double precision numbers).
Fortunately the standard IEEE numerical format is used almost exclusively
now days so that the bytes making up the particular number can be
swapped around appropriately in memory. This does assume that the
size of the particular numerical type is the same length on both machines,
the machine that wrote the file and the machine reading the file. The
usual standards are short integers are 2 bytes, long integers are 4 bytes,
floats are 4 bytes and doubles are 8 bytes.
In summary, to read 2 byte integers (signed or unsigned) one reads the
2 bytes as normal, eg: using fread(), and then swap the 2 bytes in memory.
It turns out that for long integers, floats and doubles the requirements
is to reverse the bytes as they appear in memory. See the source below
for more details.
Source code
Some routines illustrating the methods required to do the byte swapping
for various numerical types.
/*
Read a short integer, swapping the bytes
*/
int ReadShortInt(FILE *fptr,short int n)
{
unsigned char *cptr,tmp;
if (fread(n,2,1,fptr) != 1)
return(FALSE);
cptr = (unsigned char *)n;
tmp = cptr[0];
cptr[0] = cptr[1];
cptr[1] =tmp;
return(TRUE);
}
/*
Read an integer, swapping the bytes
*/
int ReadInt(FILE *fptr,int *n)
{
unsigned char *cptr,tmp;
if (fread(n,4,1,fptr) != 1)
return(FALSE);
cptr = (unsigned char *)n;
tmp = cptr[0];
cptr[0] = cptr[3];
cptr[3] = tmp;
tmp = cptr[1];
cptr[1] = cptr[2];
cptr[2] = tmp;
return(TRUE);
}
/*
Read a floating point number
Assume IEEE format
*/
int ReadFloat(FILE *fptr,float *n)
{
unsigned char *cptr,tmp;
if (fread(n,4,1,fptr) != 1)
return(FALSE);
cptr = (unsigned char *)n;
tmp = cptr[0];
cptr[0] = cptr[3];
cptr[3] =tmp;
tmp = cptr[1];
cptr[1] = cptr[2];
cptr[2] = tmp;
return(TRUE);
}
/*
Read a double precision number
Assume IEEE
*/
int ReadDouble(FILE *fptr,double *n)
{
unsigned char *cptr,tmp;
if (fread(n,8,1,fptr) != 1)
return(FALSE);
cptr = (unsigned char *)n;
tmp = cptr[0];
cptr[0] = cptr[7];
cptr[7] = tmp;
tmp = cptr[1];
cptr[1] = cptr[6];
cptr[6] = tmp;
tmp = cptr[2];
cptr[2] = cptr[5];
cptr[5] =tmp;
tmp = cptr[3];
cptr[3] = cptr[4];
cptr[4] = tmp;
return(TRUE);
}
Macros
An alternative for all but doubles is to use these cute macros, then
the swapping is done inline.
#define SWAP_2(x) ( (((x) & 0xff) << 8) | ((unsigned short)(x) >> 8) )
#define SWAP_4(x) ( ((x) << 24) | \
(((x) << 8) & 0x00ff0000) | \
(((x) >> 8) & 0x0000ff00) | \
((x) >> 24) )
#define FIX_SHORT(x) (*(unsigned short *)&(x) = SWAP_2(*(unsigned short *)&(x)))
#define FIX_INT(x) (*(unsigned int *)&(x) = SWAP_4(*(unsigned int *)&(x)))
#define FIX_FLOAT(x) FIX_INT(x)
Strategies for developers
There are three basic strategies for software developers when choosing how to
create endian independent data files and associated software.
Decide that the file format will be one particular endian. In this case software
running on machines of the same endian does nothing special, software running on
other machines byte swap everything on reading and writing. This is common for
file formats and software designed with an implicit endian assumption which get
ported at a future date to other machines.
Store in the file the endian-ness of the file. The software writes the binary
file in the natural endian of the underlying hardware but pays attention to
the endian-ness when reading binary files. Both endian files need to be handled,
the software has knowledge of its own endian-ness so it can do the right thing.
The poorer cousin of the last approaches is not to store the endian-ness and for
software to always write in its natural endian. This leads to two possible file
types and the user is expected to know which endian a file is and chooses the
appropriate one when specifying which file to read. This is obviously the least
attractive approach.
Reading FORTRAN unformatted binary files in C/C++
Or....FORTRAN Weirdness, what were they thinking?
Written by Paul Bourke
April 2003
Problem
Ever wanted to read binary files written by a FORTRAN program
with a C/C++ program? Not such an unusual or unreasonable request
but FORTRAN does some
strange things ..... consider the following FORTRAN code, where
"a" is a 3D array of 4 byte floating point values.
open(60,file=filename,status='unknown',form='unformatted')
write(60) nx,ny,nz
do k = 1,nz
do j = 1,ny
write(60) (a(i,j,k),i=1,nx)
enddo
enddo
close(60)
What you will end up with is not a file that is (4 * nx) * ny * nz + 12 bytes
long as it would be for the equivalent in most (if not all) other languages!
Instead it will be nz * ny * (4 * nx + 8) + 20 bytes long. Why?
Reason
Each time the FORTRAN write is issued a "record" is written, the record consists
of a 4 byte header, then the data, then a trailer that matches the header. The
4 byte header and trailer consist of the number of bytes that will be written
in the data section. So the following
write(60) nx,ny,nz
gets written on the disk as follows where nx,ny,nz are each 4 bytes, the other
numbers below are 2 byte integers written in decimal
0 12 nx ny nz 0 12
The total length written is 20 bytes. Similarly, the line
write(60) (a(i,j,k),i=1,nx)
gets written as follows assuming nx is 1024 and "a" is real*4
10 0 a(1,j,k) a(2,j,k) .... a(1024,j,k) 10 0
The total length is 4104 bytes. Fortunately, once this is understood, it is a trivial to
read the correct things in C/C++.
A consequence that is a bit shocking for many programmers is that the file created
with the above code gives a file that is about 1/3 the size than one created with
this code.
open(60,file=filename,status='unknown',form='unformatted')
write(60) nx,ny,nz
do k = 1,nz
do j = 1,ny
do i = 1,nx
write(60) a(i,j,k)
enddo
enddo
enddo
close(60)
In this case each element of a is written in one record and consumes 12 bytes for
a total file size of nx * ny * nz * 12 + 20.
Note
This doesn't affect FORTRAN programs that might read these files,
that is because the FORTRAN "read" commands know how to handle these
unformatted files.
The discussion here does not address the transfer of binary files
between machines with a different endian. In that case after a short,
int, float, double is read the bytes must be rearranged. Fortunately
this is relatively straightforward with these macros.
#define SWAP_2(x) ( (((x) & 0xff) << 8) | ((unsigned short)(x) >> 8) )
#define SWAP_4(x) ( ((x) << 24) | (((x) << 8) & 0x00ff0000) | \
(((x) >> 8) & 0x0000ff00) | ((x) >> 24) )
#define FIX_SHORT(x) (*(unsigned short *)&(x) = SWAP_2(*(unsigned short *)&(x)))
#define FIX_LONG(x) (*(unsigned *)&(x) = SWAP_4(*(unsigned *)&(x)))
#define FIX_FLOAT(x) FIX_LONG(x)
It appears that the endianness of the 4 byte header and trailer reflect the
endianness of the machine doing the writing. Of course if you know the format of
the data being written then you can simply skip over the header/trailer bytes, but
if you need to decode the file or do error checking then knowledge of the endian of the
machine where the file was written and the endian of the machine where the file
is being read is necessary.
And lastly, the above does not address the possibility (fairly rare these days) that
the files may be transferred between two machines with different internal
representations of floating point numbers. If that is the case then you're
really in trouble and should probably revert to transferring the data in a
readable ASCII format.
Update (Jan 2008): It would appear that on 64 bit machines the 2 header elements are
each written as 4 bytes instead of 2 bytes each.
If the file is not already in existence then writing files in FORTRAN to avoid the
above, one can use the access='stream' option. This option was introduced reasonably
recently explicitly to overcome this issue.
|