[Numpy-discussion] Fastest way to parsing a specific binay file

Robert Kern robert.kern@gmail....
Wed Sep 2 13:58:06 CDT 2009


On Wed, Sep 2, 2009 at 13:28, Gökhan Sever<gokhansever@gmail.com> wrote:
> Put the reference manual in:
>
> http://drop.io/1plh5rt
>
> First few pages describe the data format they use.

Ah. The fields are *not* delimited by a fixed value. Regexes are no
help to you for pulling out the information you need, except perhaps
later to parse the text fields. I think you are also getting spurious
results because your regex matches things inside data fields.

Instead, you have a header containing the length of the data field
followed by the data field. Create a structured dtype that corresponds
to the DataDir struct on page 15. Note that "unsigned int" there is
actually a numpy.uint16, not a uint32.

  dt = np.dtype([('tagNumber', np.uint16), ('dataOffset', np.uint16),
('numberBytes', np.uint16), ('samples', np.uint16), ('bytesPerSample',
np.uint16), ('type', np.uint8), ('param1', np.uint8), ('param2',
np.uint8), ('param3', np.uint8), ('address', np.uint16)])

Now read dt.itemsize bytes from the file and use

  header = fromstring(f.read(dt.itemsize), dt)[0]

to get a record object that corresponds to the header. Use the
dataOffset and numberBytes fields to extract the actual data bytes
from the file.

For example, if we go to the second header field:

In [28]: f.seek(dt.itemsize,0)

In [29]: header = np.fromstring(f.read(dt.itemsize), dt)[0]

In [30]: header
Out[30]: (65530, 100, 8, 1, 8, 255, 0, 0, 0, 43605)

In [31]: f.seek(header['dataOffset'], 0)

In [32]: f.read(header['numberBytes'])
Out[32]: 'prj.300\x00'


There are still some semantic issues you need to work out, still.
There are multiple "buffers" per file, and the dataOffsets are
relative to the start of the buffer, not the file.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco


More information about the NumPy-Discussion mailing list