[Numpy-discussion] seeking advice on a fast string->array conversion
Tue Nov 16 10:46:19 CST 2010
On 11/16/10 7:31 AM, Darren Dale wrote:
> On Tue, Nov 16, 2010 at 9:55 AM, Pauli Virtanen<email@example.com> wrote:
>> Tue, 16 Nov 2010 09:41:04 -0500, Darren Dale wrote:
>>> That loop takes 0.33 seconds to execute, which is a good start. I need
>>> some help converting this example to return an actual numpy array. Could
>>> anyone please offer a suggestion?
It's interesting that you found fromstring() so slow -- I've put some
time into trying to get fromfile() and fromstring() to be a bit more
robust and featurefull, but found it to be some really painful code to
work on -- but it didn't dawn on my that it would be slow too! I saw all
the layers of function calls, but I still thought that would be minimal
compared to the actual string parsing. I guess not. Shows that you never
know where your bottlenecks are without profiling.
"Slow" is relative, of course, but since the whole point of
fromfile/string is performance (otherwise, we'd just parse with python),
it would be nice to get them as fast as possible.
I had been thinking that the way to make a good fromfile was Cython, so
you've inspired me to think about it some more. Would you be interested
in extending what you're doing to a more general purpose tool?
Anyway, a comment or two:
> cdef extern from 'stdlib.h':
> double atof(char*)
One thing I found with the current numpy code is that the use of the
ato* functions is a source of a lot of bugs (all of them?) the core
problem is error handling -- you have to do a lot of pointer checking to
see if a call was successful, and with the fromfile code, that error
handling is not done in all the layers of calls.
Anyone know what the advantage of ato* is over scanf()/fscanf()?
Also, why are you doing string parsing rather than parsing the files
directly, wouldn't that be a bit faster?
I've got some C extension code for simple parsing of text files into
arrays of floats or doubles (using fscanf). I'd be curious how the
performance compares to what you've got. Let me know if you're interested.
> def test():
> py_string = '100'
> cdef char* c_string = py_string
> cdef int i, j
> cdef double val
> i = 0
> j = 2048*1200
> cdef np.ndarray[np.float64_t, ndim=1] ret
> ret_arr = np.empty((2048*1200,), dtype=np.float64)
> ret = ret_arr
> d = time.time()
> while i<j:
> c_string = py_string
> ret[i] = atof(c_string)
> i += 1
> ret_arr.shape = (1200, 2048)
> print ret_arr, ret_arr.shape, time.time()-d
> The loop now takes only 0.11 seconds to execute. Thanks again.
> NumPy-Discussion mailing list
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
More information about the NumPy-Discussion