[Numpy-discussion] Possible roadmap addendum: building better text file readers

Wes McKinney wesmckinn@gmail....
Thu Feb 23 15:39:34 CST 2012


On Thu, Feb 23, 2012 at 4:20 PM, Erin Sheldon <erin.sheldon@gmail.com> wrote:
> Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012:
>> That's pretty good. That's faster than pandas's csv-module+Cython
>> approach almost certainly (but I haven't run your code to get a read
>> on how much my hardware makes a difference), but that's not shocking
>> at all:
>>
>> In [1]: df = DataFrame(np.random.randn(350000, 32))
>>
>> In [2]: df.to_csv('/home/wesm/tmp/foo.csv')
>>
>> In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv')
>> CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s
>> Wall time: 7.04 s
>>
>> I must think that skipping the process of creating 11.2 mm Python
>> string objects and then individually converting each of them to float.
>>
>> Note for reference (i'm skipping the first row which has the column
>> labels from above):
>>
>> In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv',
>> dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys:
>> 0.48 s, total: 24.65 s
>> Wall time: 24.67 s
>>
>> In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv',
>> delimiter=',', skiprows=1)
>> CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s
>> Wall time: 11.32 s
>>
>> In this last case for example, around 500 MB of RAM is taken up for an
>> array that should only be about 80-90MB. If you're a data scientist
>> working in Python, this is _not good_.
>
> It might be good to compare on recarrays, which are a bit more complex.
> Can you try one of these .dat files?
>
>    http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/
>
> The dtype is
>
> [('ra', 'f8'),
>  ('dec', 'f8'),
>  ('g1', 'f8'),
>  ('g2', 'f8'),
>  ('err', 'f8'),
>  ('scinv', 'f8', 27)]
>
> --
> Erin Scott Sheldon
> Brookhaven National Laboratory

Forgot this one that is also widely used:

In [28]: %time recs =
matplotlib.mlab.csv2rec('/home/wesm/tmp/foo.csv', skiprows=1)
CPU times: user 65.16 s, sys: 0.30 s, total: 65.46 s
Wall time: 65.55 s

ok with one of those dat files and the dtype I get:

In [18]: %time arr =
np.genfromtxt('/home/wesm/Downloads/scat-05-000.dat', dtype=dtype,
skip_header=0, delimiter=' ')
CPU times: user 17.52 s, sys: 0.14 s, total: 17.66 s
Wall time: 17.67 s

difference not so stark in this case. I don't produce structured arrays, though

In [26]: %time arr =
read_table('/home/wesm/Downloads/scat-05-000.dat', header=None, sep='
')
CPU times: user 10.15 s, sys: 0.10 s, total: 10.25 s
Wall time: 10.26 s

- Wes


More information about the NumPy-Discussion mailing list