[Numpy-discussion] Advice for grouping recarrays

Stephen Simmons mail at stevesimmons.com
Tue Jul 18 06:59:30 CDT 2006


Hi,

Does anyone have any suggestions for summarising data in numpy?

The quick description is that I want to do something like the SQL statement:
    SELECT sum(field1), sum(field2) FROM  table GROUP BY field3;

The more accurate description is that my data is stored in PyTables HDF 
format, with 24 monthly files, each with 4m records describing how 
customers performed that month. Each record looks something like this:
('200604', 651404500000L, '800', 'K', 12L, 162.0, 2000.0, 0.054581, 0.0, 
0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 0.0, 2.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
8.80, 0.86, 7.80 17.46, 0.0, 70.0, 0.0, 70.0, -142.93, 0.0, 2000.0, 
2063.93, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -9.71, 7.75, 
87.46, 77.75, -3.45, 0.22, -0.45, -0.57, 73.95)
The first 5 fields are status fields (month_string, account_number, 
product_code, account_status, months_since_customer_joined). The 
remaining 48 fields represent different aspects of the customer's 
performance during that month. I read 100,000 of these records at a time 
and turn them into a numpy recarray with:
    dat = hdf_table.read(start=pos, stop=pos+block_size)
    dat = numpy.asarray(dat._flatArray, dtype=dat.array_descr)      

I'd like to reduce these 96m records x 53 fields down to monthly 
averages for each tuple (month_string, months_since_customer_joined) 
which in the case above is ('200604', 12L). This will let me compare the 
performance of newly acquired customers at the same point in their 
lifecycle as customers acquired 1 or 2 years ago.

The end result should be a dataset something like
    res[month_index, months_since_customer_joined]
    = array([ num_records, sum_field_5, sum_field_6, sum_field_7, ... 
sum_field_52 ])
with a shape of (24, 24, 49).

I've played around with lexsort(), take(), sum(), etc, but get very 
confused and end up feeling that I'm making things more complicated than 
they need to be. So any advice from numpy veterans on how best to 
proceed would be very welcome!

Cheers

Stephen




More information about the Numpy-discussion mailing list