[Numpy-discussion] NumPy and None (null, NaN, missing)
tchur at bigpond.com
Sat Apr 8 10:07:13 CDT 2000
I'm a new user of MumPy so forgive me if this is a FAQ. I would normally
check the list archives but I'm on holidays at the moment in Manila and
the speed of the Internet connection here does not permit much Web
I've been experimenting with using Gary Strangman's excellent stats.py
functions. The spped of these functions when operating on NumPy arrays
and the ability of NumPy to swallow very large arrays is remarkable.
However, one deficiency I have noticed is the lack of the ability to
represent nulls (i.e. missing values, None or NaN [Not-a-Number] in
NumPy arrays. Missing values commonly occur in real-life statistical
data and although they are usually excluded from most statistical
calculations, it is important to be able to keep track of the number of
missing data elements and report this. ecause NumPy arrays can't
represent missing data via a special value, it is necessary to exclude
missing data elements from NumPy arrays and keep track of them elsewhere
(in standard Python lists). This is messy. Also, it is quite common to
use various imputation techniques to estimate the values of missing data
elements - the ability to represent missing data in a NumPy array and
then change it to an imputed value would be a real boon.
. The speed of these functions arelightning-fast.
The problem is the speed with which data can be extracted from a column
of a MySQL (or any other SQL database) query result set and stuffed into
a NumPy array. This inevitably involves forming a Python list and then
assigning that to a NumPy array. This is both slow and memory-hungry,
especially with large datsets (I have een playing with a few million
I was wondering if it would be feasible to initially add a method to the
_mysql class in the MySQLdb module which iterated through a result set
using a C routine (rather than a Python routine) and stuffed the data
directly into a NumPy array (or arrays - one for each column in the
result set) in one fell swoop (or even iterating row-by-row but in C)? I
suspect that such a facility would be much faster than having to move
the data into NumPy via a standard Python list (or actually via tuples
within a list, which i sthe way the Python DB-API returns results).
If this direct MySQL-to-NumPy interface worked well, it might be
desirable to add it to the Python DB-API specification for optional
implementation in the other database modules which conform to the API.
There are probably other extensions which would make the DB-API more
useful for statistical applications, which tend to be set
(column)-oriented rather than row-oriented - will post to the list as
these occur to me.
PS I will be away for the next week so apologies in advance for not
replying immediately to any follow-ups to this posting.
More information about the Numpy-discussion