[Numpy-discussion] Insights / lessons learned from NumPy design
Dag Sverre Seljebotn
Fri Jan 4 02:13:06 CST 2013
On 01/04/2013 09:00 AM, Dag Sverre Seljebotn wrote:
> On 01/04/2013 07:29 AM, Mike Anderson wrote:
>> Hello all,
>> In the Clojure community there has been some discussion about creating a
>> common matrix maths library / API. Currently there are a few different
>> fledgeling matrix libraries in Clojure, so it seemed like a worthwhile
>> effort to unify them and have a common base on which to build on.
>> NumPy has been something of an inspiration for this, so I though I'd ask
>> here to see what lessons have been learned.
>> We're thinking of a matrix library with roughly the following design
>> (subject to change!)
>> - Support for multi-dimensional matrices (but with fast paths for 1D
>> vectors and 2D matrices as the common cases)
> Food for thought: Myself I have vectors that are naturally stored in 2D,
> "matrices" that can be naturally stored in 4D and so on (you can't view
> them that way when doing linear algebra, it's just that the indices can
> have multiple components) -- I like that NumPy calls everything "array";
> I think vector and matrix are higher-level mathematical concepts.
>> - Immutability by default, i.e. matrix operations are pure functions
>> that create new matrices. There could be a "backdoor" option to mutate
>> matrices, but that would be unidiomatic in Clojure
> Sounds very promising (assuming you can reuse the buffer if the input
> matrix had no other references and is not used again?). It's very common
> for NumPy arrays to fill a large chunk of the available memory (think
> 20-100 GB), so for those users this would need to be coupled with buffer
> reuse and good diagnostics that help remove references to old
> generations of a matrix.
Oh: Depending on your amibitions, it's worth thinking hard about i)
storage format, and ii) lazy evaluation.
Storage format: The new trend is for more flexible formats than just
column-major/row-major, e.g., storing cache-sized n-dimensional tiles.
Lazy evaluation: The big problem with numpy is that "a + b + np.sqrt(c)"
will first make a temporary result for "a + b", rather than doing the
whole expression on the fly, which is *very* bad for performance.
So if you want immutability, I urge you to consider every operation to
build up an expression tree/"program", and then either find out the
smart points where you interpret that program automatically, or make
explicit eval() of an expression tree the default mode.
Of course this depends all on how ambitious you are.
It's probably best to have a look at all the projects designed in order
to get around NumPy's short-comings:
- Blaze (in development, continuum.io)
- HDF chunks
- To some degree Cython
>> - Support for 64-bit double precision floats only (this is the standard
>> float type in Clojure)
>> - Ability to support multiple different back-end matrix implementations
>> (JBLAS, Colt, EJML, Vectorz, javax.vecmath etc.)
>> - A full range of matrix operations. Operations would be delegated to
>> back end implementations where they are supported, otherwise generic
>> implementations could be used.
>> Any thoughts on this topic based on the NumPy experience? In particular
>> would be very interesting to know:
>> - Features in NumPy which proved to be redundant / not worth the effort
>> - Features that you wish had been designed in at the start
>> - Design decisions that turned out to be a particularly big mistake /
>> Would love to hear your insights, any ideas+advice greatly appreciated!
> Travis Oliphant noted some of his thoughts on this in the recent thread
> "DARPA funding for Blaze and passing the NumPy torch" which is a must-read.
> Dag Sverre
More information about the NumPy-Discussion