[Numpy-discussion] Thoughts on persistence/object tracking in scientific code
Luis Pedro Coelho
Mon Dec 29 17:25:05 CST 2008
On Monday 29 December 2008 17:40:07 Gael Varoquaux wrote:
> It is interesting to see that you take a slightly different approach than
> the others already discussed. This probably stems from the fact that you
> are mostly interested by parallelism, whereas there are other adjacent
> problems that can be solved by similar abstractions. In particular, I
> have the impression that you do not deal with what I call
> "lazy-revaluation". In other words, I am not sure if you track results
> enough to know whether a intermediate result should be re-run, or if you
> run a 'clean' between each run to avoid this problem.
I do. As long as the hash (the arguments to the function) is the same, the
code loads objects from disk instead of computing results. I don't track the
actual source code, though, only whether parameters have changed (but this
could be a later addition).
> I must admit I went away from using hash to store objects to the disk
> because I am very much interested in traceability, and I wanted my
> objects to have meaningful names, and to be stored in convenient formats
> (pickle, numpy .npy, hdf5, or domain-specific). I have now realized that
> explicit naming is convenient, but it should be optional.
But using a hash is not so impenetrable as long as you can easily get to the
files you want.
If I want to load the results of a partial computation, all I have to do is to
generate the same Task objects as the initial computation and load those: I
can run the jugfile.py inside ipython and call the appropriate load() methods.
: interesting = [t for t in tasks if t.name == 'something.other']
: intermediate = interesting.load()
> I did notice too that using the argument value to hash was bound to
> failure in all but the simplest case. This is the immediate limitation to
> the famous memoize pattern when applied to scientific code. If I
> understand well, what you do is that you track the 'history' of the
> object and use it as a hash to the object, right? I had come to the
> conclusion that the history of objects should be tracked, but I hadn't
> realized that using it as a hash was also a good way to solve the scoping
> problem. Thanks for the trick.
Yes, let's say I have the following:
feats = [Task(features,img) for img in glob('*.png')]
cluster = Task(kmeans,feats,k=10)
then the hash for cluster is computed from its arguments:
* kmeans : the function name
* feats: this is a list of tasks, therefore I use its hash, which is defined
by its argument, which is a simple string.
* k=10: this is a literal.
I don't need to use the value computed by feats to compute the hash for
> Your task-based approach, and the API you have built around it, reminds
> my a bit of twisted deferred. Have you studied this API?
No. I will look into it. Thanks.
More information about the Numpy-discussion