[SciPy-User] format for chunked file save and read ?

josef.pktd@gmai... josef.pktd@gmai...
Wed Sep 22 12:04:24 CDT 2010


On Wed, Sep 22, 2010 at 12:29 PM, Nathaniel Smith <njs@pobox.com> wrote:
> On Wed, Sep 22, 2010 at 7:18 AM,  <josef.pktd@gmail.com> wrote:
>> What is the best file format for storing temporary data, for chunked
>> saving and loading, that only uses numpy and scipy?
>> I would like a file format that could be shared cross-platform and
>> across python/numpy versions if needed.
>
> Why not just use pickle? Mmap isn't giving you any advantages here
> that I can see, and pickles are much easier to handle when you want to
> write things out incrementally.

I don't like pickles much for anything that needs to be stored for
more than 5 minutes, because several times I wasn't able to read them
anymore after some version or code changes.

>
>> usecase: Stata is (optionally) saving all Bootstrap samples to a file
>> so that the same samples will be available if a follow-up analysis is
>> desired/required.
>>
>> We could also just save the seed and redo the same samples which might
>> however not be fast for some models
>
> You should save the seed in any case!
>
> For probably most bootstrap purposes, it would work fine to just save
> the samples themselves in the bootstrap object, or have them as an
> extra return value. The 'boot' package for R does this. Most bootstrap
> results don't involve huge amounts of memory.

The bootstrap module is still in planning stage. For all current
examples of bootstrap, I always throw away the original sample and
only keep the summary statistics. But I thought the Stata idea of
saving the samples to a file is a good idea, and I won't be needing it
in most cases. But there are bootstrap (I think) and simulation based
estimators where the same simulated samples have to be used several
times.

Right now I'm mainly planning ahead, while going through STATA, SAS
and SPSS manuals.

>
> On a more general note, I think APIs that take a filename and store
> some of their (logical) return values there are somewhat "smelly"[1].
> Managing temporary files programmatically is a huge pain, esp. when
> I'll just need to read the results back out again or whatever. If
> you're worried about memory use, maybe let the user pass in a callback
> that will be called with each sample in turn, instead of hard-coding
> this temporary file thing?

Using a callback is a good idea to allow for different storage
backends, (I ran into a similar problem for storing datasets
downloaded from the internet.), but doesn't remove the decision to
pick a default storage format, that I can also use myself.

Josef

>
> [1] http://c2.com/xp/CodeSmell.html
>
> Cheers,
> -- Nathaniel
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>


More information about the SciPy-User mailing list