[IPython-User] embed data in notebooks?

Thomas Breuel tmbdev@gmail....
Tue Jun 5 14:10:18 CDT 2012


I should say that the nb_open and nb_data interfaces I'm suggesting are
independent of the storage format, so adopting them now for a JSON based
storage would be safe even if the storage format changes to something else
in the future.

Tom

On Tue, Jun 5, 2012 at 9:09 PM, Thomas Breuel <tmbdev@gmail.com> wrote:

> Hi,
>
> thanks for the response.
>
>
>> Our vision is 'folder as a project' - so the code in a notebook sits
>> alongside data files, keeping the notebook file lightweight and
>> suitable for version control.
>>
>
> Well, right now the notebook is a file in JSON format, but it already
> includes binary data in encoded form.  I'm just suggesting a small addition
> that would make this existing storage facility useful for storing modest
> amounts of extra data.
>
> If you do want to change the notebook format in the future, I think that's
> worth considering; see my suggestion at the very end.
>
>
>> I can see there's an argument for having a way to store data as part
>> of a notebook, but I think there are some questions:
>> - How would the user interface work: How would the data be brought in
>> and assigned to a variable? What would be displayed in the notebook?
>> Would we handle different types of files differently, or treat all
>> binary files the same?
>>
>
> A simple interface might be the following:
>
> stream = nb_open("some/path/to/my.data")
>
> data = nb_data("mydata",lambda:randn(100))
>
> nb_open opens a stream to the data cached in the notebook, unless that
> data doesn't exist; in that case, it caches the file in the notebook and
> then opens it.  It always returns just a stream.
>
> nb_data gets data from the notebook if it has been stored there; if not,
> it calls its second argument, stores the result in the notebook, and also
> returns it.
>
> This interface means that as an author, I just write the code, maybe clear
> the notebook cache a few times, but I don't even have to think about
> managing the notebook data; if my notebook works once, it will continue to
> work indefinitely even if the file or data become unavailable (until I
> clear the notebook data explicitly).
>
> See the low-level code below.
>
> (There should be a UI button and another function for clearing the
> cache/data stored in the notebook.)
>
> - How would performance hold up? We'd have to base64 encode the data
>> to store it in JSON, so loading binary data will inevitably be slower
>> as it has an extra decoding step. It also increases the size of the
>> data on disk.
>>
>
> Obviously, this doesn't solve all data storage problems.  But many
> educational notebooks just need an image or a small dataset, and having an
> easy way of storing such moderate amounts of data in the notebook would be
> nice, in particular since the notebook already stores comparable kinds and
> amounts of data for the outputs.
>
>  - Is the cost/benefit trade off worth it? This may involve significant
>
> extra complexity in IPython, and it's simple enough to zip up a
>> notebook file + input data.
>
>
> I think this can probably piggy-back on existing mechanisms.  I think you
> need two low-level Python functions: nb_get_binary_data(key) and
> nb_put_binary_data(key,value).  The first one sends a message to the
> notebook asking whether there is binary data for the given key and returns
> it or returns None, and the second one stores binary data under the given
> key.    I think those should be pretty easy to provide, since the same kind
> of thing already needs to happen for output cells.
>
> In terms of those primitives, nb_open looks roughly like:
>
> def nb_open(path,mode="r"):
>     assert mode=="r","cached files need to be read-only for now"
>     key = "file:"+path
>     data = nb_get_binary_data(key)
>     if data is None:
>         with open(path) as stream: data = stream.read()
>         nb_put_binary_data(key,data)
>     return data_as_stream(data,"r")
>
> The nb_data(name,thunk) function would be pretty similar.
>
> Tom
>
> PS: As for the "notebook as directory", having notebooks be single files
> is quite convenient, since zipping/unzipping things and dealing with
> directories can be a nuisance in many settings.  What you could do in the
> future, however, is allow notebooks to be either directories or zip files
> containing a directory tree, making the difference as transparent to the
> notebook and user.  I think that would be a great direction to go into,
> also because you could then store the output images separate from the pure
> text.  Note that OpenOffice and OpenDocument files work that way, and data
> files could also be embedded efficiently. Furthermore, version control
> tools are starting to be able to deal with document formats based on zip
> files.  But, in any case, that's a bigger change than what I suggested
> above.
>
> For ZIP files as document formats, see
> http://www.openoffice.org/xml/faq.html and
> http://en.wikipedia.org/wiki/OpenDocument
>
> Here is Mercurial version control for ZIP-based document formats:
> http://mercurial.selenic.com/wiki/ZipdocExtension
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120605/ad314ce2/attachment.html 


More information about the IPython-User mailing list