[IPython-User] embed data in notebooks?

Thomas Breuel tmbdev@gmail....
Tue Jun 5 14:09:12 CDT 2012


Hi,

thanks for the response.


> Our vision is 'folder as a project' - so the code in a notebook sits
> alongside data files, keeping the notebook file lightweight and
> suitable for version control.
>

Well, right now the notebook is a file in JSON format, but it already
includes binary data in encoded form.  I'm just suggesting a small addition
that would make this existing storage facility useful for storing modest
amounts of extra data.

If you do want to change the notebook format in the future, I think that's
worth considering; see my suggestion at the very end.


> I can see there's an argument for having a way to store data as part
> of a notebook, but I think there are some questions:
> - How would the user interface work: How would the data be brought in
> and assigned to a variable? What would be displayed in the notebook?
> Would we handle different types of files differently, or treat all
> binary files the same?
>

A simple interface might be the following:

stream = nb_open("some/path/to/my.data")

data = nb_data("mydata",lambda:randn(100))

nb_open opens a stream to the data cached in the notebook, unless that data
doesn't exist; in that case, it caches the file in the notebook and then
opens it.  It always returns just a stream.

nb_data gets data from the notebook if it has been stored there; if not, it
calls its second argument, stores the result in the notebook, and also
returns it.

This interface means that as an author, I just write the code, maybe clear
the notebook cache a few times, but I don't even have to think about
managing the notebook data; if my notebook works once, it will continue to
work indefinitely even if the file or data become unavailable (until I
clear the notebook data explicitly).

See the low-level code below.

(There should be a UI button and another function for clearing the
cache/data stored in the notebook.)

- How would performance hold up? We'd have to base64 encode the data
> to store it in JSON, so loading binary data will inevitably be slower
> as it has an extra decoding step. It also increases the size of the
> data on disk.
>

Obviously, this doesn't solve all data storage problems.  But many
educational notebooks just need an image or a small dataset, and having an
easy way of storing such moderate amounts of data in the notebook would be
nice, in particular since the notebook already stores comparable kinds and
amounts of data for the outputs.

 - Is the cost/benefit trade off worth it? This may involve significant

extra complexity in IPython, and it's simple enough to zip up a
> notebook file + input data.


I think this can probably piggy-back on existing mechanisms.  I think you
need two low-level Python functions: nb_get_binary_data(key) and
nb_put_binary_data(key,value).  The first one sends a message to the
notebook asking whether there is binary data for the given key and returns
it or returns None, and the second one stores binary data under the given
key.    I think those should be pretty easy to provide, since the same kind
of thing already needs to happen for output cells.

In terms of those primitives, nb_open looks roughly like:

def nb_open(path,mode="r"):
    assert mode=="r","cached files need to be read-only for now"
    key = "file:"+path
    data = nb_get_binary_data(key)
    if data is None:
        with open(path) as stream: data = stream.read()
        nb_put_binary_data(key,data)
    return data_as_stream(data,"r")

The nb_data(name,thunk) function would be pretty similar.

Tom

PS: As for the "notebook as directory", having notebooks be single files is
quite convenient, since zipping/unzipping things and dealing with
directories can be a nuisance in many settings.  What you could do in the
future, however, is allow notebooks to be either directories or zip files
containing a directory tree, making the difference as transparent to the
notebook and user.  I think that would be a great direction to go into,
also because you could then store the output images separate from the pure
text.  Note that OpenOffice and OpenDocument files work that way, and data
files could also be embedded efficiently. Furthermore, version control
tools are starting to be able to deal with document formats based on zip
files.  But, in any case, that's a bigger change than what I suggested
above.

For ZIP files as document formats, see
http://www.openoffice.org/xml/faq.html and
http://en.wikipedia.org/wiki/OpenDocument

Here is Mercurial version control for ZIP-based document formats:
http://mercurial.selenic.com/wiki/ZipdocExtension
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-user/attachments/20120605/54f7a26d/attachment.html 


More information about the IPython-User mailing list