[IPython-User] embed data in notebooks?

Brian Granger ellisonbg@gmail....
Tue Jun 5 14:21:27 CDT 2012


You could also use the %%file magic that is under review here:

https://github.com/ipython/ipython/pull/1855

On Tue, Jun 5, 2012 at 12:10 PM, Thomas Breuel <tmbdev@gmail.com> wrote:
> I should say that the nb_open and nb_data interfaces I'm suggesting are
> independent of the storage format, so adopting them now for a JSON based
> storage would be safe even if the storage format changes to something else
> in the future.
>
> Tom
>
>
> On Tue, Jun 5, 2012 at 9:09 PM, Thomas Breuel <tmbdev@gmail.com> wrote:
>>
>> Hi,
>>
>> thanks for the response.
>>
>>>
>>> Our vision is 'folder as a project' - so the code in a notebook sits
>>> alongside data files, keeping the notebook file lightweight and
>>> suitable for version control.
>>
>>
>> Well, right now the notebook is a file in JSON format, but it already
>> includes binary data in encoded form.  I'm just suggesting a small addition
>> that would make this existing storage facility useful for storing modest
>> amounts of extra data.
>>
>> If you do want to change the notebook format in the future, I think that's
>> worth considering; see my suggestion at the very end.
>>
>>>
>>> I can see there's an argument for having a way to store data as part
>>> of a notebook, but I think there are some questions:
>>> - How would the user interface work: How would the data be brought in
>>> and assigned to a variable? What would be displayed in the notebook?
>>> Would we handle different types of files differently, or treat all
>>> binary files the same?
>>
>>
>> A simple interface might be the following:
>>
>> stream = nb_open("some/path/to/my.data")
>>
>> data = nb_data("mydata",lambda:randn(100))
>>
>> nb_open opens a stream to the data cached in the notebook, unless that
>> data doesn't exist; in that case, it caches the file in the notebook and
>> then opens it.  It always returns just a stream.
>>
>> nb_data gets data from the notebook if it has been stored there; if not,
>> it calls its second argument, stores the result in the notebook, and also
>> returns it.
>>
>> This interface means that as an author, I just write the code, maybe clear
>> the notebook cache a few times, but I don't even have to think about
>> managing the notebook data; if my notebook works once, it will continue to
>> work indefinitely even if the file or data become unavailable (until I clear
>> the notebook data explicitly).
>>
>> See the low-level code below.
>>
>> (There should be a UI button and another function for clearing the
>> cache/data stored in the notebook.)
>>
>>> - How would performance hold up? We'd have to base64 encode the data
>>> to store it in JSON, so loading binary data will inevitably be slower
>>> as it has an extra decoding step. It also increases the size of the
>>> data on disk.
>>
>>
>> Obviously, this doesn't solve all data storage problems.  But many
>> educational notebooks just need an image or a small dataset, and having an
>> easy way of storing such moderate amounts of data in the notebook would be
>> nice, in particular since the notebook already stores comparable kinds and
>> amounts of data for the outputs.
>>
>>>  - Is the cost/benefit trade off worth it? This may involve significant
>>>
>>> extra complexity in IPython, and it's simple enough to zip up a
>>> notebook file + input data.
>>
>>
>> I think this can probably piggy-back on existing mechanisms.  I think you
>> need two low-level Python functions: nb_get_binary_data(key) and
>> nb_put_binary_data(key,value).  The first one sends a message to the
>> notebook asking whether there is binary data for the given key and returns
>> it or returns None, and the second one stores binary data under the given
>> key.    I think those should be pretty easy to provide, since the same kind
>> of thing already needs to happen for output cells.
>>
>> In terms of those primitives, nb_open looks roughly like:
>>
>> def nb_open(path,mode="r"):
>>     assert mode=="r","cached files need to be read-only for now"
>>     key = "file:"+path
>>     data = nb_get_binary_data(key)
>>     if data is None:
>>         with open(path) as stream: data = stream.read()
>>         nb_put_binary_data(key,data)
>>     return data_as_stream(data,"r")
>>
>> The nb_data(name,thunk) function would be pretty similar.
>>
>> Tom
>>
>> PS: As for the "notebook as directory", having notebooks be single files
>> is quite convenient, since zipping/unzipping things and dealing with
>> directories can be a nuisance in many settings.  What you could do in the
>> future, however, is allow notebooks to be either directories or zip files
>> containing a directory tree, making the difference as transparent to the
>> notebook and user.  I think that would be a great direction to go into, also
>> because you could then store the output images separate from the pure text.
>>  Note that OpenOffice and OpenDocument files work that way, and data files
>> could also be embedded efficiently. Furthermore, version control tools are
>> starting to be able to deal with document formats based on zip files.  But,
>> in any case, that's a bigger change than what I suggested above.
>>
>> For ZIP files as document formats, see
>> http://www.openoffice.org/xml/faq.html and http://en.wikipedia.org/wiki/OpenDocument
>>
>> Here is Mercurial version control for ZIP-based document
>> formats: http://mercurial.selenic.com/wiki/ZipdocExtension
>
>
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>



-- 
Brian E. Granger
Cal Poly State University, San Luis Obispo
bgranger@calpoly.edu and ellisonbg@gmail.com


More information about the IPython-User mailing list