[IPython-User] embed data in notebooks?
Tue Jun 5 15:07:00 CDT 2012
Our idea is not "notebook as dir" but "project as dir/repo", where a
project contains any of: notebooks, data files, Python modules, etc.
I don't think storing binary data in the notebook file itself is worth a
new kernel-side API, more than existing systems for b64 data in Python
scripts, which will work just as well in the notebook as anywhere else (not
that they are great, of course).
We did seriously consider the idea of archive file-formats while planning
the notebook format, but we decided (largely from the perspective of VCS,
etc) that JSON makes much more sense, and data files belong at the project
level. For instance, what if you want two notebooks to work on the same
data? The data shouldn't live in either notebook, nor in both.
What is unfortunate at this point is that we really haven't developed our
project-level UI/APIs yet, they only exist in the planning stages. I think
once you have project as dir/repo, then the benefits of data in the
notebook file itself vanish, as the project becomes the unit of
sharing/etc. We would certainly have the ability to support *project* as a
zipfile, that gets extracted on upload to the server.
On Tue, Jun 5, 2012 at 12:21 PM, Brian Granger <firstname.lastname@example.org> wrote:
> You could also use the %%file magic that is under review here:
> On Tue, Jun 5, 2012 at 12:10 PM, Thomas Breuel <email@example.com> wrote:
> > I should say that the nb_open and nb_data interfaces I'm suggesting are
> > independent of the storage format, so adopting them now for a JSON based
> > storage would be safe even if the storage format changes to something
> > in the future.
> > Tom
> > On Tue, Jun 5, 2012 at 9:09 PM, Thomas Breuel <firstname.lastname@example.org> wrote:
> >> Hi,
> >> thanks for the response.
> >>> Our vision is 'folder as a project' - so the code in a notebook sits
> >>> alongside data files, keeping the notebook file lightweight and
> >>> suitable for version control.
> >> Well, right now the notebook is a file in JSON format, but it already
> >> includes binary data in encoded form. I'm just suggesting a small
> >> that would make this existing storage facility useful for storing modest
> >> amounts of extra data.
> >> If you do want to change the notebook format in the future, I think
> >> worth considering; see my suggestion at the very end.
> >>> I can see there's an argument for having a way to store data as part
> >>> of a notebook, but I think there are some questions:
> >>> - How would the user interface work: How would the data be brought in
> >>> and assigned to a variable? What would be displayed in the notebook?
> >>> Would we handle different types of files differently, or treat all
> >>> binary files the same?
> >> A simple interface might be the following:
> >> stream = nb_open("some/path/to/my.data")
> >> data = nb_data("mydata",lambda:randn(100))
> >> nb_open opens a stream to the data cached in the notebook, unless that
> >> data doesn't exist; in that case, it caches the file in the notebook and
> >> then opens it. It always returns just a stream.
> >> nb_data gets data from the notebook if it has been stored there; if not,
> >> it calls its second argument, stores the result in the notebook, and
> >> returns it.
> >> This interface means that as an author, I just write the code, maybe
> >> the notebook cache a few times, but I don't even have to think about
> >> managing the notebook data; if my notebook works once, it will continue
> >> work indefinitely even if the file or data become unavailable (until I
> >> the notebook data explicitly).
> >> See the low-level code below.
> >> (There should be a UI button and another function for clearing the
> >> cache/data stored in the notebook.)
> >>> - How would performance hold up? We'd have to base64 encode the data
> >>> to store it in JSON, so loading binary data will inevitably be slower
> >>> as it has an extra decoding step. It also increases the size of the
> >>> data on disk.
> >> Obviously, this doesn't solve all data storage problems. But many
> >> educational notebooks just need an image or a small dataset, and having
> >> easy way of storing such moderate amounts of data in the notebook would
> >> nice, in particular since the notebook already stores comparable kinds
> >> amounts of data for the outputs.
> >>> - Is the cost/benefit trade off worth it? This may involve significant
> >>> extra complexity in IPython, and it's simple enough to zip up a
> >>> notebook file + input data.
> >> I think this can probably piggy-back on existing mechanisms. I think
> >> need two low-level Python functions: nb_get_binary_data(key) and
> >> nb_put_binary_data(key,value). The first one sends a message to the
> >> notebook asking whether there is binary data for the given key and
> >> it or returns None, and the second one stores binary data under the
> >> key. I think those should be pretty easy to provide, since the same
> >> of thing already needs to happen for output cells.
> >> In terms of those primitives, nb_open looks roughly like:
> >> def nb_open(path,mode="r"):
> >> assert mode=="r","cached files need to be read-only for now"
> >> key = "file:"+path
> >> data = nb_get_binary_data(key)
> >> if data is None:
> >> with open(path) as stream: data = stream.read()
> >> nb_put_binary_data(key,data)
> >> return data_as_stream(data,"r")
> >> The nb_data(name,thunk) function would be pretty similar.
> >> Tom
> >> PS: As for the "notebook as directory", having notebooks be single files
> >> is quite convenient, since zipping/unzipping things and dealing with
> >> directories can be a nuisance in many settings. What you could do in
> >> future, however, is allow notebooks to be either directories or zip
> >> containing a directory tree, making the difference as transparent to the
> >> notebook and user. I think that would be a great direction to go into,
> >> because you could then store the output images separate from the pure
> >> Note that OpenOffice and OpenDocument files work that way, and data
> >> could also be embedded efficiently. Furthermore, version control tools
> >> starting to be able to deal with document formats based on zip files.
> >> in any case, that's a bigger change than what I suggested above.
> >> For ZIP files as document formats, see
> >> http://www.openoffice.org/xml/faq.html and
> >> Here is Mercurial version control for ZIP-based document
> >> formats: http://mercurial.selenic.com/wiki/ZipdocExtension
> > _______________________________________________
> > IPython-User mailing list
> > IPython-User@scipy.org
> > http://mail.scipy.org/mailman/listinfo/ipython-user
> Brian E. Granger
> Cal Poly State University, San Luis Obispo
> email@example.com and firstname.lastname@example.org
> IPython-User mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPython-User