[IPython-dev] Thoughts on the notebook format for version control
Sat Nov 5 21:41:31 CDT 2011
On Sat, Nov 5, 2011 at 18:58, Fernando Perez <email@example.com> wrote:
> Hi folks,
> I wanted to start a discussion on the notebook format regarding its
> suitability for version control. I see the notebook format as the way
> in which I'll likely keep (and hopefully many others too) most of my
> research notes/work, and thereore it's important that it's as easy as
> possible to version control notebooks and use them smoothly in a
> version-controlled workflow. Unfortunately right now, the format we
> have simply doesn't fit that, mainly for two reasons:
> 1. The cell inputs (code and text) are stored as a single line in the
> json format. This means that virtually any edits anywhere in a cell
> will immediately produce VC conflicts. Furthermore, they are nearly
> impossible to resolve by hand because you have to scan very long lines
> by eye, and can only apply wholesale one version or the other.
> 2. The presence of outputs stored inside the file causes two separate
> a) The large binary blobs make the files often quite large.
> b) Changes in the binary blobs can't really be inspected by hand, but
> tend to easily cause conflicts.
> To get a sense of the problem, here's the diff from a pull request
> made on a simple (mostly for testing purposes) repo:
> That diff is more or less useless: note the huge horizontal scroll
> bar, and changes in inputs are impossible to understand.
> So I think we need to find a solution. This doesn't have to happen
> necessarily right away, since we're trying to put 0.12 out; I think
> it's OK if for now our format is mostly treated as a binary blob. But
> we do need to come up with a plan for the medium term.
> Here's my proposal, with full credit going to Yarik who suggested the
> idea of splitting outputs into a separate file. There are basically
> two changes against what we have now:
> 1. The notebook would *always* be split into two files, the .ipynb
> containing only inputs, and a companion (say .ipynbo) file with all
> outputs. If an output file is not available or is detected to have
> problems such as cell number mismatch, it is simply ignored (it can
> always be recreated by rerunning the notebook.
There is a *huge* disadvantage in portability to notebooks not being single
files. I think this still makes
sense, though. I would treat the output as a 'cache' (along the lines of
.pyc / __cache__),
rather than considering the notebook itself as a multi-file format. And
you should be able
to embed the outputs in a single file if you want, for easier portability.
Doing it this way would not require changing the notebook format, because
notebooks would still comply with the spec.
> 2. All inputs would be stored in a json list of strings instead of a
> single string.
I like this - splitlines(code) / '\n'.join(lines) makes it easy. This
change does mean that we need it to be nbformat v3.
> With #1, one would naturally only commit to VC the ipynb file, leaving
> the output ones to be always ignored. People could obviously choose
> to commit the output as well, at their own risk. #2 would make it much
> easier to get line-by-line diffs of any input (code or text).
> I think together, these two changes mostly solve the problems I've
> encountered in practice so far. I'm trying really hard to eat our own
> dogfood by using these tools in actual, everyday research work, so
> that we see the problems first. And while I think the notebook is
> reaching a point where it's a great working environment (even if we
> have a ton of ideas for improvements already and things we know need
> fixing), it's clear now to me that we fail pretty badly as a
> version-controllable format.
> I realize that implementing something like this will add non-trivial
> complexity to the format read/write code in a number of places, so if
> anyone sees a simpler solution to the problem, we're all ears. But we
> do need to figure out how to make the notebooks first-class citizens
> in a VC world; the (effectively) opaque binary blobs they are now just
> won't cut it in the long run.
Yes, we do need to do better.
> Thoughts, ideas?
I think this sounds like a good start, with the only change that we still
allow (optionally) outputs in a single file via the download button, rather
than the notebook format being canonically multifile, which is just too
I think the key-order issue you mention in the addendum is easily fixed by
specifying `sort_keys=True` in the json dump.
> IPython-dev mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPython-dev