[IPython-dev] [sympy] Re: using reST for representing the notebook cells+text

Robert Kern robert.kern@gmail....
Wed Feb 24 17:47:24 CST 2010


On Wed, Feb 24, 2010 at 17:04, Mikhail Terekhov <termim@gmail.com> wrote:
> On Wed, Feb 24, 2010 at 4:04 PM, Robert Kern <robert.kern@gmail.com> wrote:
>>
>> I am almost certain that their use cases and workloads are much
>> different than the notebook's would be. Python's parser isn't exactly
>> a speed demon, either. A general statement like "XML is slow" followed
>> by an unrelated anecdote is not terribly convincing. Show me
>> experiments. I've attached mine. Python ends up being about 3 times
>> slower than the equivalent XML for a variety of file sizes.
>
> Believe it or not, I can't find any example of a python project that started
> implementing scientific notebook as an XML document and then switched to
> something else :) I've used a "scientific analogy" principle. Seriously,
> the Subversion is a real project and they really suffered from the decision to
> use XML as a storage for the workspace meta data and they really switched away
> from XML. No anecdotes.

Yes, that is an anecdote. Anecdotes are true stories, but they are not
convincing data. This comparison is by no means scientific. There are
large differences between the use cases here. The simple text format
that Subversion moved to is not comparable to the Python format being
discussed here. Subversion faced different read/write loads than a
notebook will. I'm happy to consider other examples of projects
finding that their XML parsers were too slow, but you need to make a
more considered argument that the circumstances are similar enough to
the one we are talking about.

XML is not slow. XML is not fast. It cannot be either thing because it
is a file format, not an implementation. Parsers can be slow or fast.
cElementTree is particularly fast and rather faster than the Python
parser on equivalent data.

> IMHO the relation is quite simple - the things like Mathematica's
> notebooks tend to
> multiply and form libraries or collections. In this case XML parsing
> could became a
> problem.

I'm sorry, but this does not follow.

> Your example is not quite correct but it is a good start :} It
> actually illustrates two
> important points. First is that writing serializer that produces XML
> representation
> is easy. The second and more important is that after parsing XML you've got
> nothing but internal XML representation and the only thing you can do with it is
> to write back to a file, you still need to implement all the
> functionality as a python
> objects and convert the XML tree into python objects tree. Only after
> that there
> will be any point in benchmarking. At the contrary, the python's
> version is complete
> and if you had the real Notebook implementation then you were just
> ready to use it.
> Also note that there is no need to write, debug and support _any_
> parser/reader,
> python provides that for free.

I intended to measure the performance of the parsing, not the
difficulty of implementation of anything else. The performance of
constructing the objects is rather smaller, at least for XML. It is a
negligible cost on top of the XML parsing, but executing the Python
code actually imposes a much larger burden on the Python
implementation. See the attached updated benchmark. The Python method
now takes about 4x as much time to construct the Notebook instance
than the XML.

>> I'm not talking about other projects adopting anything. I'm talking
>> about basic capabilities of other languages, like JavaScript's builtin
>> support for parsing XML. That enables *us* to build things in
>> JavaScript.
>>
>>> BTW the fact that
>>> everyone can parse XML doesn't mean that every one can _use_ the
>>> data right away.
>>
>> Nor am I saying that. I am saying that it is enormously easier to
>> build the JavaScript parser for the XML representation rather than the
>> Python one.
>
> That is the real question - why JavaScript needs to read _interenal_
> representation of the nb if it is not going to implement all the
> needed functionality
> to use it?

File formats are not internal. They are external. They are the primary
method of interchange.

>>> One have to have an internal logic/library/API specific
>>> to the data represented by some particular XML document. If you take
>>> this into account then the value of the exchange document format
>>> somewhat reduces. It is still not zero though and IMHO it is easy to
>>> teach classes proposed by Brian to produce XML representation just
>>> for the mythical interchange with something :)
>>
>> The need for interchange is not at all mythical. Web front ends are
>> exactly what we are talking about in this thread.
>
> Sure, and It looks like in his very interesting approach the
> JavaScript part is a
> client that queries python server for information about nb it needs and there is
> no need for JS to read nb or even know how it is stored on disk.
>
> More general: internal representation does not have to be tightly coupled to
> interfaces to external systems.

Exactly. The Python code file format is a greater coupling to the
internal implementation, not a looser one.

> Simplicity and reliability of the
> internal representation
> (in this case - just a regular python compiler versus custom XML
> parser) outweighs
> the need to write relatively simple export/interface functions that
> give a view on the nb.

I disagree with that judgement and with the characterization of the
Python format as more simple and reliable.

> As Ondrej's work shows they are needed anyway and of coarse they can use XML if
> it is easy for the client.
>
>>>> JavaScript being the hugely important player here. Certainly, you are
>>>
>>> Again, it is important to define to what degree the interoperability with
>>> something like JavaScript is needed. If you plan to work on/modify/execute
>>> the same nbs in Python and in JavaScript then you have to implement
>>> compatible engine/API in Python _and_ in JavaScript. Are you sure you
>>> want to do that? If only the representation or "computed" notebook is
>>> needed for display purposes by JavaScript, then it is something different
>>> and could be implemented through specialized repr methods.
>>
>> Or you could use the same mechanism for both instead of duplicating efforts.
>
> Unfortunately one have to duplicate something in either case. nb->XML would
> duplicate nb->repr, but as your example shows the nb->XML is quite
> straightforward.
> In case XML->nb one have to duplicate python compiler which is unnecessary
> in case repr->nb.

Honestly, it's pretty trivial stuff.

>>>> going to have a Python API that will represent that tree of text nodes
>>>> as Python objects, but I just don't see the point of making the repr()
>>>> of that be the lingua franca format of the notebook file. It's just a
>>>> wasted opportunity.
>>>
>>> The point is that nb became a first class python object - just a module,
>>> no need for specialized parser and you can work with it as with regular
>>> Python module - just import and use it. The only difference is that nb is
>>> mutable - if you modified it then you have to save it.
>>
>> I really don't see why having the file format be Python code makes it
>> any more of a first class object. The objects are the first class
>
> You are right - not the first class, just a native python object.

A file with Python code is not a native Python object. From the API
user's perspective, they call a function or execute a statement and
then they have an object. It's exactly the same for any format. There
is no benefit for either case.

>> objects. As long as loading to those objects is easy, the format just
>
> In a sense I agree, the only difference is that from the programming POV
> loading cost for repr->nb is zero (all is done by the regular python compiler)
> and XML->nb requires a special loader that should be maintained and updated
> when the application changes.

Actually, that raises another objection I have to using Python code as
the file format. The file format is intimately tied to the internal
implementation. What if we want to change the internal implementation?
It *will* happen. With a more neutral representation, you can read
older files as long as you update the reader appropriately. If you are
just executing code, you are stuck with maintaining the classes with
backwards compatible argument specs forever and won't be able to make
some of the changes that you want down the road. Format versioning is
a *huge* issue whenever you design file formats. XML and other generic
representations permit this.

The reason that Mathematica can get away with this is that it is a
Lispish language. Although you correctly point out that being able to
do things like ExpressionCell aren't particularly important, being
able to load the code into a neutral tree structure and manipulate it
before actually instantiating your API classes it is.

>> doesn't matter. Loading an object by importing is actually a very
>> inflexible and difficult to work with method compared to a function
>> call.
>
> If one prefers functions one can always to use __import__() or imp.load_module()
> functions instead of import statement.

Abusing a Python-internal API as your main file loading API is just
not a good practice.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco
-------------- next part --------------
A non-text attachment was scrubbed...
Name: notebook.py
Type: application/octet-stream
Size: 1619 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/ipython-dev/attachments/20100224/cabf73ec/attachment.obj 


More information about the IPython-dev mailing list