[Numpy-discussion] Announcing toydist, improving distribution and packaging situation

René Dudfield renesd@gmail....
Wed Dec 30 05:15:45 CST 2009


hello again,

On Tue, Dec 29, 2009 at 2:22 PM, David Cournapeau <cournape@gmail.com> wrote:
> On Tue, Dec 29, 2009 at 10:27 PM, René Dudfield <renesd@gmail.com> wrote:
>> Hi,
>>
>> In the toydist proposal/release notes, I would address 'what does
>> toydist do better' more explicitly.
>>
>>
>>
>> **** A big problem for science users is that numpy does not work with
>> pypi + (easy_install, buildout or pip) and python 2.6. ****
>>
>>
>>
>> Working with the rest of the python community as much as possible is
>> likely a good goal.
>
> Yes, but it is hopeless. Most of what is being discussed on
> distutils-sig is useless for us, and what matters is ignored at best.
> I think most people on distutils-sig are misguided, and I don't think
> the community is representative of people concerned with packaging
> anyway - most of the participants seem to be around web development,
> and are mostly dismissive of other's concerns (OS packagers, etc...).
>

Sitting down with Tarek(who is one of the current distutils
maintainers) in Berlin we had a little discussion about packaging over
pizza and beer... and he was quite mindful of OS packagers problems
and issues.  He was also interested to hear about game developers
issues with packaging (which are different again to scientific
users... but similar in many ways).

However these systems were developed by the zope/plone/web crowd, so
they are naturally going to be thinking a lot about zope/plone/web
issues.  Debian, and ubuntu packages for them are mostly useless
because of the age.  Waiting a couple of years for your package to be
released is just not an option (waiting even an hour for bug fixes is
sometimes not an option).  Also isolation of packages is needed for
machines that have 100s of different applications running, written by
different people, each with dozens of packages used by each
application.

Tools like checkinstall and stdeb ( http://pypi.python.org/pypi/stdeb/
) can help with older style packaging systems like deb/rpm.  I think
perhaps if toydist included something like stdeb as not an extension
to distutils, but a standalone tool (like toydist) there would be less
problems with it.

One thing the various zope related communities do is make sure all the
relevant and needed packages are built/tested by their compile farms.
This makes pypi work for them a lot better than a non-coordinated
effort does.  There are also lots of people trying out new versions
all of the time.


> I want to note that I am not starting this out of thin air - I know
> most of distutils code very well, I have been the mostly sole
> maintainer of numpy.distutils for 2 years now. I have written
> extensive distutils extensions, in particular numscons which is able
> to fully build numpy, scipy and matplotlib on every platform that
> matters.
>
> Simply put, distutils code is horrible (this is an objective fact) and
>  flawed beyond repair (this is more controversial). IMHO, it has
> almost no useful feature, except being standard.
>

yes, I have also battled with distutils over the years.  However it is
simpler than autotools (for me... maybe distutils has perverted my
fragile mind), and works on more platforms for python than any other
current system.  It is much worse for C/C++ modules though.  It needs
dependency, and configuration tools for it to work better (like what
many C/C++ projects hack into distutils themselves).

Monkey patching, and extensions are especially a problem... as is the
horrible code quality of distutils by modern standards.  However
distutils has had more tests and testing systems added, so that
refactoring/cleaning up of distutils can happen more so.


> If you want a more detailed explanation of why I think distutils and
> all tools on top are deeply flawed, you can look here:
>
> http://cournape.wordpress.com/2009/04/01/python-packaging-a-few-observations-cabal-for-a-solution/
>

I agree with many things in that post.  Except your conclusion on
multiple versions of packages in isolation.  Package isolation is like
processes, and package sharing is like threads - and threads are evil!
 Leave my python site-packages directory alone I say... especially
don't let setuptools infect it :)  Many people currently find the
multi versions of packages in isolation approach works well for them -
so for some use cases the tools are working wonderfully.


>> numpy used to work with buildout in python2.5, but not with 2.6.
>> buildout lets other team members get up to speed with a project by
>> running one command.  It installs things in the local directory, not
>> system wide.  So you can have different dependencies per project.
>
> I don't think it is a very useful feature, honestly. It seems to me
> that they created a huge infrastructure to split packages into tiny
> pieces, and then try to get them back together, imaganing that
> multiple installed versions is a replacement for backward
> compatibility. Anyone with extensive packaging experience knows that's
> a deeply flawed model in general.
>

Science is supposed to allow repeatability.  Without the same versions
of packages, repeating experiments is harder.  This is a big problem
in science that multiple versions of packages in _isolation_ can help
get to a solution to the repeatability problem.

Just pick some random paper and try to reproduce their results.  It's
generally very hard, unless the software is quite well packaged.
Especially for graphics related papers, there are often many different
types of environments, so setting up the environments to try out their
techniques, and verify results quickly is difficult.

Multiple versions are not a replacement for backwards compatibility,
just a way to avoid the problem in the short term to avoid being
blocked.  If a new package version breaks your app, then you can
either pin it to an old version, fix your app, or fix the package.  It
is also not a replacement for building on stable high quality
components, but helps you work with less stable, and less high quality
components - at a much faster rate of change, with a much larger
dependency list.


>> Plenty of good work is going on with python packaging.
>
> That's the opposite of my experience. What I care about is:
>  - tools which are hackable and easily extensible
>  - robust install/uninstall
>  - real, DAG-based build system
>  - explicit and repeatability
>
> None of this is supported by the tools, and the current directions go
> even further away. When I have to explain at length why the
> command-based design of distutils is a nightmare to work with, I don't
> feel very confident that the current maintainers are aware of the
> issues, for example. It shows that they never had to extend distutils
> much.
>

All agreed!  I'd add to the list parallel builds/tests (make -j 16),
and outputting to native build systems.  eg, xcode, msvc projects, and
makefiles.

It would interesting to know your thoughts on buildout recipes ( see
creating recipes http://www.buildout.org/docs/recipe.html ).  They
seem to work better from my perspective.  However, that is probably
because of isolation.  The recipe are only used by those projects that
require them.  So the chance of them interacting are lower, as they
are not installed in the main python.

How will you handle toydist extensions so that multiple extensions do
not have problems with each other?  I don't think this is possible
without isolation, and even then it's still a problem.

Note, the section in the distutils docs on creating command extensions
is only around three paragraphs.  There is also no central place to go
looking for extra commands  (that I know of).  Or a place to document
or share each others command extensions.

Many of the methods for extending distutils are not very well
documented either.  For example, 'how do I you change compiler command
line arguments for certain source files?'  Basic things like that are
possible with disutils, but not documented (very well).




>>
>> There are build farms for windows packages and OSX uploaded to pypi.
>> Start uploading pre releases to pypi, and you get these for free (once
>> you make numpy compile out of the box on those compile farms).  There
>> are compile farms for other OSes too... like ubuntu/debian, macports
>> etc.  Some distributions even automatically download, compile and
>> package new releases once they spot a new file on your ftp/web site.
>
> I am familiar with some of those systems (PPA and opensuse build
> service in particular). One of the goal of my proposal is to make it
> easier to interoperate with those tools.
>

yeah, cool.

> I think Pypi is mostly useless. The lack of enforced metadata is a big
> no-no IMHO. The fact that Pypi is miles beyond CRAN for example is
> quite significant. I want CRAN for scientific python, and I don't see
> Pypi becoming it in the near future.
>
> The point of having our own Pypi-like server is that we could do the following:
>  - enforcing metadata
>  - making it easy to extend the service to support our needs
>

Yeah, cool.  Many other projects have their own servers too.
pygame.org, plone, etc etc, which meet their own needs.  Patches are
accepted for pypi btw.

What type of enforcements of meta data, and how would they help?  I
imagine this could be done in a number of ways to pypi.
- a distutils command extension that people could use.
- change pypi source code.
- check the metadata for certain packages, then email their authors
telling them about issues.


>>
>> pypm:  http://pypm.activestate.com/list-n.html#numpy
>
> It is interesting to note that one of the maintainer of pypm has
> recently quitted the discussion about Pypi, most likely out of
> frustration from the other participants.
>

yeah, big mailing list discussions hardly ever help I think :)  oops,
this is turning into one.


>> Documentation projects are being worked on to document, give tutorials
>> and make python packaging be easier all round.  As witnessed by 20 or
>> so releases on pypi every day(and growing), lots of people are using
>> the python packaging tools successfully.
>
> This does not mean much IMO. Uploading on Pypi is almost required to
> use virtualenv, buildout, etc.. An interesting metric is not how many
> packages are uploaded, but how much it is used outside developers.
>

Yeah, it only means that there are lots of developers able to use the
packaging system to put their own packages up there.  However there
are over 500 science related packages on there now - which is pretty
cool.

A way to measure packages being used would be by downloads, and by
which packages depend on which other packages.  I think the science
ones would be reused lower than normal, since a much higher percentage
are C/C++ based, and are likely to be more fragile packages.


>>
>> I'm not sure making a separate build tool is a good idea.  I think
>> going with the rest of the python community, and improving the tools
>> there is a better idea.
>
> It has been tried, and IMHO has been proved to have failed. You can
> look at the recent discussion (the one started by Guido in
> particular).
>

I don't think 500+ science related packages is a total failure really.


>> pps. some notes on toydist itself.
>> - toydist convert is cool for people converting a setup.py .  This
>> means that most people can try out toydist right away.  but what does
>> it gain these people who convert their setup.py files?
>
> Not much ATM, except that it is easier to write a toysetup.info
> compared to setup.py IMO, and that it supports a simple way to include
> data files (something which is currently *impossible* to do without
> writing your own distutils extensions). It has also the ability to
> build eggs without using setuptools (I consider not using setuptools a
> feature, given the too many failure modes of this package).
>

yeah, I always make setuptools not used in my packages by default.
However I use command line arguments to use the features of setuptools
required (eggs, bdist_mpkg etc etc).

Having a tool to create eggs without setuptools would be great in
itself.  Definitely list this in the feature list :)


> The main goals though are to make it easier to build your own tools on
> top of if, and to integrate with real build systems.
>

yeah, cool.

>> - a toydist convert that generates a setup.py file might be cool :)
>
> toydist started like this, actually: you would write a setup.py file
> which loads the package from toysetup.info, and can be converted to a
> dict argument to distutils.core.setup. I have not updated it recently,
> but that's definitely on the TODO list for a first alpha, as it would
> enable people to benefit from the format, with 100 % backward
> compatibility with distutils.
>

yeah, cool.  That would let you develop things incrementally too, and
still have toydist be useful for the whole development period until it
catches up with the features of distutils needed.

>> - arbitrary code execution happens when building or testing with
>> toydist.
>
> You are right for testing, but wrong for building. As long as the
> build is entirely driven by toysetup.info, you only have to trust
> toydist (which is not safe ATM, but that's an implementation detail),
> and your build tools of course.
>

If you execute build tools on arbitrary code, then arbitrary code
execution is easy for someone who wants to do bad things.  Trust and
secondarily sandboxing are the best ways to solve these problems imho.

> Obviously, if you have a package which uses an external build tool on
> top of toysetup.info (as will be required for numpy itself for
> example), all bets are off. But I think that's a tiny fraction of the
> interesting packages for scientific computing.
>

yeah, currently 1/5th of science packages use C/C++/fortran/cython etc
(see http://pypi.python.org/pypi?:action=browse&c=40 110/458 on that
page ).  There seems to be a lot more using C/C++ compared to other
types of pakages on there (eg zope3 packages list 0 out of 900
packages using C/C++).

So the hight number of C/C++ science related packages on pypi
demonstrate that better C/C++ tools for scientific packages is a big
need.  Especially getting compile/testing farms for all these
packages.  Getting compile farms is a big need compared to python
packages - since C/C++ is MUCH harder to write/test in a portable way.
 I would say it is close to impossible to get code to work without
quite good knowledge on multiple platforms without errors.  There are
many times with pygame development that I make changes on an osx,
windows or linux box, commit the change, then wait for the
compile/tests to run on the build farm (
http://thorbrian.com/pygame/builds.php ).  Releasing packages
otherwise makes the process *heaps* longer... and many times I still
get errors on different platforms, despite many years of multi
platform coding.


> Sandboxing is particularly an issue on windows - I don't know a good
> solution for windows sandboxing, outside of full vms, which are
> heavy-weights.
>

yeah, VMs are the way to go.  If only to make the copies a fresh
install each time.  However I think automated distributed building,
and trust are more useful.  ie, only build those packages where you
trust the authors, and let anyone download, build and then post their
build/test results.  MS have given out copies of windows to some
people to set up VMs for building to different members of the python
community in the past.

By automated distributed building, I mean what happens with mailing
lists usually.  Where people post their test results when they have a
problem.  Except in a more automated manner.  Adding a 'Do you want to
upload your build/test results?' at the end of a setup.py for
subversion builds would give you dozens or hundreds of test results
daily from all sorts of machines.  Making it easy for people to set up
package builders which also upload their packages somewhere gives you
distributed package building, in a fairly safe automated manner.
(more details here:
http://renesd.blogspot.com/2009/09/python-build-bots-down-maybe-they-need.html
)


>> - it should be possible to build this toydist functionality as a
>> distutils/distribute/buildout extension.
>
> No, it cannot, at least as far as distutils/distribute are concerned
> (I know nothing about buildout). Extending distutils is horrible, and
> fragile in general. Even autotools with its mix of generated sh
> scripts through m4 and perl is a breeze compared to distutils.
>
>> - extending toydist?  How are extensions made?  there are 175 buildout
>> packages which extend buildout, and many that extend
>> distutils/setuptools - so extension of build tools in a necessary
>> thing.
>
> See my answer earlier about interoperation with build tools.
>

I'm still not clear on how toydist will be extended.  I am however, a
lot clearer about its goals.



cheers,


More information about the NumPy-Discussion mailing list