[IPython-dev] pyspark and IPython

Nitin Borwankar nborwankar@gmail....
Thu Aug 29 17:49:28 CDT 2013


Brian,

This is awesome - Fernando did the "integration" I was looking for all the
way to the NB integration in the last 15 mins or so!

Bottom line
a) pyspark *is* a basic Python cmd line already that calls out to the Java
and Scala parts
b) setting an env var IPYTHON=1 gets you IPython inst of Python in a)
c) our fearless leader just did the IPython NB part.

Advantages of having all the people in one room!  He asked them to open up
port 8888 on the cluster master.

Nitin


------------------------------------------------------------------
Nitin Borwankar
nborwankar@gmail.com


On Thu, Aug 29, 2013 at 3:15 PM, Nitin Borwankar <nborwankar@gmail.com>wrote:

> Hi Brian,
>
> Yes, ok I wasn't clear either.  Meta thing - IPython and NB have spoilt me
> - I want IPy as a cmd line for everything - and be able to launch all
> cmdline programs from IPy and IPyNB.  So that's the meta goal.  Every new
> cmdline I encounter I try to see if !ls works if not it is not a good
> enough cmdline any more and I try to see if there is a cell magic for that
> cmdline :-) !
>
> In the context of Spark/Shark and family, they are early efforts and I
> want to be able to play with the many moving parts in there fast and
> furiously, without being limited by the earliness of their interface.  So
> if I can plug these into IPy then all the better.
>
> I am not sure if there's any value in integrating the two parallel
> computing models as they seem to serve different audiences.  The IPy
> parallel computing model seems closer to what the scientific community
> needs and at first sight the Spark/Shark model seems to serve the more
> business oriented data demographic.
>
> Where there is an intersection is IMHO, in the ML area - we will learn
> about that tomorrow in the conference.  In any case in the spirit of "IPy
> over everything" I'd like to hope I can do some integration.
>
> Also Fernando is here too and we chatted at lunch but pretty much about
> everything else except the AMP stuff.  I think it makes more sense to wait
> till the end of day tomorrow to report on the content.
>
> Nitin
>
> P.S. In an IETF meetings decades ago Vint Cerf wore a t-shirt that said
> "IP over everything", so "IPy over everything" is my homage to that t-shirt.
>
>
>
>
> ------------------------------------------------------------------
> Nitin Borwankar
> nborwankar@gmail.com
>
>
> On Thu, Aug 29, 2013 at 2:58 PM, Brian Granger <ellisonbg@gmail.com>wrote:
>
>> Sorry I wasn't clear in my question.  I am very aware of how amazing
>> Spark and Shark are.  I do think you are right that they are looking
>> very promising right now.  What I don't see is what IPython can offer
>> in working with them.  Given their architecture, I don't see how for
>> example you could run spark jobs from the IPython Notebook
>> interactively.  Is that the type of thing you are thinking about?  Or
>> are you more thinking about direct integration of spark and
>> IPython.parallel.  I am more wondering what the benefit of
>> IPython+Spark integration would be.  I know that Fernando and Min have
>> talked with some of the AMP lab people and I would love to see what
>> can be done.  I would probably be best to sit down and talk further
>> with the spark/shark devs at some point.  But if you can learn more
>> about their architecture and investigate the possibilities and report
>> back, that would be fantastic.
>>
>> On Thu, Aug 29, 2013 at 2:41 PM, Nitin Borwankar <nborwankar@gmail.com>
>> wrote:
>> > Hi Brian,
>> >
>> > The advantage IMHO is that pyspark and the larger UCB AMP effort are a
>> huge
>> > open source effort for distributed parallel computing that improves
>> upon the
>> > Hadoop model. Spark the underlying layer + Shark the Hive compatible
>> query
>> > language adds performance gains of 10x - 100x.  The effort has 20+
>> companies
>> > contributing code including Yahoo and 70+ contributors. AMP has a 10M$
>> grant
>> > from NSF.  So
>> > a) it's not going away soon
>> > b) it may be hard to compete with it without that level of resources
>> > c) they do have a Python shell (have not used it yet) and they appear
>> > committed to have Python as a first class language in their effort.
>> > d) lets see if we can find ways to integrate with it.
>> >
>> > I think integration at the level of the interactive interface might make
>> > sense.
>> >
>> > Just my 2c but I think this effort may leapfrog pure Hadoop over the
>> next
>> > 2-3 years.
>> >
>> >
>> > Nitin.
>> >
>> >
>> >
>> >
>> > ------------------------------------------------------------------
>> > Nitin Borwankar
>> > nborwankar@gmail.com
>> >
>> >
>> > On Thu, Aug 29, 2013 at 1:35 PM, Brian Granger <ellisonbg@gmail.com>
>> wrote:
>> >>
>> >> >From a quick glance, it looks like both pyspark and IPython use
>> >> similar parallel computing models in terms of the process model.  You
>> >> might think that would help them to integrate, but in this case I
>> >> think it will get in the way of integration.  Without learning more
>> >> about the low-level details of their architecture it is really
>> >> difficult to know if it is possible or not.  But I think the bigger
>> >> question is what would the motivation for integration be?  Both
>> >> IPython and spark provide self-contained parallel computing
>> >> capabilties - what usage cases are there for using both at the same
>> >> time?  I think the biggest potential show stopper is that pyspark is
>> >> not designed in any way to be interactive as far as I can tell.
>> >> Pyspark jobs basically run in batch mode, which is going to make it
>> >> really tough to fit into IPython's interactive model.  Worth looking
>> >> more into though..
>> >>
>> >> Cheers,
>> >>
>> >> Brian
>> >>
>> >> On Thu, Aug 29, 2013 at 11:28 AM, Nitin Borwankar <
>> nborwankar@gmail.com>
>> >> wrote:
>> >> > I'm at AmpCamp3 at UCB and see that there would be huge benefits to
>> >> > integrating pyspark with IPython and IPyNB.
>> >> >
>> >> > Questions:
>> >> >
>> >> > a) has this been attempted/done? if so pointers pl.
>> >> >
>> >> > b) does this overlap the IPyNB parallel computing effort in
>> >> > conflicting/competing ways?
>> >> >
>> >> > c) if this has not been done yet - does anyone have a sense of how
>> much
>> >> > effort this might be? (I've done a small hack integrating postgres
>> psql
>> >> > into
>> >> > ipynb so I'm not terrified by that level of deep digging, but are
>> there
>> >> > any
>> >> > show stopper gotchas?)
>> >> >
>> >> > Thanks much,
>> >> >
>> >> > Nitin
>> >> > ------------------------------------------------------------------
>> >> > Nitin Borwankar
>> >> > nborwankar@gmail.com
>> >> >
>> >> > _______________________________________________
>> >> > IPython-dev mailing list
>> >> > IPython-dev@scipy.org
>> >> > http://mail.scipy.org/mailman/listinfo/ipython-dev
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Brian E. Granger
>> >> Cal Poly State University, San Luis Obispo
>> >> bgranger@calpoly.edu and ellisonbg@gmail.com
>> >> _______________________________________________
>> >> IPython-dev mailing list
>> >> IPython-dev@scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/ipython-dev
>> >
>> >
>> >
>> > _______________________________________________
>> > IPython-dev mailing list
>> > IPython-dev@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/ipython-dev
>> >
>>
>>
>>
>> --
>> Brian E. Granger
>> Cal Poly State University, San Luis Obispo
>> bgranger@calpoly.edu and ellisonbg@gmail.com
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-dev
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/ipython-dev/attachments/20130829/a385183c/attachment.html 


More information about the IPython-dev mailing list