[IPython-User] questions about IPython.parallel

Francesco Montesano franz.bergesund@gmail....
Wed Oct 24 15:07:37 CDT 2012


Hi Min,

thanks for the answer

2012/10/24 MinRK <benjaminrk@gmail.com>:
>
>
> On Wed, Oct 24, 2012 at 3:36 AM, Francesco Montesano
> <franz.bergesund@gmail.com> wrote:
>>
>> Dear list,
>>
>> I have a bunch of coded designed to repeat the same operation over a
>> (possibly large)
>> number of file. So after discovering Ipython.parallel not long ago, I
>> decided to
>> rewrite to give me the possibility to use a task scheduler (I use
>> load_balance_view) in order
>> to make the best use possible of my quad core machines.
>> Here is the typical structure of my code
>>
>> ###### BEGIN example.py ######
>> #imports
>>
>> def command_line_parsing( ... ):
>>    "in my case argparse"
>>
>> def do_some_operation( ... ):
>>   "executes some mathematical operation"
>>
>> def read_operate_save_file( file, ... ):
>>     """reads the file, does operations and save to an output file"""
>>     input = np.loadtxt( file )
>> [1] do_some_operation(   )
>>     np.savetxt( outfile, ..... )
>>
>> if __name__ == "__main__":
>>
>>     args = command_line_parsing( )
>>
>>     #parallelisation can be can chosen or not
>>     if args.parallel :
>>         #checks that Ipython is there, that an ipcluster has been started
>>         #initialises a Client and a load_balance_view. I can pass a string
>> or
>>         #list of strings to be executed on all engines (I use it to
>> "import xxx as x" )
>>         lview = IPp.start_load_balanced_view( to_execute )
>>
>>     if( args.parallel == False ):   #for serial computation
>> [2]     for fn in args.ifname:  #file name loop
>>             output = read_operate_save_file(fn, dis, **vars(args) )
>>         else:   #I want parallel computation
>> [3]         runs = [ lview.apply( read_operate_save_file,
>> os.path.abspath(fn.name), ... ) for fn in args.ifname ]
>>           results = [r.result for r in runs]
>>
>> ###### END example.py ######
>>
>> I have two questions:
>> [1] In function 'read_operate_save_file', I call 'do_some_operation'. When
>> I
>> work on serial mode, everything works fine, but in parallel mode I get
>> the error
>> "IPython.parallel.error.RemoteError: NameError(global name
>> 'do_some_operation' is not defined)"
>> I'm not surprised by this, as I imagine that each engine know only what
>> has been
>> executed or defined before and that lview.apply( func, ... ) just passes
>> the
>> "func" to the engines. A solution that I see is to run "from example
>> import
>> do_some_operation" on the engines when initialising the load_balance_view.
>> Is
>> there any easier/safer way?
>
>
>
> This namespace issue is common, and I have explanations scattered about the
> internet:
>
> http://stackoverflow.com/a/12307741/938949
> http://stackoverflow.com/a/10859394/938949
> https://github.com/ipython/ipython/issues/2489
> http://ipython.org/ipython-doc/dev/parallel/index.html
>
> Which I really need to consolidate into a single thorough explanation with
> examples.
>
> But the gist:
>
> - If a function is importable (e.g. in a module available both locally and
> remotely), then it's no problem
> - If it is defined in __main__ (e.g. in a script), then any references will
> be resolved in the *engine* namespace
>
> I recommend conforming to the first case if feasible, because then there
> should be no surprises.
> Everything surprising happens when you have depend on references in
> `__main__` or the current working dir (e.g. locally imported modules), since
> `__main__` is not the same on the various machines, nor is the working dir
> (necessarily).
>
> That said, if the names you need to resolve are few, a simple import/push
> step with a DirectView to set up namespaces should be all you need prior to
> submitting tasks (assuming new engines are not arriving in mid-computation).
>
> e.g.:
>
> rc = Client()
> dv = rc[:]
> # push any locally defined functions that your task function uses:
> dv['do_some_operation'] = do_some_operation
I ended up doing the following when initialising the load_balance_view
dv.execute( 'import sys' )
dv.execute( 'sys.path.append("path_to_example.py")' )
dv.execute( 'from example import do_some_operation' )
Your suggestion looks much neater, just a couple of questions.
With the push that you suggest, do I simply call the
'do_some_operation' as in my example or do I need some different
syntax?
Do you think that one or the other way is more optimal when the
function is called and executed?

> # perform any imports that are needed:
> dv.execute("import numpy as np...")
> # continue as before:
> lview = IPp.start_load_balanced_view( to_execute )
> ...
>
>
>>
>>
>> [2] Because of the way I parse my command line arguments, args.ifname its
>> a
>> list of already opened files. In serial mode, this is no problem, but when
>> I
>> assign the function to the scheduler passing the file, I get an error
>> saying
>> that the cannot work on a closed file. If I pass the file name with the
>> absolute path, numpy can read it without problem. Is this a behaviour to
>> be
>> expected or a bug?
>
>
> I would expect a PickleError when you try to send an open file.  Definitely
> send filenames, not open file objects.
Just a curiosity: what is the working directory of the engines? Is the
one where the ipcluster is started or where the profile is stored?
(While fixing my code, I ended up passing the filename with the full path)

Thanks again,

Francesco

>
>>
>>
>> Thanks for any help,
>>
>> Cheers,
>> Francesco
>> _______________________________________________
>> IPython-User mailing list
>> IPython-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-user
>
>
>
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user
>


More information about the IPython-User mailing list