[IPython-User] questions about IPython.parallel

Min RK benjaminrk@gmail....
Wed Oct 24 15:51:22 CDT 2012



On Oct 24, 2012, at 13:07, Francesco Montesano <franz.bergesund@gmail.com> wrote:

> Hi Min,
> 
> thanks for the answer
> 
> 2012/10/24 MinRK <benjaminrk@gmail.com>:
>> 
>> 
>> On Wed, Oct 24, 2012 at 3:36 AM, Francesco Montesano
>> <franz.bergesund@gmail.com> wrote:
>>> 
>>> Dear list,
>>> 
>>> I have a bunch of coded designed to repeat the same operation over a
>>> (possibly large)
>>> number of file. So after discovering Ipython.parallel not long ago, I
>>> decided to
>>> rewrite to give me the possibility to use a task scheduler (I use
>>> load_balance_view) in order
>>> to make the best use possible of my quad core machines.
>>> Here is the typical structure of my code
>>> 
>>> ###### BEGIN example.py ######
>>> #imports
>>> 
>>> def command_line_parsing( ... ):
>>>   "in my case argparse"
>>> 
>>> def do_some_operation( ... ):
>>>  "executes some mathematical operation"
>>> 
>>> def read_operate_save_file( file, ... ):
>>>    """reads the file, does operations and save to an output file"""
>>>    input = np.loadtxt( file )
>>> [1] do_some_operation(   )
>>>    np.savetxt( outfile, ..... )
>>> 
>>> if __name__ == "__main__":
>>> 
>>>    args = command_line_parsing( )
>>> 
>>>    #parallelisation can be can chosen or not
>>>    if args.parallel :
>>>        #checks that Ipython is there, that an ipcluster has been started
>>>        #initialises a Client and a load_balance_view. I can pass a string
>>> or
>>>        #list of strings to be executed on all engines (I use it to
>>> "import xxx as x" )
>>>        lview = IPp.start_load_balanced_view( to_execute )
>>> 
>>>    if( args.parallel == False ):   #for serial computation
>>> [2]     for fn in args.ifname:  #file name loop
>>>            output = read_operate_save_file(fn, dis, **vars(args) )
>>>        else:   #I want parallel computation
>>> [3]         runs = [ lview.apply( read_operate_save_file,
>>> os.path.abspath(fn.name), ... ) for fn in args.ifname ]
>>>          results = [r.result for r in runs]
>>> 
>>> ###### END example.py ######
>>> 
>>> I have two questions:
>>> [1] In function 'read_operate_save_file', I call 'do_some_operation'. When
>>> I
>>> work on serial mode, everything works fine, but in parallel mode I get
>>> the error
>>> "IPython.parallel.error.RemoteError: NameError(global name
>>> 'do_some_operation' is not defined)"
>>> I'm not surprised by this, as I imagine that each engine know only what
>>> has been
>>> executed or defined before and that lview.apply( func, ... ) just passes
>>> the
>>> "func" to the engines. A solution that I see is to run "from example
>>> import
>>> do_some_operation" on the engines when initialising the load_balance_view.
>>> Is
>>> there any easier/safer way?
>> 
>> 
>> 
>> This namespace issue is common, and I have explanations scattered about the
>> internet:
>> 
>> http://stackoverflow.com/a/12307741/938949
>> http://stackoverflow.com/a/10859394/938949
>> https://github.com/ipython/ipython/issues/2489
>> http://ipython.org/ipython-doc/dev/parallel/index.html
>> 
>> Which I really need to consolidate into a single thorough explanation with
>> examples.
>> 
>> But the gist:
>> 
>> - If a function is importable (e.g. in a module available both locally and
>> remotely), then it's no problem
>> - If it is defined in __main__ (e.g. in a script), then any references will
>> be resolved in the *engine* namespace
>> 
>> I recommend conforming to the first case if feasible, because then there
>> should be no surprises.
>> Everything surprising happens when you have depend on references in
>> `__main__` or the current working dir (e.g. locally imported modules), since
>> `__main__` is not the same on the various machines, nor is the working dir
>> (necessarily).
>> 
>> That said, if the names you need to resolve are few, a simple import/push
>> step with a DirectView to set up namespaces should be all you need prior to
>> submitting tasks (assuming new engines are not arriving in mid-computation).
>> 
>> e.g.:
>> 
>> rc = Client()
>> dv = rc[:]
>> # push any locally defined functions that your task function uses:
>> dv['do_some_operation'] = do_some_operation
> I ended up doing the following when initialising the load_balance_view
> dv.execute( 'import sys' )
> dv.execute( 'sys.path.append("path_to_example.py")' )
> dv.execute( 'from example import do_some_operation' )
> Your suggestion looks much neater, just a couple of questions.
> With the push that you suggest, do I simply call the
> 'do_some_operation' as in my example or do I need some different
> syntax?
> Do you think that one or the other way is more optimal when the
> function is called and executed?
> 
>> # perform any imports that are needed:
>> dv.execute("import numpy as np...")
>> # continue as before:
>> lview = IPp.start_load_balanced_view( to_execute )
>> ...
>> 
>> 
>>> 
>>> 
>>> [2] Because of the way I parse my command line arguments, args.ifname its
>>> a
>>> list of already opened files. In serial mode, this is no problem, but when
>>> I
>>> assign the function to the scheduler passing the file, I get an error
>>> saying
>>> that the cannot work on a closed file. If I pass the file name with the
>>> absolute path, numpy can read it without problem. Is this a behaviour to
>>> be
>>> expected or a bug?
>> 
>> 
>> I would expect a PickleError when you try to send an open file.  Definitely
>> send filenames, not open file objects.
> Just a curiosity: what is the working directory of the engines? Is the
> one where the ipcluster is started or where the profile is stored?
> (While fixing my code, I ended up passing the filename with the full path)

It depends on configuration and how you start the engines.  You can set this with your config files (look for work_dir), and you can view the current working dir with:

rc[:].apply_sync(os.getcwdu)

> 
> Thanks again,
> 
> Francesco
> 
>> 
>>> 
>>> 
>>> Thanks for any help,
>>> 
>>> Cheers,
>>> Francesco
>>> _______________________________________________
>>> IPython-User mailing list
>>> IPython-User@scipy.org
>>> http://mail.scipy.org/mailman/listinfo/ipython-user
>> 
>> 
>> 
>> _______________________________________________
>> IPython-User mailing list
>> IPython-User@scipy.org
>> http://mail.scipy.org/mailman/listinfo/ipython-user
> _______________________________________________
> IPython-User mailing list
> IPython-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-user


More information about the IPython-User mailing list