[IPython-dev] Ipython parallel and PBS
Fri Sep 13 13:14:17 CDT 2013
Can you inspect the pbs_engines template, and see if anything looks wrong?
Can you submit it manually, with qsub ./pbs_engines?
On Fri, Sep 13, 2013 at 3:38 AM, James <email@example.com> wrote:
> Dear all,
> I'm having a lot of trouble setting up IPython parallel on a PBS cluster,
> and I would really appreciate any help.
> The architecture is a standard PBS cluster - a head node with slave nodes.
> I connect to the head node from my laptop over ssh.
> The client (laptop) -> Head node connection seems simple enough. The
> problem is the engines.
> Ignoring the laptop for a moment, I'll just focus on running ipython on
> the head node, with the engines on a slave node. I assume this is a correct
> method of working?
> I did the following on the head node, following instructions at
> $ ipython profile create --parallel --profile=pbs
> Files are as follows:
> $cat ipcluster_config.py
> c = get_config()
> c.IPClusterStart.controller_launcher_class = 'PBSControllerLauncher'
> c.IPClusterEngines.engine_launcher_class = 'PBSEngineSetLauncher'
> c.PBSLauncher.queue = 'long'
> c.IPClusterEngines.n = 2 # Run 2 cores on 1 node or 2 nodes with all
> cores? Not sure.
> $ cat ipengine_config.py
> c = get_config()
> Then execute on the head node:
> $ ipcluster start --profile=pbs -n 2
> 2013-09-10 15:02:46,771.771 [IPClusterStart] Using existing profile dir:
> 2013-09-10 15:02:46.777 [IPClusterStart] Starting ipcluster with
> 2013-09-10 15:02:46.778 [IPClusterStart] Creating pid file:
> 2013-09-10 15:02:46.778 [IPClusterStart] Starting Controller with
> 2013-09-10 15:02:46.792 [IPClusterStart] Job submitted with job id: '2830'
> 2013-09-10 15:02:47.793 [IPClusterStart] Starting 2 Engines with
> 2013-09-10 15:02:47.808 [IPClusterStart] Job submitted with job id: '2831'
> Then the queue shows
> $ qstat
> Job id Name User Time Use S Queue
> ------------------------- ---------------- --------------- -------- - -----
> 2830.master ipcontroller username 0 Q
> 2831.master ipengine username 0 Q
> And they just hang there, queued forever. I assume the engines at least
> should be running? Full information through "qstat -f" doesn't give the
> reason for the queuing. Normally it would do. There are more than 4 nodes
> $qstat -f
> Job Id: 2831.master.domain
> Job_Name = ipengine
> Job_Owner = firstname.lastname@example.org
> job_state = Q
> queue = long
> server = [head node's domain address]
> Checkpoint = u
> ctime = Tue Sep 10 15:02:47 2013
> Error_Path = master.domain:/home/username/
> Hold_Types = n
> Join_Path = n
> Keep_Files = n
> Mail_Points = a
> mtime = Tue Sep 10 15:02:47 2013
> Output_Path = master.domain:/home/username/ipengine.o2831
> Priority = 0
> qtime = Tue Sep 10 15:02:47 2013
> Rerunable = True
> etime = Tue Sep 10 15:02:47 2013
> submit_args = ./pbs_engines
> job_array_request = 1-2
> fault_tolerant = False
> submit_host = master.domain
> init_work_dir = /home/username
> It also seems strange that the ipcontroller is launched through PBS. I
> thought this should be on the head node, so I changed
> 'PBSControllerLauncher' to 'LocalControllerLauncher'. Then it doesn't
> queue, but I don't know if what I'm doing is correct.
> Any help would be really greatly appreciated.
> Thank you.
> IPython-dev mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the IPython-dev