Mpi Wisdom

This is RUST Ticket #4081. It contains MPI Wisdom from Derek, as well as a few other tidbits.

From: Martin Siegert
To: condor-admin@cs.wisc.edu
Subject: condor on a beowulf

Hi there,

I am in the process of setting up HTCondor as a batch queueing/scheduling
system for our Beowulf cluster. But before I can get started I already
ran into a few problems:

1) I downloaded condor-6.4.0-linux-x86-glibc22-dynamic.tar.gz.
   I was expecting that this would contain shared libraries (lib*.so).
   However, the release.tar contains only what looks like static
   libraries (lib*.a). Am I missing something?

2) I am installing HTCondor on the master node of the cluster and want to
   use it on the private network 172.16.0.0. The hostname structure is

   172.16.0.1    b001                (master node)
   172.16.0.2    b002
   172.16.0.3    b003
   ...

   Following section 3.11.9 of the manual I set

   CONDOR_HOST = b001

   but to what should I set UID_DOMAIN and FILESYSTEM_DOMAIN? I do use
   a common uid space and I NFS export /home and /usr/local from the
   master node. However, I do not have a domain name (which would
   unnecessarily complicate management of the cluster). Thus, do I set
   UID_DOMAIN and FILESYSTEM_DOMAIN to nothing (empty string)?
   Also, should I set DEFAULT_DOMAIN_NAME = '' ?

3) All nodes have two interfaces one (on the 172.16.0.0 network) is
   used for MPI communication, the second (on the 172.17.0.0 network)
   is used for NFS. Thus I set NETWORK_INTERFACE = 172.16.0.# in each
   of the condor_config.local files.
   Does it create a problem with HTCondor that NFS traffic (including
   exporting of the HTCondor home directory and the HTCondor binaries and
   libraries) is actually on a different interface?

Thanks a lot for your help in advance!
Also, thanks for making HTCondor available.

Regards,
Martin

===========================================================================
Date of creation: Tue Jul 16 12:41:32 2002 (1026841296)


From RUST Tue, 16 Jul 2002 14:15:20  -0600 (CST)
Subject: Actions

Assigned to ned by ned
===========================================================================
Date of actions: Tue Jul 16 14:15:20 2002 (1026846921)

Date: Tue, 16 Jul 2002 16:45:27 -0500
From: James Kirby
To: ned
Subject: Re: [condor-admin #4081] condor on a beowulf

Hello,


> I am in the process of setting up HTCondor as a batch queueing/scheduling
> system for our Beowulf cluster. But before I can get started I already
> ran into a few problems:
>
> 1) I downloaded condor-6.4.0-linux-x86-glibc22-dynamic.tar.gz.
>    I was expecting that this would contain shared libraries (lib*.so).
>    However, the release.tar contains only what looks like static
>    libraries (lib*.a). Am I missing something?

No.  We statically link in all of our own libraries, hence the size of the
executables.

> 2) I am installing HTCondor on the master node of the cluster and want to
>    use it on the private network 172.16.0.0. The hostname structure is
>
>    172.16.0.1    b001                (master node)
>    172.16.0.2    b002
>    172.16.0.3    b003
>    ...
>
>    Following section 3.11.9 of the manual I set
>
>    CONDOR_HOST = b001
>
>    but to what should I set UID_DOMAIN and FILESYSTEM_DOMAIN?

You need to "make up" a domain for your machines, like so:


172.16.0.1    b001.qux.foo.bar.com b001                (master node)
172.16.0.2    b002.qux.foo.bar.com b002
172.16.0.3    b003.qux.foo.bar.com b002

Then just set UID_DOMAIN and FILESYSTEM_DOMAIN to qux.foo.bar.com.  If this
is set up correctly, you won't need the DEFAULT_DOMAIN_NAME config file
entry.

> 3) All nodes have two interfaces one (on the 172.16.0.0 network) is
>    used for MPI communication, the second (on the 172.17.0.0 network)
>    is used for NFS. Thus I set NETWORK_INTERFACE = 172.16.0.# in each
>    of the condor_config.local files.
>    Does it create a problem with HTCondor that NFS traffic (including
>    exporting of the HTCondor home directory and the HTCondor binaries and
>    libraries) is actually on a different interface?

Consensus amongst the development team is no, it shouldn't create any
problems, but neither has anyone run across this before.  If you do run
into problems, please let us know.

--condor-admin

===========================================================================
Date mail was appended: Tue Jul 16 16:45:27 2002 (1026855928)


Date: Tue, 23 Jul 2002 15:00:51 -0700
From: Martin Siegert
To: condor-admin response tracking system
Subject: Re: [condor-admin #4081] condor on a beowulf

Hello again,

On Tue, Jul 16, 2002 at 04:45:27PM -0600, condor-admin response
tracking system wrote:

> Hello,
>
> > I am in the process of setting up HTCondor as a batch queueing/scheduling
> > system for our Beowulf cluster. But before I can get started I already
> > ran into a few problems:
> >
> > 1) I downloaded condor-6.4.0-linux-x86-glibc22-dynamic.tar.gz.
> >    I was expecting that this would contain shared libraries (lib*.so).
> >    However, the release.tar contains only what looks like static
> >    libraries (lib*.a). Am I missing something?
>
> No.  We statically link in all of our own libraries, hence the size of the
> executables.

Doesn't that mean that every time a job is checkpointed all the HTCondor
libraries are dumped into the checkpoint directory for every and each job
over and over again leading to an unnecessary high demand for disk space?
I would prefer adding /usr/local/condor/lib to /etc/ld.so.conf (just a
thought - I have not actually estimated the typical disk space needs on
my cluster in the checkpoint directory).

> > 2) I am installing HTCondor on the master node of the cluster and want to
> >    use it on the private network 172.16.0.0. The hostname structure is
> >
> >    172.16.0.1    b001                (master node)
> >    172.16.0.2    b002
> >    172.16.0.3    b003
> >    ...
> >
> >    Following section 3.11.9 of the manual I set
> >
> >    CONDOR_HOST = b001
> >
> >    but to what should I set UID_DOMAIN and FILESYSTEM_DOMAIN?
>
> You need to "make up" a domain for your machines, like so:
>
>
> 172.16.0.1    b001.qux.foo.bar.com b001                (master node)
> 172.16.0.2    b002.qux.foo.bar.com b002
> 172.16.0.3    b003.qux.foo.bar.com b002
>
> Then just set UID_DOMAIN and FILESYSTEM_DOMAIN to qux.foo.bar.com.  If this
> is set up correctly, you won't need the DEFAULT_DOMAIN_NAME config file
> entry.

Hmm. That isn't really an option. Too many scripts would break that rely
on the short hostnames. Is there a way to change the FULL_HOSTNAME
macro (I guess that that is determining how hostnames get interpreted)?

As far as I understand I can set FILESYSTEM_DOMAIN to any string and as
long as it is the same on all hosts it will work.
I could set UID_DOMAIN to *. Then I could firewall all HTCondor ports on the
master node so that only machines on the private network can talk to the
HTCondor daemons. This seems to be desirable anyway.
Which are the HTCondor ports that I would have to block in that case?
Anything else that I must consider (with respect to security), if I set
UID_DOMAIN = *
?

> > 3) All nodes have two interfaces one (on the 172.16.0.0 network) is
> >    used for MPI communication, the second (on the 172.17.0.0 network)
> >    is used for NFS. Thus I set NETWORK_INTERFACE = 172.16.0.# in each
> >    of the condor_config.local files.
> >    Does it create a problem with HTCondor that NFS traffic (including
> >    exporting of the HTCondor home directory and the HTCondor binaries and
> >    libraries) is actually on a different interface?
>
> Consensus amongst the development team is no, it shouldn't create any
> problems, but neither has anyone run across this before.  If you do run
> into problems, please let us know.
>
> --condor-admin

I'll definitely let you know. But for now I ran into another problem:
SMP configuration.
All of the machines in my cluster are identical dual processor boxes.
I understand that by default HTCondor would split those boxes up into
two virtual machines with equal amount of memory. This is not really
what I would like to do. I'd rather make all of memory available to
the first job and then make [total_memory - (memory taken by first job)]
available to the second machine. Is that possible?

Also: can you give me some hints what would be reasonable policies for
a cluster?
This is what I'd like to do: the cluster has 192 cpus. I'd like to split
it into two parts: 96 CPUs dedicated to MPI (and possibly PVM) jobs and
96 CPUs for everything else. Right now I do not have enough MPI jobs to fill
the MPI processors, but I have (at least sometimes) more than 192 serial
jobs. I'd like to configure HTCondor such that a serial job goes preferably
on those 96 processors that are not dedicated to MPI jobs. If those
are already busy, serial jobs go onto the MPI processors as well.
When a MPI job claims those processor the serial jobs vacate immediately
and get back to the top of the serial queue.
Preferably the user priority is determined by the number of processors
jobs of the user currently occupy, i.e., the "history" should not play a role.
Quite  honestly: I am kind of lost after having read through various
chapters of the manual. I do not have a "Owner" - thus I understand that
I somehow should guarantee that Owner evaluates to UNDEFINED always. Is that
the default? Also I guess that I should set KeyboardIdle to True and
KeyboardBusy  and ConsoleBusy to False, correct? What else should I set?
According to chapter 3.11.11 half of the machines will be defined as
dedicated resources in their local config files. In order to let serial
jobs onto those processor do I just set "START = True" as mentioned under
3.11.11.5? Or can I actually use "START = True" everywhere, thus just define
it in the global config file?

Also, is there a way of forcing users to submit their jobs to HTCondor
instead of starting them directly in the background? I.e., can HTCondor
stop jobs that are not started via HTCondor?

Any pointer are appreciated. Thanks a lot for your help!

Cheers,
Martin

Assigned to wright by ned
===========================================================================
Date of actions: Mon Jul 29 13:47:30 2002 (1027968451)


To: condor-admin@cs.wisc.edu
Subject: Re: [condor-admin #4081] HTCondor on a beowulf
Date: Tue, 30 Jul 2002 16:39:28 -0500
From: Derek Wright

> Doesn't that mean that every time a job is checkpointed all the HTCondor
> libraries are dumped into the checkpoint directory for every and each job
> over and over again leading to an unnecessary high demand for disk space?

basically, yes.  however, there is state in the HTCondor libraries
linked in with your code, specific to each job, that can't be shared
across the different jobs.

> I would prefer adding /usr/local/condor/lib to /etc/ld.so.conf (just a
> thought - I have not actually estimated the typical disk space needs on
> my cluster in the checkpoint directory).

if we did a lot of effort, we might be able to make a shareable
section of the HTCondor libraries and put that in a dynamic library.
however, the problems associated with checkpointing dynamically linked
jobs (particularly on linux) were so awful that we decided it wasn't
worth our effort to continue to support it.  80 gig drives are so
cheap now, we're not loosing any sleep over the slightly inefficient
use of space...

> Hmm. That isn't really an option. Too many scripts would break that rely
> on the short hostnames. Is there a way to change the FULL_HOSTNAME
> macro (I guess that that is determining how hostnames get
interpreted)?

sort of.  that's not really what you want, though.

as luck would have it, there's active work on the HTCondor team to
support pools without fully qualified hostnames.  however, i'm not
involved with that, so i put that question of yours into a new message
in our tracking system.  i'll assign it to someone else, and let them
reply to all of the domain-related stuff.

> But for now I ran into another problem:

for future reference, it's a lot easier for us if you just send a new
message to condor-admin with completely new questions.  most of the
time, different people are the best to answer any given question.  if
there's a bunch of background information in common between the two
problems, you can always reference the tracking number of your old
ticket (in this case [condor-admin #4081]) in the new one, and any of
us can read the other ticket if we need to...

i'll reply to your SMP and MPI questions in a little while.

-derek

===========================================================================
Date mail was appended: Tue Jul 30 16:39:28 2002 (1028065169)


To: condor-admin@cs.wisc.edu
Subject: Re: [condor-admin #4081] HTCondor on a beowulf
Date: Tue, 30 Jul 2002 19:10:49 -0500
From: Derek Wright

> SMP configuration.
> All of the machines in my cluster are identical dual processor boxes.
> I understand that by default HTCondor would split those boxes up into
> two virtual machines with equal amount of memory. This is not really
> what I would like to do. I'd rather make all of memory available to
> the first job and then make [total_memory - (memory taken by first job)]
> available to the second machine. Is that possible?

no.  i agree with you, that's how i think it should work, too.
however, when i was implementing the smp support in HTCondor, i was told
it had to work the way it does, so that's how it is.  here's the deal:
you can partition the memory however you want between the two cpus
(50/50, 73/27, whatever you want), but you have to do it ahead of
time.  if there's no job running on the machine, you can re-partition
the memory if you want.  but, there's no way, while a job is running,
to change it.

the "party line" is that if you really need to dynamically partition
the memory on a per-job basis, you should write your own script/daemon
to monitor your job queue, and decide when and how to repartition the
memory on your machines.  i think this is highly unsatisfactory, but
that's what i'm supposed to tell you.  i wish it could be otherwise,
but that's just how it is.  i'm sorry.

> Also: can you give me some hints what would be reasonable policies for
> a cluster?

sure.

> This is what I'd like to do: the cluster has 192 cpus. I'd like to
> split it into two parts: 96 CPUs dedicated to MPI (and possibly PVM)
> jobs and 96 CPUs for everything else.

sounds good (assuming you never need more than 96 cpus for MPI)

> Right now I do not have enough MPI jobs to fill the MPI processors,
> but I have (at least sometimes) more than 192 serial jobs.

no problem.  this mix of dedicated parallel jobs and "opportunistic"
serial jobs is exactly the kind of environment HTCondor is setup to
handle.

> I'd like to configure HTCondor such that a serial job goes preferably
> on those 96 processors that are not dedicated to MPI jobs.

you would do this with the "Rank" expression in the job's submit
file.  you can do this 1 of two ways:

1) educate your users to put it in their own Rank expression
2) have condor_submit automatically append it to their Rank for them

either way, you want the serial jobs to prefer to run on machines with
the "DedicatedScheduler" attribute undefined (therefore, unable to run
MPI jobs).  like this:

Rank = DedicatedScheduler =?= UNDEFINED

i don't know if you're already using the job rank for anything.  if
so, you may want to give this preference more weight.  for example,
you might want to run on machines with more memory, but you'd much
rather run on a small-memory non-mpi machine than a high memory mpi
machine.  you'd do that like this:

Rank = (Memory) + ((DedicatedScheduler =?= UNDEFINED) * 10000)

assuming you don't have any machines with 10 gigs of RAM in them :),
if DedicatedScheduler is undefined, that whole side of it will
evaluate (1 * 10000) and if you add that to, say 512 megs of ram on a
"small memory" non-mpi machine, you'll always get more than the 2048
the rank would evalute to for a high memory mpi node.  make sense?

it's probably best to just have submit add this for you, so you don't
have to worry about users forgetting.  the "append_rank" expression in
your config file will do this for you, like this:

APPEND_RANK = (DedicatedScheduler =?= UNDEFINED) * 10000

condor_submit is smart enough that if you specify something to append,
but the user doesn't define anything, the expression to append is just
used as the rank.  if the user defines anything, it's put in ()'s, and
your expression is appended like this:

Rank = (their expression) + (your expression)

or, in this case:

Rank = (Memory) + ((DedicatedScheduler =?= UNDEFINED) * 10000)

> If those are already busy, serial jobs go onto the MPI processors as
> well.

job rank is just a preference.  it's not a requirement (since you put
this restriction in "rank", not "requirements").  so, if the job
doesn't have anywhere to go that it really prefers, it'll run anywhere
that satisfies its requirements.  so, you don't have to do anything
special to get this behavior.

> When a MPI job claims those processor the serial jobs vacate
> immediately

that's exactly what happens when the machine's Rank expression is used
to specify what kinds of jobs a machine prefers to run.  that's why
you're supposed to put:

RANK = Scheduler =?= $(DedicatedScheduler)

in your config file for resources that are going to be dedicated.
this way, if an MPI job comes along, dedicated resources will always
evict non-mpi jobs to run the mpi job instead.

> and get back to the top of the serial queue.

yeah, basically.  it's not really seperate queues, but that doesn't
matter.  the behavior is what you expect...

> Preferably the user priority is determined by the number of
> processors jobs of the user currently occupy, i.e., the "history"
> should not play a role.

that's a different issue.  this is the responsibility of the HTCondor
"accountant", which lives inside the condor_negotiator daemon.  the
knob you want to turn is called "PRIORITY_HALFLIFE".  think of your
user priority as a radioactive substance. :) consider a priority that
exactly matches your current resource usage the "stable state", and a
priority "contaminated" with past usage "radioactive."  if it's got a
long halflife, it takes a long time for your priority to decay back to
"normal".  if the halflife is very short, it'll decay very quickly,
and will remain very close to your current usage.  so, just set
PRIORITY_HALFLIFE to a small floating point value (like 0.0001), and
your user priority should always match your current usage.  if you're
not using any resources, your priority will go back to the baseline
value instantly.

> Quite honestly: I am kind of lost after having read through various
> chapters of the manual.

yeah, HTCondor is incredibly flexible, therefore, incredibly complicated
to configure and document.  i'm sorry.  we get this stuff wrong in our
own pool on occassion, even with the developers editing the config
files. :)

that said, specific feedback on things in the manual that are
particularly confusing, or suggestions for improvement are always
welcome.  (again, just send a new message to condor-admin, so the
person who does the editing of the manual can reply to it).

> I do not have a "Owner" - thus I understand wthat I somehow should
> guarantee that Owner evaluates to UNDEFINED always. Is that the
> default?

you don't really need to worry about that (read on)

> Also I guess that I should set KeyboardIdle to True and KeyboardBusy
> and ConsoleBusy to False, correct?

KeyboardIdle and ConsoleIdle are computed for you by HTCondor.  they're
just a measure of how long it's been since someone logged into a tty,
or touched the physical console.  you can't set them to anything.
however, all that really matters are the "policy expressions" (start,
suspend, preempt, etc), and if you don't refer to these attributes in
any of those expressions, if their values change, it won't impact the
policy on the machine.

> What else should I set?
>
> According to chapter 3.11.11 half of the machines will be defined as
> dedicated resources in their local config files. In order to let serial
> jobs onto those processor do I just set "START = True" as mentioned under
> 3.11.11.5?

yes.  if you want them to always run jobs, you want:

START = True

if you want them to also be able to run MPI jobs, you want to set:

DedicatedScheduler = "DedicatedScheduler@b001"
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
RANK         = Scheduler =?= $(DedicatedScheduler)

to tell them about MPI stuff.

> Or can I actually use "START = True" everywhere, thus just define
> it in the global config file?

it depends.  if you want all of your nodes to run jobs all the time
(assuming they're not already running a job), then, yes, you can say
"START = True" in your global config file and that'll be the behavior
on all the machines in the pool.

> Also, is there a way of forcing users to submit their jobs to HTCondor
> instead of starting them directly in the background? I.e., can
> HTCondor stop jobs that are not started via HTCondor?

sort of, but it's quite convoluted at this point.  your best bet for
now is to just have a cron job run every minute or so, check for
processes on your system that are using lots of cpu that aren't
children of the condor_master, and take appropriate action (kill
-TERM, kill -KILL, whatever).

i hope this helps.  if you have futher questions about any of this,
reply to this message.  if you have other questions or comments about
HTCondor, just send a new note to condor-admin.

thanks,
-derek




===========================================================================
Date mail was appended: Tue Jul 30 19:10:50 2002 (1028074250)


From RUST Tue, 30 Jul 2002 19:10:50  -0600 (CST)
Subject: Actions

Ticket resolved by wright
===========================================================================
Date of actions: Tue Jul 30 19:10:50 2002 (1028074251)