18 computing_htcondor
Blake Leverington edited this page 2022-01-11 16:08:26 +01:00

Back to: Home Computing Wiki Page

HTCondor use on the d0 cluster

Basic use, see the manual entry from the command line:

man condor_submit

A more descriptive explanation is found here: https://research.cs.wisc.edu/htcondor/manual/v7.6/2_5Submitting_Job.html

An example .sub file:

# Generic job which will run under local CentOS7 container, on modern servers only

# What to run and arguments, can (and probably always should be) a script
executable            = ./job.sh
arguments             = 

# Safe option, so files are not transfer when not required (our servers have access
# to the same storage, /auto/work, /auto/data, /home).
should_transfer_files = IF_NEEDED

# Without variable part, output files will be the same for all jobs.
# That can be confusing.
output                = output/job.$(ClusterId).$(ProcId).out
error                 = output/job.$(ClusterId).$(ProcId).err
log                   = output/job.$(ClusterId).log

# It is possible apply current environment variables in jobs, with the following
# flag. But in general when particular environment is required, safe option
# is directly call the script which setup correct environment (lb-run, etc). In LHCb case
# that can also set optimal platform.
#getenv                = True


# The following flag by itself set 2 following options. It is set by default
# when submiting from CentOS7 container, but setting it explicitly produce
# desired result from Ubuntu as well (when submited from hosts supporting it).
+FromVIRTC = "CentOS7"

# Ask to run inside local "centos7" container on working node
#  +WantVIRTC = "CentOS7"
# Execute on modern servers only
#  +WantCPUCap = 2020

# To run the job under Ubuntu, set
#  +FromVIRTC = ""
# or
#  +WantVIRTC = ""


# For any job (Ubuntu/CentOS) it is possible limit used servers with
#  +WantCPUCap = 2020
# Or run on any server with
#  +WantCPUCap = 0

# The number is the number of subjobs
queue 2

A simple job.sh file to see if ROOT is available and which version.

#!/bin/bash

hostname
pwd
/work/software/os_version

#setup a simple lhcb environment and check if ROOT is the correct version
source /cvmfs/lhcb.cern.ch/lib/LbEnv
source /cvmfs/sft.cern.ch/lcg/views/LCG_98/x86_64-centos7-gcc9-opt/setup.sh
env #outputs the current environment in stdout
which root
root-config --version

Then submit the job via the command line:

mkdir -p output #setup a condor output directory to keep things clean
condor_submit -file job.sub 

You can test your executable or script interactively via:

condor_submit -interactive

which will open a shell on the machine where you can run your executable by hand.

Email's from Alexey:

Email 3 from Alexey:

Dear colleagues,

I have achieved what I have planed as default behavior in our batch system (HT Condor), when submission is done from CentOS7 container (currently only from lhcba1 port 30).

ssh -p 30 lhcba1.physi.uni-heidelberg.de

Without any extra flags in the configuration, jobs shoud run under CentOS7 (local container), after login scripts applied and in the current (at the time of submission) directory. Also jobs will run on "LHCb software compatible" servers only.

There are currently 3 servers with 90 slots in total which support that model. There will be one more with 16 slots. 4 other servers are interactive at the moment, lhcba1, d0new, lhcbi1 and not updated yet d0bar-new. They can be added for batch processing (also time limited, f.e. at nights and weekends) but there is no such plans at the moment.

All other servers (many...) are "old". They will be updated to support mentioned submission, but they can fail to run particular versions of LHCb software.

Simple test can be started from lhcba1 port 30, with command

condor_submit -interactive

An example of submission file is in /auto/work/zhelezov/singularity/batch_centos7. Do not forget to start job submission from directory into which you can write, otherwise log files can not be written and your jobs will be in "on hold" state forever.

I still propose to use Singularity based approach when possible, demonstrated in /auto/work/zhelezov/singularity/FCNCfitter. That allows to use SLC6 / CentOS8 / etc. without local installation on all servers.

While not really checked, I believe the environment closely mimics current CERN/DESY HTCondor. Note that defaults are conservative (everywhere) in reserved resources (1 core, 512MB RAM). It is better specify required resourced explicitly (as documented in general HTCondor manual).

For the moment there is no multi-core slots and up to 8GB RAM per slot. Jobs with higher requirements will find no working nodes. Please let me know if you hit the problem.

Regards, Alexey.

Email 2 from ALexey:

Dear colleagues,

lhcba1 was updated and rebooted. The process was trouble free, after a while the server has lost network and it had to be restarted again. The reason is unknown. Please let me know if it stuck again, better with WhatsUp +4917648253440 . For the same reason, please do NOT try to use GPU till Monday.

I have decided to change container schema already now, so port 30 is a new style container. The only expected change for users: singularity can be used from CentOS7 container.

The change "simplify" proposed tests, just check things are still working fine on lhcba1 port 30. Any response about the batch system behavior is still more then welcome.

Note that using --fake-root with singularity requires local configuration change (per user), let me know if you want create images.

While the following can be deducted from proposed test and PS, I forgot to mentioned the change explicitly: condor submission from CentOS7 on lhcba1 no longer starts jobs under ubuntu, it starts them under CentOS7 (at the moment on one server only). To submit jobs to Ubuntu (f.e. with singularity) please use Ubuntu (port 24) on lhcba1 for d0new.

Regards, Alexey.

Email 1 from Alexey:

Dear colleagues,

  1. I am going to reboot lhcba1 later today (~17:00) to apply updates and allow GPU driver. Please let me know if that time is inconvenient for you.

  2. I try to prepare "user friendly" environment on our cluster, and there is some progress in that respect. But before I apply significant changes, please check you are able to work in that new environment.

To test:

a) ssh to lhcba1 port 30 (f.e. ssh -p 30 <user name>@lhcba1). b) run condor_submit -interactive command. In short time you should be "logged in". c) check you can do everything you usually do in CentOS7 there, including running own programs, editor, etc. It can be some software requested for interactive nodes only is not installed there, but the rest is expected to work. c*) if you plan to use Singularity from CentOS7, check it as well (except --fake-root functionality). d) "exit" from that environment. There are just 30 slots at the moment.

Please report in case you observe any problems during this exercise. If you do, try to ssh -p 30 <user name>@lhcb-raid03 and run failing part.

Note that lhcba1 CentOS7 is going to be switched to this environment soon (probably next week).

Regards, Alexey

PS. Technical background, just for information in case someone is interested.

Our cluster was running Ubuntu as the primary OS during more then 10 years, with SLC4, SLC5 and SLC6 DIY containers. Using the same approach, I hope we can have reasonable environment for the next 10 years. That is definitively not the case in case CentOS7 is the only system (as in most other centers).

My DIY containers had some benefits, but they show significant weakness in respect to other container technologies. Most important, it is not possible to use singularity from my current DIY solution. Current approach is to use Systemd nspawn container technology. If for some reason that fails, (and may be in parallel) I can try Docker.

One of the features of our old farm was SGE Batch system. Most centers (including CERN, GridKa and DESY) have switched to HTCondor. And so do we. SGE was specially modified to work on our cluster almost transparently for different OSes. And so I need to modify HTCondor for the same (or better) transparency.

HTCondor natively supports some containers. Most interesting are Singularity and Docker support, which can be enabled in the future. For the moment Singularity can be started from job scripts explicitly (f.e. see /work/zhelezov/singularity/FCNCfitter/).

It seems like people prefer transparent running in "default" containers. That mimics other centers (f.e. CERN) where the batch OS and interactive OS are the same (CentOS7 at the moment). So, a job started from CentOS7 will automatically run in CentOS7. Also only CPUs compatible with current LHCb software are considered by default. Jobs submitted from Ubuntu will start under Ubuntu, without automatic CPU type preferences. Manual steering exists and will be documented.

HTCondor supports "interactive" jobs as well as connecting to running jobs. These features supposed to work on our cluster as well.

PSPS. Well defined and supported unrolled CentOS7 image for docker/singularity is still wanted. While there are some for CMS and Atlas, I have found that "standard" CERN one has no graphic libraries. And so most programs (f.e. ROOT) can't work correctly there, even in proper "CERN view" environemnt.

Email 0 from Alexey:

Dear colleagues,

/auto/data should be again accessible from all LHCb servers. /work backups should work again (starting this night).

d0new server was upgraded, it has Ubuntu 20.04, SLC6 and CentOS7 environments now (ports 24, 28, 30).

The documentation for d0 cluster is updated: http://d0.physi.uni-heidelberg.de/d0.html Short summary. We have:

  • 4 nodes with CentOS7 environment (lhcba1, d0new, delta, lhcbi1)
  • singularity support under Ubuntu (lhcba1, d0new, delta, batch system)
  • 140 slots HTCondor batch system (currently PAT oriented).

Regards, Alexey.