Update 'computing_htcondor'
parent
7c14429048
commit
40b31d3a8f
@ -1,4 +1,8 @@
|
|||||||
*Email from Alexey:*
|
# HTCondor use on the d0 cluster
|
||||||
|
|
||||||
|
## Email's from Alexey:
|
||||||
|
|
||||||
|
*Email 3 from Alexey:*
|
||||||
|
|
||||||
Dear colleagues,
|
Dear colleagues,
|
||||||
|
|
||||||
@ -34,3 +38,92 @@ Jobs with higher requirements will find no working nodes. Please let me know if
|
|||||||
|
|
||||||
Regards,
|
Regards,
|
||||||
Alexey.
|
Alexey.
|
||||||
|
|
||||||
|
*Email 2 from ALexey:*
|
||||||
|
|
||||||
|
Dear colleagues,
|
||||||
|
|
||||||
|
lhcba1 was updated and rebooted. The process was trouble free, after a while the server has lost network and it had to be restarted again.
|
||||||
|
The reason is unknown. Please let me know if it stuck again, better with WhatsUp +4917648253440 .
|
||||||
|
For the same reason, please do NOT try to use GPU till Monday.
|
||||||
|
|
||||||
|
I have decided to change container schema already now, so port 30 is a new style container.
|
||||||
|
The only expected change for users: singularity can be used from CentOS7 container.
|
||||||
|
|
||||||
|
The change "simplify" proposed tests, just check things are still working fine on lhcba1 port 30. Any response about the batch system behavior is still more then welcome.
|
||||||
|
|
||||||
|
Note that using --fake-root with singularity requires local configuration change (per user), let me know if you want create images.
|
||||||
|
|
||||||
|
While the following can be deducted from proposed test and PS, I forgot to mentioned the change explicitly:
|
||||||
|
condor submission from CentOS7 on lhcba1 no longer starts jobs under ubuntu, it starts them under CentOS7 (at the moment on one server only).
|
||||||
|
To submit jobs to Ubuntu (f.e. with singularity) please use Ubuntu (port
|
||||||
|
24) on lhcba1 for d0new.
|
||||||
|
|
||||||
|
|
||||||
|
Regards,
|
||||||
|
Alexey.
|
||||||
|
|
||||||
|
|
||||||
|
*Email 1 from Alexey:*
|
||||||
|
|
||||||
|
Dear colleagues,
|
||||||
|
|
||||||
|
1) I am going to reboot lhcba1 later today (~17:00) to apply updates and allow GPU driver. Please let me know if that time is inconvenient for you.
|
||||||
|
|
||||||
|
2) I try to prepare "user friendly" environment on our cluster, and there is some progress in that respect. But before I apply significant changes, please check you are able to work in that new environment.
|
||||||
|
|
||||||
|
To test:
|
||||||
|
|
||||||
|
a) ssh to lhcba1 port 30 (f.e. 'ssh -p 30 <user name>@lhcba1').
|
||||||
|
b) run 'condor_submit -interactive' command. In short time you should be "logged in".
|
||||||
|
c) check you can do everything you usually do in CentOS7 there, including running own programs, editor, etc. It can be some software requested for interactive nodes only is not installed there, but the rest is expected to work.
|
||||||
|
c*) if you plan to use Singularity from CentOS7, check it as well (except --fake-root functionality).
|
||||||
|
d) "exit" from that environment. There are just 30 slots at the moment.
|
||||||
|
|
||||||
|
Please report in case you observe any problems during this exercise. If you do, try to "ssh -p 30 <user name>@lhcb-raid03" and run failing part.
|
||||||
|
|
||||||
|
Note that lhcba1 CentOS7 is going to be switched to this environment soon (probably next week).
|
||||||
|
|
||||||
|
Regards,
|
||||||
|
Alexey
|
||||||
|
|
||||||
|
PS. Technical background, just for information in case someone is interested.
|
||||||
|
|
||||||
|
Our cluster was running Ubuntu as the primary OS during more then 10 years, with SLC4, SLC5 and SLC6 DIY containers. Using the same approach, I hope we can have reasonable environment for the next 10 years. That is definitively not the case in case CentOS7 is the only system (as in most other centers).
|
||||||
|
|
||||||
|
My DIY containers had some benefits, but they show significant weakness in respect to other container technologies. Most important, it is not possible to use singularity from my current DIY solution.
|
||||||
|
Current approach is to use Systemd nspawn container technology. If for some reason that fails, (and may be in parallel) I can try Docker.
|
||||||
|
|
||||||
|
One of the features of our old farm was SGE Batch system. Most centers (including CERN, GridKa and DESY) have switched to HTCondor. And so do we. SGE was specially modified to work on our cluster almost transparently for different OSes. And so I need to modify HTCondor for the same (or
|
||||||
|
better) transparency.
|
||||||
|
|
||||||
|
HTCondor natively supports some containers. Most interesting are Singularity and Docker support, which can be enabled in the future. For the moment Singularity can be started from job scripts explicitly (f.e.
|
||||||
|
see /work/zhelezov/singularity/FCNCfitter/).
|
||||||
|
|
||||||
|
It seems like people prefer transparent running in "default" containers.
|
||||||
|
That mimics other centers (f.e. CERN) where the batch OS and interactive OS are the same (CentOS7 at the moment). So, a job started from CentOS7 will automatically run in CentOS7. Also only CPUs compatible with current LHCb software are considered by default. Jobs submitted from Ubuntu will start under Ubuntu, without automatic CPU type preferences.
|
||||||
|
Manual steering exists and will be documented.
|
||||||
|
|
||||||
|
HTCondor supports "interactive" jobs as well as connecting to running jobs. These features supposed to work on our cluster as well.
|
||||||
|
|
||||||
|
PSPS. Well defined and supported unrolled CentOS7 image for docker/singularity is still wanted. While there are some for CMS and Atlas, I have found that "standard" CERN one has no graphic libraries.
|
||||||
|
And so most programs (f.e. ROOT) can't work correctly there, even in proper "CERN view" environemnt.
|
||||||
|
|
||||||
|
*Email 0 from Alexey:*
|
||||||
|
|
||||||
|
Dear colleagues,
|
||||||
|
|
||||||
|
/auto/data should be again accessible from all LHCb servers. /work backups should work again (starting this night).
|
||||||
|
|
||||||
|
d0new server was upgraded, it has Ubuntu 20.04, SLC6 and CentOS7 environments now (ports 24, 28, 30).
|
||||||
|
|
||||||
|
The documentation for d0 cluster is updated:
|
||||||
|
http://d0.physi.uni-heidelberg.de/d0.html
|
||||||
|
Short summary. We have:
|
||||||
|
* 4 nodes with CentOS7 environment (lhcba1, d0new, delta, lhcbi1)
|
||||||
|
* singularity support under Ubuntu (lhcba1, d0new, delta, batch system)
|
||||||
|
* 140 slots HTCondor batch system (currently PAT oriented).
|
||||||
|
|
||||||
|
Regards,
|
||||||
|
Alexey.
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user