5 computing_cluster
Blake Leverington edited this page 2023-10-17 09:25:16 +02:00

Back to: Home Computing Wiki Page


The D0 Cluster

(this is modified text copied from http://d0.physi.uni-heidelberg.de/d0.html )

d0 cluster got his name from the first node, "d0". It was deployed during 2006. The original hardware has died, but was replaced with equivalent. It is still hosting login scripts, cluster monitoring, the batch system and GuestNet. But since it is less powerful then a modern phone and still runs SLC4 OS, it make no sense to use it interactively.

We plan to restructure the whole cluster. Work is in progress.

Storage

"/work" storage.

Relatively high performance 22 TB. With daily backups. Physically it is inside "sigma0" node, so all data related operations (copy, backup, delete, etc.) is better do interactive on "sigma0". Please do not use this space for data sets greater then O(100 GB). By dividing the size by the number of user, it is clear we can not provide several TB for each on that storage. Note that sigma0 is an old server. With 32 GB RAM it is not a good option for intensive parallel compilation, analysis, etc.

"/auto/data" storage.

Slow distributed 260TB storage for big files. No backups. Physically it is on several servers, so there is no particular node from which the access is faster.

If you need place for huge data sets, backups, etc., that is correct place. Note that working with many small files is slow, so having your GANGA directory there is not the best option (better put it on /work, but configure to put data files on /auto/data or at least clean GANGA regulary, moving results when required). CVMFS.

Login

  • To automatically start LHCb login scripts from CVMFS the file ".cvmfs" shoud be created in the home directory. If you need particular resource (other then lhcb.cern.ch) please contact me.

  • To not automatically start this: For a supported OS with a working /cvmfs mount (e.g. lxplus.cern.ch or lxplus7.cern.ch), you have to disable the default login environment by creating the file ~/.nogrouplogin:

   touch ~/.nogrouplogin

and log in again to the machine.

Then set the environment up:

   source /cvmfs/lhcb.cern.ch/lib/LbEnv

or, to test the upcoming new features:

   source /cvmfs/lhcb.cern.ch/lib/LbEnv-dev

Note: the LbEnv script works for both bash and tcsh.

Computing resources

Interactive nodes.

    lhcba1 (Ubuntu 20.04 / CentOS7) - general perpose modern AMD based interactive server
    lhcba2 ()
    d0new (Ubuntu 20.04 / SLC6 / CentOS7) - general perpose older AMD based interactive server
    delta (Ubuntu 18.04 / CentOS7) - general pervpose very old AMD based interactive server
    lhcbi1 (CentOS7) - special perpose modern Intel based interactive server. Please avoid using it in case you can (till you are sure it is there for you).
    d0bar-new (SLC6) - general perpose interactive node, also SciFi Web server at the moment. "Killing" it (running jobs which eat all RAM, etc.) can have administrative consequences...
    sigma0 (SLC6) - local "/work", relatively low in memory. So not for running any long tasks (can be killed by admin any time he see them and they disturb other people...)

new server (lhcba2) is available now for the same use cases as lhcba1.

Please note this server is NOT a clone of lhcba1, in particular:

  • it has more modern hardware and 4x times more RAM
  • basis system is Ubuntu 22.04
  • extra software is NOT installed

I ask every particular (potential) user of this server to let me know:

  1. which extra software (under Ubuntu and/or CentOS7) you need on the system level (don't forget to specify OS). If required version is not default for particular OS, please specify exact version. If there is specific installation procedure, the link to the installation documentation.
  2. do you prefer interactive, throw batch system or mixed (f.e. batch system during weekends) use of that server
  3. do you need extra OS on that server (with explanation why you can't use singularity).

Standard services which supposed to work already now:

  • interactive login into Ubuntu (port 24) and CentOS7 (port 30)
  • access to /work, /auto/data and /cvmfs (in both OS)
  • /scratch (in Ubuntu)
  • possibility to run singularity (in both OS)
  • our batch system client (in both OS, with the same rules as on lhcba1)

Please note that any questions should be directed to me, central EDV is NOT managing LHCb servers.

OS.

to login into particular OS, use different ports (not all OSes exist on all nodes):

    22 (default)- SLC4
    24 - host Ubuntu
    26 - SLC5
    28 - SLC6
    30 - CentOS7

Note that default "CMTCONFIG" is historical, each user should specify what he/she is using (f.e. gcc incarnation). CentOS7 support CVMFS based LHCb environment only.

Batch system.

SGE is deprecated

Default configuration for HTCondor is deployed. Currently ~200 slots. Submit hosts are lhcba1, d0new and delta (Ubuntu). Note that jobs are running under Ubuntu (18.04 / 20.04) on hosts WITHOUT local CentOS7 environment. So at the moment it is usable with singularity containers only (sufficient for PAT group).

Containers.

    singularity is supported from Ubuntu environment, you can add the path to it with:

    export PATH=/work/software/singularity/latest/`/work/software/os_version`/bin:$PATH

Best Practices:

1. Compile: store the code on /work, compile on lhcba1, d0new, delta (d0bar-new, not intensive compilations on sigma0, special on lhcbi1)
2. run tests: lhcba1, d0new, delta, d0bar-new
3. run jobs: batch system, lhcba1, d0new, d0bar-new
4. GANGA: lhcba1, d0new, d0bar-new, delta (depends what it really does and where it runs jobs)
5. store big files: /auto/data (only), that in general also includes GRID produced files from GANGA. But having job Ganga directory on /auto/data is not a good idea.
6. as with everything on "common" resources, it is ok to do things which are required and do not disturb other more then necessary. But first and more important, the consequences of any operation should be clear before starting the operation.
7. So "do not do anything you do not understand". True for computers, conference systems, touching high voltage cables and working with radioactive materials...

Environment Setup

See what is already defined in your environment:

env

Basic environment which will give you access to the python-based lb-XXXX env-tools:

source /cvmfs/lhcb.cern.ch/lib/LbEnv 

Check which platforms are available on the server you logged in to:

lb-describe-platform

Most recent available platforms: (last checked 26.02.2021)

  • sigma0: x86_64-slc6-gcc49-opt
  • delta: x86_64-centos7-gcc49-opt, x86_64-slc6-gcc49-opt
  • d0new: x86_64-slc6-gcc8-opt
  • d0bar-new: x86_64-slc6-gcc8-opt
  • lhcbi1: x86_64-centos7-gcc9+py3-opt, x86_64-centos7-clang10-opt
  • *new server to be delivered soon (Intel). A lack of AMD support is part of the reason for the lack of recent platform availability.

You can produce a list of the installed Projects and their version names and compatible platforms.

lb-export-project-info  out.txt

You can run an executable with the Project environment:

lb-run -c <platform> <Project>/<version> executable

You can create a shell in the terminal with the Project environment by using bash as the executable. For example, on sigma0:

lb-run -c x86_64-slc6-gcc49-opt Urania/v7r0 bash

will provide a nominal environment for compiling and running C++ code (gcc) with ROOT.