LQCD Run-time Environment

This document presents the common Run Time Environment for the national LQCD facilities.  It is divided into five sections:

File System – defines the logical view of the file systems accessible to each part of the environment, as well as certain broad policy aspects of those file systems

Interactive – defines the environment a user sees by logging into the system

Shared Batch + Interactive – defines a number of utilities & capabilities which can be accessed both from the interactive shell, and from the batch script.

Batch Script – defines the environment seen by the batch script

Parallel Execution – defines the environment seen by the multi-node parallel executable.

Sections in blue below are not yet completely specified and will evolve in the near future.

 

File System

1.     Each user has a normal home directory

       /home/<user-name>  also $HOME

       backed up “frequently”

       small quota (typically a few GB, more on request if justified)

2.     There is a large shared file system with large quotas

       /cache/... is the name of the root of this large file system, also

       /cache/projectA is a sub-tree for projectA writeable by people in that project (defined by a unix group)

       /cache/users/<user-name> will be an area writeable by the named user

       this file system is not NFS mounted to the parallel machine

       probably not backed up, but maybe RAID

       possibly backed by tertiary storage

       maybe oversubscribed, but always space to write the output of active jobs; for this year, the system will attempt to maintain at least 100 GBytes of free space

       future goal: jobs can declare how much they will need, and the system will make that much available (up to 100 GBytes total)

       possibly with auto-migrate active (to tape or another site), where migration may depend upon the location in the file system (e.g. /cache/migrate/...) or may be controlled by policy or file attributes specified by special files within the file system (user must refer to local site documentation until this is standardized)

       commands will be provided to move large files (even greater than 2 GB) between the parallel nodes and the large file system (see batch, below)

       large quotas may be active

3.     There is a small file system shared by interactive, batch script, and compute node 0

       /qdata/… is the name of the root of this file system (also $QDATA)

       Files are persistent for the life of the job (no backup system)

       Can be used to hold small data files, log files, etc. to be staged in for initialization or staged out at end of job

4.     there is a high performance file system (or area) accessible to, and unique to, each compute node

       path name is accessible to the job script as $PSCRATCH (same path on all nodes)

       ??? is this ONLY visible to the compute node (as is true today for the cluster nodes’ local /scratch area) ???

       The files are not accessible to the batch script? Special commands will be available to copy/split/unsplit files to/from this area ?

       the aggregate size (across all nodes in a job) is modest: at least one terabyte per teraflops sustained

       the area is not guaranteed to be persistent after the batch job completes (see local policy)

5.     there is no maximum file size, but for a little while longer, files over 2 GBytes may not work with all utilities (even Linux has troubles still)

 

Interactive

1.     user home directory (/home), large file system (/cache), and small shared data area (/qdata) are mounted (/qstage)

2.     users can ssh into the interactive node of the facility (may require 2 hops)

3.     standard Unix shell environment, both bash & tcsh

4.     other Unix / scientific tools, libraries: TBD

       (cross) compilers,

       perl, python, …

       blas? lapack?

       parallel debugging tools?

5.     scp to/from remote notes

       single hop transfers (ASAP)

       scp will be kerberized to talk to FNAL:

1.     kinit to get kerberos 24 hour ticket

2.     kinit –r to renew up to 7 days

6.     environment variables standardized (list to be determined)

       must use a “setup” script for target architecture

a.      setup <target>, target = qcdoc, myrinet, gige, infiniband, …as well as flags that specify special version of libraries, e.g.   QDP without SSE code; flags will default to site’s normal production environment

+        example for old version number

b.     setup script may manipulate PATH as well as define environment variables

       to include locations for MPI, QMP, QDP, etc. (QMP_DIR, …)

       to include name of cross compiler, cross linker ( $QCC, $QLD, $QAR, $QAS, …)

       local host compiler, etc. accessed via standard commands

       open issue: do we standardize SciDAC packages to have …/bin/conf files?

7.     manage storage system for work flow:

       pin / unpin files

       mark files as permanent / volatile (to be saved on tertiary storage or not)

8.     data grid commands

       query / browse meta data catalog (mapping between physics parameters and Global File Name(s), GFN)

       fetch by global file name (GFN) from remote site or local silo (uses Replica Catalog to find a copy)

       translate GFN to local file name (mounted path, typically in /cache)

       push / publish a new file, creating entries in meta data and replica catalogs

       request a copy of file to move to (this / remote) site (updates Replica Catalog)

 

Batch: Interactive + Batch Execution Script commands

1.     standard command to submit a batch job

       initial: PBS / LSF-like capabilities: qsub, qstat, qdel, qalter, qhold, qrls

       queue names will be used to select a piece of the machine, documented on the site’s web page

2.     $QSCRATCH points to a path which will be unique for each compute node and points to the highest performance disk accessible to the compute nodes

3.     batch time limits will be set per machine (or partition)

4.     data grid commands & interacting with local storage manager (above) allowed

5.     capturing stdout, stderr from N nodes:

       node 0 spools to host, optionally capture output from all nodes

       option: other nodes have short buffer (can be fetched)

       all of this can be seen from interactive node (how: TBD)

       can be directed to a file visible (live) to interactive user

       stdout, stderr may or may not be treated separately

       can be redirected on qcdrun command line (a la mpirun)

6.     exit status of batch jobs

       from interactive node, ability to query exit status of a completed batch script; i.e. given the job number, return exit status “$ jobstat <jobid>” where 0 is good

7.     batch jobs can submit batch jobs (auto dependent on current job)

8.     when submitting, have a way of specifying the “charge code” (system checks that user is authorized to use this code)

9.     can get “account balance” for specified “charge code”

 

Batch Script

1.     batch script may be in any interactive script language (on clusters)

       initially, only c shell is guaranteed on the qcdoc restriction

2.     /cache, /home and /qdata are accessible on the machine executing the user’s batch script

       moderate performance (aggregate < typical net bandwidth, today <100MB/sec)

       to be verified for correct operation for dCache; may need to restrict the set of commands which operate on /cache

3.     user’s batch scripts are responsible for moving files between the compute nodes and the large file system /cache using one or more of the following: these files systems are NOT accessible to the executable (compute nodes other than node 0); presumption: NFS doesn’t scale well,

       command to replicate a file to all nodes

       command to split a lattice onto all nodes (to $QSCRATCH)

       command to unsplit a lattice from all nodes (from $QSCRATCH)

       command to gather a set of files from $QSCRATCH into a single tree (remain individual files, with node files in separate directories named either by logical node number or node coordinates: …/1.1.3.2/…)    (may perform poorly)

       qcp: command to move file from interactive file systems to/from compute node file systems (/qdata or $QSCRATCH); takes a source and destination, and has cp like semantics, including recursion

       tbd: name by which the batch script (qcp) can find $QSCRATCH for node 0, e.g. for an initialization file other than stdin

4.     batch script defines needed / expected machine size:

       qcdsetdimensions x y z t s w …         

       arbitrary number of dimensions in syntax, 3-4 in practice

       returns 0 if success, negative number otherwise with message to stderr

       modifies environment (env variables or special file) to be used by qcdrun in initializing the user’s parallel machine

Parallel Job Execution

1.     parallel executables are launched from the batch script by special command

       qcdrun (options tbd) <executable> <args>

       executable may be on any file system accessible to the batch script (/home, /cache, /qdata)

2.     /user is not guaranteed to be accessible

3.     /qdata is accessible via open() on node 0 (not guaranteed to be accessible on other nodes)

4.     large file system (/cache) is not be accessible via open() on execution node

5.     each compute node has a transient private directory (will be deleted at end of job script execution) which is accessible via $QSCRATCH in the batch script environment and may be passed to the job via qcdrun; this is the highest performance file I/O system

6.     stdout, stderr are live from all nodes; using this excessively from lots of nodes is a bad idea

7.     stdin is not supported for the parallel job; instead, if text input is desired within the script file, it should be streamed out to a temporary file, and that file should be given to the parallel job as an input file