LQCD Run-time Environment
This document presents the common Run Time Environment for the national LQCD facilities. It is divided into five sections:
File System – defines the logical view of the file systems accessible to each part of the environment, as well as certain broad policy aspects of those file systems
Interactive – defines the environment a user sees by logging into the system
Shared Batch + Interactive – defines a number of utilities & capabilities which can be accessed both from the interactive shell, and from the batch script.
Batch Script – defines the environment seen by the batch script
Parallel Execution – defines the environment seen by the multi-node parallel executable.
Sections in blue below are not yet completely specified and will evolve in the near future.
1. Each user has a normal home directory
□ /home/<user-name> also $HOME
□ backed up “frequently”
□ small quota (typically a few GB, more on request if justified)
2. There is a large shared file system with large quotas
□ /cache/... is the name of the root of this large file system, also
□ /cache/projectA is a sub-tree for projectA writeable by people in that project (defined by a unix group)
□ /cache/users/<user-name> will be an area writeable by the named user
□ this file system is not NFS mounted to the parallel machine
□ probably not backed up, but maybe RAID
□ possibly backed by tertiary storage
□ maybe oversubscribed, but always space to write the output of active jobs; for this year, the system will attempt to maintain at least 100 GBytes of free space
□ future goal: jobs can declare how much they will need, and the system will make that much available (up to 100 GBytes total)
□ possibly with auto-migrate active (to tape or another site), where migration may depend upon the location in the file system (e.g. /cache/migrate/...) or may be controlled by policy or file attributes specified by special files within the file system (user must refer to local site documentation until this is standardized)
□ commands will be provided to move large files (even greater than 2 GB) between the parallel nodes and the large file system (see batch, below)
□ large quotas may be active
3.
There is a small file
system shared by interactive, batch script, and compute node 0
□
/qdata/… is the name of
the root of this file system (also $QDATA)
□
Files are persistent
for the life of the job (no backup system)
□
Can be used to hold
small data files, log files, etc. to be staged in for initialization or staged out
at end of job
4. there is a high performance file system (or area) accessible to, and unique to, each compute node
□ path name is accessible to the job script as $PSCRATCH (same path on all nodes)
□ ??? is this ONLY visible to the compute node (as is true today for the cluster nodes’ local /scratch area) ???
□ The files are not accessible to the batch script? Special commands will be available to copy/split/unsplit files to/from this area ?
□ the aggregate size (across all nodes in a job) is modest: at least one terabyte per teraflops sustained
□ the area is not guaranteed to be persistent after the batch job completes (see local policy)
5. there is no maximum file size, but for a little while longer, files over 2 GBytes may not work with all utilities (even Linux has troubles still)
1. user home directory (/home), large file system (/cache), and small shared data area (/qdata) are mounted (/qstage)
2. users can ssh into the interactive node of the facility (may require 2 hops)
3. standard Unix shell environment, both bash & tcsh
4.
other Unix / scientific
tools, libraries: TBD
□
(cross) compilers,
□
perl, python, …
□
blas? lapack?
□
parallel debugging
tools?
5. scp to/from remote notes
□ single hop transfers (ASAP)
□ scp will be kerberized to talk to FNAL:
1. kinit to get kerberos 24 hour ticket
2. kinit –r to renew up to 7 days
6. environment variables standardized (list to be determined)
□ must use a “setup” script for target architecture
a. setup <target>, target = qcdoc, myrinet, gige, infiniband, …as well as flags that specify special version of libraries, e.g. QDP without SSE code; flags will default to site’s normal production environment
+
example for old version
number
b. setup script may manipulate PATH as well as define environment variables
□ to include locations for MPI, QMP, QDP, etc. (QMP_DIR, …)
□ to include name of cross compiler, cross linker ( $QCC, $QLD, $QAR, $QAS, …)
□ local host compiler, etc. accessed via standard commands
□
open issue: do we
standardize SciDAC packages to have …/bin/conf files?
7.
manage storage system
for work flow:
□
pin / unpin files
□
mark files as permanent
/ volatile (to be saved on tertiary storage or not)
8.
data grid commands
□
query / browse meta
data catalog (mapping between physics parameters and Global File Name(s), GFN)
□
fetch by global file
name (GFN) from remote site or local silo (uses Replica Catalog to find a copy)
□
translate GFN to local
file name (mounted path, typically in /cache)
□
push / publish a new
file, creating entries in meta data and replica catalogs
□
request a copy of file
to move to (this / remote) site (updates Replica Catalog)
1. standard command to submit a batch job
□ initial: PBS / LSF-like capabilities: qsub, qstat, qdel, qalter, qhold, qrls
□ queue names will be used to select a piece of the machine, documented on the site’s web page
2. $QSCRATCH points to a path which will be unique for each compute node and points to the highest performance disk accessible to the compute nodes
3. batch time limits will be set per machine (or partition)
4. data grid commands & interacting with local storage manager (above) allowed
5.
capturing stdout,
stderr from N nodes:
□
node 0 spools to host,
optionally capture output from all nodes
□
option: other nodes
have short buffer (can be fetched)
□
all of this can be seen
from interactive node (how: TBD)
□
can be directed to a
file visible (live) to interactive user
□
stdout, stderr may or
may not be treated separately
□
can be redirected on qcdrun
command line (a la mpirun)
6.
exit status of batch
jobs
□
from interactive node,
ability to query exit status of a completed batch script; i.e. given the job
number, return exit status “$ jobstat <jobid>” where 0 is good
7.
batch jobs can submit
batch jobs (auto dependent on current job)
8.
when submitting, have a
way of specifying the “charge code” (system checks that user is authorized to
use this code)
9.
can get “account
balance” for specified “charge code”
1.
batch script may be in any interactive script language (on clusters)
□
initially, only c shell is guaranteed on the
qcdoc restriction
2.
/cache, /home and /qdata are accessible on the
machine executing the user’s batch script
□ moderate performance (aggregate < typical net bandwidth, today <100MB/sec)
□ to be verified for correct operation for dCache; may need to restrict the set of commands which operate on /cache
3. user’s batch scripts are responsible for moving files between the compute nodes and the large file system /cache using one or more of the following: these files systems are NOT accessible to the executable (compute nodes other than node 0); presumption: NFS doesn’t scale well,
□ command to replicate a file to all nodes
□ command to split a lattice onto all nodes (to $QSCRATCH)
□ command to unsplit a lattice from all nodes (from $QSCRATCH)
□ command to gather a set of files from $QSCRATCH into a single tree (remain individual files, with node files in separate directories named either by logical node number or node coordinates: …/1.1.3.2/…) (may perform poorly)
□
qcp: command to move
file from interactive file systems to/from compute node file systems (/qdata or
$QSCRATCH); takes a source and destination, and has cp like semantics,
including recursion
□
tbd: name by which the
batch script (qcp) can find $QSCRATCH for node 0, e.g. for an initialization
file other than stdin
4.
batch script defines
needed / expected machine size:
□
qcdsetdimensions x y z
t s w …
□
arbitrary number of
dimensions in syntax, 3-4 in practice
□
returns 0 if success,
negative number otherwise with message to stderr
□
modifies environment
(env variables or special file) to be used by qcdrun in initializing the user’s
parallel machine
1. parallel executables are launched from the batch script by special command
□ qcdrun (options tbd) <executable> <args>
□ executable may be on any file system accessible to the batch script (/home, /cache, /qdata)
2. /user is not guaranteed to be accessible
3. /qdata is accessible via open() on node 0 (not guaranteed to be accessible on other nodes)
4. large file system (/cache) is not be accessible via open() on execution node
5. each compute node has a transient private directory (will be deleted at end of job script execution) which is accessible via $QSCRATCH in the batch script environment and may be passed to the job via qcdrun; this is the highest performance file I/O system
6. stdout, stderr are live from all nodes; using this excessively from lots of nodes is a bad idea
7. stdin is not supported for the parallel job; instead, if text input is desired within the script file, it should be streamed out to a temporary file, and that file should be given to the parallel job as an input file