LQCD News

(This list is purged as info ages and becomes less relevant.)
04-Nov-09 6n mostly operational -- 213/426 nodes/cores of 6n are available immediately. These nodes have been reconfigured to have 2:1 infiniband oversubscription up from 3:1. Another rack of 6n nodes (~30) will be added to the batch system shortly.
08-Oct-09 LQCD CacheManager downtime -- The CacheManager will be down from 9:00am - 12:00pm for maintenance.
08-Oct-09 LQCD webpages changes -- The LQCD webpages are undergoing some changes that will result in periodic outages from 9:00am - 12:00pm.
23-Oct-10 6n cluster down -- The 6n cluster will be down for about a week starting Tuesday as re-arrangements are made to accommodate the new ARRA funded cluster arriving the first week of November. While 6n is down, the Infiniband fabric will be modified slightly to improve performance by lowering the over-subscription from 3:1 to 2:1.
07-Jul-09 Number of 6n nodes decreasing and all of 6n down tomorrow -- Tomorrow (Wed, Jul 8th) all of 6 will be offline so that we can reconfigure a rack for reconfiguration and testing. After this is done, another rack's worth of 6n will be unnavailable for LQCD compute time.
06-Jul-09 New Allocation Year -- The new allocations and fairshare have been configured for the 09-10 USQCD project year.
29-Apr-09 Service Restoration -- The cooling has been temporarily restored, and the repair part is on the way for HPCDATA9 (/cache). All services are expected to be restored by the end of the day. An additional scheduled outage will be needed for full repair of the chilled water line -- length and time to be determined by JLab Facilities Management.
28-Apr-09 Compute clusters down -- There was a chilled water line break and cooling in the machine room is degraded. We do not know the length of the downtime.
25-Mar-09 number of 6n nodes available to LQCD decreasing -- Some of these compute nodes will be repurposed in the near future.
11-June-09 LQCD Queues stoppped -- There is a scheduled outage of the Lattice clusters. The clusters are expected up again at the end of the day.
28-Apr-09 hpcdata9 down -- This fileserver was damaged due to loss of power. We are currently working with the Vendor to correct this situation. As a result, this machine will be down for the duration of today. We hope to get it up again on 04/29/2009. Several cache areas are affected: users,chiQCD,PROJS and NPLQCD.
07-April-09 PBS batch server hardware problems -- The machine was rebooted this morning. Its not reporting the same errors.
03-April-09 All /work areas and services available -- The move to a new fileserver is complete. If you encounter any problems please let us know. Thanks for your patience.
30-Mar-09 Scheduled /work outage on Thursday, April 2 -- while the majority of this work has been completed, /work areas are visible on the interactive machines, copies to these areas using cache copy are restricted. Please hold jobs that require /work access until all services are fully operational. Please use this time to ensure that all your files have been moved correctly.
17-Mar-09 srmGet/srmPut unavailable until later Friday, March 20th -- The requests will be queued, but they will not be executed until the the silo system upgrade is done and the lattice 2008 data are copied to the new tape silo.
29-Jan-09 HPCDATA7 and HPCDATA9 outages -- The /cache fileservers crashed and are now rebooted after a heavy load took them down.
07-Jan-09 4g Cluster Decommissioned -- The LQCD 4g cluster has been decommissioned.
24-Nov-08 New cache_cp installed -- This version has a daemon on each of the cache/work servers that will limit the number of concurrent connections per server. If you see:
cache_cp: Could not connect to hpcdata8: Connection refused -- will try again
That is normal. If your copies fail, send in a CCPR.
16-Sep-08 Interactive machine QCD7NI02 -- A second 7n interactive machine qcd7ni02 has been added, identical to qcd7ni01.
14-Jul-08 New pbs client jobstat is avaliable now -- A new pbs client, jobstat, is installed under /usr/local/bin on all interactive nodes. It print out the important job information in a nice format. Type jobstat -h for the usage and options.
20-Jan-09 7n down for IB switch reboot on Thursday morning (updated time) -- There appears to be an issue with one of the line cards in the main IB switch. We will do a quick reboot of it to see if this fixes the problem.
09-Jan-09 Hpcdata7 down for maintenance -- Hpcdata7 will be down for maintenance for the first half of the day on January 9th, 2009. This is a continuation of the work started late last year. /cache/LHPC will be unavailable.
29-Dec-08 lqcd nodes down for A/C maintenance -- The nodes of 6n and 7n need to be powered down while the air conditioner units are being maintained. They should be down for a day or two. The interactive machines and file servers will still be up.
17-Dec-08 Hpcdata7 down for maintenance -- Hpcdata7 will be unavailable from 8am to sometime pass midday due to maintenance work. Correspondingly, the /cache/LHPC file system will be unavailable. This maintenance is necessary as we must replace a faulty component and upgrade the OS of the machine. Please plan accordingly.
23-Nov-08 HPCDATA7 overload -- HPCDATA7 became overloaded overnight Saturday due to a problem with cache_cp, and has returned to normal service.
14-Nov-08 7n and HPCDATA7/8/9 Recovery from Power Loss -- 7n and HPCDATA8 suffered an unexpected power loss at ~9:30am; recovery included troubleshooting the breaker and required a complete outage to the PDU, including HPCDATA7 and HPCDATA9. Services are now being restored and should be complete by noon.
22-Oct-08 7n outage -- On Wednesday, Oct 22, 7n will unavailable in the morning. There is an issue with the main IB switch, and hopefully a power cycle will fix the problem.
14-Oct-08 Known rcp/cache_cp problems -- Many user's jobs have changed recently so that they require fairly large input files copied from the file servers to the compute nodes. This is causing the file servers to be overloaded and the copies are failing much of the time. These failures are due to the number of concurrent copies happening at the same time, which is made even worse since we have slow network connections between the compute nodes and the file servers.

We are working on a solution or workaround.

16-Sep-08 Unexpected Power Outage -- The Computer Center experienced an unexpected complete power outage this evening ~6pm during the scheduled maintenance period. Some scientific computing services will not be available until Wednesday.
15-Sep-08 Outage of 4g and 8n -- We are having a planned power outage on 9/16 @ 5pm that will affect the 8n and 4g clusters. We will bring them up ASAP on Wednesday 9/17 in the AM.
11-Aug-08 LQCD Outage Complete -- Full service is restored to all LQCD HPC systems, including clusters, interactive, and service systems.
11-Sept-08 Hpcdata7 outage -- Hpcdata7 will be unavailable starting at 8am on Thursday, 11th September 2008 due to needed maintenance work. The machine will be available again sometime later that day. The following file systems LHPC/MISC, LHPC/NF0, LHPC/NF2, LHPC/NF3, LHPC/Spectrum and LHPC/Polar will be offlined for this duration. Please do not submit jobs that require these file systems for 24 hours leading up to this outage. Also please quit all editing and archiving processes that access the /cache/LHPC directory tree.
7-Jul-08 LQCD Outage Rescheduled for Friday August 8 -- The LQCD clusters, fileservers, and interactive nodes will be unavailable during the day on Fri 8/8 for maintenance, including patches, an upgrade to PBS, and other items as needed. The change from 7/22 coincides with required Facilities Management work in the F-wing data center on Saturday August 9. The clusters will remain down during the weekend; other services will be available by the end of the day Friday.
23-Jun-08 LQCD BBCP transfer services outage scheduled for Tuesday July 8 -- The LQCD BBCP transfer services will be unavailable from 12am-4pm on Tuesday July 8th for a machine upgrade. This will only affect offsite transfers using bbcp.
16-Jun-08 14 new nodes -- 8n available UPDATE -- Actually, there are only 13 nodes, because one has hardware issues. Also, due to the PBS bug listed below that will be fixed in July, one cannot submit a job using more than 8 cpu cores with the :c8n tag. As documented in the wiki pages below, your best bet for running on the opteron class of nodes is to use the :opteron tag for your jobs, and they will land on either 7n or 8n.

We have 14 new nodes that are very similar to the 7n nodes that are in the IB queue. The 2 differences between 8n and 7n are that these new nodes are a little faster than 7n (2.1 GHz vs 1.9 GHz), they have the same amount of memory, but they have slower IB cards than 7n. The documented changes can be seen at the New users wiki page as well as the running a batch job wiki page. New tags have been added to the 6n and 7n nodes to better differentiate between the architectures, and this is documented as well on the wiki pages.

08-May-08 Hpcdata9 is functioning but experiencing problems periodically -- We have an ongoing investigation with Sun Microsystems into the unscheduled crashes of hpcdata9. The issue is still unresolved.
15-Jun-08 6n down @ 9 AM Monday -- There seems to be issues with the IB network on that cluster. Hopefully a power cycling of the switches will help.
30-May-08 4g Cluster Down - A/C Failure -- The 4g cluster is down due to an air conditioning failure and may remain down for the weekend.
13-May-08 Hpcdata9 will be down for maintenance from 8am-12pm -- Under the cache file systems users NPLQCD disco emc chiQCD will be directly affected.

  • All users with queued jobs requiring the above areas please advise us via email. These jobs will be held so that they will not be killed.

  • All users NOT requiring the above file system please submit additional jobs.
  • Users who plan to submit jobs that will require these area, please refrain.
  • Users who can temporarily use work areas instead of the above areas. Please do so.
30-Apr-08 LQCD Web Server Patches -- LQCD web servers will be unavailable beginning at 9am on Wednesday, April 30 for required patches.
24-Apr-08 PBS bug UPDATE -- We noticed today that there is a bug in the PBS server that will return:

qsub: Job exceeds queue resource limits

when you submit a job using the qsub option -l nodes=512:c7n. It does not complain if you leave off the :c7n part. We have implemented a workaround for this bug. All larger jobs (> 256 processors) will go to 7n, and the rest of the jobs will be routed to either 6n or 7n unless otherwise requested by the user. We have a patched PBS server that fixes this bug, and this will be installed at a later date because it will require a complete queue drain and downtime.

15-Apr-08 HPCDATA9 back up for now -- We were able to boot the machine with help from Sun, but the original problems are unresolved. We are still working with Sun to diagnose this machine.
14-Mar-08 7n Memory Upgrade to 8GB Completed -- All of the 7n nodes now have 8GB RAM per node.
15-Apr-08 HPCDATA9 down -- We continue to work with SUN on the continuing HPCDATA9 problems. The cache areas for NPLQCD, chiQCD, PROJS (disco, emc), and users are currently unavailable.
1-Apr-08 HPCDATA9 outage Thursday morning April 3 -- Patches will be applied starting at 9am to fix the problem that has caused recent crashes. Users should hold jobs that access filesystems on HPCDATA9, including cache for NPLQCD, chiQCD, PROJS (disco, emc), and users, as the file services will be unavailable. We will attempt to unmount these filesystems to prevent NFS problems, but some interactive nodes may need to be rebooted.
19-Mar-08 7n Ethernet Troubleshooting -- We are currently troubleshooting Ethernet problems in the 7n cluster.
14-Mar-08 7n Downtime on Monday 17-Mar-08 @ 1PM -- We will be patching the kernel to workaround the Barcelona CPU bug, and installing a patched pbs_mom to get rid of jobs being killed with messages like "<< PBS: job killed: node 57 (qcd7n1219) requested job terminate"
12-Mar-08 4g down for 256 way testing -- At 10AM, the 4g cluster will be removed from the job queue, and we will be testing to run it as 1 256 CPU cluster vs 2 128 CPU clusters.
15-Feb-08 7n Memory Upgrade to 8GB -- Rolling outages on 7n nodes will occur over the next few weeks, as memory in each node is increased from 4 to 8GB.
06-Feb-08 IB queues draining for some IB maintenance -- 7n has some IB issues that require some power cycling of switches.
25-Jan-08 qcd7ni01 Scheduled Outage -- On Monday Jan 28 at 11am qcd7ni01 will be shut down to increase memory from 8 to 16GB. qcd6ni01 will remain available.
22-Jan-08 Runtime Environment Change -- For all of the users who are still using mpirun_rsh, please give mpiexec a try. At some time, mpirun_rsh will nolonger be available. For more info on mpiexec see this local documentation, plus man pages are on the 6n and 7n interactive nodes, and many other users are already using it.
10-Jan-08 PBS bug -- We noticed today that there is a bug in the PBS server that will return:

qsub: Job exceeds queue resource limits

when you submit a job using the qsub option -l nodes=512:c7n. It does not complain if you leave off the :c7n part. We are looking into this. This bug does not come up if you use fewer nodes.

04-Jan-08 6n Cluster Upgraded -- The 6n cluster is now upgraded to be the same FC7 64-bit version of the OS and OFED tools as 7n. In tests, the 7n builds of codes have run fine on 6n. Old 6n builds will NOT run on 6n nodes anymore.

The interactive 6n node QCD6NI01 is the same hardware as other 6n nodes; otherwise it is configured similar to QCD7NI01.

If you have scripts or configuration files (~/.cshrc or ~/.bashrc) that key off of 6n or 7n, please check them. The paths and libraries are the same on both nodes. Please check any files that check for 6n or 7n.

NOTE: The PBS upgrade is postponed until several issues are resolved.

03-Jan-08 Recovery from Holiday Shutdown -- The PBS server upgrade is in progress. The 7n cluster is operational; 4g still has some failed nodes that need repair. 6n is down through the end of the week for upgrade to 64bit. Use the ib queue for running on 7n. Web pages are a work-in-progress and will be updated soon.
18-Dec-07 PBS Upgrade on 1/2/08
  • With this upgrade, jobs submitted with -l nodes=X:ppn=Y will nolonger work. You must just say -l nodes=X without the ppn.
  • Its very likely that any queued jobs left on the old server will NOT be queued on the new server.
  • There will be no ib64 queue, only ib, and it will run on both on 6n and 7n nodes. If you need a certain type of node you must use -l nodes=X:c7n for a 7n node or -l nodes=X:c6n for a 6n node.
  • More info will follow.
06-Dec-07 6n Upgrade Starting 1/2/08 -- The 6n cluster will be upgraded to 64bit during the first week of January. The 'ib' and 'ib64' queues will be configured to use both the 6n and 7n clusters. A 6n interactive node will also be provided.
26-Nov-07 Cluster Outages 12/21/07 - 1/2/08 -- The LQCD clusters will be unavailable during the holiday shutdown due to JLab's holiday electrical shutdown. Power will remain available for Core Computing systems (e-mail, web serviers and telecommunications). See holiday electrical shutdown for more information on the electrical outage.
14-Nov-07 7n cluster is down for testing and benchmarking today. -- Some existing jobs will run to completion. There are a few things we want to look into with the cluster quiet.
12-Nov-07 The small file limit policy changed and will be effective on Nov. 20 -- Please read this document for detail.
12-Nov-07 The first 4g panel has been decomissioned. -- The other 2 panels are still operational. Some of the nodes have hardware problems, so we are using these nodes for other purposes.
26-Oct-07 7n is progressing -- 7n has been upgraded to have 2 1.9GHz quad core Barcelona Opteron chips (8 cores/node). Each node is pretty fast. We saw something like 42 GFLOPS on one node running the HPL benchmark and slightly more than 13 TFLOPS runing on 380 nodes, which will put this machine pretty high on the 11/07 top500 list. The cluster will be ready for general use, but beware that all of the software and hardware has been changed, so expect failed jobs.
25-Oct-07 qcd7ni01 is down until further notice -- This machine has crashed repeatedly over the past couple of days. We will have to verify the hardware, and then update the software to see if this fixes the problem.
23-Oct-07 7n upgrade to be completed this week -- 7n Rack 1 machines are being rebuilt and tested today (Tue 10/23), with the rest of the cluster following on Wednesday. Following testing and benchmarking, the nodes will be returned to normal operation before the end of the week.
22-Oct-07 7n upgraded to 8 CPU cores/node and being updated with new software -- 7n jobs are running, but failures are common. These machines already have the new quad cores installed and they are being upgraded to RedHat Fedora Core 7, kernel 2.6.22, new OFED IB support, etc. Currently we are seeing strange node crashes and segfaults. The crashes could be software or hardware. There does seem to be an issue with the kernel in causing these segfaults. More info will follow as we know more.
17-Oct-07 JLab 2007 Cluster 7n Back (for now) -- 7n jobs are running from now until Monday Oct 22nd. On the 22nd, we will be taking the cluster down to upgrade the kernel and other software on the nodes. Please see the 7n web page for more information (this is still being written).
17-July-07 JLab 2007 Cluster 7n Operational -- The new JLab cluster 7n ("2007 Infiniband") is now operational. This 64-bit cluster, 396 dual-core dual processor AMD nodes with DDR Infiniband and OFED 1.2, is available through the 'ib64' queue. The new node 'qcd7ni01' provides a 64-bit interactive environment. Version rollouts of Chroma and QDP++ will continue to be placed in /dist/scidac. All but ~10% of the nodes are in service; the remaining nodes will be included as outstanding power and hardware issues are resolved. Please see the 7n web page for more information.
16-Oct-07 7n remains down for upgrade -- The 7n cluster has all processors upgraded to quad core AMD 1.9GHz cpus, and will be upgraded to a newer kernel over the coming week.
16-Oct-07 QCDI01 reboot 10/17 8am; /cache/users,emc,disco available by 9am -- QCDI01 will be rebooted at 8am on Wednesday 10/17 to remove the HPCDATA6 mount. The /cache/users, /cache/emc, and /cache/disco will then be made available around 9am from the new fileserver HPCDATA9, after the final filesystem sync is completed from HPCDATA6.
03-Oct-07 7n reserved for benchmarking and preparation for quad-core upgrade UPDATE -- The cluster will be down until early 04-Oct-07.
02-Oct-07 7n reserved for benchmarking and preparation for quad-core upgrade At 1:30 on Oct 03, we will have the cluster down while we do these things.
21-Sept-07 HPC down for power outage Sept 28 - Oct 1 The HPC environment will be shut down Friday afternoon, 9/28 to prepare for the lab's scheduled outage for power maintenance. Systems will be returned to service on Monday morning, Oct 1.
20-Sept-07 7n Quad CPU installation The first batch of new AMD Barcelona 1.9GHz quad core cpus have arrived; the first rack of 7n nodes will be upgraded on Friday, Sept 21. Throughout the next few weeks, racks will be taken offline one by one as the CPUs are upgraded.
18-Sept-07 6n down to diagnose IB switch problem There is a switch that is acting up on 6n. The queue is being drained of active jobs and we are going to see if a power cycle of all of the switches fixes the problem. We will know its status tomorrow -- Sept 19th.
17-August-07 LQCD down for scheduled power outage Tuesday morning The 6n and 7n clusters and HPCDATA7/8 fileservers will be shut down Tuesday 8/21 at 5am to accomodate a required F-Wing power outage. Services should be restored by midmorning.
14-Aug-07 PBS is slow/unreliable -- We are seeing many errors like: No Permission. qstat: cannot connect to server qcdpbs (errno=15007) We are looking into it.
3-Aug-07 Downtime scheduled for 7n cluster on Tuesday, August 7th 10 AM local time -- We are going to address some hardware issues at this time. The cluster should be running jobs within a few hours.
27-July-07 CacheManager off for Saturday mass storage outage The cachemanager is turned off through the scheduled JLab central systems outage.
25-July-07 Downtime scheduled for Friday, July 27th 12 noon local time -- We are having electrical work done which affects the /cache and /work file servers and the 6n and 7n clusters, so the jobs will not run until the electrical work is complete. This will probably require a reboot of the interactive nodes to remount the /work and /cache disks.
17-July-07 1 million small file limit in /cache disk will be effective start September 1 -- Please read this document for detail.
10-July-07 All 7n nodes will be rebuilt on Tuesday July 10th. --We are upgrading to OFED 1.2
10-July-07 Reboot qcdi02 at 8 am to make 18 TB /work disk pool available
9-July-07 A 18 TB /work disk pool is ready to use --This is a project managed disk space used to store small configuration data sets. The data files under /work will not be backed up. Please read this document for detail.
3-July-07 2007-08 allocation start now
29-Jun-07 24 nodes/96 CPU cores of 7n available -- Instructions for access available here.
26-Jun-07 QCD back up with the exception of some of 4g -- hopefully the rest of 4g will be up tomorrow
25-Jun-07 QCD Downtime for Tuesday, June 26th -- qcd will be down starting at 8AM. We will be instaling a new home NFS server, adding some of the 7n nodes for testing, and doing some other general software and hardware maintenance.
22-May-07 New Disk Servers The LQCD project has now procured two new Sun x4500 file servers (thumpers). The first one, which was the evaluation machine, is now in production use, divided into two virtual disks. The second machine will be deployed next month, ahead of 7n going into full production. The older servers have been retired or moved into other roles. Once the second machine is deployed, we will have increased our user space from 15 TBytes to over 30 TBytes. A third machine is planned, to take us to 50 TBytes.
31-May-07 mesh downtime complete, but not 100% up -- The air conditioner maintenance is done, but some of the nodes don't want to boot. We are looking into this...
24-May-07 3g nodes decomissioned -- The last 3g panel has been shut down.
24-May-07 mesh downtime on Thursday May 31 in the AM -- The air conditioner for the computer room is undergoing maintenance on that day, so we will have to shutdown all of the mesh nodes.
18-May-07 3g cluster to be decomissioned next week -- This cluster has been in use for 4 years, and is starting to age. The 7n cluster should be online soon.
17-May-07 Important changes in CacheManager version 4 -- The software used to manage the /cache disk pool is upgraded to version4. There are some changes (mainly for small files) in this new version. Please read this document for the detail.
30-Apr-07 Temporary Down Completed -- We are temporarily still constraining usage on the new fileserver in case we in the end need to move back off of it.
30-Apr-07 qcdi02 rebooted at 3:30 pm -- Rebooted to maintain file system consistency.
25-Apr-07 /cache/users is available now -- We are moving /cache/users from hpcdata5 to hpcdata6. This work is done at ~9:25am. Now this disk pool can be accessed from all interactive nodes and rcp from all computing nodes.
24-Apr-07 Issues with 4g panel 45-- Today, a number of nodes on this panel died at about the same time. One has had repeated issues, and the status of the others is unclear. It may take some time to get this panel active again.
20-Apr-07 /cache/users unavailable from 8am-12pm. Has been cancelled!! -- We are moving /cache/users to another machine. A reboot of the interactive nodes (qcdi01 and qcdi02) will take place at 8am as well.
10-Apr-07 HPC infrastructure down LONGER than planned due to extended power outage! -- File systems re-syncs will be done after the power returns. The systems will be available at midday.
10-Apr-07 HPC infrastructure down from 6-8am due to power outage and upgrades. -- There will be a power outage for the datacenter. At this time, we will introduce a new cache server in the environment. These modifications make it necessary to reboot the compute and interactive nodes.
29-Mar-07 Possible 6N outage due to a water leak -- A water leak has been discovered that is affecting HVAC conditions in the datacenter. We may have to shutdown the 6N cluster. Specifics will be posted here as soon as we are updated!
28-Mar-07 4g not fully functional -- One of the 4g nodes is not working, so one panel of 4g nodes is offline until that node is repaired or replaced.
27-Mar-07 LQCD is DOWN -- there is a networking problem that has caused all of lqcd to be down we are working on fixing it...
17-May-07 Important changes in CacheManager version 4 -- The software used to manage the /cache disk pool is upgraded to version4. There are some changes (mainly for small files) in this new version. Please read this document for the detail.
1-Mar-07 New debug queues created -- High priority 30 minute queues are available for the mesh and ib clusters to use them read instructions here.
14-Feb-07 mesh 3g01 panel decomissioned-- due to a lack of spares and repeated hardware problems with the nodes the 3g01 panel is no longer available. 7n is coming...
02-Mar-07 PBS server is dying!!! -- The PBS server is having hardware problems and is in the process of being replaced some jobs may die in the process, but this should be better than it is now...
27-Feb-07 CacheManager will be down from 11:00am - 10:30pm today -- Due to Scientific Computing outage (include JASMine) February 27, Noon to 1 pm, cacheManage running on all file servers will be down utill tonight 10:30pm.
14-Feb-07 mesh 3g01 panel decomissioned-- due to a lack of spares and repeated hardware problems with the nodes the 3g01 panel is no longer available. 7n is coming...
26-Jan-07 New problem reporting ticket system available-- A new ticket system to report problems has been implemented. The 'Report a problem' link above makes this system available.
27-Dec-06 A/C Outage for 6n cluster December 27th is a scheduled maintenance day for the air conditioners, and the 6n cluster will be unavailable to run jobs from between 8AM to about 12 noon.
22-Dec-06 6n test queue has been removed The test queue is not available. The nodes that were a part of that queue are in the process of being added to the general ib queue. There are infiniband problems at this time on those nodes, and they will be resolved as soon as possible.
26-Nov-06 Email outtage from 8am - 2 pm-- Email services will be unavailable during the above time due to maintenance work in the JLAB computing environment. All incoming email will be queued and so will be available again when this work is completed.
19-Oct-06 Upgrade of www.usqcd(lqcd).org-- We have upgraded to RHEL4, apache-2.2.3, tomcat-5.5.17 and php-5.1.6. This upgrade aimed to meet security requirements on our web services. If there are any issues please contact Lawrence(sorrillo@jlab.org).
03-Oct-06 Upgrade of interactive node, qcdi01-- We upgrade from a 2.6GHz node with 900MB RAM running RedHat 9.0 to 4GB with dual 2.6GHz processors running Fedora Core 3 or beyond. Please plan to login to this node again on Thursday 5th, October 2006.
29-Sep-06 Downtime from 9:30 am - 12 pm due to memory upgrade for NFS server (qcdhome)-- Over the last two weeks there have been two outtages due to the NFS server, qcdhome crashing. We believe adding more memory will further stablize our environment. We plan to upgrade from 1GB to 4GB.
12-Sep-06 6n cluster upgrade Thursday morning-- The 6n cluster will be offline to allow upgrade of the kernel and infiniband software. Please plan to re-link and test your code. Nodes qcd6n321,322,323,324 are available for interactive testing of your applications. Please let us know immediately if you find any problems. List of changes.
22-Aug-06 6n Outage and Upgrade -- Thursday, August 24, the 6n cluster will be down to add additional local resources for JLabLQCD. This will increase the total machine size from 280 to 322, allowing 5 concurrent jobs of 64 nodes (128 processes).
22-Aug-06 MAUI DEFAULT Account Change --It turns out that the name DEFAULT is special for MAUI, so we will no longer use it for background tasks and for projects with no scheduled allocation. Instead there will be multiple small allocation accounts. For LQCD users, please switch from using DEFAULT to using lqcd. All others please contact Chip Watson.
22-Aug-06 Fairshare Update -- When the machine is underutilizied, and all active accounts have reached their fairshare targets, MAUI currently equally divides the remaining time without taking into account relative fairshare. E.g. if only two active accounts are running, one with FS target of 40 and the other of 10, MAUI will divide the remaining 50% equally, yielding a split of 65:35. We are looking into changing MAUI to fix this so that the end result would be 75:25.
11-May-06 4t cluster decommissioned -- The remaining 4t nodes have been decommissioned to be used to upgrade various other servers.
27-Apr-06 Infiniband cluster operational May 1 -- Monday, May 1, the new 280 node cluster called 6n (2006 infiniband) went into production operation (queue name ib). The version of MPI (and thus QMP) has one optimization turned off until a bug is fixed, but does perform reasonably well. Until the new allocations go into effect July 1, this machine will be in "friendly user" mode, with allocations for the national resources following those of the previous year.
18-Apr-06 Infiniband cluster news -- Jefferson Lab will soon put into operation a new 280 node cluster called 6n (2006 infiniband). This machine was funded 50% by the new LQCD Computing Project, 25% by the LQCD SciDAC project (2005 funds), and 25% by JLab for the local theory group. We are now trying several versions of operating systems and infiniband libraries prior to releasing this system to general use.

Until the new allocations go into effect July 1, this machine will be in "friendly user" mode, with allocations of the national resources following proportionally the allocations of the previous JLab resources. Thirty five nodes are already accessible in the ib queue, but we are experiencing difficulties with file I/O with the newest version of IB software installed on this partition (so please don't file problem reports until we have cleared this up).

10-Mar-06 Myrinet 2m cluster de-commissioning -- The myrinet cluster will be slowly decommissioned over the coming month or two. Very soon, the first 48 nodes will be offline to move one of the ethernet switches to support the new Infiniband cluster. We will try to have 80 nodes back online as soon as possible.
17-Jan-06 Using qcdpbs / MAUI scheduler (updated)-- For gigE mesh jobs:
  • Use qcdi02 or qcdi03 to submit jobs, or the new system will hang (incompatibility between two versions of PBS). Hence, do not submit to the new server from qcdi01.
  • Two queues are available:
    • test -- supports jobs of 1 or 8 nodes (2x2x2 mesh)
    • mesh -- supports jobs of ONLY 128 nodes (4x4x8 mesh)
  • Use the -A flag on qsub to specify the account (see old News) or your job will run at the lowest priority.
  • Do NOT include the old qmp-f and qmp-l configuration options on your QMP_run.sh command. The new 2.6 systems will default to the correct files. You will NOT know which node your job will be landing on, as a single queue will control multiple clusters under MAUI.
14-Feb-06 LQCD Computing in the CERN Courier -- There's a nice, short article on the US LQCD Computing program in the CERN Courier.
17-Jan-06 qcdi02 upgraded -- The interactive machine qcdi02 has been upgraded to the 2.6 kernel. Both qcdi02 and qcdi03 can now be used for work on the gigE mesh machines, all of which are now running 2.6. Node qcdi01 remains at 2.4 to support the myrinet cluster, which will be de-commissioned next month.
14-Dec-05 4g migrated to qcdpbs / MAUI scheduler -- The 3rd panel of the 4g cluster is now being moved to the qcdpbs server. The qcd4gadm server is now retired. Jobs will (by the end of today) run on the next available panel of 128 nodes. Details:

  • Use ONLY qcdi03 to submit jobs, or the new system will hang (incompatibility between two versions of PBS). Hence, do not submit to the new server from qcdi01 or qcdi02.
  • Two queues are available:
    • test -- supports jobs of 1 or 8 nodes (2x2x2 mesh)
    • mesh -- supports jobs of ONLY 128 nodes (4x4x8 mesh)
  • Use the -A flag on qsub to specify the account (see old News) or your job will run at the lowest priority.
  • Do NOT include the old qmp-f and qmp-l configuration options on your QMP_run.sh command. The new 2.6 systems will default to the correct files. You will NOT know which node your job will be landing on, as a single queue will control multiple clusters under MAUI.
23-Nov-05 Linux 2.6, Accounts & the MAUI scheduler -- Panel01 of the 4g cluster has now been moved to the new PBS server (qcdpbs). Several minor problems have been revealed, and these are being addressed as quickly as possible. (See old News for additional information.)

  • Use ONLY qcdi03 to submit jobs, or the new system will hang (incompatibility between two versions of PBS). Hence, do not submit to the new server from qcdi01 or qcdi02.
  • Two queues are available:
    • test -- supports jobs of 1 or 8 nodes (2x2x2 mesh)
    • mesh -- supports jobs of ONLY 128 nodes (4x4x8 mesh)
  • Use the -A flag on qsub to specify the account (see old News) or your job will run at the lowest priority.
  • Do NOT include the old qmp-f and qmp-l configuration options on your QMP_run.sh command. The new 2.6 systems will default to the correct files. You will NOT know which node your job will be landing on, as a single queue will control multiple clusters under MAUI.
4-Nov-05 Accounts & the MAUI scheduler -- Starting with the new 2.6 system described below, and migrating out to the production systems, JLab is switching to the MAUI scheduler and account based operations. To use the system, you will need to have a valid account, corresponding to an approved SciDAC project. Contact your project P.I. to get added to an account if you are not already a user on that account, or to discover the name of your project (account). Valid project names are:
  • HASTE (Negele)
  • Spectrum (Richards)
  • DynChiral (Edwards)
  • chiQCD (Liu)
  • Hyperons (Orginos)
  • NPLQCD (Savage)
  • polar (Wilcox)
The scheduler will allocate time on a fair share basis to projects (not individuals), with target percentages based upon the SciDAC allocations. To use an account, specify it on the qsub command line as follows:
> qsub -A HASTE ...
7-Oct-05 Silo File Size Increased to 20GB -- The latest version of JASMine silo software supports 20GB file sizes, a tenfold increase.
29-Jul-05 HPC user quota -- Every hpc user has 2GB disk quota. Anyone who needs more disk space than the quota limit has to send a request to the HPC group.
21-Jun-05 2m end of life -- The myrinet cluster is experiencing an increasing error rate. Because of higher priority work, we will not attempt to keep all of these 3 year old nodes up. We are currently setting "Offline" the nodes that are causing problems, allowing the cluster to degrade in capacity. When we catch up on other tasks, we will go back and see if we can recover more of these nodes.
14-Jan-05 New Interactive Nodes (beta) -- Two new dual processor nodes are being deployed as qcdi01 and qcdi02. The old nodes of these names have been decommissioned. Not all software is yet ready on these nodes, and if you encounter trouble please report the problem to the sysadmin, and then resume work on the older qcd3g-i01.
4-Jan-05 Happy New Year! -- The 3g and 4g clusters are operational. Please report any problems.
21-Dec-04 Offsite Login -- There is a new external login server available for JLab computing users. Please use login.jlab.org when remotely accessing on-site computers. From this machine you can then reach the LQCD interactive nodes.
Cache disk web page -- A new web page is now available to view the state of the cache disks, and of requests to move files to/from the silo. It is available in the menu (left) as Cache Disks
8-Dec-04 Cluster Monitoring Developments for 4g -- Work is underway to commission the IPMI based node monitoring on the 4g cluster. This is being debugged on 128 nodes, and once completed all 384 nodes will be made available to users.
4g Cluster Open for Beta Testing -- Two of the 128 node partitions on the 4g cluster are open for beta testing. Submit 128 node jobs to queue Panel01@qcd4gadm or Panel45@qcd4gadm. Partition config and list files for qmp-mvia are located in subdirectories of /etc/qmp (additional details in "JLab Specific" environment menu item at left.
29-Nov-04 4g Cluster Soon -- The 4g (2004 gigE) cluster is now being debugged, and will be made available for early testers the first week of December.
24-Jan-04 Batch prologue scripts -- As part of our efforts to improve cluster reliability, we will be adding a PBS based prologue script to the system during the coming week. This script will perform the following tasks before your job is started:
1) Terminate any non-privileged executables that are running on each node
2) Remove all data from the /scratch directory that is not owned by root.
Additional steps will be added in the future to ensure that each of the compute nodes is in a 'ready' state before your job is deployed.