LQCD News |
(This list is purged as info ages and becomes less relevant.) |
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 04-Nov-09 | 6n mostly operational -- 213/426 nodes/cores of 6n are available immediately. These nodes have been reconfigured to have 2:1 infiniband oversubscription up from 3:1. Another rack of 6n nodes (~30) will be added to the batch system shortly. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 08-Oct-09 | LQCD CacheManager downtime -- The CacheManager will be down from 9:00am - 12:00pm for maintenance. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 08-Oct-09 | LQCD webpages changes -- The LQCD webpages are undergoing some changes that will result in periodic outages from 9:00am - 12:00pm. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 23-Oct-10 | 6n cluster down -- The 6n cluster will be down for about a week starting Tuesday as re-arrangements are made to accommodate the new ARRA funded cluster arriving the first week of November. While 6n is down, the Infiniband fabric will be modified slightly to improve performance by lowering the over-subscription from 3:1 to 2:1. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 07-Jul-09 | Number of 6n nodes decreasing and all of 6n down tomorrow -- Tomorrow (Wed, Jul 8th) all of 6 will be offline so that we can reconfigure a rack for reconfiguration and testing. After this is done, another rack's worth of 6n will be unnavailable for LQCD compute time. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 06-Jul-09 | New Allocation Year -- The new allocations and fairshare have been configured for the 09-10 USQCD project year. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 29-Apr-09 | Service Restoration -- The cooling has been temporarily restored, and the repair part is on the way for HPCDATA9 (/cache). All services are expected to be restored by the end of the day. An additional scheduled outage will be needed for full repair of the chilled water line -- length and time to be determined by JLab Facilities Management. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 28-Apr-09 | Compute clusters down -- There was a chilled water line break and cooling in the machine room is degraded. We do not know the length of the downtime. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 25-Mar-09 | number of 6n nodes available to LQCD decreasing -- Some of these compute nodes will be repurposed in the near future. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11-June-09 | LQCD Queues stoppped -- There is a scheduled outage of the Lattice clusters. The clusters are expected up again at the end of the day. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 28-Apr-09 | hpcdata9 down -- This fileserver was damaged due to loss of power. We are currently working with the Vendor to correct this situation. As a result, this machine will be down for the duration of today. We hope to get it up again on 04/29/2009.
Several cache areas are affected: users,chiQCD,PROJS and NPLQCD.
| 07-April-09
| PBS batch server hardware problems -- The machine was rebooted this morning. Its not reporting the same errors.
| 03-April-09
| All /work areas and services available -- The move to a new fileserver is complete. If you encounter any problems please let us know. Thanks for your patience.
| 30-Mar-09
| Scheduled /work outage on Thursday, April 2 -- while the majority of this work has been completed,
/work areas are visible on the interactive machines, copies to these areas using cache copy are restricted.
Please hold jobs that require /work access until all services are fully operational. Please use this time to ensure that all your files have been moved correctly.
| 17-Mar-09
| srmGet/srmPut unavailable until later Friday, March 20th -- The requests
will be queued, but they will not be executed until the the silo system upgrade is done
and the lattice 2008 data are copied to the new tape silo.
| 29-Jan-09
| HPCDATA7 and HPCDATA9 outages --
The /cache fileservers crashed and are now rebooted after a heavy load took them down.
| 07-Jan-09
| 4g Cluster Decommissioned --
The LQCD 4g cluster has been decommissioned.
| 24-Nov-08
| New cache_cp installed --
This version has a daemon on each of the cache/work servers that will limit
the number of concurrent connections per server. If you see: | cache_cp: Could not connect to hpcdata8: Connection refused -- will try again That is normal. If your copies fail, send in a CCPR. 16-Sep-08
| Interactive machine QCD7NI02 --
A second 7n interactive machine qcd7ni02 has been added, identical to qcd7ni01.
| 14-Jul-08
| New pbs client jobstat is avaliable now -- A new pbs client, jobstat, is installed under /usr/local/bin on all interactive nodes. It print out the important job information in a nice format. Type jobstat -h for the usage and options.
| 20-Jan-09
| 7n down for IB switch reboot on Thursday morning (updated time) --
There appears to be an issue with one of the line cards in the
main IB switch. We will do a quick reboot of it to see if this
fixes the problem.
| 09-Jan-09
| Hpcdata7 down for maintenance --
Hpcdata7 will be down for maintenance for the first half of the day on January 9th, 2009. This
is a continuation of the work started late last year. /cache/LHPC will be unavailable.
| 29-Dec-08
| lqcd nodes down for A/C maintenance --
The nodes of 6n and 7n need to be powered down while the air conditioner units
are being maintained. They should be down for a day or two. The interactive
machines and file servers will still be up.
| 17-Dec-08
| Hpcdata7 down for maintenance --
Hpcdata7 will be unavailable from 8am to sometime pass midday due to maintenance work. Correspondingly, the /cache/LHPC file system will be unavailable. This maintenance is necessary as we must replace a faulty component and upgrade the OS of the machine. Please plan accordingly.
| 23-Nov-08
| HPCDATA7 overload --
HPCDATA7 became overloaded overnight Saturday due to a problem with cache_cp,
and has returned to normal service.
| 14-Nov-08
| 7n and HPCDATA7/8/9 Recovery from Power Loss --
7n and HPCDATA8 suffered an unexpected power loss at ~9:30am; recovery
included troubleshooting the breaker and required a complete outage to the PDU, including
HPCDATA7 and HPCDATA9. Services are now being restored and should be complete by noon.
| 22-Oct-08
| 7n outage --
On Wednesday, Oct 22, 7n will unavailable in the morning. There is an
issue with the main IB switch, and hopefully a power cycle will fix the
problem.
| 14-Oct-08
| Known rcp/cache_cp problems --
Many user's jobs have changed recently so that they require fairly large
input files copied from the file servers to the compute nodes. This is
causing the file servers to be overloaded and the copies are failing much
of the time. These failures are due to the number of concurrent copies
happening at the same time, which is made even worse since we have slow
network connections between the compute nodes and the file servers.
| We are working on a solution or workaround. 16-Sep-08
| Unexpected Power Outage --
The Computer Center experienced an unexpected complete power outage this evening
~6pm during the scheduled maintenance period.
Some scientific computing services will not be available until Wednesday.
| 15-Sep-08
| Outage of 4g and 8n -- We are having a planned power outage on 9/16 @ 5pm that will affect the 8n and 4g
clusters. We will bring them up ASAP on Wednesday 9/17 in the AM.
| 11-Aug-08
| LQCD Outage Complete --
Full service is restored to all LQCD HPC systems, including clusters, interactive, and service systems.
| 11-Sept-08
| Hpcdata7 outage --
Hpcdata7 will be unavailable starting at 8am on Thursday, 11th September 2008 due to needed maintenance work. The machine will be available again sometime later that day. The following file systems LHPC/MISC, LHPC/NF0, LHPC/NF2, LHPC/NF3, LHPC/Spectrum and LHPC/Polar will be offlined for this duration.
Please do not submit jobs that require these file systems for 24 hours leading up to this outage. Also please quit all editing and archiving processes that access the /cache/LHPC directory tree.
| 7-Jul-08
| LQCD Outage Rescheduled for Friday August 8 -- The LQCD clusters, fileservers, and interactive nodes will be
unavailable during the day on Fri 8/8 for maintenance, including patches, an upgrade to PBS, and other items
as needed. The change from 7/22 coincides with required Facilities Management work in the F-wing data center
on Saturday August 9. The clusters will remain down during the weekend; other services will be available by
the end of the day Friday.
| 23-Jun-08
| LQCD BBCP transfer services outage scheduled for Tuesday July 8 --
The LQCD BBCP transfer services will be unavailable from 12am-4pm on Tuesday July 8th for
a machine upgrade. This will only affect offsite transfers using bbcp.
| 16-Jun-08
| 14 new nodes -- 8n available UPDATE -- Actually, there are only 13 nodes, because
one has hardware issues. Also, due to the PBS bug listed below that will be fixed in July,
one cannot submit a job using more than 8 cpu cores with the :c8n tag. As documented in the
wiki pages below, your best bet for running on the opteron class of nodes is to use the :opteron
tag for your jobs, and they will land on either 7n or 8n.
| We have 14 new nodes that are very similar to the 7n nodes that are in the IB queue. The 2 differences between 8n and 7n are that these new nodes are a little faster than 7n (2.1 GHz vs 1.9 GHz), they have the same amount of memory, but they have slower IB cards than 7n. The documented changes can be seen at the New users wiki page as well as the running a batch job wiki page. New tags have been added to the 6n and 7n nodes to better differentiate between the architectures, and this is documented as well on the wiki pages. 08-May-08
| Hpcdata9 is functioning but experiencing problems periodically -- We have an ongoing investigation with Sun Microsystems into the unscheduled crashes of hpcdata9. The issue is still unresolved.
| 15-Jun-08
| 6n down @ 9 AM Monday -- There seems to be issues with the IB network on that cluster. Hopefully a power cycling of the switches will help.
| 30-May-08
| 4g Cluster Down - A/C Failure -- The 4g cluster is down due to an air conditioning failure and may remain down for the weekend.
| 13-May-08
| Hpcdata9 will be down for maintenance from 8am-12pm -- Under the cache file systems users NPLQCD disco emc chiQCD will be directly affected.
|
30-Apr-08
| LQCD Web Server Patches -- LQCD web servers will be unavailable beginning at 9am on
Wednesday, April 30 for required patches.
| 24-Apr-08
| PBS bug UPDATE -- We noticed today that there is a bug in the PBS server that will return:
| qsub: Job exceeds queue resource limits when you submit a job using the qsub option -l nodes=512:c7n. It does not complain if you leave off the :c7n part. We have implemented a workaround for this bug. All larger jobs (> 256 processors) will go to 7n, and the rest of the jobs will be routed to either 6n or 7n unless otherwise requested by the user. We have a patched PBS server that fixes this bug, and this will be installed at a later date because it will require a complete queue drain and downtime. 15-Apr-08
| HPCDATA9 back up for now -- We were able to boot the machine with help from Sun, but
the original problems are unresolved. We are still working with Sun to diagnose this machine.
| 14-Mar-08
| 7n Memory Upgrade to 8GB Completed -- All of the 7n nodes now have 8GB RAM per node.
| 15-Apr-08
| HPCDATA9 down -- We continue to work with SUN on the continuing HPCDATA9 problems.
The cache areas for NPLQCD, chiQCD, PROJS (disco, emc), and users are currently unavailable.
| 1-Apr-08
| HPCDATA9 outage Thursday morning April 3 -- Patches will be applied starting at 9am
to fix the problem that has caused recent crashes. Users should hold jobs that access
filesystems on HPCDATA9, including cache for NPLQCD, chiQCD, PROJS (disco, emc), and users,
as the file services will be unavailable. We will attempt to unmount these filesystems to prevent
NFS problems, but some interactive nodes may need to be rebooted.
| 19-Mar-08
| 7n Ethernet Troubleshooting -- We are currently troubleshooting Ethernet problems in the 7n cluster.
| 14-Mar-08
| 7n Downtime on Monday 17-Mar-08 @ 1PM -- We will be patching the kernel to workaround
the Barcelona CPU bug, and installing a patched pbs_mom to get rid of jobs being killed with messages
like "<< PBS: job killed: node 57 (qcd7n1219) requested job terminate"
| 12-Mar-08
| 4g down for 256 way testing -- At 10AM, the 4g cluster will be removed
from the job queue, and we will be testing to run it as 1 256 CPU cluster vs 2 128 CPU
clusters.
| 15-Feb-08
| 7n Memory Upgrade to 8GB -- Rolling outages on 7n nodes will occur over the next few weeks, as memory in each node is increased from 4 to 8GB.
| 06-Feb-08
| IB queues draining for some IB maintenance -- 7n has some IB issues that require some power cycling of switches.
| 25-Jan-08
| qcd7ni01 Scheduled Outage -- On Monday Jan 28 at 11am qcd7ni01 will be shut down to increase memory from 8 to 16GB. qcd6ni01 will remain available.
| 22-Jan-08
| Runtime Environment Change -- For all of the users who are still using mpirun_rsh, please give mpiexec a try. At some time, mpirun_rsh will nolonger be available. For more info on mpiexec see this local documentation, plus man pages are on the 6n and 7n interactive nodes, and many other users are already using it.
| 10-Jan-08
| PBS bug -- We noticed today that there is a bug in the PBS server that will return:
| qsub: Job exceeds queue resource limits when you submit a job using the qsub option -l nodes=512:c7n. It does not complain if you leave off the :c7n part. We are looking into this. This bug does not come up if you use fewer nodes. 04-Jan-08
| 6n Cluster Upgraded
-- The 6n cluster is now upgraded to be the same FC7 64-bit version of the OS and OFED tools as 7n.
In tests, the 7n builds of codes have run fine on 6n. Old 6n builds will NOT run on 6n nodes anymore.
| The interactive 6n node QCD6NI01 is the same hardware as other 6n nodes; otherwise it is configured similar to QCD7NI01. If you have scripts or configuration files (~/.cshrc or ~/.bashrc) that key off of 6n or 7n, please check them. The paths and libraries are the same on both nodes. Please check any files that check for 6n or 7n. NOTE: The PBS upgrade is postponed until several issues are resolved. 03-Jan-08
| Recovery from Holiday Shutdown
-- The PBS server upgrade is in progress. The 7n cluster is operational; 4g still has some failed nodes that need repair.
6n is down through the end of the week for upgrade to 64bit. Use the ib queue for running
on 7n. Web pages are a work-in-progress and will be updated soon.
| 18-Dec-07
| PBS Upgrade on 1/2/08
|
06-Dec-07
| 6n Upgrade Starting 1/2/08
-- The 6n cluster will be upgraded to 64bit during the first week of January. The 'ib' and 'ib64' queues
will be configured to use both the 6n and 7n clusters. A 6n interactive node will also be provided.
| 26-Nov-07
| Cluster Outages 12/21/07 - 1/2/08
-- The LQCD clusters will be unavailable during the holiday shutdown due to JLab's holiday electrical shutdown.
Power will remain available for Core Computing systems (e-mail, web serviers and telecommunications).
See holiday electrical shutdown
for more information on the electrical outage.
| 14-Nov-07
| 7n cluster is down for testing and benchmarking today.
-- Some existing jobs will run to completion. There are a few things
we want to look into with the cluster quiet.
| 12-Nov-07
| The small file limit policy changed and will be effective
on Nov. 20
-- Please read
this document for detail.
| 12-Nov-07
| The first 4g panel has been decomissioned.
-- The other 2 panels are still operational. Some of the nodes have hardware problems, so
we are using these nodes for other purposes.
| 26-Oct-07
| 7n is progressing -- 7n has been upgraded to have 2 1.9GHz quad core Barcelona Opteron
chips (8 cores/node). Each node is pretty fast. We saw something like 42 GFLOPS on one node running
the HPL benchmark and slightly more than 13 TFLOPS runing on 380 nodes, which will put this machine
pretty high on the 11/07 top500 list. The cluster will be ready for general use, but beware that all
of the software and hardware has been changed, so expect failed jobs.
| 25-Oct-07
| qcd7ni01 is down until further notice -- This machine has crashed
repeatedly over the past couple of days. We will have to verify the hardware,
and then update the software to see if this fixes the problem.
| 23-Oct-07
| 7n upgrade to be completed this week --
7n Rack 1 machines are being rebuilt and tested today (Tue 10/23), with the rest of the
cluster following on Wednesday. Following testing and benchmarking, the nodes will be
returned to normal operation before the end of the week.
| 22-Oct-07
| 7n upgraded to 8 CPU cores/node and being updated with new software -- 7n jobs are running,
but failures are common. These machines already have the new quad cores installed and they are being
upgraded to RedHat Fedora Core 7, kernel 2.6.22, new OFED IB support, etc. Currently we are seeing
strange node crashes and segfaults. The crashes could be software or hardware. There does seem to
be an issue with the kernel in causing these segfaults. More info will follow as we know more.
| 17-Oct-07
| JLab 2007 Cluster 7n Back (for now) -- 7n jobs are running
from now until Monday Oct 22nd. On the 22nd, we will be taking the
cluster down to upgrade the kernel and other software on the nodes.
Please see the 7n web page
for more information (this is still being written).
| 17-July-07
| JLab 2007 Cluster 7n Operational
-- The new JLab cluster 7n ("2007 Infiniband") is now operational. This
64-bit cluster, 396 dual-core dual processor AMD nodes with DDR
Infiniband and OFED 1.2, is available through the 'ib64' queue. The
new node 'qcd7ni01' provides a 64-bit interactive environment. Version
rollouts of Chroma and QDP++ will continue to be placed in /dist/scidac.
All but ~10% of the nodes are in service; the remaining nodes will be
included as outstanding power and hardware issues are resolved.
Please see the 7n web page for more information.
| 16-Oct-07
| 7n remains down for upgrade
-- The 7n cluster has all processors upgraded to quad core AMD 1.9GHz cpus,
and will be upgraded to a newer kernel over the coming week.
| 16-Oct-07
| QCDI01 reboot 10/17 8am; /cache/users,emc,disco available by 9am
-- QCDI01 will be rebooted at 8am on Wednesday 10/17 to remove the HPCDATA6
mount. The /cache/users, /cache/emc, and /cache/disco will then be made
available around 9am from the new fileserver HPCDATA9, after the final
filesystem sync is completed from HPCDATA6.
| 03-Oct-07
| 7n reserved for benchmarking and preparation for quad-core upgrade UPDATE
-- The cluster will be down until early 04-Oct-07.
| 02-Oct-07
| 7n reserved for benchmarking and preparation for quad-core upgrade
At 1:30 on Oct 03, we will have the cluster down while we do these things.
| 21-Sept-07
| HPC down for power outage Sept 28 - Oct 1
The HPC environment will be shut down Friday afternoon, 9/28 to prepare for the
lab's scheduled outage for power maintenance. Systems will be returned to service
on Monday morning, Oct 1.
| 20-Sept-07
| 7n Quad CPU installation
The first batch of new AMD Barcelona 1.9GHz quad core cpus have arrived; the first rack
of 7n nodes will be upgraded on Friday, Sept 21. Throughout the next
few weeks, racks will be taken offline one by one as the CPUs are upgraded.
| 18-Sept-07
| 6n down to diagnose IB switch problem
There is a switch that is acting up on 6n. The queue is being drained
of active jobs and we are going to see if a power cycle of all of the
switches fixes the problem. We will know its status tomorrow -- Sept 19th.
| 17-August-07
| LQCD down for scheduled power outage Tuesday morning
The 6n and 7n clusters and HPCDATA7/8 fileservers will be shut down Tuesday 8/21 at 5am to
accomodate a required F-Wing power outage. Services should be restored by midmorning.
| 14-Aug-07
| PBS is slow/unreliable -- We are seeing many errors like:
No Permission.
qstat: cannot connect to server qcdpbs (errno=15007)
We are looking into it.
| 3-Aug-07
| Downtime scheduled for 7n cluster on Tuesday, August 7th 10 AM local time --
We are going to address some hardware issues at this time. The cluster should
be running jobs within a few hours.
| 27-July-07
| CacheManager off for Saturday mass storage outage
The cachemanager is turned off through the scheduled JLab central
systems outage.
| 25-July-07
| Downtime scheduled for Friday, July 27th 12 noon local time
-- We are having electrical work done which affects the /cache and /work
file servers and the 6n and 7n clusters, so the jobs will not run until
the electrical work is complete. This will probably require a reboot of
the interactive nodes to remount the /work and /cache disks.
| 17-July-07
| 1 million small file limit in /cache disk will be effective start September 1
-- Please read
this document for detail.
| 10-July-07
| All 7n nodes will be rebuilt on Tuesday July 10th.
--We are upgrading to OFED 1.2
| 10-July-07
| Reboot qcdi02 at 8 am to make 18 TB /work disk pool available
| 9-July-07
| A 18 TB /work disk pool is ready to use
--This is a project managed disk space used to store small configuration
data sets. The data files under /work will not be backed up. Please read
this document for detail.
| 3-July-07
| 2007-08 allocation start now
| 29-Jun-07
| 24 nodes/96 CPU cores of 7n available
-- Instructions for access available here.
| 26-Jun-07
| QCD back up with the exception of some of 4g
-- hopefully the rest of 4g will be up tomorrow
| 25-Jun-07
| QCD Downtime for Tuesday, June 26th
-- qcd will be down starting at 8AM. We will be instaling a new home NFS
server, adding some of the 7n nodes for testing, and doing some other
general software and hardware maintenance.
| 22-May-07
| New Disk Servers
The LQCD project has now procured two new Sun x4500 file servers
(thumpers). The first one, which was the evaluation machine,
is now in production use, divided into two virtual disks.
The second machine will be deployed next month, ahead
of 7n going into full production. The older servers have
been retired or moved into other roles. Once the second
machine is deployed, we will have increased our user space
from 15 TBytes to over 30 TBytes. A third machine is planned,
to take us to 50 TBytes.
| 31-May-07
| mesh downtime complete, but not 100% up
-- The air conditioner maintenance is done, but some of the nodes
don't want to boot. We are looking into this...
| 24-May-07
| 3g nodes decomissioned
-- The last 3g panel has been shut down.
| 24-May-07
| mesh downtime on Thursday May 31 in the AM
-- The air conditioner for the computer room is undergoing maintenance
on that day, so we will have to shutdown all of the mesh nodes.
| 18-May-07
| 3g cluster to be decomissioned next week
-- This cluster has been in use for 4 years, and is starting to age. The 7n cluster
should be online soon.
| 17-May-07
| Important changes in CacheManager version 4
-- The software used to manage the /cache disk pool is upgraded to version4.
There are some changes (mainly for small files) in this new version. Please read
this document for the detail.
| 30-Apr-07
| Temporary Down Completed --
We are temporarily still constraining usage on the new
fileserver in case we in the end need to move back
off of it.
| 30-Apr-07
| qcdi02 rebooted at 3:30 pm --
Rebooted to maintain file system consistency.
| 25-Apr-07
| /cache/users is available now --
We are moving /cache/users from hpcdata5 to hpcdata6. This work is done at ~9:25am.
Now this disk pool can be accessed from all interactive nodes and rcp from all
computing nodes.
| 24-Apr-07
| Issues with 4g panel 45--
Today, a number of nodes on this panel died at about the same time. One has
had repeated issues, and the status of the others is unclear. It may take
some time to get this panel active again.
| 20-Apr-07
| /cache/users unavailable from 8am-12pm. Has been cancelled!!
-- We are moving /cache/users to another machine. A reboot of the interactive nodes (qcdi01 and qcdi02) will take place at 8am as well.
| 10-Apr-07
| HPC infrastructure down LONGER than planned due to extended power outage!
-- File systems re-syncs will be done after the power returns. The systems will be available at midday.
| 10-Apr-07
| HPC infrastructure down from 6-8am due to power outage and upgrades.
-- There will be a power outage for the datacenter. At this time, we will introduce a new cache server in the environment. These modifications make it necessary to reboot the compute and interactive nodes.
| 29-Mar-07
| Possible 6N outage due to a water leak
-- A water leak has been discovered that is affecting HVAC conditions in the datacenter. We may have to shutdown the 6N cluster. Specifics will be posted here as soon as we are updated!
| 28-Mar-07
| 4g not fully functional
-- One of the 4g nodes is not working, so one panel of 4g nodes is offline until
that node is repaired or replaced.
| 27-Mar-07
| LQCD is DOWN
-- there is a networking problem that has caused all of lqcd to be down
we are working on fixing it...
|
17-May-07
| Important changes in CacheManager version 4
-- The software used to manage the /cache disk pool is upgraded to version4.
There are some changes (mainly for small files) in this new version. Please read
this document for the detail.
|
1-Mar-07
| New debug queues created
-- High priority 30 minute queues are available for the mesh and ib clusters
to use them read instructions here.
| 14-Feb-07
| mesh 3g01 panel decomissioned--
due to a lack of spares and repeated hardware problems with the nodes the 3g01
panel is no longer available. 7n is coming...
| 02-Mar-07
| PBS server is dying!!!
-- The PBS server is having hardware problems and is in the process of being replaced
some jobs may die in the process, but this should be better than it is now...
| 27-Feb-07
| CacheManager will be down from 11:00am - 10:30pm today
-- Due to Scientific Computing outage (include JASMine) February 27, Noon to 1 pm,
cacheManage running on all file servers will be down utill tonight 10:30pm.
| 14-Feb-07
| mesh 3g01 panel decomissioned--
due to a lack of spares and repeated hardware problems with the nodes the 3g01
panel is no longer available. 7n is coming...
| 26-Jan-07
| New problem reporting ticket system available--
A new ticket system to report problems has been implemented. The 'Report a
problem' link above makes this system available.
| 27-Dec-06
| A/C Outage for 6n cluster
December 27th is a scheduled maintenance day for the air conditioners, and the
6n cluster will be unavailable to run jobs from between 8AM to about 12 noon.
|
22-Dec-06
| 6n test queue has been removed
The test queue is not available. The nodes that were a part of that queue are
in the process of being added to the general ib queue. There are infiniband
problems at this time on those nodes, and they will be resolved as soon as
possible.
|
26-Nov-06
| Email outtage from 8am - 2 pm--
Email services will be unavailable during the above time due to maintenance work in the JLAB computing environment. All incoming email will be queued and so will be available again when this work is completed.
| 19-Oct-06
| Upgrade of www.usqcd(lqcd).org--
We have upgraded to RHEL4, apache-2.2.3, tomcat-5.5.17 and php-5.1.6. This upgrade aimed to meet security requirements on our web services. If there are any issues please contact Lawrence(sorrillo@jlab.org).
| 03-Oct-06
| Upgrade of interactive node, qcdi01--
We upgrade from a 2.6GHz node with 900MB RAM running RedHat 9.0 to 4GB with dual 2.6GHz processors running Fedora Core 3 or beyond. Please plan to login to this node again on Thursday 5th, October 2006.
| 29-Sep-06
| Downtime from 9:30 am - 12 pm due to memory upgrade for NFS server (qcdhome)--
Over the last two weeks there have been two outtages due to the NFS server,
qcdhome crashing. We believe adding more memory will further stablize our
environment. We plan to upgrade from 1GB to 4GB.
| 12-Sep-06
| 6n cluster upgrade Thursday morning--
The 6n cluster will be offline to allow upgrade of the kernel
and infiniband software.
Please plan to re-link and test your code.
Nodes qcd6n321,322,323,324 are available for interactive
testing of your applications. Please let us know
immediately if you find any problems.
List of changes.
| 22-Aug-06
| 6n Outage and Upgrade --
Thursday, August 24, the 6n cluster will be down
to add additional local resources for JLabLQCD.
This will increase the total machine size from
280 to 322, allowing 5 concurrent jobs of 64
nodes (128 processes).
| 22-Aug-06
| MAUI DEFAULT Account Change
--It turns out that the name DEFAULT is special for
MAUI, so we will no longer use it for background tasks
and for projects with no scheduled allocation. Instead
there will be multiple small allocation accounts. For
LQCD users, please switch from using
DEFAULT to using
lqcd.
All others please contact Chip Watson.
| 22-Aug-06
| Fairshare Update --
When the machine is underutilizied, and all active accounts
have reached their fairshare targets, MAUI currently
equally divides the remaining time without taking
into account relative fairshare. E.g. if only two
active accounts are running, one with FS target of 40
and the other of 10, MAUI will divide the remaining
50% equally, yielding a split of 65:35. We are
looking into changing MAUI to fix this so that the
end result would be 75:25.
| 11-May-06
| 4t cluster decommissioned --
The remaining 4t nodes have been decommissioned
to be used to upgrade various other servers.
| 27-Apr-06
| Infiniband cluster
operational May 1 --
Monday, May 1, the new 280 node cluster called
6n
(2006
infiniband)
went into production operation (queue name ib).
The version of MPI (and thus QMP) has one optimization
turned off until a bug is fixed, but does perform
reasonably well.
Until the new allocations go into effect July 1,
this machine will be in "friendly user" mode, with
allocations for the national resources
following those of the previous year.
| 18-Apr-06
| Infiniband cluster news --
Jefferson Lab will soon put into operation a
new 280 node cluster called
6n
(2006
infiniband).
This machine was funded
50% by the new LQCD Computing Project,
25% by the LQCD SciDAC project (2005 funds), and
25% by JLab for the local theory group.
We are now trying several versions of operating systems
and infiniband libraries prior to releasing this system
to general use.
| Until the new allocations go into effect July 1, this machine will be in "friendly user" mode, with allocations of the national resources following proportionally the allocations of the previous JLab resources. Thirty five nodes are already accessible in the ib queue, but we are experiencing difficulties with file I/O with the newest version of IB software installed on this partition (so please don't file problem reports until we have cleared this up). 10-Mar-06
| Myrinet 2m cluster de-commissioning --
The myrinet cluster will be slowly decommissioned over
the coming month or two. Very soon, the first 48 nodes
will be offline to move one of the ethernet switches
to support the new Infiniband cluster. We will try
to have 80 nodes back online as soon as possible.
| 17-Jan-06
| Using qcdpbs / MAUI scheduler (updated)--
For gigE mesh jobs:
|
|