|
The news items below address various technical issues regarding our servers, code development, and research.
Nov 30, 2007
DK, our primary site developer, is recovering from surgery and is out of contact. When the troubles arose DK's status wasn't known to the IT group, resulting in confusion. Rhiju stepped in and provided the fix. We apologize for the outage...
Nov 29, 2007
The feeder is failing over and over again thus no new work is being sent out. The feeder application needs to be rebuilt to handle the large number of applications that R@H has had. We know of the problems and will get this done as soon as possible.
Sep 18, 2007
The RAID is rebuilt and now we're chasing down the next bottleneck....
Sep 17, 2007
We're experiencing another hardware problem that is effecting the overall project, causing the validator and assimilators to fall behind. The temporary fileserver to which we've switched has dropped on of its RAID disks. We've found a replacement and the RAID is rebuilding now. Once finished the project should start catching up w/ itself...
Sep 08, 2007
Rosetta@home has experienced a horrendous hardware/fireware failure. We essentially lost the SAN partition upon which the project was running! The newest edition of our SAN hardware was shipped with a firmware revision that contained an insidious bug - one which caused the new SAN disks to vanish after roughly 45 days of service. We - or rather I (KEL) - apologize for the inconveinence, lost time and lost effort that you have endured during our outage. We know full well that your contribution hinges on the understanding that we make maximum use of your valuable resources - that we not waste your time, CPU cylces or good humor. We are planning to express our disappointment to our vendors in clear terms, specifically siting the importance of this project to our research effort. We'll keep you abreast of the outcome.
Aug 30, 2007
There was an unannounced service outage this morning at 8AM PST. The outage was planned - I just forgot to announce it to the R@H community. I (KEL) apologize. We had to upgrade the firmware on our SAN which required the hardware being offline for an hour. We don't anticipate any further issues along this line, but we'll work harder to ensure prior notification to our R@H partners - you.
July 31, 2007
We moved the backend of the BOINC system from our older fileserver (a dual 2.8 GHz Xeon w/ 2 GB RAM, running a 64-bit kernel and serving up 655 GB RAID5 disk (6 X 146 GB 10K SCSI)) to our fiber SAN. The SAN is a clustered filserver, running Polyserve's Matrix Server and serving up 16 450 GB SATA disks as a single RAID5 LUN via two Dell 2950s from within our group's fiber SAN. I'll post more information on and pictures of the SAN setup in the near future....
Apr 10, 2007
The UW campus experienced a large power service failure this morning around 9:15 PST that lasted for roughly 45 minutes. 14 buildings in the South Campus area were effected, including the datacenter in which Rosetta@home is hosted. The battery backup for this facility provides only 10 minutes or so of power to the datacenter floor; there is no backup generator at this site. As a result lost all power to all RAH hardware and had to bring everything back from a cold start. We are reviewing the situation.
Jan 11, 2007
Regarding the recent increasing graphic related errors, we suspect it is a problem of thread synchronization. Basically Rosetta working thread does the simulation which changes all the atom coordinates ( which are saved in shared memory) while the graphic thread tries to read data from that place to draw the graphic or screensaver. Currently there is no locking mechanism to ensure the shared memory is accessed by one thread at a time and this could generate some conflicts or memory corruption and then trigger an error. On one of our local computers, when screensaver or graphic is turned on, it caught errors at a rate of at least one per day on average and without any graphics, it ran flawlessly. The errors which have been observed include crashing(0xc0000005), hung-up (0x40010004) and being stuck( watchdog ending). All the errors were not reproducable with same random number seeds and we think that is due to the radomness in graphic process. Another side proof was that showing sidechains requires accessing shared memory more often and intensively, and after turning off sidechains and rotating, the graphic error rates drop but the problem is not solved completely. There seems to be an correlation between two. We are currently working on adding the new locking mechanism and we will post an update when it is ready to be tested out. Meanwhile, if you have experienced freqeuent client-errors, please temporarily disable the boinc_graphics/screensaver to reduce the problem. Thanks for everybody's support.
Oct 11, 2006
A batch of work units, the first protein-protein docking Rosetta@home experiments, were sent out yesterday that, unfortunately, were destined to fail.
The actual work units were completely fine as they were tested on Ralph thoroughly but a rather unfortunate circumstance caused them to fail on clients.
The work units had input files whose names ended with "map", and our downloads web server was configured to handle such files as image map files
(the default apache setting) so the clients were not getting the actual content of the files and thus the jobs were failing with download errors.
The web server's configuration has been updated so the errors should no longer occur, however, the batch was cancelled, so if you have any work units queued
on your client whose names start with "DOC_" AND contain the text "pert_bench_1263_" please abort them. For example, "DOC_4HTC_U_pert_bench_1263_1000_0".
We are very sorry this happened. It's unfortunate since there has been some recent exciting developments which include an updated
application and screensaver, and docking.
August 23, 2006
We have officially switched over to a new crediting system that grants credit based on the amount of structures produced by your computer. Under the new system,
the amount of credit awarded per structure for a particular work unit is determined by the average amount of credit claimed per structure using the standard BOINC
credit metric over all Rosetta@home runs of that work unit to date. For each work unit type, we keep track of the total amount of claimed credits and structures from valid results
returned by hosts, and we use these running totals to determine the amount of credit to award per structure. So if your computer returns 2 structures, the
amount of credit awarded would be 2 * total_claimed_credit / total_structures where total_claimed_credit and total_structures are the sum of the claimed credits
and structures from valid results returned by all hosts prior to your returned result for that particular work unit type, respectively.
The first returned result will be awarded the claimed credit, the second returned result will get the average claimed credit per structure between the two multiplied by the number of structures returned by the result, the third returned result will get the average claimed credit per structure between the three multiplied by the number of structures returned by the result, and so forth.
Under the same time frame, a faster computer will produce more structures than
a slower computer and thus will be awarded more credits per cpu time.
April 14, 2006
Our backbone switch at one of our datacenters failed this evening, dragging down the connection to a good deal of our clusters, analysis gear and the backbone of Rosetta@home. The switch has been replaced.
March 31, 2006
All the routers on the UW campus were upgraded sometime between 0530PST and 0600PST. There was be a 5 minute outage sometime during that half hour while our router rebooted.
March 28, 2006
With Rom Walton's help, we've made good progress debugging rosetta. For starters,
Rom has fixed the annoying "leave in memory" bug by updating the BOINC API to use TerminateProcess instead of exit
to shut down the application. TerminateProcess halts any executing threads and then cleans up after the application
instead of relying on the application to clean up after itself. Additional information for debugging has also been
added to help track down remaining bugs like the "1% bug." For example, there are now fractional percent complete
values (.01,.02, etc.) that will help us determine where jobs are getting stuck. Windows users can also help us track down bugs further
by downloading the program database file
(http://boinc.bakerlab.org/rosetta/download/rosetta_X.XX_windows_intelx86.pdb where X.XX is the latest version number)
and placing it in the "BOINC/projects/boinc.bakerlab.org_rosetta" directory (the same directory where the rosetta executable resides).
Currently, this file is not automatically packaged with the executable to reduce bandwidth usage.
The program database file provides additional debug information that gets written to stderr. We are optimistic that with Rom's help
and feedback from participants, we will soon be able to track down and fix the remaining issues with rosetta and continue to reduce the
error rate for the project.
March 22, 2006
Recently our database server has been crushed repeatedly resulting in webserver slowdowns. Tracking all of this down has been a real challenge. This AM it was noticed that during every slowdown someone was trying to 'merge' two client hosts. After eliminating everything else as a possibility we were left to wonder if this merging of hosts could really be a problem. After disasterously wasteful searching of many unrelated issues, it was discovered that SETI@home had this very problem last month [see the entry for February 1, 2006 on the SETI@home Technical News Page]. We have disabled host merges until we can upgrade the database software.
February 22, 2006
Starting at around 8:20 (PST) this morning the University of Washington network began to experience widespread connectivity problems. It has been resolved.
February 17, 2006
Today we backed up our database and upgraded our production database server
which now uses mysql-max and has improved I/O performance by using a SCSI controller serving a RAID10 of 14 drives.
We will be delaying the release of the updated rosetta application due to some remaining issues with the cpu run time
preference that are apparent on our RALPH test project. A fix will be made and tested immediately. You should still
be able to set the new project specific preferences but they will not take effect
until the application gets updated.
February 14, 2006
We've modified the webserver to address the problems connecting to the server. This should improve matters for all.
January 17, 2006
The project will be down for maintenance starting today at 3pm PST. Today's down time
is expected to be a bit longer than usual because, in addition to backing up our database and optimizing tables,
we are also going to move our project files over to the file server.
January 13, 2006
The University of Washington experienced a campus wide network slowdown today related to the Windows WMF vulnerability. See more here
January 12, 2006
We stated below that we will grant credit to users who have run and aborted bad work units that were
initially released on December 20th. This has finally been done for aborted and failed results from work units in batch 205 and
work units that were issued bad random number seeds. The claimed_credit from these results was added to the total_credit
in the user, host, and team database tables. A total of 274609.56 credits were granted.
A tab delimited list of userid, hostid, teamid, and granted credit is
available online (4.2M) for
anyone curious.
January 6, 2006
Today, we are going to back up the database and optimize tables for general maintenance starting
at 3pm PST.
We are also going to replace the data fileserver with one that is more robust.
Our initial fileserver used a logical volume consisting of 5 146GB
Ultra3 SCSI drives, w/o redundancy. One of the disks has developed a
problem putting the logical volume in peril.
As a replacement we've built a new fileserver from a dual 2.8GHz XEON
w/ 2GB RAM running a 6 X 146 GB RAID-5 from a LSI MegaRAID controller,
providing redundancy.
December 20, 2005
Last evening we released updated versions of the rosetta application for all three platforms. The
updates include changes to, again, increase diversity in the searches. For those familiar with Rosetta,
the protocol can now use larger protein fragment libraries and run more cycles. There were also minor changes to
the graphics to allow rotation of the native structure.
Additionally, a bug was found and fixed by Bin, a post doc
in our lab, that may have been causing the "1%" continual loop. This bug would occur very infrequently in specific circumstances.
We do not know for sure yet if this is the only "1%" bug.
We also put our new work unit batch submission system into production.
Unfortunately, a batch of work units using this system was not set up correctly. Work units from this batch have names
starting with "DEFAULT_xxxxx_205_" where xxxxx are the protein code and chain id. 205 is the batch id.
IF YOU ARE RUNNING ONE OF THESE WORK UNITS, PLEASE ABORT IT. Batch 206 and greater are okay, and should not be
aborted.
The work units in batch 205 were set up to predict 1000 structures instead of 10, so they will all reach the run time
limit of 12-16 hours before finishing and will eventually error out. WE WILL GRANT CREDIT
TO PEOPLE WHO HAVE RUN AND ABORTED THESE WORK UNITS.
Another problem has been identified with some new work units which is causing a 0xc0000005 UNHANDLED EXCEPTION
error. This is a weird bug that appears to be dependent on the random number seed and we are currently looking
into its cause. A short-term fix of using the computer clock to generate the seed (as has been done in previous
runs) is in place.
In an effort to prevent errors like this in the future, we will set up a local test boinc server and do quality
control after the holidays.
December 12, 2005
Our work unit feeder is having a tough time keeping up with all the client requests for work.
A short term fix (as has been done before), is to optimize the database tables. We will be doing this later today at
3pm and also backing up the database. As stated before, we are going to expand our servers soon to deal with this
issue.
November 27, 2005
Welcome to our new technical news bulletin.
Today, we backed up our database and reconfigured the database server to match Seti@home's configuration.
We'd like to thank Bob Bankay, Seti@home's database administrator, and David Hammer at Einstein@home for providing
useful advice and copies of their my.cnf files. Soon, we will be testing database replication on two test servers
(64 bit dual Opterons w/ 8 GB RAM) set up by Keith, and if the tests look good, they will be used for production. The
benefits of using replication (as stated in the MySQL documentation) are 1) server robustness (if the master server goes down another
can be used as a backup), 2) load balancing for non-updating queries, and 3) server maintenance (such as database backups)
without disruptions.
|