Rosetta@home

Technical news

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]

The news items below address various technical issues regarding our servers, code development, and research.

Aug 14, 2014
Due to user feedback and an attempt to reduce the load on our servers, we increased the default target run time from 3 to 6 hours, increased job deadlines to 14 days, and added the target cpu runtime option of 2 days. Any input on this is appreciated, good or bad. Please post to this thread. We can revert to the previous values if necessary after assessing how these changes effect the project. Remember, you can always set the run time preference to your liking, from as low as 1 hr to as long as 2 days. It is the "Target CPU run time" option in the Rosetta@home specific preferences. Also keep in mind that it's a target run time but if the job is a large protein, it may take longer than 1 hour to generate 1 model so the actual run time can exceed the target run time (at least 1 model is generated).

Jan 29, 2014
We will be moving the hardware that supports Rosetta@Home on Tuesday, Feb 4th within the datacenter here at the UW. This will require that the entire project be taken offline for the duration. While we will try to minimize the down-time, we are planning for the system being offline from 0800 PST until 1500 PST on the fourth. -KEL & DOVA

Jan 6, 2009
Bloody Hell! What a day! First I kill the database server by inadvertently yanking it out of the rack and away from it's plugs when removing the old fileserver from the rack - DK saved the day there. Next the backbone switch for the entire project went bonkers after the server upon which it had been resting - the same old fileserver - was removed (see above), resulting in a small torque upon the switch chassis. I propped the switch up to relieve any torque, rebooted it and all was well. The group would be better off paying me NOT to come to work! -KEL

Nov 30, 2008
Today the fileserver crashed yet once again. I'd had enough of this and decided to take the opportunity to move things over to the new SAN we've been prepping. Moving all the necessay files from the old server to the new took ~6 hours. We've still got to migrate the older persistent data - something like 3 days of solid data transfer - but we can run without that immediately available. This should be much more stable and reduce the number of sudden outages. -KEL

Nov 30, 2007
DK, our primary site developer, is recovering from surgery and is out of contact. When the troubles arose DK's status wasn't known to the IT group, resulting in confusion. Rhiju stepped in and provided the fix. We apologize for the outage...

Nov 29, 2007
The feeder is failing over and over again thus no new work is being sent out. The feeder application needs to be rebuilt to handle the large number of applications that R@H has had. We know of the problems and will get this done as soon as possible.

Sep 18, 2007
The RAID is rebuilt and now we're chasing down the next bottleneck....

Sep 17, 2007
We're experiencing another hardware problem that is effecting the overall project, causing the validator and assimilators to fall behind. The temporary fileserver to which we've switched has dropped on of its RAID disks. We've found a replacement and the RAID is rebuilding now. Once finished the project should start catching up w/ itself...

Sep 08, 2007
Rosetta@home has experienced a horrendous hardware/fireware failure. We essentially lost the SAN partition upon which the project was running! The newest edition of our SAN hardware was shipped with a firmware revision that contained an insidious bug - one which caused the new SAN disks to vanish after roughly 45 days of service. We - or rather I (KEL) - apologize for the inconveinence, lost time and lost effort that you have endured during our outage. We know full well that your contribution hinges on the understanding that we make maximum use of your valuable resources - that we not waste your time, CPU cylces or good humor. We are planning to express our disappointment to our vendors in clear terms, specifically siting the importance of this project to our research effort. We'll keep you abreast of the outcome.

Aug 30, 2007
There was an unannounced service outage this morning at 8AM PST. The outage was planned - I just forgot to announce it to the R@H community. I (KEL) apologize. We had to upgrade the firmware on our SAN which required the hardware being offline for an hour. We don't anticipate any further issues along this line, but we'll work harder to ensure prior notification to our R@H partners - you.

July 31, 2007
We moved the backend of the BOINC system from our older fileserver (a dual 2.8 GHz Xeon w/ 2 GB RAM, running a 64-bit kernel and serving up 655 GB RAID5 disk (6 X 146 GB 10K SCSI)) to our fiber SAN. The SAN is a clustered filserver, running Polyserve's Matrix Server and serving up 16 450 GB SATA disks as a single RAID5 LUN via two Dell 2950s from within our group's fiber SAN. I'll post more information on and pictures of the SAN setup in the near future....

Apr 10, 2007
The UW campus experienced a large power service failure this morning around 9:15 PST that lasted for roughly 45 minutes. 14 buildings in the South Campus area were effected, including the datacenter in which Rosetta@home is hosted. The battery backup for this facility provides only 10 minutes or so of power to the datacenter floor; there is no backup generator at this site. As a result lost all power to all RAH hardware and had to bring everything back from a cold start. We are reviewing the situation.

Jan 11, 2007
Regarding the recent increasing graphic related errors, we suspect it is a problem of thread synchronization. Basically Rosetta working thread does the simulation which changes all the atom coordinates ( which are saved in shared memory) while the graphic thread tries to read data from that place to draw the graphic or screensaver. Currently there is no locking mechanism to ensure the shared memory is accessed by one thread at a time and this could generate some conflicts or memory corruption and then trigger an error. On one of our local computers, when screensaver or graphic is turned on, it caught errors at a rate of at least one per day on average and without any graphics, it ran flawlessly. The errors which have been observed include crashing(0xc0000005), hung-up (0x40010004) and being stuck( watchdog ending). All the errors were not reproducable with same random number seeds and we think that is due to the radomness in graphic process. Another side proof was that showing sidechains requires accessing shared memory more often and intensively, and after turning off sidechains and rotating, the graphic error rates drop but the problem is not solved completely. There seems to be an correlation between two. We are currently working on adding the new locking mechanism and we will post an update when it is ready to be tested out. Meanwhile, if you have experienced freqeuent client-errors, please temporarily disable the boinc_graphics/screensaver to reduce the problem. Thanks for everybody's support.

Oct 11, 2006
A batch of work units, the first protein-protein docking Rosetta@home experiments, were sent out yesterday that, unfortunately, were destined to fail. The actual work units were completely fine as they were tested on Ralph thoroughly but a rather unfortunate circumstance caused them to fail on clients. The work units had input files whose names ended with "map", and our downloads web server was configured to handle such files as image map files (the default apache setting) so the clients were not getting the actual content of the files and thus the jobs were failing with download errors. The web server's configuration has been updated so the errors should no longer occur, however, the batch was cancelled, so if you have any work units queued on your client whose names start with "DOC_" AND contain the text "pert_bench_1263_" please abort them. For example, "DOC_4HTC_U_pert_bench_1263_1000_0". We are very sorry this happened. It's unfortunate since there has been some recent exciting developments which include an updated application and screensaver, and docking.

August 23, 2006
We have officially switched over to a new crediting system that grants credit based on the amount of structures produced by your computer. Under the new system, the amount of credit awarded per structure for a particular work unit is determined by the average amount of credit claimed per structure using the standard BOINC credit metric over all Rosetta@home runs of that work unit to date. For each work unit type, we keep track of the total amount of claimed credits and structures from valid results returned by hosts, and we use these running totals to determine the amount of credit to award per structure. So if your computer returns 2 structures, the amount of credit awarded would be 2 * total_claimed_credit / total_structures where total_claimed_credit and total_structures are the sum of the claimed credits and structures from valid results returned by all hosts prior to your returned result for that particular work unit type, respectively. The first returned result will be awarded the claimed credit, the second returned result will get the average claimed credit per structure between the two multiplied by the number of structures returned by the result, the third returned result will get the average claimed credit per structure between the three multiplied by the number of structures returned by the result, and so forth. Under the same time frame, a faster computer will produce more structures than a slower computer and thus will be awarded more credits per cpu time.

April 14, 2006
Our backbone switch at one of our datacenters failed this evening, dragging down the connection to a good deal of our clusters, analysis gear and the backbone of Rosetta@home. The switch has been replaced.

March 31, 2006
All the routers on the UW campus were upgraded sometime between 0530PST and 0600PST. There was be a 5 minute outage sometime during that half hour while our router rebooted.

March 28, 2006
With Rom Walton's help, we've made good progress debugging rosetta. For starters, Rom has fixed the annoying "leave in memory" bug by updating the BOINC API to use TerminateProcess instead of exit to shut down the application. TerminateProcess halts any executing threads and then cleans up after the application instead of relying on the application to clean up after itself. Additional information for debugging has also been added to help track down remaining bugs like the "1% bug." For example, there are now fractional percent complete values (.01,.02, etc.) that will help us determine where jobs are getting stuck. Windows users can also help us track down bugs further by downloading the program database file (http://boinc.bakerlab.org/rosetta/download/rosetta_X.XX_windows_intelx86.pdb where X.XX is the latest version number) and placing it in the "BOINC/projects/boinc.bakerlab.org_rosetta" directory (the same directory where the rosetta executable resides). Currently, this file is not automatically packaged with the executable to reduce bandwidth usage. The program database file provides additional debug information that gets written to stderr. We are optimistic that with Rom's help and feedback from participants, we will soon be able to track down and fix the remaining issues with rosetta and continue to reduce the error rate for the project.

March 22, 2006
Recently our database server has been crushed repeatedly resulting in webserver slowdowns. Tracking all of this down has been a real challenge. This AM it was noticed that during every slowdown someone was trying to 'merge' two client hosts. After eliminating everything else as a possibility we were left to wonder if this merging of hosts could really be a problem. After disasterously wasteful searching of many unrelated issues, it was discovered that SETI@home had this very problem last month [see the entry for February 1, 2006 on the SETI@home Technical News Page]. We have disabled host merges until we can upgrade the database software.

February 22, 2006
Starting at around 8:20 (PST) this morning the University of Washington network began to experience widespread connectivity problems. It has been resolved.

February 17, 2006
Today we backed up our database and upgraded our production database server which now uses mysql-max and has improved I/O performance by using a SCSI controller serving a RAID10 of 14 drives. We will be delaying the release of the updated rosetta application due to some remaining issues with the cpu run time preference that are apparent on our RALPH test project. A fix will be made and tested immediately. You should still be able to set the new project specific preferences but they will not take effect until the application gets updated.

February 14, 2006
We've modified the webserver to address the problems connecting to the server. This should improve matters for all.

January 17, 2006
The project will be down for maintenance starting today at 3pm PST. Today's down time is expected to be a bit longer than usual because, in addition to backing up our database and optimizing tables, we are also going to move our project files over to the file server.

January 13, 2006
The University of Washington experienced a campus wide network slowdown today related to the Windows WMF vulnerability. See more here

January 12, 2006
We stated below that we will grant credit to users who have run and aborted bad work units that were initially released on December 20th. This has finally been done for aborted and failed results from work units in batch 205 and work units that were issued bad random number seeds. The claimed_credit from these results was added to the total_credit in the user, host, and team database tables. A total of 274609.56 credits were granted. A tab delimited list of userid, hostid, teamid, and granted credit is available online (4.2M) for anyone curious.

January 6, 2006
Today, we are going to back up the database and optimize tables for general maintenance starting at 3pm PST. We are also going to replace the data fileserver with one that is more robust. Our initial fileserver used a logical volume consisting of 5 146GB Ultra3 SCSI drives, w/o redundancy. One of the disks has developed a problem putting the logical volume in peril. As a replacement we've built a new fileserver from a dual 2.8GHz XEON w/ 2GB RAM running a 6 X 146 GB RAID-5 from a LSI MegaRAID controller, providing redundancy.

December 20, 2005
Last evening we released updated versions of the rosetta application for all three platforms. The updates include changes to, again, increase diversity in the searches. For those familiar with Rosetta, the protocol can now use larger protein fragment libraries and run more cycles. There were also minor changes to the graphics to allow rotation of the native structure.

Additionally, a bug was found and fixed by Bin, a post doc in our lab, that may have been causing the "1%" continual loop. This bug would occur very infrequently in specific circumstances. We do not know for sure yet if this is the only "1%" bug.

We also put our new work unit batch submission system into production. Unfortunately, a batch of work units using this system was not set up correctly. Work units from this batch have names starting with "DEFAULT_xxxxx_205_" where xxxxx are the protein code and chain id. 205 is the batch id.

IF YOU ARE RUNNING ONE OF THESE WORK UNITS, PLEASE ABORT IT. Batch 206 and greater are okay, and should not be aborted.

The work units in batch 205 were set up to predict 1000 structures instead of 10, so they will all reach the run time limit of 12-16 hours before finishing and will eventually error out. WE WILL GRANT CREDIT TO PEOPLE WHO HAVE RUN AND ABORTED THESE WORK UNITS.

Another problem has been identified with some new work units which is causing a 0xc0000005 UNHANDLED EXCEPTION error. This is a weird bug that appears to be dependent on the random number seed and we are currently looking into its cause. A short-term fix of using the computer clock to generate the seed (as has been done in previous runs) is in place.

In an effort to prevent errors like this in the future, we will set up a local test boinc server and do quality control after the holidays.

December 12, 2005
Our work unit feeder is having a tough time keeping up with all the client requests for work. A short term fix (as has been done before), is to optimize the database tables. We will be doing this later today at 3pm and also backing up the database. As stated before, we are going to expand our servers soon to deal with this issue.

November 27, 2005
Welcome to our new technical news bulletin.

Today, we backed up our database and reconfigured the database server to match Seti@home's configuration. We'd like to thank Bob Bankay, Seti@home's database administrator, and David Hammer at Einstein@home for providing useful advice and copies of their my.cnf files. Soon, we will be testing database replication on two test servers (64 bit dual Opterons w/ 8 GB RAM) set up by Keith, and if the tests look good, they will be used for production. The benefits of using replication (as stated in the MySQL documentation) are 1) server robustness (if the master server goes down another can be used as a backup), 2) load balancing for non-updating queries, and 3) server maintenance (such as database backups) without disruptions.


Home | Join | About | Participants | Community | Statistics

Copyright © 2016 University of Washington

Last Modified: 10 Nov 2007 5:01:25 UTC
Back to top ^