Message boards : Number crunching : Welcome Back!
Author | Message |
---|---|
Mike Francis Send message Joined: 24 Nov 05 Posts: 8 Credit: 623,519 RAC: 0 |
Sep 08, 2007 Rosetta@home has experienced a horrendous hardware/fireware failure. We essentially lost the SAN partition upon which the project was running! The newest edition of our SAN hardware was shipped with a firmware revision that contained an insidious bug - one which caused the new SAN disks to vanish after roughly 45 days of service. We - or rather I (KEL) - apologize for the inconveinence, lost time and lost effort that you have endured during our outage. We know full well that your contribution hinges on the understanding that we make maximum use of your valuable resources - that we not waste your time, CPU cylces or good humor. We are planning to express our disappointment to our vendors in clear terms, specifically siting the importance of this project to our research effort. We'll keep you abreast of the outcome. You People at the Project have been doing one heck of a GREAT JOB! We will see you when we see you. Mike F, |
Keith E. Laidig Volunteer moderator Project developer Send message Joined: 1 Jul 05 Posts: 154 Credit: 117,189,961 RAC: 0 |
You People at the Project have been doing one heck of a GREAT JOB! We will see you when we see you. I appreciate your patience but we're embarrassed.... I plan to pass along my discomfort to a couple of OEM vice presidents next week! Post if anything doesn't work as you expect. -KEL |
BitSpit Send message Joined: 5 Nov 05 Posts: 33 Credit: 4,147,344 RAC: 0 |
Can't do any file transfers yet. I spotted this in one of the message logs here: aurora rosetta@home 9/8/2007 8:09:40 PM Message from server: Server can't open log file (../log_boinc/cgi.log) |
michaelgwynn Send message Joined: 10 Apr 06 Posts: 8 Credit: 1,055,837 RAC: 0 |
same error that i've seen since yesterday, on all 8 pcs |
googloo Send message Joined: 15 Sep 06 Posts: 133 Credit: 22,813,645 RAC: 2,622 |
KEL, thanks for your hard work. Give those OEM executives hell! Since you asked for it, here are two message sets. The first is an attempt to get new work, and the second is an attempt to report a completed work unit. 9/8/2007 9:54:19 PM|rosetta@home|Sending scheduler request: Requested by user 9/8/2007 9:54:19 PM|rosetta@home|Requesting 28580 seconds of new work 9/8/2007 9:54:24 PM|rosetta@home|Scheduler RPC succeeded 9/8/2007 9:54:24 PM|rosetta@home|Message from server: Server can't open log file (../log_boinc/cgi.log) 9/8/2007 9:54:24 PM|rosetta@home|Deferring communication for 1 hr 0 min 0 sec 9/8/2007 9:54:24 PM|rosetta@home|Reason: project is down 9/8/2007 9:56:47 PM|rosetta@home|[file_xfer] Started upload of file Ly49A_BOINC_MFR_ABRELAX_PICKED_2065_9296_0_0 9/8/2007 9:56:48 PM|rosetta@home|[error] Error on file upload: can't open log file 9/8/2007 9:56:48 PM|rosetta@home|[file_xfer] Temporarily failed upload of Ly49A_BOINC_MFR_ABRELAX_PICKED_2065_9296_0_0: transient upload error 9/8/2007 9:56:48 PM|rosetta@home|Backing off 2 hr 49 min 7 sec on upload of file Ly49A_BOINC_MFR_ABRELAX_PICKED_2065_9296_0_0 |
Polian Send message Joined: 21 Sep 05 Posts: 152 Credit: 10,141,266 RAC: 0 |
|
googloo Send message Joined: 15 Sep 06 Posts: 133 Credit: 22,813,645 RAC: 2,622 |
Sorry, my terminology was incorrect. The second set of messages was an attmpt to upload a completed work unit (not report). |
TA_JC Send message Joined: 7 Nov 05 Posts: 13 Credit: 7,105,670 RAC: 3,389 |
9/8/2007 18.55.18|rosetta@home|Started download of boinc_mfr_aaPROF_09_05.200_v1_3.gz 9/8/2007 18.55.19||Network error: Transferred a partial file 9/8/2007 18.55.20|rosetta@home|Temporarily failed download of boinc_mfr_aaPROF_09_05.200_v1_3.gz: http error 9/8/2007 18.55.21|rosetta@home|Started download of boinc_mfr_aaPROF_09_05.200_v1_3.gz 9/8/2007 18.55.23||Network error: couldn't connect to server 9/8/2007 18.55.23|rosetta@home|Temporarily failed download of boinc_mfr_aaPROF_09_05.200_v1_3.gz: http error 9/8/2007 18.55.24|rosetta@home|Started download of boinc_mfr_aaPROF_09_05.200_v1_3.gz 9/8/2007 18.55.26||Network error: couldn't connect to server 9/8/2007 18.55.26|rosetta@home|Temporarily failed download of boinc_mfr_aaPROF_09_05.200_v1_3.gz: http error 9/8/2007 18.55.26|rosetta@home|Backing off 8 minutes and 38 seconds on download of file boinc_mfr_aaPROF_09_05.200_v1_3.gz 9/8/2007 18.55.32|rosetta@home|Started upload of truncbeat__BOINC_JUMPRELAX_BARCODE3_CONSTRAINT_DISULF-beat_-_2056_36555_0_0 9/8/2007 18.55.35|rosetta@home|Error on file upload: can't open log file This is what I'm getting. The 'partial file' errors started on 9/3 for me. |
Mike Francis Send message Joined: 24 Nov 05 Posts: 8 Credit: 623,519 RAC: 0 |
Am also receiving log file errors on transfers. 9/8/2007 10:35:42 PM|rosetta@home|[error] Error on file upload: can't open log file 9/8/2007 10:35:42 PM|rosetta@home|[file_xfer] Temporarily failed upload of CNTRL_01ABRELAX_SAVE_ALL_OUT_-1opd_-_filters_1782_486037_0_0: transient upload error 9/8/2007 10:35:42 PM|rosetta@home|Backing off 1 hr 44 min 48 sec on upload of file CNTRL_01ABRELAX_SAVE_ALL_OUT_-1opd_-_filters_1782_486037_0_0 9/8/2007 10:35:51 PM|rosetta@home|[file_xfer] Started upload of file CNTRL_01ABRELAX_SAVE_ALL_OUT_-1opd_-_filters_1782_485795_0_0 9/8/2007 10:35:52 PM|rosetta@home|[error] Error on file upload: can't open log file 9/8/2007 10:35:52 PM|rosetta@home|[file_xfer] Temporarily failed upload of CNTRL_01ABRELAX_SAVE_ALL_OUT_-1opd_-_filters_1782_485795_0_0: transient upload error 9/8/2007 10:35:52 PM|rosetta@home|Backing off 3 hr 20 min 4 sec on upload of file CNTRL_01ABRELAX_SAVE_ALL_OUT_-1opd_-_filters_1782_485795_0_0 9/8/2007 10:35:59 PM|rosetta@home|[file_xfer] Started upload of file Ly49A_BOINC_MFR_ABRELAX_PICKED_2065_26154_0_0 9/8/2007 10:36:00 PM|rosetta@home|[error] Error on file upload: can't open log file 9/8/2007 10:36:00 PM|rosetta@home|[file_xfer] Temporarily failed upload of Ly49A_BOINC_MFR_ABRELAX_PICKED_2065_26154_0_0: transient upload error 9/8/2007 10:36:00 PM|rosetta@home|Backing off 2 hr 39 min 28 sec on upload of file Ly49A_BOINC_MFR_ABRELAX_PICKED_2065_26154_0_0 |
Yank Send message Joined: 18 Apr 06 Posts: 71 Credit: 1,752,514 RAC: 0 |
Same errors here. Thanks for your hard work on what I'm sure was a nightmare. Same errors. Nightmare for all of us especially for the working staff at Rosetta at Home. |
Scottatron Send message Joined: 20 Sep 05 Posts: 23 Credit: 591,959 RAC: 0 |
Give the project time, and uploads etc will all work again. |
MattDavis Send message Joined: 22 Sep 05 Posts: 206 Credit: 1,377,748 RAC: 0 |
Does this mean the project lost a lot of the information that we have crunched? |
Scottatron Send message Joined: 20 Sep 05 Posts: 23 Credit: 591,959 RAC: 0 |
I doubt it, the scientific results would be pumped into a database on a regular basis - and there would be backups done (Hopefully!) |
BarryAZ Send message Joined: 27 Dec 05 Posts: 153 Credit: 30,843,285 RAC: 0 |
Agreed, regarding giving the project time to recover -- I'm assuming that work unit deadlines are going to be pushed out so we don't have a bunch of over dues. Same errors here. Thanks for your hard work on what I'm sure was a nightmare. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Just restarted my oldgirls after some T.L.C. (Total Little Cleanout). Getting the same problem, guess it'll take some time. 9/9/2007 3:32:22 PM|rosetta@home|Message from server: Server can't open log file (../log_boinc/cgi.log) Pete. |
larry1186 Send message Joined: 18 Apr 06 Posts: 7 Credit: 329,257 RAC: 0 |
|
Rohan Send message Joined: 19 Aug 07 Posts: 6 Credit: 75,560 RAC: 0 |
Just restarted my oldgirls after some T.L.C. (Total Little Cleanout). Same erorrs, (cant open log file). Any ideas on how long until up and running again? Thanks Rohan |
Emigdio Lopez Laburu Send message Joined: 25 Feb 06 Posts: 61 Credit: 40,240,061 RAC: 0 |
Hi. Still not possible to send/receive work. |
Stevea Send message Joined: 19 Dec 05 Posts: 50 Credit: 738,655 RAC: 0 |
Welcome back? Still not uploading any wu's, giving a can't find file error. I have 4 rigs that have not uploaded a single wu yet. Last contact was on Sept. 4th. I can see us not getting credit for the work that was completed before the servers went down. If the file in question was on the server. And cannot be recovered. I can see a lot of people not returning after finding how much credit other projects are giving out compared to rosetta. I can say for sure one of my machines will not be returning as its getting over 100 ppd more on another project. Seems like the dreaded fair credit question will be brought back up after this fiasco has been resolved. So much for the industry standard 99.9% uptime....for critical systems. BETA = Bahhh Way too many errors, killing both the credit & RAC. And I still think the (New and Improved) credit system is not ready for prime time... |
hugothehermit Send message Joined: 26 Sep 05 Posts: 238 Credit: 314,893 RAC: 0 |
Stevea It's frustrating isn't. I have been frequently checking R@H and I noticed that when they first came back online that they had 4-Sep ~60 TF (I believe) on the main page so hopefully it means that everything was backed up, so at a guess I would say that there should be no problems with the credit, but then again we are talking about computers :) |
Message boards :
Number crunching :
Welcome Back!
©2024 University of Washington
https://www.bakerlab.org