Posts by Cobra

1) Message boards : Number crunching : Server error: can't attach shared memory (Message 57472)
Posted 2 Dec 2008 by Profile Cobra
Post:
Now getting a different error from the server, but it's still preventing reporting results and downloading new work:

12/2/2008 8:14:43 AM Sending scheduler request: To fetch work. Requesting 1521479 seconds of work, reporting 21 completed tasks
12/2/2008 8:14:48 AM Scheduler request succeeded: got 0 new tasks
12/2/2008 8:14:48 AM Message from server: Server error: can't attach shared memory

2) Message boards : Number crunching : WUs stuck at "Uploading" (Message 57394)
Posted 1 Dec 2008 by Profile Cobra
Post:
I recognize that there was a Rosetta fileserver crash 11/30. However, both the home page and the technical news page seem to imply that the fileservers are back online, which makes me think that client/server communication should have been restored.

However, my machines all have a number of WUs stuck at the "Uploading" phase, and there are a number of "temporarily failed upload" lines under the BOINC Mgr Messages tab.

It's not clear to me from what's on the home page and tech news page whether I should expect client/server comm to be back to normal now, or if I should expect a few more days of difficulty before things get back on track.

Should I hang tight, or do I need to reset my project to clear out the stuck WUs?

(An extra sentence on the tech news page saying something like, "Clients may experience delays communicating with servers for the next few days, but backlogged WUs should eventually get delivered," or "Clients still showing WUs in 'Uploaded' state should be reset," would be greatly appreciated.)
3) Message boards : Number crunching : Minirosetta v1.40 bug thread (Message 57054)
Posted 19 Nov 2008 by Profile Cobra
Post:
Add me to the list of folks seeing WUs seemingly hang at around 9ish minutes to go to completion. I've seen WUs run as long as 11 hrs without completing before manually aborting them. Behavior seen on multiple hardware platforms (at least an AMD 9950BE and Opteron 180 and an Intel Core2 Duo dual core laptop with installed memory ranging from 1GB to 3+ GB), but all running WinXP.

I really don't understand why people keep going on about this. It seems quite obvious to me that once the counter gets to around 10 minutes it stops counting altogether. Every WU does this, Mini or Beta. Always has, likely always will.

I have not shared your experience. I've run Rosetta@Home for a couple of years, and have happened to catch a few WUs counting down their final couple of minutes, so I disagree with your "always has" comment.

I have also become accustomed over the years to workunits wrapping up in ~2.5-3 hrs nearly 100% of the time. The combination of a "stuck" countdown timer and WUs going ~3-4 times longer than I'm used to was behavior outside of my experience and seemed to indicate a problem, so I posted.
4) Message boards : Number crunching : Minirosetta v1.40 bug thread (Message 57018)
Posted 17 Nov 2008 by Profile Cobra
Post:
Add me to the list of folks seeing WUs seemingly hang at around 9ish minutes to go to completion. I've seen WUs run as long as 11 hrs without completing before manually aborting them. Behavior seen on multiple hardware platforms (at least an AMD 9950BE and Opteron 180 and an Intel Core2 Duo dual core laptop with installed memory ranging from 1GB to 3+ GB), but all running WinXP.

Seems to be happening on 5-20% of my Rosetta Mini 1.40 WUs.
5) Message boards : Number crunching : Minirosetta v1.28 bug thread (Message 54603)
Posted 22 Jul 2008 by Profile Cobra
Post:
I have multiple WinXP Pro computers attached to Rosetta; three AMD-based boxes work fine (one Opteron 180, one Athlon64 3700+, one ancient AthlonXP 1700+), but my Intel-based laptop (Core2 Duo T9300) consistently fails ALL Rosetta Mini 1.28 workunits, usually after 20-35 seconds with a computation error. Occasionally, the laptop will simply hang on a RM 1.28 WU, and I'll have to abort it manually once I notice it. Once or twice I've gotten Windows error dialog boxes telling me that the RM 1.28 process terminated abnormally.

The laptop is new--I got it on July 3--but it has failed on hundreds of RM 1.28 WUs in the three weeks I've had it. As far as I know, it has not successfully processed a single RM 1.28 WU.

For dozens of examples, check my 22 Jul 2008 12:26:04 UTC and 20 JUL 2008 19:43:25 UTC report times.
6) Message boards : Number crunching : Report stuck & aborted WU here please - II (Message 13433)
Posted 11 Apr 2006 by Profile Cobra
Post:
I have had a work unit stuck ~32.9% for what I think is several days (I did not note the name of the work unit at first, so I cannot be 100% sure it's the smae one). CPU clock cycles are being consumed as normal (95-99%), and in the BOINC Manager, CPU time is incrementing. Problem is, "To completion" is incrementing just as fast, and the Progress is not incrementing (though it sometimes seems to fluctuate between 32.90 - 32.94%).

I have seen this work unit (if it's the same one) showing CPU time ~21:00:00 and time to completion as ~19:00:00. However, if I suspend calculation on that work unit, then resume, the times reset to 39:39 CPU time and ~1:45:00 To completion, then both proceed to count up from there again. (The same thing happens if I kill all the BOINC processes and restart them--CPU time resets to ~39:39, and To completion resets to ~1:45:00.)

The workunit in question is FA_RLXpt_hom002_1ptq__361_178_1 (workunit ID 11695526).

I will give the WU one more night before I abort it.






©2020 University of Washington
https://www.bakerlab.org