Posts by genes

21) Message boards : Number crunching : Problems with Rosetta version 5.41 (Message 32139)
Posted 6 Dec 2006 by genes
Post:
Had this one crash while running graphics: resultid=50653149

the usual 0xC0000005 error. I thought I had maybe gotten the problem under control last night, I installed an ATI graphics card instead of the NVidia one that I had been running with for the longest time. It did run one WU with screensaver graphics enabled, but then this morning I found this one frozen. With the NVidia card, the graphics don't last more than a few minutes before freezing.

System: Balrog

Current VGA card: ATI Radeon X800 XL AGP, driver 6-11_xp-2k_dd_37616
Previous VGA card: NVidia GeForce FX5950 AGP, driver 91.31.

I have tried many other drivers for this and other NVidia cards, Rosetta/Ralph has problems with the graphics on all of them so far. I do have another ATI card I can try, I think it's an X850XT. I could also try their CCC driver, though it's bloatware and I don't like it.

If anybody has any suggestions for good NVidia or ATI driver versions, I'm willing to try them out.
22) Message boards : Number crunching : Problems with Rosetta version 5.40 (Message 31549)
Posted 22 Nov 2006 by genes
Post:
I had to abort this wu: resultid=47806539

It had gotten stuck for hours, and was not using any CPU time, even though the Boinc CC said it was running. I suppose if I let it run a few more hours the watchdog would have stopped it, but I didn't want to waste any more time on it.


The above WU was running under Boinc CC 5.7.2 on Windows XP.
23) Message boards : Number crunching : Problems with Rosetta version 5.40 (Message 31445)
Posted 20 Nov 2006 by genes
Post:
I had to abort this wu: resultid=47806539

It had gotten stuck for hours, and was not using any CPU time, even though the Boinc CC said it was running. I suppose if I let it run a few more hours the watchdog would have stopped it, but I didn't want to waste any more time on it.
24) Message boards : Number crunching : Report problems with Rosetta version 5.36 (Message 30829)
Posted 9 Nov 2006 by genes
Post:

Don't do the GUIRPC one otherwise you'll have a never ending list updated very very fast.
I would stick with the screensaver one.

Just in case the
cc_config.xml file should be for screensaver
<cc_config>
<log_flags>
<scrsave_debug>1</scrsave_debug>
</log_flags>
</cc_config>

Also attach to Ralph@Home as we are up to R@H 5.40 there trying to fix some error code. http://ralph.bakerlab.org

You will know if the logging is working as it'll be logging from the beginning
08/11/2006 15:22:51||[scrsave_debug] ACTIVE_TASK::check_graphics_mode_ack(): got graphics ack <mode_hide_graphics/> for 1dcj__ETABLE_TEST_ABRELAX_rhh13sm6__1470_193_0, previous mode <mode_unsupported/>

Also example of mem_use debug, updated every 10 seconds, though
08/11/2006 15:22:59|ralph@home|[mem_usage_debug] 1dcj__ETABLE_TEST_ABRELAX_rhh13sm6__1470_193_0: RAM 28.30MB, page 54.39MB, 710.91 page faults/sec, user CPU 8.903, kernel CPU 0.110


Thanks FluffyChicken, I have set this up, currently I have a Ralph 5.40 WU so we'll see how it goes.
25) Message boards : Number crunching : Report problems with Rosetta version 5.36 (Message 30803)
Posted 8 Nov 2006 by genes
Post:
Here's one I just had fail due to the screensaver:

resultid=46126708

The machine is a dual Xeon with HT, so 4 processors, and BOINC is running 4 projects at a time. I just switched to Boinc CC version 5.7.2, but that had no effect on the behavior of Rosetta, it did the same things under 5.4.11.

Here's how it went, because I was exercising at the time and I saw it happen:
Boinc went into screensaver mode, and Seti was displayed. After 10 minutes the CC changed the screensaver to Rosetta. Rosetta was initially running, and the graphics were changing. Sometime during its 10 minute slice, it froze (the cpu time counter on the graphics stopped updating) while in the "relax" phase. At the end of the slice, the graphics changed to QMC, no problem (but Rosetta was already dead). Then CDPN, and Seti again, then it was Rosetta's turn. The Seti graphics just stopped updating but remained on the screen, and the taskbar appeared. I could see the Rosetta app on the taskbar, and I could move the mouse onto the taskbar, but no programs responded. Ctrl-alt-del got me the task manager, and I killed the Rosetta app. The screen came back to life, and everything worked normally after that. The Rosetta WU showed as a Computation error in the Boinc manager. I manually reported it a few minutes ago.

I'll look at the debugging options mentioned in the last post to see what I can do to help.

26) Message boards : Number crunching : Report problems with Rosetta version 5.36 (Message 30748)
Posted 7 Nov 2006 by genes
Post:
I'm running both Ralph and Rosetta with the screensaver ON. I agree that if we take the easy way out and just turn it off, we will have no problems, but they will never fix it. I'm reporting errors both in Ralph and Rosetta.

Here's some more errors, BTW, but I can't say if they are due to the screensaver:
resultid=45561965
resultid=45523378
resultid=45492781

Crunch on!
27) Message boards : Number crunching : Report problems with Rosetta version 5.36 (Message 30592)
Posted 4 Nov 2006 by genes
Post:
I just had this result hang, and had to terminate it:

result

I walked over to the machine and the screensaver was on, but it wasn't Rosetta, it was Seti. The Seti graphics had frozen, and I couldn't wake up the machine. I hit ctrl-alt-del to get the task manager, and found Rosetta using two cpu's worth of time (out of 4), where normally it would use only 1. The task manager, BTW, was the only thing I could get to come up, and it was sitting on top of the frozen Seti graphics. I killed the Rosetta WU, and immediately the Seti graphics disappeared, and the machine woke up. The Rosetta WU reported "computation error" in the status column.

I suspect this happened when the BOINC manager did its usual 10 minute graphic switchover, but I'm not sure if it was Rosetta trying to take over from Seti, or not letting go when Seti wanted to take over.
28) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 13247)
Posted 8 Apr 2006 by genes
Post:
I've gotten 8 errors with 4.97 over the last 2 days on several machines, and that's just with Rosetta! There's also Ralph, which is currently using 4.97, and I'm having errors there as well.

They are ALL 0xC0000005 errors (access violation). I could list them here, but there are already plenty to look at. Just checking in.

I think I have had only one finish without errors.
29) Message boards : Number crunching : Help us solve the 1% bug! (Message 12746)
Posted 28 Mar 2006 by genes
Post:
Rom-

I tried this: First I suspended network activity and work. I made a backup copy of my BOINC directory, then I restarted BOINC in its original directory. I aborted everything but the stuck Rosetta. I let the Rosetta go, and it passed the stuck point.

I killed BOINC, then deleted everything from Program FilesBOINC. Copied back the contents of BOINC_backup, started up again. Unsuspended the stuck Rosetta, it got stuck again.

So, I have this backup copy of my BOINC directory where this Rosetta WU will stick, but it seems to require the other processes to be running. I can burn this backup to a DVD-R and send it to you, how about that?

[edit] BTW, 4 at a time, dual Xeon with HT. [/edit]

[edit] Going to sleep now, will check again in the AM...[/edit]
30) Message boards : Number crunching : Help us solve the 1% bug! (Message 12742)
Posted 28 Mar 2006 by genes
Post:
What is the size of your BOINC directory?

How many days worth of workunits do your have? Which projects are attached?

Would you be willing to make a copy of the directory and in the copy abort all of the other workunits except the one that is stalling and zip everything up and send it to me?


Hi Rom,

My BOINC directory is 1.3GB. I am attached to CPDN (regular and seasonal), Rosetta, Ralph, Einstein, Seti, and Seti Beta. I currently have a CPDN seasonal and a CPDN sulphur WU, a ready-to-report Rosetta and the suspended Rosetta, a Seti Beta and an Einstein. I've set everything to "no new tasks" for now.

Running BOINC CC 5.3.28.

I keep a 0.1 day cache, so I don't have a lot of WU's around. I would not be happy to abort the CPDN WU's. I don't mind suspending everything for the time it takes to zip, etc., or aborting the other WU's.

[edit]
Wait a minute. Did I misunderstand you -- you mean abort the other WU's *after making a copy*, then send you that copy, then go about my merry way... sure, I'll do that.

Please advise on where and how to send...
[/edit]
31) Message boards : Number crunching : Help us solve the 1% bug! (Message 12740)
Posted 28 Mar 2006 by genes
Post:
OK, I ran the standalone test on the WU in my previous post, HB_BARCODE_30_2ci2I_351_30593_0.

As expected, in standalone mode it blew right past the spot it stopped at under BOINC. Interestingly, it already had the argument -constant_seed -jran xxxx on the "command executed" line. I killed the standalone process which had gotten much farther along by then, restarted BOINC, and unsuspended the WU. It started from the beginning, and hung at exactly the same spot.

It is now sitting there suspended. I await any suggestions as to what to do with it. (I know, stick it where the sun don't shine...)

This machine is also running Ralph, but hasn't had any problems there as yet.
32) Message boards : Number crunching : Help us solve the 1% bug! (Message 12734)
Posted 28 Mar 2006 by genes
Post:
I've got one now, HB_BARCODE_30_2ci2I_351_30593_0.
This result:
http://boinc.bakerlab.org/rosetta/result.php?resultid=15136445

It is currently suspended so other units can run. I just found it when I came home, stuck for ~5hours. I stopped/restarted BOINC, it ran for about a minute, then got stuck at step 21292, Acc. RMSD 9.045, Acc. Energy 0.6126684. I stopped/restarted BOINC again (2 more times total) and it keeps getting stuck in the exact same spot at 1 minute, 14 seconds.

I'll try running it outside of BOINC later tonight when I get a chance. First stuck WU on this machine (but it IS a new machine).

Machine: Dual Xeon 3.06GHz, 2GB ram, WinXP SP2. HT is on, running 4 BOINC processes, leave in memory = YES (not that it matters for this WU).
33) Message boards : Number crunching : Issues with 4.82 (Message 11287)
Posted 24 Feb 2006 by genes
Post:
Just had another WU fail on this machine:

http://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=13228

but this time it was with a Ralph WU. On Ralph the link to the machine is this:

http://ralph.bakerlab.org/show_host_detail.php?hostid=953

The failure was exactly the same as with the three 4.82 WU's that failed earlier. BTW the last one I got completed successfully.

A little bit of log around the error:

2/23/2006 8:09:41 PM|ralph@home|Resuming result BARCODE_30_1a68__215_17_0 using rosetta_beta version 486
2/23/2006 8:09:41 PM|rosetta@home|Pausing result PRODUCTION_ABINITIO_DBFLAGS_1tit__307_607_1 (left in memory)
2/23/2006 8:40:14 PM|ralph@home|Unrecoverable error for result BARCODE_30_1a68__215_17_0 ( - exit code -1073741811 (0xc000000d))
2/23/2006 8:40:17 PM||request_reschedule_cpus: process exited
2/23/2006 8:40:17 PM|ralph@home|Computation for result BARCODE_30_1a68__215_17_0 finished


The machine is a dual P3 1GHz with 1GB of ram, running WinXP SP2. Running with "Leave in Memory" = YES, several other BOINC projects, ... did I forget anything? Oh yes, BOINC CC 5.2.15.

Failed Rosetta WU's:
http://boinc.bakerlab.org/rosetta/result.php?resultid=11823719
http://boinc.bakerlab.org/rosetta/result.php?resultid=11805479
http://boinc.bakerlab.org/rosetta/result.php?resultid=11796212
Failed Ralph WU:
http://ralph.bakerlab.org/result.php?resultid=6153

34) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 11101)
Posted 21 Feb 2006 by genes
Post:
Yet another 4.82 crash. Same as the others.

http://boinc.bakerlab.org/rosetta/result.php?resultid=11823719

I'm setting Rosetta to No New Work on that machine. It didn't have any problems with 4.81.

Any tests I could do here? Seems 4.82 fails pretty reliably (100%) on this machine.

Currently also running CPDN (Sulphur), Einstein, Seti, Seti Beta, and an occasional Pirates.



Can you attach this host to the Ralph project if you haven't already?


Will do.
[edit]
OK, it's this one:
http://ralph.bakerlab.org/show_host_detail.php?hostid=953
[/edit]
35) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 11059)
Posted 21 Feb 2006 by genes
Post:
Yet another 4.82 crash. Same as the others.

http://boinc.bakerlab.org/rosetta/result.php?resultid=11823719

I'm setting Rosetta to No New Work on that machine. It didn't have any problems with 4.81.

Any tests I could do here? Seems 4.82 fails pretty reliably (100%) on this machine.

Currently also running CPDN (Sulphur), Einstein, Seti, Seti Beta, and an occasional Pirates.
36) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 11006)
Posted 20 Feb 2006 by genes
Post:
Got another 4.82 crash. This one brought up a Microsoft Dialog "Please report this error..."

Looks like a carbon copy of the previous one. Same machine. Same settings.

http://boinc.bakerlab.org/rosetta/result.php?resultid=11805479

Here's the goings-on around the time of the error:
2/20/2006 10:02:25 AM|rosetta@home|Resuming result HBLR_1.0_2reb_314_890_1 using rosetta version 482
2/20/2006 10:02:25 AM|SETI@home|Pausing result 05ap00aa.5327.11904.572166.1.187_1 (left in memory)
2/20/2006 10:08:18 AM|Pirates@Home|Sending scheduler request to http://pirates.spy-hill.net/cgi-bin/scheduler
2/20/2006 10:08:18 AM|Pirates@Home|Reason: To fetch work
2/20/2006 10:08:18 AM|Pirates@Home|Requesting 17280 seconds of new work
2/20/2006 10:08:23 AM|Pirates@Home|Scheduler request to http://pirates.spy-hill.net/cgi-bin/scheduler succeeded
2/20/2006 10:08:23 AM|Pirates@Home|Message from server: No work sent
2/20/2006 10:08:23 AM|Pirates@Home|Message from server: (there was work for other platforms)
2/20/2006 10:08:23 AM|Pirates@Home|No work from project
2/20/2006 10:33:57 AM|rosetta@home|Unrecoverable error for result HBLR_1.0_2reb_314_890_1 ( - exit code -1073741811 (0xc000000d))
2/20/2006 10:34:00 AM||request_reschedule_cpus: process exited
2/20/2006 10:34:00 AM|rosetta@home|Computation for result HBLR_1.0_2reb_314_890_1 finished
2/20/2006 10:34:00 AM|Einstein@Home|Resuming result r1_0992.0__526_S4R2a_2 using albert version 437



37) Message boards : Number crunching : Miscellaneous Work Unit Errors (Message 10984)
Posted 20 Feb 2006 by genes
Post:
I've had a 4.82 WU crash today:

2/19/2006 7:36:41 PM|rosetta@home|Resuming result HBLR_1.0_1di2_314_135_1 using rosetta version 482
2/19/2006 8:01:05 PM|rosetta@home|Unrecoverable error for result HBLR_1.0_1di2_314_135_1 ( - exit code -1073741811 (0xc000000d))
2/19/2006 8:01:07 PM||request_reschedule_cpus: process exited
2/19/2006 8:01:07 PM|rosetta@home|Computation for result HBLR_1.0_1di2_314_135_1 finished


This WU:

http://boinc.bakerlab.org/rosetta/result.php?resultid=11796212

Nothing unusual was going on, "Leave in Memory" is set to YES. (It wasn't being swapped anyway.)
38) Message boards : Number crunching : Help us solve the 1% bug! (Message 9041)
Posted 14 Jan 2006 by genes
Post:

Thanks. this again suggests it is not an internal rosetta problem. we are going to see if the BOINC developers have any ideas on what might be going on.



Might there be the possibility that it has to do with stopping/restarting BOINC or with BOINC pausing/resuming the app? I do have "leave in memory" set -- is that still necessary?
39) Message boards : Number crunching : Help us solve the 1% bug! (Message 9018)
Posted 14 Jan 2006 by genes
Post:
Well assuming my standard routine is 'routine behavior' (and I take that assumption as fact) I can no longer say that every WU is encountering the 1 percent issue.


You'll know that you've hit the "1% issue" if you look at the graphics and the line "Step: xxxx" is not increasing. (a good reason to have graphics to look at, BTW)
40) Message boards : Number crunching : Help us solve the 1% bug! (Message 9017)
Posted 14 Jan 2006 by genes
Post:
OK, I've got one: stuck at 1%, 20+ hours of CPU on a P3 1GHz dual, running WinXP SP2, BOINC 5.2.15 client and 4 other BOINC projects (S@H, S@H Enhanced, E@H, and CPDN). I've suspended the WU, stopped BOINC, and I'll run the tests.
----


I neglected to mention that I have "Leave applications in memory" set to "yes". In the early days of Rosetta, that was the only way to get it to work at all on a multi-processor setup. I also upped the memory from 768MB to 1GB on my machines, and they typically run at 60% or less memory usage now with 5 projects.


Previous 20 · Next 20



©2024 University of Washington
https://www.bakerlab.org