Minirosetta v1.32 bug thread

Message boards : Number crunching : Minirosetta v1.32 bug thread

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
mitrichr
Avatar

Send message
Joined: 23 May 07
Posts: 44
Credit: 1,005,660
RAC: 0
Message 55382 - Posted: 29 Aug 2008, 13:40:27 UTC - in response to Message 55226.  

The graphic is freezing, meaning, I assume, that the WU is a dead fish. {...}


Not necessarily. Try this workaround.


David-

Yes, I know, that was what I had been doing and did not want to.

>>RSM

http://sciencespringe.wordpress.com
http://facebook.com/sciencesprings


ID: 55382 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mitrichr
Avatar

Send message
Joined: 23 May 07
Posts: 44
Credit: 1,005,660
RAC: 0
Message 55384 - Posted: 29 Aug 2008, 13:44:11 UTC - in response to Message 55226.  

The graphic is freezing, meaning, I assume, that the WU is a dead fish. {...}


Not necessarily. Try this workaround.


David-

Yes, I know, that was what I had been doing and did not want to.

>>RSM

http://sciencespringe.wordpress.com
http://facebook.com/sciencesprings


ID: 55384 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 55402 - Posted: 30 Aug 2008, 21:30:06 UTC

too many exit(0)s
https://boinc.bakerlab.org/rosetta/result.php?resultid=188137329
https://boinc.bakerlab.org/rosetta/result.php?resultid=187374554
https://boinc.bakerlab.org/rosetta/result.php?resultid=187233250
Watchdog shutting down...
https://boinc.bakerlab.org/rosetta/result.php?resultid=187377874
- exit code -1073741819 (0xc0000005)
https://boinc.bakerlab.org/rosetta/result.php?resultid=185996255
ID: 55402 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mitrichr
Avatar

Send message
Joined: 23 May 07
Posts: 44
Credit: 1,005,660
RAC: 0
Message 55404 - Posted: 30 Aug 2008, 23:07:49 UTC
Last modified: 30 Aug 2008, 23:09:59 UTC

So, here is the deal:

People are detaching from Rosetta, even though it may be the single most important project at BOINC. We have no choice but to endure these problems, or detach, because this mini runs along with what else is going.

Detach this buggy process from the rest of Rosetta, Run it in Ralph@home. Get some volunteers to run it if you can not find the problems in the lab, and let us get back to crunching for Rosetta.

If in fact you get it to run properly, put the news in the RSS feed, so we know it is safe to go back in the water.

>>RSM
http://sciencespringe.wordpress.com
http://facebook.com/sciencesprings


ID: 55404 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile jasonwishart
Avatar

Send message
Joined: 7 Nov 05
Posts: 14
Credit: 936,419
RAC: 0
Message 55406 - Posted: 31 Aug 2008, 3:51:58 UTC

It seems that since the minirosetta 1.32, I not had a completed job on my AMD 3000+ machine, yet it is fine on my other two machines?? All jobs now end, either slowly or quickly, with a "Output file absent" error, see output messages below....

8/31/2008 12:58:36 PM|rosetta@home|Computation for task abinitio_homfrag_71_A_1l6pA_4443_21507_0 finished
8/31/2008 12:58:36 PM|rosetta@home|Output file abinitio_homfrag_71_A_1l6pA_4443_21507_0_0 for task abinitio_homfrag_71_A_1l6pA_4443_21507_0 absent
8/31/2008 1:00:37 PM|rosetta@home|Computation for task abinitio_homfrag_71_A_1zd0A_4443_21518_0 finished
8/31/2008 1:00:37 PM|rosetta@home|Output file abinitio_homfrag_71_A_1zd0A_4443_21518_0_0 for task abinitio_homfrag_71_A_1zd0A_4443_21518_0 absent

This has been going on for a few weeks...thought it was a irregularity and have put up with it. But now it is really getting to me. Is there any way to complete these jobs (future jobs) on this machine, or might I just as well abandon R@H on this machine??
Will work for bandwidth!!
ID: 55406 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Terrasapiens

Send message
Joined: 25 Apr 08
Posts: 15
Credit: 368,919
RAC: 0
Message 55407 - Posted: 31 Aug 2008, 5:07:49 UTC - in response to Message 55301.  
Last modified: 31 Aug 2008, 5:10:48 UTC

Terrasapiens, I see you are running BOINC 6.2.18. Do have any history running mini on older versions of BOINC? Are you using BOINC as your screensaver?


I think the mini WUs ran fine about two versions prior to the current one but I'm not sure exactly. I do know that the mini rosetta WUs didn't start failing until v1.28. I did turn off the BOINC screen saver a couple of weeks ago but that made no difference in terms of failed WUs.
ID: 55407 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 163
Credit: 800,690
RAC: 20
Message 55408 - Posted: 31 Aug 2008, 5:39:51 UTC
Last modified: 31 Aug 2008, 5:43:07 UTC

For people having trouble This may help you out hope it helps.
Cheers
Speedy
Have a crunching good day!!
ID: 55408 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 55409 - Posted: 31 Aug 2008, 6:32:45 UTC

too many exit(0)s
https://boinc.bakerlab.org/rosetta/result.php?resultid=188137329
ID: 55409 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1982
Credit: 38,463,172
RAC: 15,101
Message 55436 - Posted: 31 Aug 2008, 23:59:44 UTC - in response to Message 55402.  

What you don't show is my "Can't acquire lockfile - exiting" error. This seems to be the clue that screws up my WUs altogether then.

I also get a lot of those ss2 messages, but on an AMD processor using Vista SP1 and BOINC 5.10.45. I haven't seen any of the lockfile messages. I wonder if some of the current workunits are missing the ss2 file since they don't need filtering, but 1.32 doesn't have a way built in to just turn off any attempts to use this file.

Thanks Robert. It seems you run an AMD Dual Core with Vista 32-bit.

too many exit(0)s
https://boinc.bakerlab.org/rosetta/result.php?resultid=188137329
https://boinc.bakerlab.org/rosetta/result.php?resultid=187374554
https://boinc.bakerlab.org/rosetta/result.php?resultid=187233250
Watchdog shutting down...
https://boinc.bakerlab.org/rosetta/result.php?resultid=187377874
- exit code -1073741819 (0xc0000005)
https://boinc.bakerlab.org/rosetta/result.php?resultid=185996255

Thanks (_KoDAk_). 1 of these is a 5.98 WU, one I can't see, but the other 3 (and the later 1) all show the same "Can't acquire lockfile - exiting" error that I get.

I note you run an Intel Quad core with the 64-bit version of Windows Server Enterprise.

Based on poor evidence of just one example I'd be inclined to look at an incompatibility between MiniRosetta 1.32 and any Windows 64-bit OS as a matter of urgency.

Speedy: I appreciate the suggestion, but a bug thread on v1.32 isn't served by avoiding running it altogether (thought it may well be better sorted out on Ralph, I agree).

It's made worse at the moment by the fact nearly all WUs are 1.32s and hardly any 5.98s. Though, for what little it's worth, my failure rate has reduced from 79/151 (52%) to 50/115 (43%) since my last posts.
ID: 55436 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Heidi1
Avatar

Send message
Joined: 11 Aug 07
Posts: 49
Credit: 1,786,248
RAC: 0
Message 55437 - Posted: 1 Sep 2008, 1:02:46 UTC

I've been having the same screensaver/blackout problem that others have had. I don't know which WU it is, but it definitely is a Mini 1.32, as that's all my machine has right now (it's one of these two: abinitio_homfrag_71_A_2uzrA_4443_23491_0 ; abinitio_homfrag_71_A_2ib0A_4443_23507_0). I used to have BOINC be the Windows screensaver, then I changed it to an actual Windows one when the Mini started doing the black screen thing, then I later changed the screensaver to None when Mini was continuing to produce black screens. Sometimes using Ctrl+Alt+Del got me to where the taskbar would show, sometimes not; another trick I found out about is pressing the key for the Start button (my keyboard has it between the left Ctrl and Alt keys) would also allow me to see the taskbar. One of the open programs on the taskbar at that time is MiniRosetta, and I don't know if that is the screensaver that is trying to kick in. Sometimes the Mini screensaver kicks in way before the Windows screensaver is supposed to start (once the Mini one started after the computer was idle for a whopping 30 seconds!).

I have not paid attention as to what the WU crunching is doing, if it's still running during the black screen or if it's frozen up.
ID: 55437 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Heidi1
Avatar

Send message
Joined: 11 Aug 07
Posts: 49
Credit: 1,786,248
RAC: 0
Message 55438 - Posted: 1 Sep 2008, 5:32:21 UTC

Update: It's also blacking out on this WU: abinitio_homfrag_71_A_2hh6A_4443_24307_0. It seems to be doing it at some of the checkpoints, but not all of them. What's more, the other WU that is running does show graphics okay but not this particular WU (it's just a black screen if it's able to show a separate window at all). And the main black screen issue doesn't care if you're doing anything else, even if you're watching a movie or moving your mouse or typing, it will supercede anything else on the screen. The Windows button has worked each time I've pressed it, a better success rate than Ctrl+Alt+Del. One additional thing of note: when exiting from the black screen and I see the taskbar, the icon for the Mini WU that's on the taskbar is that of "minirosetta_graphics_1.30_windows_intelx86.exe", not a BOINC or generic icon (it's a miniature version of Rosetta's logo).

I hope this helps you fix this bug. It's sure bugging me! :)
ID: 55438 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,703,886
RAC: 2,191
Message 55443 - Posted: 1 Sep 2008, 11:29:54 UTC

To Mod,

Why has no one from the team or for that matter anyone from RAH in general said anything about these 'needs psipred_ss2 to run filters' errors?

This is what annoys alot of people and causes them to leave, no news, no reply to posts about a common issue, no communication at all.

Could you give us a update on this issue or see if the team has something they can say in general about this?

Thanks
ID: 55443 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 55444 - Posted: 1 Sep 2008, 13:51:57 UTC

Update: It's also blacking out on this WU: abinitio_homfrag

snap! with a variation on a theme.
Its blacking out on abinitio_homfrag_71_A_2flsA_4443_19927_0 and the graphics window has to be shut via Task Manager
ID: 55444 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 55449 - Posted: 1 Sep 2008, 16:01:03 UTC

The graphics have now come back on line. It would appear that it was being upset by a usb drive that I had just installed. The old hard drive I had installed was still set up as a slave drive which was causing the problems.
ID: 55449 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mitrichr
Avatar

Send message
Joined: 23 May 07
Posts: 44
Credit: 1,005,660
RAC: 0
Message 55461 - Posted: 1 Sep 2008, 18:06:51 UTC

I just re-attached a PIII and a Core 2 Duo, two machines which I had previously used to crunch for Rosetta. They both got WU's within minutes. After the WU started on each machine, I let it run as long as it stayed active. Within about seven minutes, each machine froze. So, I detached both again.

>>RSM
http://sciencespringe.wordpress.com
http://facebook.com/sciencesprings


ID: 55461 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith T.
Avatar

Send message
Joined: 1 Mar 07
Posts: 58
Credit: 34,135
RAC: 0
Message 55477 - Posted: 2 Sep 2008, 11:50:12 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=187758720 has been running on my AMD Athlon 2200+ since 28/08/08 08:45 (UTC+1).

So far it has run for 11:33:44 and is currently "waiting to run".

boinc_checkpoint_count.txt shows 97
boinc_init_count shows 1 87

The task appears to still be on the first decoy or model.

I have changed my runtime prefs while this task has been running, to try to get it to finish and grant some credit.

stderr.txt
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 21600
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 7200
# cpu_run_time_pref: 3600
# cpu_run_time_pref: 3600

I think it may have got stuck in a loop and is repeating the first model, as when I look at the graphics, the task does seem to be making progress.

How far past the required runtime does a task need to go before it gets stopped by the watchdog?
ID: 55477 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 55482 - Posted: 2 Sep 2008, 14:30:11 UTC - in response to Message 55477.  

How far past the required runtime does a task need to go before it gets stopped by the watchdog?

The messages in stderr indicate that this task has been removed from memory and resumed many times. If 5 such restarts occur with no progress being made (i.e. a checkpoint saved) the task will be ended.

Otherwise, if the tasks uses more then 4 times the runtime preference, the watchdog will end it. Since it is waiting to run, it probably has to begin running again for the watchdog to get any time. The watchdog checks on it every 15 minutes.

So, I would have expected it to have ended on the first restart where your runtime preference was lowered to 2 hours. Since it did not end, I guess I would abort that one. 11.5 hours and not to complete a single model, and not responding to the watchdog, sounds like something may be wrong there.
Rosetta Moderator: Mod.Sense
ID: 55482 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 55483 - Posted: 2 Sep 2008, 14:42:18 UTC - in response to Message 55443.  

To Mod,

Why has no one from the team or for that matter anyone from RAH in general said anything about these 'needs psipred_ss2 to run filters' errors?

This is what annoys alot of people and causes them to leave, no news, no reply to posts about a common issue, no communication at all.

Could you give us a update on this issue or see if the team has something they can say in general about this?

Thanks


Greg, I can't post information I do not have. The message does not seem to adversely effect the running of the tasks. I'm sure it will be addressed in a future release.

Rest assured that the Project Team does review and react to the posts in these "problems with..." threads.
Rosetta Moderator: Mod.Sense
ID: 55483 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JordanWeber

Send message
Joined: 24 Apr 08
Posts: 4
Credit: 716,009
RAC: 0
Message 55484 - Posted: 2 Sep 2008, 15:02:33 UTC

Still can't run mini on this 1 computer, but at least I get computation errors now on all the tasks, with a message, getting warmer :):
189123860

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x7C91152A write attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...


</stderr_txt>
]]>
ID: 55484 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Phil Hirons, Jr.

Send message
Joined: 5 Jan 06
Posts: 1
Credit: 44,233
RAC: 0
Message 55504 - Posted: 3 Sep 2008, 17:50:52 UTC

I've had 3 Mini WU that go alon fine until about 10 minutes are estimated to be left. They then take hours of CPU time to complete (>15)
ID: 55504 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Minirosetta v1.32 bug thread



©2024 University of Washington
https://www.bakerlab.org