Minirosetta v1.47 bug thread.

Message boards : Number crunching : Minirosetta v1.47 bug thread.

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
yose-ue

Send message
Joined: 30 Dec 05
Posts: 3
Credit: 228,710
RAC: 0
Message 58654 - Posted: 7 Jan 2009, 21:17:04 UTC

This job (wuid=198707114)appeares to have finished twice and after using 71456 cpu seconds total I was only granted 2 points

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800
======================================================
DONE :: 1 starting structures 47173.5 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800
# cpu_run_time_pref: 28800
======================================================
DONE :: 1 starting structures 71456 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish

</stderr_txt>
]]>


Validate state Valid
Claimed credit 156.119054549462
Granted credit 2
application version 1.47

ID: 58654 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58655 - Posted: 7 Jan 2009, 21:21:30 UTC

DK has now corrected the problem where results are always granted 2 credits per model. See his post.
Rosetta Moderator: Mod.Sense
ID: 58655 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5661
Credit: 5,698,483
RAC: 2,016
Message 58673 - Posted: 8 Jan 2009, 11:57:56 UTC

some bizarre behavior for these tasks

https://boinc.bakerlab.org/rosetta/result.php?resultid=218440422
lr5_score12_rlbd_2o7k_IGNORE_THE_REST_DECOY_5559_1165_0

Exit Status -1073741819 (0xc0000005)
CPU time 8809.906
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0049162C read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...

Validate state Invalid
Claimed credit 58.9690388361006
Granted credit 58.9690388361006

But according to the tasks for user page the granted credit never happened.

---------

https://boinc.bakerlab.org/rosetta/result.php?resultid=218547095
lr5_score12_rlbd_1ubi_IGNORE_THE_REST_DECOY_5559_1100_1

Exit status -1073741819 (0xc0000005)
CPU time 1089.156
stderr out

<core_client_version>6.4.5</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0049162C read attempt to address 0x00000000

Engaging BOINC Windows Runtime Debugger...


Claimed credit 7.29025740598957
Granted credit 7.29025740598957

but again, no credit in the tasks for user page
ID: 58673 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
slre

Send message
Joined: 6 Dec 08
Posts: 2
Credit: 1,908,468
RAC: 0
Message 58738 - Posted: 11 Jan 2009, 21:38:08 UTC

I'm seeing the diminishing returns problem regularly. The progress for tasks is good for the first 70-80%, then progress gets slower and slower. Yesterday I aborted a task that had taken 30 hours to go from 97 to 99.5% after taking under 12 hours to get to 97%.
The following taks is going the same way:
abinitio_norelax_homfrag_129_B_1o7uA_SAVE_ALL_OUT_4626_11775_0
After 3 hours it was reporting 70% complete; it is now at 98.8% after 13.5 hours.

My main complaint is not that the tasks can overrun - though that is clearly a problem, it is reported previously - but that I thought the target cpu time included a threshold (3*target cpu time?) that terminated an overruning task. Minirosetta is clearly ignoring this if it's set, as my target time is set to 4 hours.

Is minirosetta supposed to act on target cpu time? If it is, why isn't it?

ID: 58738 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,841,472
RAC: 1,593
Message 58739 - Posted: 11 Jan 2009, 23:50:51 UTC - in response to Message 58738.  
Last modified: 12 Jan 2009, 0:23:24 UTC

I'm seeing the diminishing returns problem regularly. The progress for tasks is good for the first 70-80%, then progress gets slower and slower. Yesterday I aborted a task that had taken 30 hours to go from 97 to 99.5% after taking under 12 hours to get to 97%.
The following taks is going the same way:
abinitio_norelax_homfrag_129_B_1o7uA_SAVE_ALL_OUT_4626_11775_0
After 3 hours it was reporting 70% complete; it is now at 98.8% after 13.5 hours.

My main complaint is not that the tasks can overrun - though that is clearly a problem, it is reported previously - but that I thought the target cpu time included a threshold (3*target cpu time?) that terminated an overruning task. Minirosetta is clearly ignoring this if it's set, as my target time is set to 4 hours.

Is minirosetta supposed to act on target cpu time? If it is, why isn't it?


It is, but it doesn't check continuously for an overrun. If you have BOINC set to give each workunit a two hour timeslice before deciding what workunit gets the next timeslice, as I do, it only checks for an overrun every two hours.

In other words, your actual limit should be (3*target cpu time) + 1 timeslice at present.

Also, the diminishing returns you see is at least partly a fake; minirosetta doesn't have a good way of measuring what percentage of the work has been done, so it estimates the percentage done based on the percentage of the target CPU time it has already used until it gets within about 10 minutes of the target CPU time, then it almost stops changing the reported percentage done until it actually finishes.
ID: 58739 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5661
Credit: 5,698,483
RAC: 2,016
Message 58740 - Posted: 11 Jan 2009, 23:52:37 UTC - in response to Message 58738.  

I'm seeing the diminishing returns problem regularly. The progress for tasks is good for the first 70-80%, then progress gets slower and slower. Yesterday I aborted a task that had taken 30 hours to go from 97 to 99.5% after taking under 12 hours to get to 97%.
The following taks is going the same way:
abinitio_norelax_homfrag_129_B_1o7uA_SAVE_ALL_OUT_4626_11775_0
After 3 hours it was reporting 70% complete; it is now at 98.8% after 13.5 hours.

My main complaint is not that the tasks can overrun - though that is clearly a problem, it is reported previously - but that I thought the target cpu time included a threshold (3*target cpu time?) that terminated an overruning task. Minirosetta is clearly ignoring this if it's set, as my target time is set to 4 hours.

Is minirosetta supposed to act on target cpu time? If it is, why isn't it?




be sure to post links to the tasks that ran over in the long running models thread. apparently the team reads this thread to find out what is going on and make corrections in the next batch of tasks that are similar in nature.
ID: 58740 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5661
Credit: 5,698,483
RAC: 2,016
Message 58742 - Posted: 12 Jan 2009, 0:53:44 UTC
Last modified: 12 Jan 2009, 0:56:32 UTC

just a heads up:

1/12/2009 1:23:10 AM|rosetta@home|Task abinitio_norelax_homfrag_129_B_1a19A_SAVE_ALL_OUT_4626_9187_0 exited with zero status but no 'finished' file
1/12/2009 1:23:10 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
1/12/2009 1:23:10 AM|rosetta@home|Task abinitio_norelax_homfrag_129_B_4ubpA_SAVE_ALL_OUT_4626_9186_0 exited with zero status but no 'finished' file
1/12/2009 1:23:10 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
1/12/2009 1:23:10 AM|rosetta@home|Restarting task abinitio_norelax_homfrag_129_B_4ubpA_SAVE_ALL_OUT_4626_9186_0 using minirosetta version 147

the 87 task: https://boinc.bakerlab.org/rosetta/result.php?resultid=219581418
the 86 task https://boinc.bakerlab.org/rosetta/result.php?resultid=219581394

both tasks got credit ok. so don't know what that message was all about.
ID: 58742 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58743 - Posted: 12 Jan 2009, 0:57:09 UTC

A link to slre's task, it ran for over 40 hours! So, yes, clearly the watchdog should have ended it.

Robert, I don't believe the watchdog is dependant upon the BOINC task switching. On the other hand, it's not constantly checking either.
Rosetta Moderator: Mod.Sense
ID: 58743 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
slre

Send message
Joined: 6 Dec 08
Posts: 2
Credit: 1,908,468
RAC: 0
Message 58748 - Posted: 12 Jan 2009, 1:50:36 UTC - in response to Message 58743.  

A link to slre's task, it ran for over 40 hours! So, yes, clearly the watchdog should have ended it.

Robert, I don't believe the watchdog is dependant upon the BOINC task switching. On the other hand, it's not constantly checking either.


Thanks for that; a) I didn't know you could link to aborted taks; b) it made my case better than I did and c) thanks for confirming there's a genuine problem.

S
ID: 58748 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
HA-SOFT, s.r.o.

Send message
Joined: 27 Jan 07
Posts: 10
Credit: 94,518,643
RAC: 0
Message 58754 - Posted: 12 Jan 2009, 10:54:27 UTC - in response to Message 58144.  

StdErr is empty or contains message about access violation on 0xc0000005. Application hangs with 3MB RAM and does nothing. I have for example about 10 minirosetta apps in memory that do nothing. When I kill them, there is not stderr or any other file in slots directory.

greb_be and all,

When there is a new version of minirosetta update, we usually put a windows debug symbol image in a downloadable location. So when a WU crashes out, it should provide a backtrace of how an error is caused (this does not work every time and that makes our debugging very hard). If it is an error from Minirosetta program or bad command line/input file setup, the stdout or stderr usually will print out a message as hints, for example, the hbond NAN problem in the previous versions. Also, we should see a significantly higher error rate among either all or certain batches of WUs running. If it is caused by interfacing with the host's hardware or software, we will usually see that certain client hosts kept encountering errors or failure. We wish we could tell what have been wrong in every scenario when an error occurs, however, most of us Rosetta developer are far from being an expert on computer software/hardware and we can only hope to trap errors locally on our testing machines to continue with debugging.

Thank you all for voluntarily helping us on doing this project and sorry about any inconvenience/trouble caused on your computer. Please continue to report problems and/or possible fixes you have found as every bit of such information will certainly help us to improve R@H stability and resolve hidden bugs/problems sooner or later. Happy holidays to every one and happy crunching!



ID: 58754 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 58777 - Posted: 13 Jan 2009, 2:15:46 UTC - in response to Message 58499.  

Thanks Paul and everyone else. I'll give these suggestions a try sometime this week (besides Rosie is out of work for the time being).

In lieu of any direct reply, I note that every recent job for sslickerson has completed successfully.

Looks like Boinc 6.4.5 answers at least one person's problems with MiniRosetta WUs. Worth thinking about for anyone with otherwise persistent problems, it seems.
ID: 58777 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 58786 - Posted: 13 Jan 2009, 19:46:55 UTC

Hi all! Hope you all had a fabulous Christmas break. Despite being quiet on the message boards we've been busy working on making mini more stable. This is the top priority right now and i think we've made some progress. Your comments and feedback and error reports have been invaluable in this process! We've also set up a windows test-bed here locally which identified a number of hiden issues that the Linux machines we typically use didn't catch.

The next release 1.48 is about to go on RALPH and I am intending to test it very thoroughly before moving it onto BOINC. Since you guys posting here are already familiar with spotting problems I think it would be awesome if some of you experienced users could move over to RALPH@Home just for a few weeks while we test the new release. You've already seen the problems that used to occur and we need your feedback (and the extra processing power and variety of machines) to make sure we've fixed the issues we think we have fixed. I'll announce again here when the new version is actually out.

Here's a preview of the features that have been put into mini 1.48:

1.48 Release CHANGELOG

Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs.

Bug fix concerning intermittent crashes in _rlbd_ jobs.

Bug fix for a potential instability in handling text files (affects all types of WUs).

Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs)

Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread.

Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs)

Added checkpointing to Looprelax.

The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about.



http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 58786 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 58790 - Posted: 13 Jan 2009, 22:55:10 UTC - in response to Message 58786.  

Despite being quiet on the message boards we've been busy working on making mini more stable. This is the top priority right now and I think we've made some progress. Your comments and feedback and error reports have been invaluable in this process! We've also set up a windows test-bed here locally which identified a number of hidden issues that the Linux machines we typically use didn't catch.

That's the way I like - that you're getting busy behind the scenes rather than getting bogged down here. But it's worth a quick progress report once a week to prevent the natives getting too restless.

Good to hear you're set up with a Windows machine to pick up problems on the majority platform and it's earned its corn already. I look forward to the results and a much quieter bug thread. The work on over-running WUs, intermittent crashes and extra check-pointing should make a big difference if they're successful.
ID: 58790 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 58796 - Posted: 14 Jan 2009, 5:14:36 UTC

Well, no work yet in RALPH ...

BUt, I did sign up for what it is worth ... I will watch and see if I get any work on one system ...
ID: 58796 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 58797 - Posted: 14 Jan 2009, 7:11:41 UTC

Yeah - hold yer horses .. we've not yet done the update yet. I'll announce it here.

http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 58797 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile sslickerson

Send message
Joined: 14 Oct 05
Posts: 101
Credit: 578,497
RAC: 0
Message 58812 - Posted: 14 Jan 2009, 17:09:48 UTC - in response to Message 58777.  

Thanks Paul and everyone else. I'll give these suggestions a try sometime this week (besides Rosie is out of work for the time being).

In lieu of any direct reply, I note that every recent job for sslickerson has completed successfully.

Looks like Boinc 6.4.5 answers at least one person's problems with MiniRosetta WUs. Worth thinking about for anyone with otherwise persistent problems, it seems.


Hey there, sorry about not replying. Actually, the Rosetta Wu's you are looking at are on my desktop (BOINC 6.4.5) which *typically* does not have issues with minirosetta. I have not allowed work on my laptop (BOINC 6.5.0) since the last batch of errors, so I am uncertain if the update would have fixed the issue.

I am going to reattach to RALPH for awhile and hopefully if there are errors we can get them fixed over there.

Timothy



ID: 58812 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Krata

Send message
Joined: 25 Oct 05
Posts: 2
Credit: 17,084
RAC: 0
Message 58829 - Posted: 15 Jan 2009, 7:57:59 UTC - in response to Message 58812.  

Hi,

I have still same problem with Minirosseta application (at least last 4 versions).

Symptoms - the aplication start (running in boinc) but CPU usage is zero... there is no progress and finally (e.g. 2 hours) I am forced to abort it. There are still some tasks that are finished without any problem...

successfull result example:
https://boinc.bakerlab.org/rosetta/result.php?resultid=220577616

need to be aborted example:
https://boinc.bakerlab.org/rosetta/result.php?resultid=220578787
https://boinc.bakerlab.org/rosetta/result.php?resultid=220578788

Due to these facts (no error and so no work performed at all) I have switched to different project. Thanks for any advice...

PS I tried detaching from project, reseting and so on...

15/01/2009 08:53:44||Starting BOINC client version 6.4.5 for windows_intelx86
15/01/2009 08:53:44||log flags: task, file_xfer, sched_ops
15/01/2009 08:53:44||Libraries: libcurl/7.19.0 OpenSSL/0.9.8i zlib/1.2.3
15/01/2009 08:53:44||Data directory: C:Documents and SettingskratochvilDesktopboincnewCommonAppDataBOINC
15/01/2009 08:53:44||Running under account kratochvil
15/01/2009 08:53:44||Processor: 1 GenuineIntel Intel(R) Pentium(R) M processor 1.73GHz [x86 Family 6 Model 13 Stepping 8]
15/01/2009 08:53:44||Processor features: fpu tsc sse sse2 mmx
15/01/2009 08:53:44||OS: Microsoft Windows XP: Professional x86 Editon, Service Pack 2, (05.01.2600.00)
15/01/2009 08:53:44||Memory: 1.99 GB physical, 4.82 GB virtual
15/01/2009 08:53:44||Disk: 74.53 GB total, 9.79 GB free
15/01/2009 08:53:44||Local time is UTC +1 hours
15/01/2009 08:53:44||Using HTTP proxy CZproxy.de.eurw.ey.net:8080
15/01/2009 08:53:44||No CUDA devices found
15/01/2009 08:53:44||No coprocessors
15/01/2009 08:53:44|rosetta@home|URL: https://boinc.bakerlab.org/rosetta/; Computer ID: 984920; location: home; project prefs: default
15/01/2009 08:53:44|QMC@HOME|URL: http://qah.uni-muenster.de/; Computer ID: 114583; location: (none); project prefs: default
15/01/2009 08:53:44||General prefs: from rosetta@home (last modified 14-Jun-2008 11:07:07)
15/01/2009 08:53:44||Computer location: home
15/01/2009 08:53:44||General prefs: using separate prefs for home
15/01/2009 08:53:44||Reading preferences override file
15/01/2009 08:53:44||Preferences limit memory usage when active to 1426.87MB
15/01/2009 08:53:44||Preferences limit memory usage when idle to 1834.55MB
15/01/2009 08:53:45||Preferences limit disk usage to 2.00GB
15/01/2009 08:53:45|QMC@HOME|Restarting task one_bench12_s22-ecp2-TZmf.13431_0 using Amolqc-preRC1 version 501

ID: 58829 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58836 - Posted: 15 Jan 2009, 14:36:50 UTC
Last modified: 15 Jan 2009, 14:39:29 UTC

Krata, I do not have any specific advice to offer you to resolve the problem you describe. I only see a few tasks from that host, and only one completed normally and only two were aborted. So, perhaps greater numbers will help reveal more symptoms.

Could I ask that you keep an eye on the news portion of the home page and come back when the new Mini version is available? It will correct the majority of problems people have been reporting.

If you are willing, you might also consider attaching to Ralph to help test the new version. They need machines like yours that were having problems before, to be certain they have corrected them. The new release is not yet ready for testing, so you won't see much (or any) tasks available on Ralph right now. But should be soon.
Rosetta Moderator: Mod.Sense
ID: 58836 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 58916 - Posted: 19 Jan 2009, 9:08:54 UTC

** 1.48 released over on RALPH@HOme **

Good evening all. For those who've been following this thread and are interersted in helping us get the minirosetta app stable, i've just released a new application version over on ralph with a whole slew of stuff in it to make it more stable or at least give us mroe feedback on where it breaks. It's a first step.
Since you've already been giving us incredibly invaluable feedback over the last weeks and months I'd really appreciate your feedback on this new app over on RALPH. Does it run more stably ? Do an of the familiar problems crop up ? Overrunning WUs ? Weired crasehs etc. ?

thanks !

mike


http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 58916 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,841,472
RAC: 1,593
Message 58921 - Posted: 19 Jan 2009, 13:09:39 UTC - in response to Message 58916.  
Last modified: 19 Jan 2009, 13:23:12 UTC

** 1.48 released over on RALPH@HOme **

Good evening all. For those who've been following this thread and are interersted in helping us get the minirosetta app stable, i've just released a new application version over on ralph with a whole slew of stuff in it to make it more stable or at least give us mroe feedback on where it breaks. It's a first step.
Since you've already been giving us incredibly invaluable feedback over the last weeks and months I'd really appreciate your feedback on this new app over on RALPH. Does it run more stably ? Do an of the familiar problems crop up ? Overrunning WUs ? Weired crasehs etc. ?

thanks !

mike



I've been over on ralph. Looks like you may have made the 1.48 program available over there, but so far I've seen no sign of any new workunits in the queue over there for testing it. I'll need to run at least 10 workunits using it to tell if it's better or not, unless it's worse than 1.47.
ID: 58921 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : Minirosetta v1.47 bug thread.



©2024 University of Washington
https://www.bakerlab.org