Message boards : Number crunching : Minirosetta v1.47 bug thread.
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
some bizarre behavior for these tasks https://boinc.bakerlab.org/rosetta/result.php?resultid=218440422 lr5_score12_rlbd_2o7k_IGNORE_THE_REST_DECOY_5559_1165_0 Exit Status -1073741819 (0xc0000005) CPU time 8809.906 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 14400 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0049162C read attempt to address 0x00000000 Engaging BOINC Windows Runtime Debugger... Validate state Invalid Claimed credit 58.9690388361006 Granted credit 58.9690388361006 But according to the tasks for user page the granted credit never happened. --------- https://boinc.bakerlab.org/rosetta/result.php?resultid=218547095 lr5_score12_rlbd_1ubi_IGNORE_THE_REST_DECOY_5559_1100_1 Exit status -1073741819 (0xc0000005) CPU time 1089.156 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 14400 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0049162C read attempt to address 0x00000000 Engaging BOINC Windows Runtime Debugger... Claimed credit 7.29025740598957 Granted credit 7.29025740598957 but again, no credit in the tasks for user page |
slre Send message Joined: 6 Dec 08 Posts: 2 Credit: 1,908,468 RAC: 0 |
I'm seeing the diminishing returns problem regularly. The progress for tasks is good for the first 70-80%, then progress gets slower and slower. Yesterday I aborted a task that had taken 30 hours to go from 97 to 99.5% after taking under 12 hours to get to 97%. The following taks is going the same way: abinitio_norelax_homfrag_129_B_1o7uA_SAVE_ALL_OUT_4626_11775_0 After 3 hours it was reporting 70% complete; it is now at 98.8% after 13.5 hours. My main complaint is not that the tasks can overrun - though that is clearly a problem, it is reported previously - but that I thought the target cpu time included a threshold (3*target cpu time?) that terminated an overruning task. Minirosetta is clearly ignoring this if it's set, as my target time is set to 4 hours. Is minirosetta supposed to act on target cpu time? If it is, why isn't it? |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 677 |
I'm seeing the diminishing returns problem regularly. The progress for tasks is good for the first 70-80%, then progress gets slower and slower. Yesterday I aborted a task that had taken 30 hours to go from 97 to 99.5% after taking under 12 hours to get to 97%. It is, but it doesn't check continuously for an overrun. If you have BOINC set to give each workunit a two hour timeslice before deciding what workunit gets the next timeslice, as I do, it only checks for an overrun every two hours. In other words, your actual limit should be (3*target cpu time) + 1 timeslice at present. Also, the diminishing returns you see is at least partly a fake; minirosetta doesn't have a good way of measuring what percentage of the work has been done, so it estimates the percentage done based on the percentage of the target CPU time it has already used until it gets within about 10 minutes of the target CPU time, then it almost stops changing the reported percentage done until it actually finishes. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I'm seeing the diminishing returns problem regularly. The progress for tasks is good for the first 70-80%, then progress gets slower and slower. Yesterday I aborted a task that had taken 30 hours to go from 97 to 99.5% after taking under 12 hours to get to 97%. be sure to post links to the tasks that ran over in the long running models thread. apparently the team reads this thread to find out what is going on and make corrections in the next batch of tasks that are similar in nature. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
just a heads up: 1/12/2009 1:23:10 AM|rosetta@home|Task abinitio_norelax_homfrag_129_B_1a19A_SAVE_ALL_OUT_4626_9187_0 exited with zero status but no 'finished' file 1/12/2009 1:23:10 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 1/12/2009 1:23:10 AM|rosetta@home|Task abinitio_norelax_homfrag_129_B_4ubpA_SAVE_ALL_OUT_4626_9186_0 exited with zero status but no 'finished' file 1/12/2009 1:23:10 AM|rosetta@home|If this happens repeatedly you may need to reset the project. 1/12/2009 1:23:10 AM|rosetta@home|Restarting task abinitio_norelax_homfrag_129_B_4ubpA_SAVE_ALL_OUT_4626_9186_0 using minirosetta version 147 the 87 task: https://boinc.bakerlab.org/rosetta/result.php?resultid=219581418 the 86 task https://boinc.bakerlab.org/rosetta/result.php?resultid=219581394 both tasks got credit ok. so don't know what that message was all about. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
A link to slre's task, it ran for over 40 hours! So, yes, clearly the watchdog should have ended it. Robert, I don't believe the watchdog is dependant upon the BOINC task switching. On the other hand, it's not constantly checking either. Rosetta Moderator: Mod.Sense |
slre Send message Joined: 6 Dec 08 Posts: 2 Credit: 1,908,468 RAC: 0 |
A link to slre's task, it ran for over 40 hours! So, yes, clearly the watchdog should have ended it. Thanks for that; a) I didn't know you could link to aborted taks; b) it made my case better than I did and c) thanks for confirming there's a genuine problem. S |
HA-SOFT, s.r.o. Send message Joined: 27 Jan 07 Posts: 10 Credit: 94,518,643 RAC: 0 |
StdErr is empty or contains message about access violation on 0xc0000005. Application hangs with 3MB RAM and does nothing. I have for example about 10 minirosetta apps in memory that do nothing. When I kill them, there is not stderr or any other file in slots directory. greb_be and all, |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
Thanks Paul and everyone else. I'll give these suggestions a try sometime this week (besides Rosie is out of work for the time being). In lieu of any direct reply, I note that every recent job for sslickerson has completed successfully. Looks like Boinc 6.4.5 answers at least one person's problems with MiniRosetta WUs. Worth thinking about for anyone with otherwise persistent problems, it seems. |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
Hi all! Hope you all had a fabulous Christmas break. Despite being quiet on the message boards we've been busy working on making mini more stable. This is the top priority right now and i think we've made some progress. Your comments and feedback and error reports have been invaluable in this process! We've also set up a windows test-bed here locally which identified a number of hiden issues that the Linux machines we typically use didn't catch. The next release 1.48 is about to go on RALPH and I am intending to test it very thoroughly before moving it onto BOINC. Since you guys posting here are already familiar with spotting problems I think it would be awesome if some of you experienced users could move over to RALPH@Home just for a few weeks while we test the new release. You've already seen the problems that used to occur and we need your feedback (and the extra processing power and variety of machines) to make sure we've fixed the issues we think we have fixed. I'll announce again here when the new version is actually out. Here's a preview of the features that have been put into mini 1.48: 1.48 Release CHANGELOG Faster loop closing in FoldCST/Abinitio (affects cc_* cc2_* cs_* WUs), should help with overrunning WUs. Bug fix concerning intermittent crashes in _rlbd_ jobs. Bug fix for a potential instability in handling text files (affects all types of WUs). Bug fix in checkpointing machinery, states were not being correctly restored, probably contributing to long runtimes. (affects cc_* cc2_* cs_* WUs) Increased the density of checkpoints to lose less time on restarts and address the weired "backjumping" of the time reported in this thread. Added checkpointing to Loopclosing part of FoldCST. (affects cc_* cc2_* cs_* WUs) Added checkpointing to Looprelax. The Watchdog has been checked and improved, now returning information on the aborted jobs to help us figure out how the remaining long running models come about. http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
Despite being quiet on the message boards we've been busy working on making mini more stable. This is the top priority right now and I think we've made some progress. Your comments and feedback and error reports have been invaluable in this process! We've also set up a windows test-bed here locally which identified a number of hidden issues that the Linux machines we typically use didn't catch. That's the way I like - that you're getting busy behind the scenes rather than getting bogged down here. But it's worth a quick progress report once a week to prevent the natives getting too restless. Good to hear you're set up with a Windows machine to pick up problems on the majority platform and it's earned its corn already. I look forward to the results and a much quieter bug thread. The work on over-running WUs, intermittent crashes and extra check-pointing should make a big difference if they're successful. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Well, no work yet in RALPH ... BUt, I did sign up for what it is worth ... I will watch and see if I get any work on one system ... |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
Yeah - hold yer horses .. we've not yet done the update yet. I'll announce it here. http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
sslickerson Send message Joined: 14 Oct 05 Posts: 101 Credit: 578,497 RAC: 0 |
Thanks Paul and everyone else. I'll give these suggestions a try sometime this week (besides Rosie is out of work for the time being). Hey there, sorry about not replying. Actually, the Rosetta Wu's you are looking at are on my desktop (BOINC 6.4.5) which *typically* does not have issues with minirosetta. I have not allowed work on my laptop (BOINC 6.5.0) since the last batch of errors, so I am uncertain if the update would have fixed the issue. I am going to reattach to RALPH for awhile and hopefully if there are errors we can get them fixed over there. Timothy |
Krata Send message Joined: 25 Oct 05 Posts: 2 Credit: 17,084 RAC: 0 |
Hi, I have still same problem with Minirosseta application (at least last 4 versions). Symptoms - the aplication start (running in boinc) but CPU usage is zero... there is no progress and finally (e.g. 2 hours) I am forced to abort it. There are still some tasks that are finished without any problem... successfull result example: https://boinc.bakerlab.org/rosetta/result.php?resultid=220577616 need to be aborted example: https://boinc.bakerlab.org/rosetta/result.php?resultid=220578787 https://boinc.bakerlab.org/rosetta/result.php?resultid=220578788 Due to these facts (no error and so no work performed at all) I have switched to different project. Thanks for any advice... PS I tried detaching from project, reseting and so on... 15/01/2009 08:53:44||Starting BOINC client version 6.4.5 for windows_intelx86 15/01/2009 08:53:44||log flags: task, file_xfer, sched_ops 15/01/2009 08:53:44||Libraries: libcurl/7.19.0 OpenSSL/0.9.8i zlib/1.2.3 15/01/2009 08:53:44||Data directory: C:Documents and SettingskratochvilDesktopboincnewCommonAppDataBOINC 15/01/2009 08:53:44||Running under account kratochvil 15/01/2009 08:53:44||Processor: 1 GenuineIntel Intel(R) Pentium(R) M processor 1.73GHz [x86 Family 6 Model 13 Stepping 8] 15/01/2009 08:53:44||Processor features: fpu tsc sse sse2 mmx 15/01/2009 08:53:44||OS: Microsoft Windows XP: Professional x86 Editon, Service Pack 2, (05.01.2600.00) 15/01/2009 08:53:44||Memory: 1.99 GB physical, 4.82 GB virtual 15/01/2009 08:53:44||Disk: 74.53 GB total, 9.79 GB free 15/01/2009 08:53:44||Local time is UTC +1 hours 15/01/2009 08:53:44||Using HTTP proxy CZproxy.de.eurw.ey.net:8080 15/01/2009 08:53:44||No CUDA devices found 15/01/2009 08:53:44||No coprocessors 15/01/2009 08:53:44|rosetta@home|URL: https://boinc.bakerlab.org/rosetta/; Computer ID: 984920; location: home; project prefs: default 15/01/2009 08:53:44|QMC@HOME|URL: http://qah.uni-muenster.de/; Computer ID: 114583; location: (none); project prefs: default 15/01/2009 08:53:44||General prefs: from rosetta@home (last modified 14-Jun-2008 11:07:07) 15/01/2009 08:53:44||Computer location: home 15/01/2009 08:53:44||General prefs: using separate prefs for home 15/01/2009 08:53:44||Reading preferences override file 15/01/2009 08:53:44||Preferences limit memory usage when active to 1426.87MB 15/01/2009 08:53:44||Preferences limit memory usage when idle to 1834.55MB 15/01/2009 08:53:45||Preferences limit disk usage to 2.00GB 15/01/2009 08:53:45|QMC@HOME|Restarting task one_bench12_s22-ecp2-TZmf.13431_0 using Amolqc-preRC1 version 501 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Krata, I do not have any specific advice to offer you to resolve the problem you describe. I only see a few tasks from that host, and only one completed normally and only two were aborted. So, perhaps greater numbers will help reveal more symptoms. Could I ask that you keep an eye on the news portion of the home page and come back when the new Mini version is available? It will correct the majority of problems people have been reporting. If you are willing, you might also consider attaching to Ralph to help test the new version. They need machines like yours that were having problems before, to be certain they have corrected them. The new release is not yet ready for testing, so you won't see much (or any) tasks available on Ralph right now. But should be soon. Rosetta Moderator: Mod.Sense |
Mike Tyka Send message Joined: 20 Oct 05 Posts: 96 Credit: 2,190 RAC: 0 |
** 1.48 released over on RALPH@HOme ** Good evening all. For those who've been following this thread and are interersted in helping us get the minirosetta app stable, i've just released a new application version over on ralph with a whole slew of stuff in it to make it more stable or at least give us mroe feedback on where it breaks. It's a first step. Since you've already been giving us incredibly invaluable feedback over the last weeks and months I'd really appreciate your feedback on this new app over on RALPH. Does it run more stably ? Do an of the familiar problems crop up ? Overrunning WUs ? Weired crasehs etc. ? thanks ! mike http://beautifulproteins.blogspot.com/ http://www.miketyka.com/ |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 677 |
** 1.48 released over on RALPH@HOme ** I've been over on ralph. Looks like you may have made the 1.48 program available over there, but so far I've seen no sign of any new workunits in the queue over there for testing it. I'll need to run at least 10 workunits using it to tell if it's better or not, unless it's worse than 1.47. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
** 1.48 released over on RALPH@HOme ** project seems to be disabled at the moment for "maintenance" |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
back to 1.47 errors this one crashed and burned: jump-neg-1aiu___6220_9692_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=221608803 state Compute error Exit status -1073741819 (0xc0000005) CPU time 10506.81 stderr out <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 14400 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0059C5C0 write attempt to address 0x00978D98 Engaging BOINC Windows Runtime Debugger... |
Message boards :
Number crunching :
Minirosetta v1.47 bug thread.
©2025 University of Washington
https://www.bakerlab.org