Problems with version 5.96

Message boards : Number crunching : Problems with version 5.96

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 10 · Next

AuthorMessage
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 52097 - Posted: 23 Mar 2008, 23:47:13 UTC

This version of Rosetta is pretty similar to 5.95, though we'll be running a few RNA workunits -- please post if there are any issues.
ID: 52097 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile alpha

Send message
Joined: 4 Nov 06
Posts: 27
Credit: 1,550,107
RAC: 0
Message 52099 - Posted: 24 Mar 2008, 8:35:13 UTC

Waking up this morning I've got two compute errors from one computer:

Task 150487485
Task 150487488

Exit status -1073741819 (0xc0000005) on both. Zero credit.
ID: 52099 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile skildude

Send message
Joined: 13 Dec 05
Posts: 7
Credit: 1,167,139
RAC: 0
Message 52109 - Posted: 25 Mar 2008, 1:41:27 UTC
Last modified: 25 Mar 2008, 1:44:20 UTC

1ehzA_BOINC_SEQSEP_TITRATION_TEMP4_RNA_ABINITIO_RNA_CONTACT-1ehzA-_299

I've had at least 8 WU's fail at various times while in process.
the WU's all get a computational error the moment I attempt to "show graphics" half failed before I even attempted even noticed I had the WU's. perhaps the graphics need to be turned off on these WU's

typical error even with WU's that failed immediately.

stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3404258
# cpu_run_time_pref: 14400
# random seed: 3404258
# cpu_run_time_pref: 14400
# random seed: 3404258
# cpu_run_time_pref: 14400
# random seed: 3404258
# cpu_run_time_pref: 14400
# random seed: 3404258
# cpu_run_time_pref: 14400
# random seed: 3404258
# cpu_run_time_pref: 14400
# random seed: 3404258
# cpu_run_time_pref: 14400
# random seed: 3404258
# cpu_run_time_pref: 14400
# random seed: 3404258
# cpu_run_time_pref: 14400
# random seed: 3404258
# cpu_run_time_pref: 14400
# random seed: 3404258
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 2420.03 cpu seconds
This process generated 0 decoys from 0 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>1ehzA_BOINC_SEQSEP_TITRATION_TEMP4_RNA_ABINITIO_RNA_CONTACT-1ehzA-_2996_9923_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>
]]>
Space is a vast empty space. Let us hope that it does not occupy the region between your ears.

Come visit Team Starfire at www.TSWB.org

ID: 52109 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
vicel

Send message
Joined: 28 Mar 06
Posts: 5
Credit: 957,142
RAC: 0
Message 52124 - Posted: 26 Mar 2008, 10:11:00 UTC

Freeze calculation of WU 1wrp__BOINC_SYMM_FOLD_AND_DOCK_RELAX-1wrp_-native__3004_6 (https://boinc.bakerlab.org/rosetta/workunit.php?wuid=137744826).
Today, I restarted my computer. After that, process of this WU was started - status "Running", but really progress of calculation is stoping (CPU Time don't changing and System Monitor for any process "rosseta_beta_5.96..." show 0%).
Message Log has only "resuming task 1wrp_..." without any errors.
I try to restart processing of this WU: suspend - resume, but progress stop. (Other WU was run).

Linux Ubuntu 7.10 (32bit), BOINC 5.10.45. Rosetta Beta 5.96 (for this WU).

Intel Core 2 Duo E4500 (2.2 GHz). RAM: 4Gb (really used 3 Gb). HDD: 33 Gb free.
MotherBoard: Intel DG33BU (internal video).

SETI@Home/Rosseta@Home - 50/50.

ID: 52124 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Andrii Muliar

Send message
Joined: 10 Nov 05
Posts: 12
Credit: 7,655,243
RAC: 0
Message 52125 - Posted: 26 Mar 2008, 10:38:18 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=137758041

Weird search process:



During the process of search Accepted RMSD & Energy remains constant:


ID: 52125 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 52213 - Posted: 1 Apr 2008, 20:33:58 UTC
Last modified: 1 Apr 2008, 20:39:20 UTC

I got a grey windows box saying 5.96 encountered an error and is shutting down.
The computation was around 66.36% finished when it crashed. *read the error log here as it is quite lenghty.* If its a windows thing, then at the moment I am trying to finish the batch of files I got after the last crash and then will be loading a new version of XP on my drive when they are done.
Here is what was in BOINC manager:
4/1/2008 6:18:55 PM|rosetta@home|Starting task bench80_rozilla_abrelax_natfrag_1gu3A_2986_41161_0 using rosetta_beta version 596
4/1/2008 6:18:57 PM|rosetta@home|Started upload of bench80_rozilla_abrelax_natfrag_2hnfA_2986_38472_0_0
4/1/2008 6:19:02 PM|rosetta@home|Finished upload of bench80_rozilla_abrelax_natfrag_2hnfA_2986_38472_0_0
4/1/2008 7:31:08 PM|rosetta@home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 4 completed tasks
4/1/2008 7:31:13 PM|rosetta@home|Scheduler request succeeded: got 0 new tasks
4/1/2008 7:31:31 PM||Project communication failed: attempting access to reference site
4/1/2008 7:31:32 PM||Access to reference site succeeded - project servers may be temporarily down.
4/1/2008 10:29:53 PM|rosetta@home|Computation for task bench80_rozilla_abrelax_natfrag_1gu3A_2986_41161_0 finished
4/1/2008 10:29:54 PM|rosetta@home|Output file bench80_rozilla_abrelax_natfrag_1gu3A_2986_41161_0_0 for task bench80_rozilla_abrelax_natfrag_1gu3A_2986_41161_0 absent
ID: 52213 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Christian Diepold
Avatar

Send message
Joined: 23 Sep 05
Posts: 37
Credit: 300,225
RAC: 0
Message 52220 - Posted: 3 Apr 2008, 15:57:01 UTC

I also experienced a WU crash with a Windows XP box popping up and asking for a click on ok (Rosetta experienced a critical error ...). As it happened on my main PC, no big deal, I caught it the second it happened. But I hope it doesn't happen on any of the unmonitored crunchers our there.
ID: 52220 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
M.L.

Send message
Joined: 21 Nov 06
Posts: 182
Credit: 180,462
RAC: 0
Message 52250 - Posted: 5 Apr 2008, 0:37:06 UTC

05-Apr-2008 00:40:59 [rosetta@home] Task FRA_t035_CAPRI15_hom1dyoB_2_IGNORE_THE_RESTt035_2_t035.1dyo.second_round_starting_pdb.pdb_3020_54511_0 exited with a DLL initialization error.
05-Apr-2008 00:40:59 [rosetta@home] If this happens repeatedly you may need to reboot your computer.
05-Apr-2008 00:40:59 [rosetta@home] Task FRA_t848_hom1jnx_1_IGNORE_THE_RESTt848_1_t848_1jnx.template_0002.pdb_3021_2209_0 exited with a DLL initialization error.
05-Apr-2008 00:40:59 [rosetta@home] If this happens repeatedly you may need to reboot your computer.
05-Apr-2008 00:41:01 [rosetta@home] Restarting task FRA_t035_CAPRI15_hom1dyoB_2_IGNORE_THE_RESTt035_2_t035.1dyo.second_round_starting_pdb.pdb_3020_54511_0 using rosetta_beta version 596
05-Apr-2008 00:41:01 [rosetta@home] Restarting task FRA_t848_hom1jnx_1_IGNORE_THE_RESTt848_1_t848_1jnx.template_0002.pdb_3021_2209_0 using rosetta_beta version 596

Both seem good now but will check in a while. (did not reboot {see the time}).
ID: 52250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 52260 - Posted: 5 Apr 2008, 3:22:02 UTC

The stdout.txt file for some WUs seems to be growing to over 40 Megabytes. The following WUs are doing this:

BAK_1auq_Nterm_loop_3031_10951
BAK_1auq_Nterm_loop_3031_11495
BAK_1auq_Nterm_loop_3031_2944

At the moment, the WUs are still crunching.

Looking at one of the stdout.txt files shows a lot of lines like:

...
STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0320764
1 55 -4.99472 -4.99646
STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0552378
2 55 -6.28509 -6.28283
STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0535297
1 56 -6.78289 -6.78004
STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0397191
2 56 -7.03684 -7.03334
STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0533867
3 56 -36.1671 -36.1738
STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0479667
2 58 -5.27631 -5.27237
STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0940199
1 59 -6.51416 -6.5067
STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0393486
3 59 -32.5412 -32.5466
STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.0327196
1 61 -4.3825 -4.38517
STOP:: Pose::set_coords(): mismatch between N,CA,C,O coords between full-coord and Eposition. dev= 0.054975
1 62 -5.24462 -5.25139
...

ad nauseum
ID: 52260 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 52267 - Posted: 5 Apr 2008, 14:14:30 UTC

As a followup to my previous post, the three WUs completed normally:

https://boinc.bakerlab.org/rosetta/result.php?resultid=153052812
https://boinc.bakerlab.org/rosetta/result.php?resultid=153058613
https://boinc.bakerlab.org/rosetta/result.php?resultid=152966018

When I first noticed these, all of them had a stdout.txt file a bit over 40MB. But the stdout.txt file didn't grow much bigger, even with another 8 hours or so of crunching.
ID: 52267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Larry256

Send message
Joined: 11 Nov 05
Posts: 2
Credit: 2,986,767
RAC: 840
Message 52301 - Posted: 6 Apr 2008, 16:51:09 UTC - in response to Message 52097.  
Last modified: 6 Apr 2008, 17:05:27 UTC

I experienced 2 WU crash with a Windows XP box popping up and asking for a click on ok
https://boinc.bakerlab.org/rosetta/result.php?resultid=153734357
https://boinc.bakerlab.org/rosetta/result.php?resultid=153734536
ID: 52301 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Kidd

Send message
Joined: 9 Dec 06
Posts: 5
Credit: 327
RAC: 0
Message 52306 - Posted: 7 Apr 2008, 1:28:53 UTC

Thanks for letting us know about the large stdout files that you're getting. I'm currently looking into this issue and it looks like the error is basically harmless. We'll correct the large stdout files on the next release.
ID: 52306 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DJStarfox

Send message
Joined: 19 Jul 07
Posts: 145
Credit: 1,235,310
RAC: 70
Message 52391 - Posted: 11 Apr 2008, 22:43:12 UTC

When WU finish, three (out of four) processes/executables are still left in memory as orphaned processes for each WU. The results do show as valid. Here's an example:
https://boinc.bakerlab.org/rosetta/result.php?resultid=155168683

System is Linux Fedora 7 x86_64
2 x AMD Opteron 248 HE
2 x 1GB RAM
BOINC 5.8.16 x86
ID: 52391 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
msirois

Send message
Joined: 12 Mar 08
Posts: 1
Credit: 112,554
RAC: 0
Message 52531 - Posted: 16 Apr 2008, 19:45:48 UTC

On about half of the jobs, when I reach around 95% completed progress simply crawls. To completion time stops but percentages increment extremely slowly. I assume the job is progreessing but I don;t know.
ID: 52531 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 52532 - Posted: 16 Apr 2008, 20:02:20 UTC - in response to Message 52531.  

On about half of the jobs, when I reach around 95% completed progress simply crawls. To completion time stops but percentages increment extremely slowly. I assume the job is progreessing but I don;t know.


That would be normal. Especially if you are still on the first model, and/or have a short preferred runtime specified in your Rosetta preferences.
Rosetta Moderator: Mod.Sense
ID: 52532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hedera
Avatar

Send message
Joined: 15 Jul 06
Posts: 76
Credit: 5,139,863
RAC: 905
Message 52540 - Posted: 16 Apr 2008, 23:08:48 UTC

I don't know if this is an issue with 5.96 (nothing actually seems to be crashing) or just with the RNA work units, but I'm posting this from my laptop because my desktop machine is completely non-responsive - I have 3 WU's "active" and among them they're eating over half a GB of RAM. And I only have a GB on a WinXP box. I literally can't get a response from keyboard or mouse.

I've ordered a memory upgrade to 4 GB after which you can analyze RNA if you want to, but in the meantime, have pity!!
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

ID: 52540 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 52542 - Posted: 16 Apr 2008, 23:48:10 UTC
Last modified: 17 Apr 2008, 13:44:41 UTC

Hedera, you can tell BOINC to "snooze", or limit it to less then all of your CPUs, or put a memory constraint on it to help keep it responsive for you. Most likely, as you already seem aware, you are bottlenecking on memory. Yes those RNA's do take a lot. And so it you limit the amount of memory BOINC is allowed to use, it will basically scale back on how many it's trying to do at the same time.
Rosetta Moderator: Mod.Sense
ID: 52542 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 52546 - Posted: 17 Apr 2008, 5:30:17 UTC - in response to Message 52542.  

Hedera, can you post a link to one of your results, so that I can figure out what the problem is? I actually did not expect any of these RNA jobs to take a lot of memory, so I definitely want to track ths down.

Hedera, you can tell BOINC to "snooze", or limit it to less then all of your CPUs, or put a memory constraint on it to help keep it responsive for you. Most likely, as you already seem aware, you are bottlenecking on memory. Yes those RNA's do that a lot. And so it you limit the amount of memory BOINC is allowed to use, it will basically scale back on how many it's trying to do at the same time.


ID: 52546 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hedera
Avatar

Send message
Joined: 15 Jul 06
Posts: 76
Credit: 5,139,863
RAC: 905
Message 52569 - Posted: 17 Apr 2008, 17:45:39 UTC

At the moment this task is 54% done and "waiting for memory":

https://boinc.bakerlab.org/rosetta/result.php?resultid=156211157
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=142644377

And these tasks are running:

https://boinc.bakerlab.org/rosetta/result.php?resultid=156203235 (31% done)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=142636856

https://boinc.bakerlab.org/rosetta/result.php?resultid=156409361 (4% done)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=142826368

I may have screwed up its scheduling last night when I suspended the project so I could do some actual work on the box. I don't usually have 3 at a time.

One of the 2 running tasks has a working memory set of just under 192K, and the other set size is just under 116K; I don't know which is which.

I actually have restricted memory use on my system, to 50% of memory when the box is in use and 85% when not, and until this last round of WUs that was enough. It isn't as bad this morning as it was last night. (I got the memory upgrade today! But it isn't in yet.)


--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

ID: 52569 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 52573 - Posted: 17 Apr 2008, 22:52:07 UTC

I've got a bunch of these CAPRI15 tasks. First one has been running 4hrs and shows peak memory usage of 470M and it still hasn't completed the first model. This is the one running now:
https://boinc.bakerlab.org/rosetta/result.php?resultid=156183910
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 52573 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 10 · Next

Message boards : Number crunching : Problems with version 5.96



©2024 University of Washington
https://www.bakerlab.org