Report long-running models here

Author	Message
Hubington Send message Joined: 3 Feb 06 Posts: 24 Credit: 127,236 RAC: 0	Message 56195 - Posted: 3 Oct 2008, 14:11:40 UTC - in response to Message 56194. You think you have it bad what a waste of time and energy this one was 194952967 26 Sep 2008 18:51:19 UTC Over Client error Compute error 61,849.00 562.03 --- it's all these hombench tasks. It's like I said in this threed where they announced it (https://boinc.bakerlab.org/forum_thread.php?id=4388). This sort of thing should really be part of RALPH not Rosetta. As I understand it Rosetta was set up so all us grunts can do the monkey work with the tested and proven applications. While RALPH was for RnD for Rosetta so they could test new ideas and get them working right before we all grind away at processing it all. If you go to the RALPH home page the first thing on the page says "RALPH@home is the official alpha test project for Rosetta@home. New application versions, work units, and updates in general will be tested here before being used for production. The goal for RALPH@home is to improve Rosetta@home." ID: 56195 · Rating: 0 · rate: / Reply Quote

sswilson Send message Joined: 9 May 08 Posts: 6 Credit: 1,519,259 RAC: 0	Message 56787 - Posted: 9 Nov 2008, 16:38:20 UTC Rosetta mini 1.40 Boinc 5.10.45 Win XP (fully updated) 1hzh_1s69_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_113_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=206158706 1hzh_2jc7_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_73_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=206146511 1hzh_1o4r_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_74_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=206098744 1hzh_1r6n_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_21_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=206025047 All of these took much longer than normal, and returned very poor granted credit vrs claimed credit. ID: 56787 · Rating: 0 · rate: / Reply Quote

sswilson Send message Joined: 9 May 08 Posts: 6 Credit: 1,519,259 RAC: 0	Message 56789 - Posted: 9 Nov 2008, 16:58:58 UTC BTW.... If this is an ongoing issue, this thread should be sticked so that folks know it exists. I only came across it accidentally through a link from another thread. ID: 56789 · Rating: 0 · rate: / Reply Quote

googloo Send message Joined: 15 Sep 06 Posts: 133 Credit: 23,650,756 RAC: 6,014	Message 56790 - Posted: 9 Nov 2008, 18:00:11 UTC 205632563 ran long. I suspect it was because of a number of these: recovering checkpoint of tag S_U9X3X_00000001 with id abrelax_rg_state recovering checkpoint of tag S_U9X3X_00000001 with id stage_1 recovering checkpoint of tag S_U9X3X_00000001 with id stage_2 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_1 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_2 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_3 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_4 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_5 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_6 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_7 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_8 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_9 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_10 recovering checkpoint of tag S_U9X3X_00000001 with id stage4_kk_1 recovering checkpoint of tag S_U9X3X_00000001 with id stage4_kk_2 recovering checkpoint of tag S_U9X3X_00000001 with id stage4_kk_3 recovering checkpoint of tag S_U9X3X_00000001 with id abrelax_relax Eight of them, actually, which equals the number of decoys it ran. ID: 56790 · Rating: 0 · rate: / Reply Quote

Inikurmoma Send message Joined: 12 Oct 08 Posts: 1 Credit: 606,772 RAC: 0	Message 56796 - Posted: 9 Nov 2008, 22:20:44 UTC - in response to Message 56790. I have to report a long task 09/11/2008 12:01:27 PM\|rosetta@home\|Starting 1hzh_1wrm_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_172_0 It's running for 5 hours now and stuck at 96,779% ID: 56796 · Rating: 0 · rate: / Reply Quote

Hubington Send message Joined: 3 Feb 06 Posts: 24 Credit: 127,236 RAC: 0	Message 56823 - Posted: 11 Nov 2008, 11:26:27 UTC 11/11/2008 06:11:44\|rosetta@home\|Computation for task 1hzh_2fi9_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_20_0 finished run time is supposed to be 3 hours or there abouts, took 8 hours 11 mins given the similarity in name to the above threed I assume it to went to a snails pace in the last few percent. ID: 56823 · Rating: 0 · rate: / Reply Quote

Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0	Message 56827 - Posted: 11 Nov 2008, 13:07:59 UTC On slow Duron CPU, Mini 1.39 task 1g73A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1g73A-_4653_6486_0 was interrupted after nearly 10 hours because of "going too long". The preference was increased from 1 to 2 hours during the run, which was an attempt, how would the model cope with such slow machine. (Probably not able at all to finish a decoy on such slow host.) It was checkpointing. During the run, the progress went very fast to some 80-90% and then was progressing 0.1%-wise over hours... Peter ID: 56827 · Rating: 0 · rate: / Reply Quote

Path7 Send message Joined: 25 Aug 07 Posts: 128 Credit: 61,751 RAC: 0	Message 56835 - Posted: 11 Nov 2008, 15:00:29 UTC Hello all, Not sure were to post, the Minirosetta v1.40 bug thread or this thread: With a runtime preference of 6 hours, this WU: 1hzh_1mve_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_147_0 is already running for over 13 hours. From the graphics I can see it is at Model: 1 Step: 52500 and running. I'm glad it is running on a (2-core) machine with ram: 2813.69 MB and swap space: 5849.91 MB since the WU uses 400 MB of ram (peak 437 MB) and 393 MB (peak: 429 MB) of virtual memory. Have a nice day, Path7. ID: 56835 · Rating: 0 · rate: / Reply Quote

robertmiles Send message Joined: 16 Jun 08 Posts: 1240 Credit: 14,421,737 RAC: 5	Message 56844 - Posted: 11 Nov 2008, 18:04:37 UTC I already reported this one a few times in the Minirosetta v1.40 bug thread, but that thread may be getting overloaded since it lost my last message. 1hzh_1o9g_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_155_1 It finally completed in about 19.25 CPU hours, due to that being over three times the preference of 6 CPU hours. Also, it was very memory hungry compared to the other workunits I run for other projects - a peak of perhaps 296 MB, and I haven't found a way to check how much virtual memory. A poor ratio of requested to granted credit - 200/80; but that seems common among workunits with 4704 in their names. I suspect a problem in its debt calculations also, which could explain why it won't yield the CPU core to a workunit from another project at the end of a 2 hour timeslot, even with leave in memory set. ID: 56844 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 18 Sep 05 Posts: 655 Credit: 12,105,178 RAC: 1,395	Message 56846 - Posted: 11 Nov 2008, 19:12:48 UTC I put mine in the Mini Rosetta 1.30 thread. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 56846 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 18 Sep 05 Posts: 655 Credit: 12,105,178 RAC: 1,395	Message 56847 - Posted: 11 Nov 2008, 19:13:21 UTC I put mine in the Mini Rosetta 1.30 thread. Was called... 1hzh_2he4_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_262 Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 56847 · Rating: 0 · rate: / Reply Quote

sarha1 Send message Joined: 23 Sep 05 Posts: 5 Credit: 6,339,735 RAC: 0	Message 56853 - Posted: 11 Nov 2008, 21:03:56 UTC robertmiles: Granted credit 80 means you were awarded a flat-rate credit after watchdog shut down the task due to time consumed. ID: 56853 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5756 Credit: 6,051,728 RAC: 675	Message 57073 - Posted: 19 Nov 2008, 19:11:16 UTC default cpu time 21600 this ran 3146.078 https://boinc.bakerlab.org/rosetta/result.php?resultid=207892330 h001b_BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-11-S3-8--h001b-_4769_556_0 Client state Compute error Exit status 1 (0x1) Computer ID 871217 Report deadline 26 Nov 2008 22:35:22 UTC CPU time 3146.078 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> recovering checkpoint of tag S_U11X8X_00000001 with id abrelax_rg_state recovering checkpoint of tag S_U11X8X_00000001 with id stage_1 recovering checkpoint of tag S_U11X8X_00000001 with id stage_2 # cpu_run_time_pref: 21600 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_1 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_2 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_3 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_4 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_5 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_6 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_7 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_8 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_9 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_10 and this repeats then this stderr: ERROR: NANs occured in hbonding! ERROR:: Exit from: ....srccorescoringhbondshbonds_geom.cc line: 763 called boinc_finish </stderr_txt> ]]> Validate state Invalid Claimed credit 21.0970375448934 ID: 57073 · Rating: 0 · rate: / Reply Quote

Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0	Message 57092 - Posted: 20 Nov 2008, 8:17:27 UTC I've got notifications for 7 more messages between 19:14 and 21:43 UTC, which are missing here. Was the thread spammed, or a server hiccup happened? Peter ID: 57092 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5756 Credit: 6,051,728 RAC: 675	Message 57095 - Posted: 20 Nov 2008, 11:02:20 UTC - in response to Message 57092. I've got notifications for 7 more messages between 19:14 and 21:43 UTC, which are missing here. Was the thread spammed, or a server hiccup happened? Peter mod hid some double posts by me and some rhetorical fighting as well. you didn't miss anything. the last current post is the one below. ID: 57095 · Rating: 0 · rate: / Reply Quote

Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0	Message 57097 - Posted: 20 Nov 2008, 12:39:37 UTC - in response to Message 57095. ...and some rhetorical fighting... :-) Peter ID: 57097 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5756 Credit: 6,051,728 RAC: 675	Message 57177 - Posted: 23 Nov 2008, 8:49:38 UTC Last modified: 23 Nov 2008, 8:51:53 UTC sorry wrong area - no actual over run ID: 57177 · Rating: 0 · rate: / Reply Quote

paulcsteiner Send message Joined: 15 Oct 05 Posts: 19 Credit: 3,144,322 RAC: 0	Message 57239 - Posted: 26 Nov 2008, 2:58:48 UTC This one went 263,622.40 which is a bit longer that the 24 hour RT I have set. Also no credit,.. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=188279776 ID: 57239 · Rating: 0 · rate: / Reply Quote

Dennis Send message Joined: 30 May 06 Posts: 2 Credit: 3,619,161 RAC: 0	Message 57535 - Posted: 3 Dec 2008, 5:56:17 UTC I believe my long running started when i loaded v6.2.19 which i am running on 5 machines with xp and vista. I have single cpu and 2 cpu machines. I have seen the over run on all and have aborted the run. after 50 plus hours i gave up. I am running 24 hr models[?]. So now i will wait only to the 30 hr mark of cpu time. there were 5 or 6 occasions of over run. on two occasions after aborting, the value of cpu time shown, changed from the 30 plus hrs and 40 plus hrs to the mid 20 hrs which would be normal. don't know if anyone has seen this. ID: 57535 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5756 Credit: 6,051,728 RAC: 675	Message 57536 - Posted: 3 Dec 2008, 8:44:32 UTC - in response to Message 57535. I believe my long running started when i loaded v6.2.19 which i am running on 5 machines with xp and vista. I have single cpu and 2 cpu machines. I have seen the over run on all and have aborted the run. after 50 plus hours i gave up. I am running 24 hr models[?]. So now i will wait only to the 30 hr mark of cpu time. there were 5 or 6 occasions of over run. on two occasions after aborting, the value of cpu time shown, changed from the 30 plus hrs and 40 plus hrs to the mid 20 hrs which would be normal. don't know if anyone has seen this. I had a look at your computers and their tasks. I see only 2 instances where the tasks went over their time. 120,000+ seconds when your preference is for 87,000+ secs. There were some tasks that looked like memory access errors (similar to what I get when my OC speed is set to high) and then you had some tasks that had to many restarts for whatever reason. There were some tasks that the message says keep the program in memory. have you set your preferences in boinc manager on the memory tab to 'leave tasks in memory while suspended'? That will help with the error message about keep tasks in memory. Be sure to use the boinc manager activity, suspend function if you are going to have multiple reboots and boinc manager is set to start automatically on boot up. This should help you get more steady run times. ID: 57536 · Rating: 0 · rate: / Reply Quote