Message boards : Number crunching : Report long-running models here
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 14 · Next
Author | Message |
---|---|
Hubington Send message Joined: 3 Feb 06 Posts: 24 Credit: 127,236 RAC: 0 |
You think you have it bad what a waste of time and energy this one was it's all these hombench tasks. It's like I said in this threed where they announced it (https://boinc.bakerlab.org/forum_thread.php?id=4388). This sort of thing should really be part of RALPH not Rosetta. As I understand it Rosetta was set up so all us grunts can do the monkey work with the tested and proven applications. While RALPH was for RnD for Rosetta so they could test new ideas and get them working right before we all grind away at processing it all. If you go to the RALPH home page the first thing on the page says "RALPH@home is the official alpha test project for Rosetta@home. New application versions, work units, and updates in general will be tested here before being used for production. The goal for RALPH@home is to improve Rosetta@home." |
sswilson Send message Joined: 9 May 08 Posts: 6 Credit: 1,519,259 RAC: 0 |
Rosetta mini 1.40 Boinc 5.10.45 Win XP (fully updated) 1hzh_1s69_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_113_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=206158706 1hzh_2jc7_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_73_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=206146511 1hzh_1o4r_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_74_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=206098744 1hzh_1r6n_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_21_0 https://boinc.bakerlab.org/rosetta/result.php?resultid=206025047 All of these took much longer than normal, and returned very poor granted credit vrs claimed credit. |
sswilson Send message Joined: 9 May 08 Posts: 6 Credit: 1,519,259 RAC: 0 |
BTW.... If this is an ongoing issue, this thread should be sticked so that folks know it exists. I only came across it accidentally through a link from another thread. |
googloo Send message Joined: 15 Sep 06 Posts: 133 Credit: 22,677,186 RAC: 4,532 |
205632563 ran long. I suspect it was because of a number of these: recovering checkpoint of tag S_U9X3X_00000001 with id abrelax_rg_state recovering checkpoint of tag S_U9X3X_00000001 with id stage_1 recovering checkpoint of tag S_U9X3X_00000001 with id stage_2 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_1 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_2 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_3 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_4 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_5 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_6 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_7 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_8 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_9 recovering checkpoint of tag S_U9X3X_00000001 with id stage_3_iter1_10 recovering checkpoint of tag S_U9X3X_00000001 with id stage4_kk_1 recovering checkpoint of tag S_U9X3X_00000001 with id stage4_kk_2 recovering checkpoint of tag S_U9X3X_00000001 with id stage4_kk_3 recovering checkpoint of tag S_U9X3X_00000001 with id abrelax_relax Eight of them, actually, which equals the number of decoys it ran. |
Inikurmoma Send message Joined: 12 Oct 08 Posts: 1 Credit: 606,772 RAC: 0 |
I have to report a long task 09/11/2008 12:01:27 PM|rosetta@home|Starting 1hzh_1wrm_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_172_0 It's running for 5 hours now and stuck at 96,779% |
Hubington Send message Joined: 3 Feb 06 Posts: 24 Credit: 127,236 RAC: 0 |
11/11/2008 06:11:44|rosetta@home|Computation for task 1hzh_2fi9_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_20_0 finished run time is supposed to be 3 hours or there abouts, took 8 hours 11 mins given the similarity in name to the above threed I assume it to went to a snails pace in the last few percent. |
Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
On slow Duron CPU, Mini 1.39 task 1g73A_ZNMP_ABRELAX_tetraR_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1g73A-_4653_6486_0 was interrupted after nearly 10 hours because of "going too long". The preference was increased from 1 to 2 hours during the run, which was an attempt, how would the model cope with such slow machine. (Probably not able at all to finish a decoy on such slow host.) It was checkpointing. During the run, the progress went very fast to some 80-90% and then was progressing 0.1%-wise over hours... Peter |
Path7 Send message Joined: 25 Aug 07 Posts: 128 Credit: 61,751 RAC: 0 |
Hello all, Not sure were to post, the Minirosetta v1.40 bug thread or this thread: With a runtime preference of 6 hours, this WU: 1hzh_1mve_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_147_0 is already running for over 13 hours. From the graphics I can see it is at Model: 1 Step: 52500 and running. I'm glad it is running on a (2-core) machine with ram: 2813.69 MB and swap space: 5849.91 MB since the WU uses 400 MB of ram (peak 437 MB) and 393 MB (peak: 429 MB) of virtual memory. Have a nice day, Path7. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,269,631 RAC: 2,588 |
I already reported this one a few times in the Minirosetta v1.40 bug thread, but that thread may be getting overloaded since it lost my last message. 1hzh_1o9g_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_155_1 It finally completed in about 19.25 CPU hours, due to that being over three times the preference of 6 CPU hours. Also, it was very memory hungry compared to the other workunits I run for other projects - a peak of perhaps 296 MB, and I haven't found a way to check how much virtual memory. A poor ratio of requested to granted credit - 200/80; but that seems common among workunits with 4704 in their names. I suspect a problem in its debt calculations also, which could explain why it won't yield the CPU core to a workunit from another project at the end of a 2 hour timeslot, even with leave in memory set. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 137 |
I put mine in the Mini Rosetta 1.30 thread. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 137 |
I put mine in the Mini Rosetta 1.30 thread. Was called... 1hzh_2he4_fchbonds_20_30sarel_SAVE_ALL_OUT_4704_262 Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
sarha1 Send message Joined: 23 Sep 05 Posts: 5 Credit: 6,339,735 RAC: 0 |
robertmiles: Granted credit 80 means you were awarded a flat-rate credit after watchdog shut down the task due to time consumed. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
default cpu time 21600 this ran 3146.078 https://boinc.bakerlab.org/rosetta/result.php?resultid=207892330 h001b_BOINC_ABRELAX_RANGE_yebf_IGNORE_THE_REST-S25-11-S3-8--h001b-_4769_556_0 Client state Compute error Exit status 1 (0x1) Computer ID 871217 Report deadline 26 Nov 2008 22:35:22 UTC CPU time 3146.078 stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> recovering checkpoint of tag S_U11X8X_00000001 with id abrelax_rg_state recovering checkpoint of tag S_U11X8X_00000001 with id stage_1 recovering checkpoint of tag S_U11X8X_00000001 with id stage_2 # cpu_run_time_pref: 21600 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_1 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_2 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_3 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_4 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_5 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_6 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_7 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_8 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_9 recovering checkpoint of tag S_U11X8X_00000001 with id stage_3_iter1_10 and this repeats then this stderr: ERROR: NANs occured in hbonding! ERROR:: Exit from: ....srccorescoringhbondshbonds_geom.cc line: 763 called boinc_finish </stderr_txt> ]]> Validate state Invalid Claimed credit 21.0970375448934 |
Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
I've got notifications for 7 more messages between 19:14 and 21:43 UTC, which are missing here. Was the thread spammed, or a server hiccup happened? Peter |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I've got notifications for 7 more messages between 19:14 and 21:43 UTC, which are missing here. Was the thread spammed, or a server hiccup happened? mod hid some double posts by me and some rhetorical fighting as well. you didn't miss anything. the last current post is the one below. |
Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
...and some rhetorical fighting... :-) Peter |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
sorry wrong area - no actual over run |
paulcsteiner Send message Joined: 15 Oct 05 Posts: 19 Credit: 3,144,322 RAC: 0 |
This one went 263,622.40 which is a bit longer that the 24 hour RT I have set. Also no credit,.. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=188279776 |
Dennis Send message Joined: 30 May 06 Posts: 2 Credit: 3,619,161 RAC: 0 |
I believe my long running started when i loaded v6.2.19 which i am running on 5 machines with xp and vista. I have single cpu and 2 cpu machines. I have seen the over run on all and have aborted the run. after 50 plus hours i gave up. I am running 24 hr models[?]. So now i will wait only to the 30 hr mark of cpu time. there were 5 or 6 occasions of over run. on two occasions after aborting, the value of cpu time shown, changed from the 30 plus hrs and 40 plus hrs to the mid 20 hrs which would be normal. don't know if anyone has seen this. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I believe my long running started when i loaded v6.2.19 which i am running on 5 machines with xp and vista. I have single cpu and 2 cpu machines. I had a look at your computers and their tasks. I see only 2 instances where the tasks went over their time. 120,000+ seconds when your preference is for 87,000+ secs. There were some tasks that looked like memory access errors (similar to what I get when my OC speed is set to high) and then you had some tasks that had to many restarts for whatever reason. There were some tasks that the message says keep the program in memory. have you set your preferences in boinc manager on the memory tab to 'leave tasks in memory while suspended'? That will help with the error message about keep tasks in memory. Be sure to use the boinc manager activity, suspend function if you are going to have multiple reboots and boinc manager is set to start automatically on boot up. This should help you get more steady run times. |
Message boards :
Number crunching :
Report long-running models here
©2024 University of Washington
https://www.bakerlab.org