Rosetta 4.0+

Message boards : Number crunching : Rosetta 4.0+

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 13 · Next

AuthorMessage
Jim1348

Send message
Joined: 19 Jan 06
Posts: 299
Credit: 8,919,650
RAC: 17,219
Message 88534 - Posted: 26 Mar 2018, 5:43:01 UTC - in response to Message 88533.  

I just found Rosetta 4.07 used 2,111,242,240 bytes (1.97 GIGAbytes) before my system crashed (i7-4770K, 8GB).

That is a bit high. The maximum I see for the last two weeks is 1179 GB, and usually less than 700 GB. However, I have 32 GB, so they might as well use it. My other projects (on LHC and GPUGrid Quantum Chemistry) often use more.
ID: 88534 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Darrell

Send message
Joined: 28 Sep 06
Posts: 25
Credit: 43,130,615
RAC: 39,220
Message 88563 - Posted: 27 Mar 2018, 10:33:41 UTC - in response to Message 88534.  

@ Jim1348

That is a bit high. The maximum I see for the last two weeks is 1179 GB, and usually less than 700 GB. However, I have 32 GB, so they might as well use it. My other projects (on LHC and GPUGrid Quantum Chemistry) often use more.


And on my 32GB computers, I don't mind. I wasn't expecting the 4.07 version to take so much, though. I would like to restrict them to the "big boys" but there doesn't seem to be a way to deselect or select them. Perhaps just limit the tasks to a single CPU on the computers that have only 8GB.

LHC often takes more, but they run in a VM on my 32GB machines and so I can manage the load.
ID: 88563 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jesse Viviano

Send message
Joined: 14 Jan 10
Posts: 41
Credit: 1,458,237
RAC: 2,496
Message 88584 - Posted: 29 Mar 2018, 2:45:07 UTC

Rosetta v4.07 crashed on work unit 886791424. The error message is below:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x756C3EF2
ID: 88584 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 977
Credit: 21,298,556
RAC: 14,994
Message 88618 - Posted: 3 Apr 2018, 17:56:31 UTC

Err... wut?

Just had 27 Rosetta 4.07 tasks cancelled by the server - some that were already running. What happened?!
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Scheduler request completed
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h23_l3_h29_l3_01826_1_loop_100_0001_one_fragments_relax_SAVE_ALL_OUT_565183_3_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h21_l2_h23_l3_09795_1_2_loop_15_0001_one_fragments_fold_SAVE_ALL_OUT_566349_3_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h29_l4_h19_l4_13021_3_loop_29_0001_one_fragments_fold_SAVE_ALL_OUT_567879_1_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h26_l3_h26_l3_18358_1_loop_9_0001_one_fragments_relax_SAVE_ALL_OUT_567511_5_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h27_l2_h30_l2_00615_1_2_loop_256_0001_one_fragments_fold_SAVE_ALL_OUT_553384_111_1 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h24_l4_h25_l4_10118_1_2_loop_6_0001_one_fragments_relax_SAVE_ALL_OUT_566452_9_1 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h24_l2_h19_l3_01986_1_loop_26_0001_one_fragments_relax_SAVE_ALL_OUT_567803_10_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h19_l3_h13_l3_04084_1_2_loop_11_0001_one_fragments_relax_SAVE_ALL_OUT_568253_11_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h17_l4_h17_l2_08702_1_loop_37_0001_one_fragments_relax_SAVE_ALL_OUT_564082_12_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h30_l4_h18_l3_13696_3_loop_20_0001_one_fragments_fold_SAVE_ALL_OUT_564079_13_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h24_l4_h30_l4_17340_1_loop_44_0001_one_fragments_relax_SAVE_ALL_OUT_564389_14_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h26_l4_h16_l2_13839_3_loop_34_0001_one_fragments_fold_SAVE_ALL_OUT_568114_15_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h28_l2_h30_l4_17467_1_loop_50_0001_one_fragments_relax_SAVE_ALL_OUT_566621_18_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h29_l4_h26_l3_06486_3_loop_31_0001_one_fragments_relax_SAVE_ALL_OUT_565395_18_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h28_l4_h23_l4_06512_3_loop_10_0001_one_fragments_fold_SAVE_ALL_OUT_565610_18_1 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h30_l3_h18_l3_08183_4_loop_48_0001_one_fragments_fold_SAVE_ALL_OUT_565853_11_1 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h27_l4_h17_l2_13671_1_loop_1_0001_one_fragments_fold_SAVE_ALL_OUT_568591_21_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h22_l4_h29_l4_11202_2_loop_11_0001_one_fragments_fold_SAVE_ALL_OUT_564102_22_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h19_l3_h26_l3_07687_4_2_loop_27_0001_one_fragments_fold_SAVE_ALL_OUT_554018_210_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h25_l4_h23_l3_18334_3_loop_11_0001_one_fragments_fold_SAVE_ALL_OUT_553950_226_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h23_l4_h30_l4_03328_1_2_loop_21_0001_one_fragments_fold_SAVE_ALL_OUT_554635_175_1 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h17_l4_h13_l4_00511_4_loop_2_0001_one_fragments_relax_SAVE_ALL_OUT_568590_19_1 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h30_l2_h15_l4_10050_1_2_loop_38_0001_one_fragments_fold_SAVE_ALL_OUT_554762_240_1 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h12_l4_h13_l4_00011_2_loop_8_0001_one_fragments_fold_SAVE_ALL_OUT_564155_28_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h27_l2_h19_l3_10893_1_loop_23_0001_one_fragments_fold_SAVE_ALL_OUT_555096_245_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h26_l3_h19_l3_02051_1_loop_37_0001_one_fragments_fold_SAVE_ALL_OUT_554049_246_0 is no longer usable
Tuesday 03/04/2018 18:48:13 | Rosetta@home | Result DRH_curve_X_h12_l4_h14_l4_03193_3_2_loop_49_0001_one_fragments_fold_SAVE_ALL_OUT_554560_247_0 is no longer usable
Tuesday 03/04/2018 18:48:39 | Rosetta@home | Computation for task DRH_curve_X_h23_l3_h29_l3_01826_1_loop_100_0001_one_fragments_relax_SAVE_ALL_OUT_565183_3_0 finished
Tuesday 03/04/2018 18:48:39 | Rosetta@home | Computation for task DRH_curve_X_h21_l2_h23_l3_09795_1_2_loop_15_0001_one_fragments_fold_SAVE_ALL_OUT_566349_3_0 finished
Tuesday 03/04/2018 18:48:39 | Rosetta@home | Computation for task DRH_curve_X_h29_l4_h19_l4_13021_3_loop_29_0001_one_fragments_fold_SAVE_ALL_OUT_567879_1_0 finished
Tuesday 03/04/2018 18:48:41 | Rosetta@home | Starting task NTF2chip_2375_relax_SAVE_ALL_OUT_557694_4_0
Tuesday 03/04/2018 18:48:41 | Rosetta@home | [cpu_sched] Starting task NTF2chip_2375_relax_SAVE_ALL_OUT_557694_4_0 using rosetta version 407 in slot 1
Tuesday 03/04/2018 18:48:59 | Rosetta@home | Starting task foldit_2004880_1014_fold_SAVE_ALL_OUT_552030_1010_1
Tuesday 03/04/2018 18:48:59 | Rosetta@home | [cpu_sched] Starting task foldit_2004880_1014_fold_SAVE_ALL_OUT_552030_1010_1 using minirosetta version 378 in slot 2
Tuesday 03/04/2018 18:48:59 | Rosetta@home | Sending scheduler request: To report completed tasks.
Tuesday 03/04/2018 18:48:59 | Rosetta@home | Reporting 27 completed tasks
Tuesday 03/04/2018 18:48:59 | Rosetta@home | Requesting new tasks for CPU
Tuesday 03/04/2018 18:49:02 | Rosetta@home | Starting task NTF2chip_4175_fold_SAVE_ALL_OUT_559494_6_0
Tuesday 03/04/2018 18:49:02 | Rosetta@home | [cpu_sched] Starting task NTF2chip_4175_fold_SAVE_ALL_OUT_559494_6_0 using rosetta version 407 in slot 3
Tuesday 03/04/2018 18:49:04 | Rosetta@home | Scheduler request completed: got 0 new tasks
Tuesday 03/04/2018 18:49:04 | Rosetta@home | No tasks sent

ID: 88618 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 977
Credit: 21,298,556
RAC: 14,994
Message 88619 - Posted: 3 Apr 2018, 18:04:12 UTC - in response to Message 88618.  

And now 9 and 5 more on 2 other machines.

Looks like all the DRH_curve_X jobs have been aborted... <sigh>
ID: 88619 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3539
Credit: 0
RAC: 0
Message 88620 - Posted: 3 Apr 2018, 19:07:26 UTC

One of the features of the updated server code.

On my Windows machine, they were taking excessive memory. So, that may be reason enough to cancel them.
Rosetta Moderator: Mod.Sense
ID: 88620 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 688
Credit: 9,229,478
RAC: 3,796
Message 88621 - Posted: 3 Apr 2018, 19:13:17 UTC
Last modified: 3 Apr 2018, 19:21:35 UTC

I've had 8 tasks cancelled by server today, when they were, on the average, about half finished. Are you planning to issue any credit for the CPU time they used, or should I think of reducing the share of CPU time I offer to Rosetta@Home?

Most of them used the 32-bit version of 4.07, even though they were running under 64-bit Windows and BOINC. The computer has 32 GB of memory, so the 64-bit version of 4.07 should have been able to give them all enough memory.
ID: 88621 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
aad

Send message
Joined: 5 Jan 06
Posts: 5
Credit: 49,746,123
RAC: 133,633
Message 88622 - Posted: 3 Apr 2018, 20:43:13 UTC - in response to Message 88621.  

I've had 8 tasks cancelled by server today, when they were, on the average, about half finished. Are you planning to issue any credit for the CPU time they used, or should I think of reducing the share of CPU time I offer to Rosetta@Home?


Yeah.
I had that too, just a few minutes ago....
I sure hope this is not the new standard...
ID: 88622 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 299
Credit: 8,919,650
RAC: 17,219
Message 88634 - Posted: 4 Apr 2018, 20:37:40 UTC - in response to Message 88619.  

Yes, I had four yesterday on an Ubuntu machine and one today on a Win7 machine that were aborted, a couple after 23+ hours.

But it is better that they kill them if they know they are defective and save what time they can. Some more quality control would be better still.
ID: 88634 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 977
Credit: 21,298,556
RAC: 14,994
Message 88651 - Posted: 8 Apr 2018, 0:28:40 UTC - in response to Message 88620.  
Last modified: 8 Apr 2018, 0:30:47 UTC

One of the features of the updated server code.

On my Windows machine, they were taking excessive memory. So, that may be reason enough to cancel them.

Sure, I've seen tasks cancelled at the server end before - just never so many of all one task-type.

My question was more to ask what went wrong with the batch that they all had to be withdrawn. I hadn't noticed the memory issue, but I've got plenty to spare

My guess is the Rosetta guys were in such a rush to re-supply us with tasks after the recent outage, something major got missed in quality control, only realised when tasks starting coming back.

My concern at the time was we might've had another shortage as new tasks were brought down to replace them in our buffers, as no new tasks came down, but that didn't happen. And I can see new DRH_curve_X tasks in my current buffer, so I'm guessing they got fixed and are now fed back through to us.

All's well that ends well.
ID: 88651 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
San-Fernando-Valley

Send message
Joined: 16 Mar 16
Posts: 6
Credit: 117,897
RAC: 0
Message 88672 - Posted: 9 Apr 2018, 14:15:49 UTC

... just want to add my 2 cents worth:

Started crunching today after a very long pause (many months).

I noticed that after about 4:00 to 4:30 hours (approx.) elapsed time (4.07 and mini 3.78) there is suddenly an increase of
remaining time from 4 hours to approx. 20 hours!

This is happening on at least 3 of my rigs (Win7 64-bit).

Anybody any ideas if this is OK or not?
ID: 88672 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3539
Credit: 0
RAC: 0
Message 88675 - Posted: 9 Apr 2018, 20:46:59 UTC - in response to Message 88672.  

... just want to add my 2 cents worth:

Started crunching today after a very long pause (many months).

I noticed that after about 4:00 to 4:30 hours (approx.) elapsed time (4.07 and mini 3.78) there is suddenly an increase of
remaining time from 4 hours to approx. 20 hours!

This is happening on at least 3 of my rigs (Win7 64-bit).

Anybody any ideas if this is OK or not?


It sounds like you may have changed your runtime preference on the R@h website. 24hrs is the highest value allowed. Beyond that, estimated time remaining is not a very reliable indicator. I would not presume any problem based solely on that.
Rosetta Moderator: Mod.Sense
ID: 88675 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
James W

Send message
Joined: 25 Nov 12
Posts: 59
Credit: 541,803
RAC: 341
Message 88678 - Posted: 10 Apr 2018, 5:37:24 UTC - in response to Message 88675.  

I agree that estimated time remaining per BOINC is definitely an estimate, in general. On my Windows 7 machine, 4.07 will show approx. 5 hrs. estimated crunch time and mini 3.78 will show approx. 8 hrs. estimated crunch time. However, the 4.07 WUs and 3.78 WUs both end up taking approx. 8 hrs. to complete.
ID: 88678 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
San-Fernando-Valley

Send message
Joined: 16 Mar 16
Posts: 6
Credit: 117,897
RAC: 0
Message 88682 - Posted: 10 Apr 2018, 13:48:58 UTC - in response to Message 88675.  


It sounds like you may have changed your runtime preference on the R@h website. 24hrs is the highest value allowed. Beyond that, estimated time remaining is not a very reliable indicator. I would not presume any problem based solely on that.


... haven't changed anything ...
WUs have all but one finished without error.

I sort of find it inappropiate to show an aprrox. runtime of 4 to 5 hours and then suddenly the darn things increase up to just under 24 hours !!!

I am sure you have cited somewhere how long these WUs run?
ID: 88682 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 250
Credit: 8,037,564
RAC: 0
Message 88683 - Posted: 10 Apr 2018, 20:50:42 UTC - in response to Message 88682.  
Last modified: 10 Apr 2018, 20:52:05 UTC


It sounds like you may have changed your runtime preference on the R@h website. 24hrs is the highest value allowed. Beyond that, estimated time remaining is not a very reliable indicator. I would not presume any problem based solely on that.


... haven't changed anything ...
WUs have all but one finished without error.

I sort of find it inappropiate to show an aprrox. runtime of 4 to 5 hours and then suddenly the darn things increase up to just under 24 hours !!!

I am sure you have cited somewhere how long these WUs run?



It appears from the 86k seconds for the WUs that your preference is set to 24 hours .... 86k seconds even though the Rosetta command line says "-cpu_run_time 28800" or 8 hours.
Rosetta seems to be ignoring your 8 hour preference and running the maximum 24 hours.

Rosetta loops on multiple attempts until the time preference is reached and then it terminates when that loop is finished. It looks like you are getting the 24 hour credit but something appear broken.

Task 987505757
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.07_windows_x86_64.exe @P49334_PF04281_0.6_domain1.bnd15.flags -in:file:boinc_wu_zip P49334_PF04281_0.6_domain1.bnd15.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3903853
Starting watchdog...
Watchdog active.
======================================================
DONE :: 1 starting structures 86140.1 cpu seconds
This process generated 191 decoys from 191 attempts
======================================================
BOINC :: WS_max 3.94056e+08
ID: 88683 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 141
Credit: 3,132,431
RAC: 0
Message 88694 - Posted: 13 Apr 2018, 7:13:29 UTC

All 32 bit work units on both this main project and on Ralph test project, all fail with the "Can't Create Process" error.

I have checked my anti-virus and it does not appear to be blocking.
Only the "Rosetta" tasks are failing , the "Rosetta Mini" tasks are running fine.

Linux is OK.

Conan
ID: 88694 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 141
Credit: 3,132,431
RAC: 0
Message 88739 - Posted: 23 Apr 2018, 9:05:33 UTC - in response to Message 88694.  

All 32 bit work units on both this main project and on Ralph test project, all fail with the "Can't Create Process" error.

I have checked my anti-virus and it does not appear to be blocking.
Only the "Rosetta" tasks are failing , the "Rosetta Mini" tasks are running fine.

Linux is OK.

Conan


Any headway on this 32 bit issue?

64 bit on Linux runs fine for both work unit types.

Both Rosetta and Ralph have the same issue with the Rosetta work units.
Rosetta Mini works fine on both projects.

Could it be that the Rosetta work units are not Win32 valid applications?
And should be re-compiled as 32 Bit as they appear to be 64 Bit instead.

I am running Windows XP 32 Bit on the computer that can't run the Rosetta work unit type.

An answer would be nice as it has been over 10 days now.

thanks
Conan
ID: 88739 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
m

Send message
Joined: 2 May 09
Posts: 10
Credit: 1,221,559
RAC: 1,246
Message 88740 - Posted: 23 Apr 2018, 9:32:39 UTC - in response to Message 88739.  

Could be this is your problem (and mine...) but don't hold your breath for a fix.
ID: 88740 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 141
Credit: 3,132,431
RAC: 0
Message 88743 - Posted: 23 Apr 2018, 12:23:01 UTC - in response to Message 88740.  

Could be this is your problem (and mine...) but don't hold your breath for a fix.


No I wont be holding my breath, as that report, with the same error I have is from the 1st Feb 2018, so no fix for this 32 Bit issue for almost 3 months now.

And all the Ralph work units I have been given that have also failed with the same error have also not been fixed. What is the point of a test project when things are not tested?
No point at all.

Conan
ID: 88743 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 912
Credit: 3,464,138
RAC: 4,076
Message 88754 - Posted: 26 Apr 2018, 7:40:42 UTC - in response to Message 88743.  

The screensaver of all "Xy_00" wus crashes.
ID: 88754 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 13 · Next

Message boards : Number crunching : Rosetta 4.0+



©2019 University of Washington
http://www.bakerlab.org