Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 18 · Next

AuthorMessage
Profile [B^S] Paul@home
Avatar

Send message
Joined: 18 Sep 05
Posts: 34
Credit: 393,096
RAC: 0
Message 9444 - Posted: 20 Jan 2006, 10:57:50 UTC - in response to Message 9424.  
Last modified: 20 Jan 2006, 11:00:29 UTC

Hi Phil,


It would be really great if it worked the way it is documented. The fact is that while not linear in the sence that adding 1 to the DCF will give you 1 additional min of processing time, adjusting the DCF does in fact increase the time. Depending on the system benchmarks this adjustment provides varying amounts of increase.


u have been loking at this in more detail than I have and, admitedly, it was 2am when i was trawling code last night so I do accept what you are saying. I just can't find it in the code (yet!). The only place I can see DCF used is in calculating the client's estimated to-completion time (what u see in BOINC Manager). This figure does not appear to have any relationship to the max allowed cpu time for a work unit.


I think that the sum of the earlier discussion was that the bound value should be increased. I suspect that that is what David (Kim) was looking at. It would seem that this value may have been static while the WU size increased, and that may be the problem.
Regards
Phil


It certainly would! I believe it is quite difficult for them to get a reasonable estimate for the number of fpops in a given WU type but if they could manage that somehow, they may fix the problem. Perhaps as David B. suggests, they may be able to run a few of each WU type thru a test server to determine an accurate run time. If this was a public server they would not even need to do the work themselves - just set a high fpops_bound in the WU and let them out. Bound value could be increased / reduced accordingly.


Cheers and have a good weekend (I'm not back at a computer till monday! )

Paul.


Wanna visit BOINC Synergy team site? Click below!

Join BOINC Synergy Team
ID: 9444 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 9456 - Posted: 20 Jan 2006, 15:21:24 UTC - in response to Message 9444.  

Hi Phil,
...
u have been loking at this in more detail than I have and, admitedly, it was 2am when i was trawling code last night so I do accept what you are saying. I just can't find it in the code (yet!). The only place I can see DCF used is in calculating the client's estimated to-completion time (what u see in BOINC Manager). This figure does not appear to have any relationship to the max allowed cpu time for a work unit.


...Cheers and have a good weekend (I'm not back at a computer till monday! )

Paul.


After looking at this a little further I think I know what is happening. There are really two clocks running down toward the completion of a WU. There is the flop counter that will internally determine when a WU has exceeded the maximum time a WU will be allowed to run, and there is the time to completion clock that is presented to the user through the interface. I am assuming here that everyone would agree that if you have a certain number of flops available, and the system runs at a certain speed, that combining the two yields, in effect, a clock. As the WU progresses both of these clocks count down. While I have not verified what happens to BOINC if the completion clock runs out, I can imaging that if it goes to zero or less than zero, that this might cause problems unless the condition is trapped and handled.

Clearly the DCF directly adjusts the completion clock. In my testing I have been able to extend the projected run time of a WU by manually adjusting the DCF. THis also appears to actually provide additional run time for the WU. But this only will go so far. I have determined that there is an absolute maximum time beyond which you cannot force the system to continue working on a particular WU. Interestingly this limit is almost exactly the same across WU types. Obviously that limit is the flops clock timing out.

It now looks as though what is happening is that the flops clock can be set to a value that is higher than the calculated completion clock. This makes sense as they are completely separate. It may be that when this completion clock hits zero or drops below zero, that BOINC stops the process. By adjusting the the DCF, in effect resetting the completion clock to a higher value, you can increase the length of time that it takes for BOINC to count down the completion clock to zero, thus allowing the process to run longer. But the absolute maximum time is still the flops clock. So once the DCF is set to a value higher than the time flops counter will allow (based on system speed), the WU will fail on Max time when it hit the flops limit.

While this theory matches the observed behavior of the system, it would take some looking at the BOINC and R@H code to determine what actually happens if the completion time becomes zero or less than zero.

Regards
Phil
ID: 9456 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mistral

Send message
Joined: 29 Sep 05
Posts: 1
Credit: 3,568
RAC: 0
Message 9459 - Posted: 20 Jan 2006, 16:16:27 UTC - in response to Message 9456.  

While I have not verified what happens to BOINC if the completion clock runs out, I can imaging that if it goes to zero or less than zero, that this might cause problems unless the condition is trapped and handled.


Hi Phil,

Just to add some complexity to your thoughts. As far as Predictor@Home is concerned, the WU's percentage of completion will climb up to 103%, then it goes back to 97% and the WU completes in a few seconds. During the period during which the percentage of completion is comprised between 100% and 103% the "Remaining time" column will only show "---" (i.e. the completion clock has run out) and it will again show a few minutes when this percentage goes back to 97%. But this is normal behaviour for P@H.

Hope this helps you translating the Rosetta stone :-)

Regards
Pierre



ID: 9459 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Darren
Avatar

Send message
Joined: 6 Oct 05
Posts: 27
Credit: 43,535
RAC: 0
Message 9462 - Posted: 20 Jan 2006, 17:19:01 UTC - in response to Message 9459.  

While I have not verified what happens to BOINC if the completion clock runs out, I can imaging that if it goes to zero or less than zero, that this might cause problems unless the condition is trapped and handled.


Hi Phil,

Just to add some complexity to your thoughts. As far as Predictor@Home is concerned, the WU's percentage of completion will climb up to 103%, then it goes back to 97% and the WU completes in a few seconds. During the period during which the percentage of completion is comprised between 100% and 103% the "Remaining time" column will only show "---" (i.e. the completion clock has run out) and it will again show a few minutes when this percentage goes back to 97%. But this is normal behaviour for P@H.

Hope this helps you translating the Rosetta stone :-)

Regards
Pierre





The complexity can be made even greater when you look at how a seti wu handles the completion clock. Seti units run for anywhere from a few minutes (normal seti app) to up to 45 minutes for the new enhanced beta seti app after the completion clock reaches 100%. For seti units, the clock does not go over 100% like it does for the predictor units. Instead, it simply stops at 100% and the remaining time stays at ---, while the work unit continues to run to actual completion.


ID: 9462 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 9466 - Posted: 20 Jan 2006, 18:16:46 UTC - in response to Message 9402.  
Last modified: 20 Jan 2006, 18:22:12 UTC



As I understand, the rsc_fpops_bound sets the absolute maximum time (depending also on the benchmark). See http://boinc.berkeley.edu/work.php. I don't see how DCF effects the max run time.


As I understand it

The DCF is a scaling factor that is used (effectively) to tweak the benchmarks for example when estimating the run times of WU for testing for EDF mode, etc.

If it is the tweaked benchmark that is used to set the max run time from the max no ops, then on any one machine the actual max applied will be proportional to the DCF.

If it is the raw benchmark that is applied, then it won't.

Hope that helps. Hope it's right ;-)
ID: 9466 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kevint

Send message
Joined: 8 Oct 05
Posts: 84
Credit: 2,530,451
RAC: 0
Message 9480 - Posted: 20 Jan 2006, 21:02:34 UTC - in response to Message 9426.  

This work unit processed for about 5 hours - when Boinc did it automatic switch to work on another project this WU aborted.
https://boinc.bakerlab.org/rosetta/result.php?resultid=7095632
I have had this problem with other WU's on other machines.


Hi Kevint,
It's my turn to help people in the same way others help me for the same problem (I think it's the same one...)
This is apparently a known bug . You just have to set "leave applications in memory when preempted" to YES when your in the section "View or edit general preferences", in the section "Preferences". You can go there by clicking "Your account" on the Rosetta's Home Page.
Hope this help !
Godpiou


Ok, this seems to work for now - but now I have a different problem, appears to be near the same - When Boinc does its automatic benchmark I get this problem on a couple of my machines. It does not happen all the time - but it does happen.

1/20/2006 1:56:22 PM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
1/20/2006 1:56:51 PM||Suspending computation and network activity - running CPU benchmarks
1/20/2006 1:56:51 PM|rosetta@home|Pausing result NO_SIM_ANNEAL_BARCODE_30_1mky_251_5400_0 (removed from memory)
1/20/2006 1:56:51 PM|rosetta@home|Pausing result NO_SIM_ANNEAL_BARCODE_30_1r69_251_5402_0 (removed from memory)
1/20/2006 1:56:52 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1mky_251_5400_0 ( - exit code -1073741819 (0xc0000005))
1/20/2006 1:56:52 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1r69_251_5402_0 ( - exit code -1073741819 (0xc0000005))
1/20/2006 1:56:52 PM||request_reschedule_cpus: process exited
1/20/2006 1:56:52 PM|rosetta@home|Computation for result NO_SIM_ANNEAL_BARCODE_30_1mky_251_5400_0 finished
1/20/2006 1:56:52 PM||Running CPU benchmarks
1/20/2006 1:56:53 PM|rosetta@home|Computation for result NO_SIM_ANNEAL_BARCODE_30_1r69_251_5402_0 finished


SETI.USA


ID: 9480 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 9481 - Posted: 20 Jan 2006, 21:43:33 UTC - in response to Message 9480.  

This work unit processed for about 5 hours - when Boinc did it automatic switch to work on another project this WU aborted.
https://boinc.bakerlab.org/rosetta/result.php?resultid=7095632
I have had this problem with other WU's on other machines.


Hi Kevint,
It's my turn to help people in the same way others help me for the same problem (I think it's the same one...)
This is apparently a known bug . You just have to set "leave applications in memory when preempted" to YES when your in the section "View or edit general preferences", in the section "Preferences". You can go there by clicking "Your account" on the Rosetta's Home Page.
Hope this help !
Godpiou


Ok, this seems to work for now - but now I have a different problem, appears to be near the same - When Boinc does its automatic benchmark I get this problem on a couple of my machines. It does not happen all the time - but it does happen.

1/20/2006 1:56:22 PM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
1/20/2006 1:56:51 PM||Suspending computation and network activity - running CPU benchmarks
1/20/2006 1:56:51 PM|rosetta@home|Pausing result NO_SIM_ANNEAL_BARCODE_30_1mky_251_5400_0 (removed from memory)
1/20/2006 1:56:51 PM|rosetta@home|Pausing result NO_SIM_ANNEAL_BARCODE_30_1r69_251_5402_0 (removed from memory)
1/20/2006 1:56:52 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1mky_251_5400_0 ( - exit code -1073741819 (0xc0000005))
1/20/2006 1:56:52 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1r69_251_5402_0 ( - exit code -1073741819 (0xc0000005))
1/20/2006 1:56:52 PM||request_reschedule_cpus: process exited
1/20/2006 1:56:52 PM|rosetta@home|Computation for result NO_SIM_ANNEAL_BARCODE_30_1mky_251_5400_0 finished
1/20/2006 1:56:52 PM||Running CPU benchmarks
1/20/2006 1:56:53 PM|rosetta@home|Computation for result NO_SIM_ANNEAL_BARCODE_30_1r69_251_5402_0 finished



This is related to the keep in memory issue. When the system benchmarks it removes the application from memory (as shown in your messages) the fact that the system is benchmarking does not matter. What does matter is that the app was removed from memory. This has the same effect as an application switch with keep in memory set to no. So of course the WUs abort.


Not a good thing but that is how it happens.

Regards
Phil



We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9481 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 9483 - Posted: 20 Jan 2006, 21:58:42 UTC - in response to Message 9462.  

Hi Phil,

Just to add some complexity to your thoughts. As far as Predictor@Home is concerned, the WU's percentage of completion will climb up to 103%, then it goes back to 97% and the WU completes in a few seconds. During the period during which the percentage of completion is comprised between 100% and 103% the "Remaining time" column will only show "---" (i.e. the completion clock has run out) and it will again show a few minutes when this percentage goes back to 97%. But this is normal behavior for P@H.

Hope this helps you translating the Rosetta stone :-)

Regards
Pierre





The complexity can be made even greater when you look at how a seti wu handles the completion clock. Seti units run for anywhere from a few minutes (normal seti app) to up to 45 minutes for the new enhanced beta seti app after the completion clock reaches 100%. For seti units, the clock does not go over 100% like it does for the predictor units. Instead, it simply stops at 100% and the remaining time stays at ---, while the work unit continues to run to actual completion.


Pierre & Darren,

I see the same behavior you describe on those applications. However, while the percent complete is used to calculate the running completion clock, and if the percent exceeds 100% the clock will be nulled out. this is actually different than if the completion time actually runs out before the percent in 100.

It would be possible to have a lot of nasty things like zero divides going on if the completion clock runs down to a value below zero. Now I have to assume that the BOINC programers are smart enough to prevent problems like that under normal conditions, but I can imagine a situation where they might figure the completion time would never hit zero because it is calculated from the percent complete. Once the WU hits 100% the completion clock is no longer incremented. If you watch close on P@H the clock never hits zero until the precise moment that the WU hits or exceeds 100% complete.

With R@H we have a situation where all kind of screwy things are going on with the completion clock. The time is actually rising between percent compete changes, and then if jumps all at once to a new value based in part on percent complete. It is possible for a R@H WU to run out of time on the completion clock before the WU actually completes.

So the question is, what happens if the time to completion runs out at say 95% complete? Could this abort a WU? Since I have never seen this kind of clock behavior on any other projects I have nothing to go on. But based on the behavior of all the clocks and percent counters (which are kept by BOINC) we can generally assume that BOINC was not designed to handle whatever R@H is doing as it runs.

Good thoughts though. Keep thinking, we will eventually figure this thing out.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9483 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 9487 - Posted: 20 Jan 2006, 22:35:32 UTC

boinc isn't doing a count-down on the time. It does increase the current-time variable. It periodically compares the current time to the max time, and if the current time is greater then the WU is aborted. The max speed is calculated from the benchmark score and the limit specified by the WU. The DCF is not used in this calculation in the official version of BOINC.

The time-remaining variable is never incremented or decremented. It is periodically recalculated using the original estimated time, the current time, and the corrected speed of the machine. The DCF and the benchmark score are combined to get the corrected speed.
ID: 9487 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
KaptainBlazzed

Send message
Joined: 30 Dec 05
Posts: 3
Credit: 969,393
RAC: 0
Message 9491 - Posted: 20 Jan 2006, 23:37:44 UTC

i got this error.


Unrecoverable error for result PRODUCTION_ABINITIO_2chf__250_1035_0 (Maximum CPU time exceeded)

the same goes for PRODUCTION_ABINITIO_2acy__250_1035_0

in total i lost 12Hrs of cpu time on these 2 WU's
ID: 9491 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 9492 - Posted: 20 Jan 2006, 23:56:35 UTC - in response to Message 9487.  

boinc isn't doing a count-down on the time. It does increase the current-time variable. It periodically compares the current time to the max time, and if the current time is greater then the WU is aborted. The max speed is calculated from the benchmark score and the limit specified by the WU. The DCF is not used in this calculation in the official version of BOINC.

The time-remaining variable is never incremented or decremented. It is periodically recalculated using the original estimated time, the current time, and the corrected speed of the machine. The DCF and the benchmark score are combined to get the corrected speed.


What you have described makes no sense. Clearly the time to completion DOES decrement. It does this on all of the projects. The CPU time rises as processing moves along. Moreover the completion time decrements in proportion to the percent complete. That is why you can see it rising on R@H WU as they process. The CPU time is rising, but the percent complete is not, so the "To completion" time also rises, until the percent complete finally changes. While I would agree that the DCF is used to determine the "to completion" time, I would disagree that BOINC is not making use of these numbers.

The absolute time for a WU to complete is set by a variable value stored in the WU. But that is an absolute value. since the project could not possible have any idea what amount of time the slowest machine might take, there must be a system to make adjustments.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9492 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
KaptainBlazzed

Send message
Joined: 30 Dec 05
Posts: 3
Credit: 969,393
RAC: 0
Message 9501 - Posted: 21 Jan 2006, 4:09:40 UTC

now this one too, i am probably going to abort ALL PRODUCTION_ABINITO_xxxxxxx WU's
I can not waste my CPU time like this!!


Unrecoverable error for result PRODUCTION_ABINITIO_1acf__250_338_0 (Maximum CPU time exceeded)

ID: 9501 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
KaptainBlazzed

Send message
Joined: 30 Dec 05
Posts: 3
Credit: 969,393
RAC: 0
Message 9524 - Posted: 21 Jan 2006, 14:20:41 UTC

I aborted this one after 4 1/2 hours and only being 1% done


Unrecoverable error for result NO_VARY_OMEGA_2reb_253_1552_0 (aborted via GUI RPC)

ID: 9524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Viking69
Avatar

Send message
Joined: 3 Oct 05
Posts: 20
Credit: 5,615,437
RAC: 495
Message 9569 - Posted: 21 Jan 2006, 23:38:11 UTC

1/21/2006 2:42:03 PM|rosetta@home|Unrecoverable error for result DEFAULT_2reb_219_913_1 ( - exit code -1073741819 (0xc0000005))
1/21/2006 2:42:03 PM||request_reschedule_cpus: process exited
1/21/2006 2:42:03 PM|rosetta@home|Computation for result DEFAULT_2reb_219_913_1 finished

This one stopped after 50 minutes.
Hi all you enthusiastic crunchers.....
ID: 9569 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SarahCorreia

Send message
Joined: 11 Dec 05
Posts: 1
Credit: 3,351
RAC: 0
Message 9583 - Posted: 22 Jan 2006, 13:33:54 UTC

1/22/2006 6:43:06 AM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_13924_0 ( - exit code -1073741819 (0xc0000005))
1/22/2006 6:43:06 AM||request_reschedule_cpus: process exited
1/22/2006 6:43:06 AM|rosetta@home|Computation for result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_13924_0 finished

1/21/2006 2:47:29 PM|rosetta@home|Pausing result NO_MORE_RELAX_CYCLES_1n0u_249_8874_0 (removed from memory)
1/21/2006 2:47:31 PM|rosetta@home|Unrecoverable error for result NO_MORE_RELAX_CYCLES_1n0u_249_8874_0 ( - exit code -1073741819 (0xc0000005))
1/21/2006 2:47:33 PM||request_reschedule_cpus: process exited

1/21/2006 7:24:04 AM|rosetta@home|Pausing result NEW_SOFT_CENTROID_PACKING_1n0u_249_8877_0 (removed from memory)
1/21/2006 7:24:07 AM|rosetta@home|Unrecoverable error for result NEW_SOFT_CENTROID_PACKING_1n0u_249_8877_0 ( - exit code -1073741819 (0xc0000005))
1/21/2006 7:24:08 AM||request_reschedule_cpus: process exited
1/21/2006 7:24:08 AM|rosetta@home|Computation for result NEW_SOFT_CENTROID_PACKING_1n0u_249_8877_0 finished

1/21/2006 12:53:22 AM|rosetta@home|Pausing result NO_MORE_RELAX_CYCLES_1n0u_249_6302_0 (removed from memory)
1/21/2006 12:53:23 AM|rosetta@home|Unrecoverable error for result NO_MORE_RELAX_CYCLES_1n0u_249_6302_0 ( - exit code -1073741819 (0xc0000005))
1/21/2006 12:53:24 AM||request_reschedule_cpus: process exited
1/21/2006 12:53:24 AM|rosetta@home|Computation for result NO_MORE_RELAX_CYCLES_1n0u_249_6302_0 finished

1/20/2006 2:32:51 AM|rosetta@home|Pausing result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_4062_0 (removed from memory)
1/20/2006 2:32:53 AM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_4062_0 ( - exit code -1073741819 (0xc0000005))
1/20/2006 2:32:53 AM||request_reschedule_cpus: process exited
1/20/2006 2:32:53 AM|rosetta@home|Computation for result NO_SIM_ANNEAL_BARCODE_30_1ogw_251_4062_0 finished

1/19/2006 12:40:55 PM|rosetta@home|Pausing result PRODUCTION_ABINITIO_1a32__250_1520_0 (removed from memory)
1/19/2006 12:40:56 PM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1a32__250_1520_0 ( - exit code -1073741819 (0xc0000005))
1/19/2006 12:41:00 PM||request_reschedule_cpus: process exited
1/19/2006 12:41:01 PM|rosetta@home|Computation for result PRODUCTION_ABINITIO_1a32__250_1520_0 finished

1/19/2006 4:57:06 AM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1a19A_250_1520_0 ( - exit code -1073741819 (0xc0000005))
1/19/2006 4:57:06 AM||request_reschedule_cpus: process exited
1/19/2006 4:57:06 AM|rosetta@home|Computation for result PRODUCTION_ABINITIO_1a19A_250_1520_0 finished


ID: 9583 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Darren
Avatar

Send message
Joined: 6 Oct 05
Posts: 27
Credit: 43,535
RAC: 0
Message 9594 - Posted: 22 Jan 2006, 17:11:21 UTC - in response to Message 9583.  

1/21/2006 2:47:29 PM|rosetta@home|Pausing result NO_MORE_RELAX_CYCLES_1n0u_249_8874_0 (removed from memory)
1/21/2006 2:47:31 PM|rosetta@home|Unrecoverable error for result NO_MORE_RELAX_CYCLES_1n0u_249_8874_0 ( - exit code -1073741819 (0xc0000005))
1/21/2006 2:47:33 PM||request_reschedule_cpus: process exited


You need to change your preferences to leave the app in memory. If you go into your online account, you'll find the option under your general preferences. Just change the "leave applications in memory while preempted" setting to "yes".


ID: 9594 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Viking69
Avatar

Send message
Joined: 3 Oct 05
Posts: 20
Credit: 5,615,437
RAC: 495
Message 9623 - Posted: 23 Jan 2006, 9:16:03 UTC - in response to Message 9594.  

1/21/2006 2:47:29 PM|rosetta@home|Pausing result NO_MORE_RELAX_CYCLES_1n0u_249_8874_0 (removed from memory)
1/21/2006 2:47:31 PM|rosetta@home|Unrecoverable error for result NO_MORE_RELAX_CYCLES_1n0u_249_8874_0 ( - exit code -1073741819 (0xc0000005))
1/21/2006 2:47:33 PM||request_reschedule_cpus: process exited


You need to change your preferences to leave the app in memory. If you go into your online account, you'll find the option under your general preferences. Just change the "leave applications in memory while preempted" setting to "yes".



Yes, I used to do that but then my PC's would be using 200% of the available RAM (512MB). Even as I incresed the Swap file to allow this, the performance of the PC did suffer. I beleive that Rossetta is the only BOINC system that requires this to be enabled, but it affects all the BOINC systems I am running.
Hi all you enthusiastic crunchers.....
ID: 9623 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Trog Dog
Avatar

Send message
Joined: 25 Nov 05
Posts: 129
Credit: 57,345
RAC: 0
Message 9630 - Posted: 23 Jan 2006, 13:17:27 UTC - in response to Message 9623.  

I beleive that Rossetta is the only BOINC system that requires this to be enabled, but it affects all the BOINC systems I am running.


As far as I can work out World Community Grid also requires this setting on windows machines - it uses an earlier version of the Rosetta app.

I'm pretty sure that Climate Prediction wants you to leave the results in memory too.

I'm not prepared to do this on my systems so I don't run CPDN and only run Rosetta and WCG on my Linux boxes.
ID: 9630 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
gpcola

Send message
Joined: 31 Dec 05
Posts: 8
Credit: 361,118
RAC: 0
Message 9638 - Posted: 23 Jan 2006, 16:50:23 UTC

Hi, I've had a couple of wierd WUs recently:

https://boinc.bakerlab.org/rosetta/result.php?resultid=7113121

This failed with a 'Maximum CPU time exceeded' and it certainly wasted enough CPU cycles in the process, having run for 6.5hrs.

https://boinc.bakerlab.org/rosetta/result.php?resultid=7442399

This one is really strange. It had been sitting at 70% complete for over an hour with the CPU time reading +-5.5hrs. I was beginning to worry that it was stuck but thought I'd leave it and hope for the best. Sometime soon after I needed to reboot the machine and when I next checked it's progress the CPU time had dropped to half an hour but it was still at 70% progress! I decided at this point to abort the WU.

Several others have failed with an exit status of '1073741819 (0xc0000005)' and they all happen to be similar types (PRODUCTION_ABINITIO_xxxxxxx):

https://boinc.bakerlab.org/rosetta/result.php?resultid=7449596
https://boinc.bakerlab.org/rosetta/result.php?resultid=7113161
https://boinc.bakerlab.org/rosetta/result.php?resultid=7113121
https://boinc.bakerlab.org/rosetta/result.php?resultid=7113092
https://boinc.bakerlab.org/rosetta/result.php?resultid=7113091

ID: 9638 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_Duc
Avatar

Send message
Joined: 30 Dec 05
Posts: 17
Credit: 310,471
RAC: 0
Message 9651 - Posted: 23 Jan 2006, 20:34:21 UTC

I have two to report for the moment:

I just aborted this one, stuck at 1% after more then 7 hours...
https://boinc.bakerlab.org/rosetta/result.php?resultid=7670806

The other one gave a "Max CPU time exceeded' error, it was crunching for more then 14 hours...
https://boinc.bakerlab.org/rosetta/result.php?resultid=7165518
The weak shall perish...
ID: 9651 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 18 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2022 University of Washington
https://www.bakerlab.org