Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 18 · Next

AuthorMessage
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 9038 - Posted: 14 Jan 2006, 20:29:17 UTC
Last modified: 14 Jan 2006, 20:31:42 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=6869420

Never had it before.
Just a computer which couldn't flush (installed R@H last week) because of ZA and I've changed that so it could flush.
Now all of the WU's are like the above.
And couldn't upload any WU's this afternoon.
Reason can be ????
ID: 9038 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 9043 - Posted: 14 Jan 2006, 22:26:27 UTC - in response to Message 9038.  

http://boinc.bakerlab.org/rosetta/result.php?resultid=6869420

Never had it before.
Just a computer which couldn't flush (installed R@H last week) because of ZA and I've changed that so it could flush.
Now all of the WU's are like the above.
And couldn't upload any WU's this afternoon.
Reason can be ????


Update :
Just checked and the downloading seems to have been OK now.
Don't know about the other problem though.

ID: 9043 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tibor Futo

Send message
Joined: 13 Jan 06
Posts: 1
Credit: 472
RAC: 0
Message 9045 - Posted: 14 Jan 2006, 22:31:47 UTC

1/14/2006 10:08:04 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_BARCODE_30_1dcj_240_4586_0 ( - exit code -1073741819 (0xc0000005))

1/14/2006 11:24:37 PM|rosetta@home|Unrecoverable error for result NO_SIM_ANNEAL_MORE_FRAGS_1hz6_241_4827_0 ( - exit code -1073741819 (0xc0000005))

Check also: http://boinc.bakerlab.org/rosetta/results.php?userid=50111

And I think these errors are related to project switching. I noticed one time that Rosetta was working fine, then BOINC switched projects, and when it reloaded Rosetta next time, it tried to start then gave the error.

Tibor
ID: 9045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bill Michael
Avatar

Send message
Joined: 25 Oct 05
Posts: 574
Credit: 3,560,485
RAC: 1,970
Message 9047 - Posted: 14 Jan 2006, 23:04:24 UTC - in response to Message 9045.  

And I think these errors are related to project switching. I noticed one time that Rosetta was working fine, then BOINC switched projects, and when it reloaded Rosetta next time, it tried to start then gave the error.


These look like the "application not left in memory" bug; if you have "leave applications in memory when preempted" set to "no", you'll need to set it to "yes" until this bug is exterminated...

ID: 9047 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
godpiou

Send message
Joined: 22 Dec 05
Posts: 7
Credit: 1,373
RAC: 0
Message 9068 - Posted: 15 Jan 2006, 4:29:34 UTC

Hi !

Sorry but another error...

|rosetta@home|Unrecoverable error for result NEW_SOFT_CENTROID_PACKING_1di2_225_9449_1 ( - exit code -1073741819 (0xc0000005))

And I support the hypothesis for project switching causing this type of error. Look at this part of my log:

14-01-06 22:39:34|SETI@home|Restarting result 05oc03ab.24335.11026.429814.1.28_1 using setiathome version 418
14-01-06 22:39:34|rosetta@home|Pausing result NEW_SOFT_CENTROID_PACKING_1di2_225_9449_1 (removed from memory)
14-01-06 22:39:35|rosetta@home|Unrecoverable error for result NEW_SOFT_CENTROID_PACKING_1di2_225_9449_1 ( - exit code -1073741819 (0xc0000005))
14-01-06 22:39:35||request_reschedule_cpus: process exited
14-01-06 22:39:35|rosetta@home|Computation for result NEW_SOFT_CENTROID_PACKING_1di2_225_9449_1 finished


And...again...hope this help !

Godpiou
ID: 9068 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bill Michael
Avatar

Send message
Joined: 25 Oct 05
Posts: 574
Credit: 3,560,485
RAC: 1,970
Message 9070 - Posted: 15 Jan 2006, 5:31:22 UTC - in response to Message 9068.  

14-01-06 22:39:34|rosetta@home|Pausing result NEW_SOFT_CENTROID_PACKING_1di2_225_9449_1 (removed from memory)


This is DEFINITELY the known bug. You _MUST_ set "leave applications in memory when preempted" to _YES_!!!

ID: 9070 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
godpiou

Send message
Joined: 22 Dec 05
Posts: 7
Credit: 1,373
RAC: 0
Message 9071 - Posted: 15 Jan 2006, 6:26:19 UTC - in response to Message 9070.  

14-01-06 22:39:34|rosetta@home|Pausing result NEW_SOFT_CENTROID_PACKING_1di2_225_9449_1 (removed from memory)


This is DEFINITELY the known bug. You _MUST_ set "leave applications in memory when preempted" to _YES_!!!


Hi !

Sorry Bill,

The correction had been done.

Thank's a lot for this information that I should have seen...sorry again.

Godpiou
ID: 9071 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DonutDon

Send message
Joined: 23 Sep 05
Posts: 2
Credit: 545,377
RAC: 0
Message 9105 - Posted: 15 Jan 2006, 21:13:37 UTC

01/15/2006 11:37:19|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1who__239_654_0 (app_version download error: couldn't get input files:<file_xfer_error> <file_name>rosetta_4.81_windows_intelx86.exe</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>)
01/15/2006 11:37:19|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1tul__239_2411_0 (app_version download error: couldn't get input files:<file_xfer_error> <file_name>rosetta_4.81_windows_intelx86.exe</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>)
01/15/2006 11:37:19|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1ubi__239_2340_0 (app_version download error: couldn't get input files:<file_xfer_error> <file_name>rosetta_4.81_windows_intelx86.exe</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>)
01/15/2006 11:37:19|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1who__239_646_0 (app_version download error: couldn't get input files:<file_xfer_error> <file_name>rosetta_4.81_windows_intelx86.exe</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>)
01/15/2006 11:37:19|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1gvp__239_2415_0 (app_version download error: couldn't get input files:<file_xfer_error> <file_name>rosetta_4.81_windows_intelx86.exe</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>)
01/15/2006 11:37:19|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_2vik__239_650_0 (app_version download error: couldn't get input files:<file_xfer_error> <file_name>rosetta_4.81_windows_intelx86.exe</file_name> <error_code>-200</error_code> <error_message></error_message></file_xfer_error>)
01/15/2006 11:37:20|rosetta@home|Deferring communication with project for 3 minutes and 44 seconds

It had temporarily backed-off downloading the .exe, but then when the WU files finished downloading, Boinc tried to run them before it finished downloading the .exe.
ID: 9105 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bill Michael
Avatar

Send message
Joined: 25 Oct 05
Posts: 574
Credit: 3,560,485
RAC: 1,970
Message 9106 - Posted: 15 Jan 2006, 22:10:17 UTC - in response to Message 9105.  

It had temporarily backed-off downloading the .exe, but then when the WU files finished downloading, Boinc tried to run them before it finished downloading the .exe.


Yes, this has happened to me before, and it's been reported. It's an annoying BOINC bug. I _thought_ it had been fixed somewhere in the 5.2.x series though...

ID: 9106 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DonutDon

Send message
Joined: 23 Sep 05
Posts: 2
Credit: 545,377
RAC: 0
Message 9116 - Posted: 16 Jan 2006, 0:46:42 UTC - in response to Message 9106.  

Yes, this has happened to me before, and it's been reported. It's an annoying BOINC bug. I _thought_ it had been fixed somewhere in the 5.2.x series though...


It may well have been fixed: I'm still running Boinc 4.45.

ID: 9116 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 9215 - Posted: 17 Jan 2006, 18:05:42 UTC

This work unit has just hit 12+ hours CPU time and aborted itself.

NO_SIM_ANNEAL_BARCODE_30_2reb_246_2003

WU 5628369, Result ID 7059337
ID: 9215 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Divide Overflow

Send message
Joined: 17 Sep 05
Posts: 82
Credit: 921,382
RAC: 0
Message 9238 - Posted: 18 Jan 2006, 2:19:53 UTC

I just noticed two WU's that ran for just over 9 hours before aborting with the Maximum CPU time exceeded:

1/17/2006 12:20:28 PM|rosetta@home|Aborting result PRODUCTION_ABINITIO_2chf__250_242_0: exceeded CPU time limit 32474.092756
1/17/2006 12:20:28 PM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_2chf__250_242_0 (Maximum CPU time exceeded)

1/17/2006 3:58:41 PM|rosetta@home|Aborting result PRODUCTION_ABINITIO_2vik__250_261_0: exceeded CPU time limit 32474.092756
1/17/2006 3:58:41 PM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_2vik__250_261_0 (Maximum CPU time exceeded)

Are there more bad batches of WU's out there again?
ID: 9238 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Darren
Avatar

Send message
Joined: 6 Oct 05
Posts: 27
Credit: 43,535
RAC: 0
Message 9239 - Posted: 18 Jan 2006, 2:26:57 UTC
Last modified: 18 Jan 2006, 2:33:50 UTC

Here's another ABINITIO WU that exceeded the maximum CPU time.


ID: 9239 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 2,776,611
RAC: 874
Message 9267 - Posted: 18 Jan 2006, 12:17:52 UTC - in response to Message 9239.  

Here's another ABINITIO WU that exceeded the maximum CPU time.



Yup, I just had one do the same thing with over 10 hours on it, I was watching it because I didn't think it would make it. The first ABINITIO took under 3 hours to do & it reset the time to completion to under 5 hours.

Then the second ABINITIO stumbled it's way to over 10 hours & was only showing 75% Completion, I had a feeling it wouldn't make it ...
ID: 9267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
jpashton

Send message
Joined: 4 Oct 05
Posts: 1
Credit: 559,238
RAC: 0
Message 9316 - Posted: 19 Jan 2006, 1:05:42 UTC

Have been getting a lot of these the past few days:

1/18/2006 11:25:43 AM|rosetta@home|Unrecoverable error for result BARCODE_FRAG_30_1ogw_234_9512_0 ( - exit code -1073741819 (0xc0000005))
1/18/2006 11:25:43 AM|rosetta@home|Unrecoverable error for result BARCODE_FRAG_30_2reb_234_9512_0 ( - exit code -1073741819 (0xc0000005))

Usual CPU time is between 1.5 - 2 hours.

I haven't run into any that sit at 1% for hours though, just a lot of computation errors.

My two cents for those that want/need to know...
ID: 9316 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Divide Overflow

Send message
Joined: 17 Sep 05
Posts: 82
Credit: 921,382
RAC: 0
Message 9321 - Posted: 19 Jan 2006, 5:11:13 UTC

Yet another ABINITIO that exceeded the maximum CPU time...

1/18/2006 3:16:11 PM|rosetta@home|Aborting result PRODUCTION_ABINITIO_1fkb__250_452_0: exceeded CPU time limit 32474.092756
1/18/2006 3:16:11 PM|rosetta@home|Unrecoverable error for result PRODUCTION_ABINITIO_1fkb__250_452_0 (Maximum CPU time exceeded)

ID: 9321 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,400,906
RAC: 0
Message 9325 - Posted: 19 Jan 2006, 7:12:55 UTC

The fact is a lot of folks are seeing a lot of "Max Time" errors. I get at least 6-7 a day between both my machines. So many that reporting them seems like a waste of time. Since they all fail at 80% to 90% complete, this represent nominally 35 to 40 hours of time lost to the project every day for just my two machines.

This is NOT a BOINC problem, this is a R@H WU problem. I have not once EVER seen this error on any of the other projects I and running. Even SETI WUs used to take longer than the R@H Max time failed WUs I am seeing. The old E@H application used to take 7 hours and 20 min almost to the second and not one failed WU for Max time. Only a few months ago I got a few R@H WUs that ran over 30 hours and completed ok.

NOW if a WU runs longer than 5 or 6 hours on R@H it fails. All that has changed on my systems is the BOINC version and the updated R@H Application with the 1% stall patch. I have not seen one single computation error of any kind on any of the other projects I am working on, for Max time or anything else, so forgive me if I don't see this as a BOINC problem.

The BOINC system is not designed to accommodate a 900%-1000% variation in WU size. It is as simple as that. I have NEVER seen the R@H DCF corrected to allow more time, always less. Eventually this leads to longer WUs failing.

Also there seemes to be an absolute limit to the range of CPU times the system will allow for a particular machine. In other words, there is an absolute maximum for the CPU time difference between the shortest and the longest WUs allowed by the system and anything outside the top of that range will fail. The system simply cannot be forced to process beyond that limit, I have tried.

In practical terms this means that there is an absolute limit to the longest WU any particular system can complete successfully, based on the shortest one it has seen. This is why these errors occur on a particular WU on one system but not another. The limit of this range is unique to each system set up. If all you see is long or mid length WUs and the DCF is set to allow that, then the system will work ok. If all you see is short ones and then you get a long one, forget it.

There seems to be a very limited window of systems that can handle almost all of the WUs that they get somewhere in the middle of processing speed. The only solution I can see is for the project to finally recognize that the fix is to limit the maximum difference between the largest WU and the smallest to something more like 100% to 200%. If that means larger WUs for everyone fine, if it means smaller WUs for everyone fine, but that is the short term fix.

Anything else is going to require recoding the application. Now perhaps if you take out the fix for the 1% stalls this limit will go away. It seems to me that when that 1% stall fix was installed the problems on Max time began. As an impact to the progress of the project in terms of lost time the Max time failures far and away exceed the 1% hang issue, and the 20 second WU failures pale by comparison to either of these.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9325 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 9331 - Posted: 19 Jan 2006, 10:36:09 UTC - in response to Message 9325.  

This is NOT a BOINC problem, this is a R@H WU problem. I have not once EVER seen this error on any of the other projects I and running.


It is an unexpected interaction between BOINC and R@h. The R@h software runs fine on its own; as you correctly say the BOINC software runs fine with every other project (in this regard at least)


The BOINC system is not designed to accommodate a 900%-1000% variation in WU size. It is as simple as that.

Yes, various parts of the BOINC system assume a repeatable mapping from the estimated time to the actual time for a result, These Rosetta WU break the expected repeatablility.

There are actually two different issues here. One is the fact that some R@h W skip over some of the work when they predict it will be useless. This is also done on LHC, but it does not happen as often.

The second is that while the current BOINC system allows for a different correction factor for different projects, it does not allow for differing correction factors between different categories of WU within a project. At present Baker et al are trying out more than a dozen different strategies and this is stretching the BOIC code firther than it will go.

Perhaps later BOINC versions will build in different correction factors for different categories of WU - there is in principle a possible demand for similarly wide variation like this from future projects.

That is why I am not so sure as you that it is right to call it solely a R@h issue. We agree however that the initial fix must come from R@h simply as this is the project that first needs such a wide variation.

I have NEVER seen the R@H DCF corrected to allow more time, always less.


I have on LHC. The problem here is that the variation is more severe. When the WU overruns on LHC it is a small enough overrun that the result still completes OK, and the DCF is boosted. On this project the overrun is larger, the WU aborts, and an aborted WU does not adjust the DCF as it is regarded as an error.

The only solution I can see is for the project to finally recognize that the fix is to limit the maximum difference between the largest WU and the smallest to something more like 100% to 200%. If that means larger WUs for everyone fine, if it means smaller WUs for everyone fine, but that is the short term fix.

Anything else is going to require recoding the application.


As I understand it, so is getting a more stable run length.

My suggestion is that instead of aiming for a pre-planned number of structures in a run (currently ten structures) the app should "cheat" by aiming for a pre-planned run time +/- say 20%. It would do this by seeing how the time is going at the end of each struct.


Now perhaps if you take out the fix for the 1% stalls this limit will go away. It seems to me that when that 1% stall fix was installed the problems on Max time began. As an impact to the progress of the project in terms of lost time the Max time failures far and away exceed the 1% hang issue, and the 20 second WU failures pale by comparison to either of these.

I think this is an acute observation but a wrong dignosis in my opinion.

The 1% fix came at around the same time as the explosion in kinds of work unit. It is the latter that I believe has triggered this problem, combined with the already existing problem of some WU ending early - but I haven't seen the code so I can't say for sure.

River~~
ID: 9331 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Lee Carre

Send message
Joined: 6 Oct 05
Posts: 96
Credit: 79,331
RAC: 0
Message 9339 - Posted: 19 Jan 2006, 11:51:59 UTC

I have a result that hasn't failed or anything yet, but has been going for about 7 hours at 0%
normally rosetta results finish sooner than 7 hours on that host, i'll leave it and see what it does thou, because it's a "PRODUCTION" WU, a type i haven't seen before

the WU name is "PRODUCTION_ABINITIO_1urnA_250_1147" if that helps
ID: 9339 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,400,906
RAC: 0
Message 9344 - Posted: 19 Jan 2006, 13:14:59 UTC - in response to Message 9331.  

Now perhaps if you take out the fix for the 1% stalls this limit will go away. It seems to me that when that 1% stall fix was installed the problems on Max time began. As an impact to the progress of the project in terms of lost time the Max time failures far and away exceed the 1% hang issue, and the 20 second WU failures pale by comparison to either of these.

I think this is an acute observation but a wrong dignosis in my opinion.

The 1% fix came at around the same time as the explosion in kinds of work unit. It is the latter that I believe has triggered this problem, combined with the already existing problem of some WU ending early - but I haven't seen the code so I can't say for sure.

River~~


River-

You are correct that the two issues cloud one another, but you are wrong that one is not the cause of the other. If WUs were not forced to abort because they take too long (the 1% fix), then they would NOT be aborting because of longer run times (max time exceeded). Take out the abort that stops a hung WUs, and you fix the Max time errors. It really is just that simple.

As for rewriting the code. The suggestions for changing the WU run length are all done on the server not the client software. Removing the 1% hang solution IS a client side application fix that would require client app programming. But the size of the WUs is all determined on the server side, so that fix is not as big a deal as you claim, it requires altering some scripts.

The fix for the 1% solution that was implemented was not very elegant, and it is in fact a club where a scalpel was needed. The Max time errors are the result of applying a heavy handed quick fix to a subtile problem.

The post just ahead of this one is another example of a stuck WU, but it did not happen at 1% it happened at 0%. These hangs occur all the time on R@H, so something is going on. While aborting the WU stopped the hang it does not fix the root problem.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9344 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 18 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2020 University of Washington
http://www.bakerlab.org