pd1_graftsheet_41limit keep crashing

Message boards : Number crunching : pd1_graftsheet_41limit keep crashing

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Ananas

Send message
Joined: 1 Jan 06
Posts: 232
Credit: 752,471
RAC: 0
Message 76965 - Posted: 7 Jul 2014, 6:24:45 UTC
Last modified: 7 Jul 2014, 6:26:07 UTC

Error code 0xc0000005 (protection fault / access violation)

Not only for me, my wingmen seem not to have more luck with those.

p.s.: already reported here, sorry, I had not seen this before I posted.
ID: 76965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1663
Credit: 6,521,035
RAC: 840
Message 76966 - Posted: 7 Jul 2014, 7:44:16 UTC - in response to Message 76965.  

Error code 0xc0000005 (protection fault / access violation)

Not only for me, my wingmen seem not to have more luck with those.

p.s.: already reported here, sorry, I had not seen this before I posted.


+1.
All this kind of wus crashes after few seconds.
Please, stop this batch
ID: 76966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1663
Credit: 6,521,035
RAC: 840
Message 76971 - Posted: 8 Jul 2014, 5:47:55 UTC - in response to Message 76966.  

Please, stop this batch


Again, a lot of pd2_grafsheet errors (and i kill the download of these wus).
Plese, stop this batch, don't waste our time

ID: 76971 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 76972 - Posted: 8 Jul 2014, 8:32:35 UTC - in response to Message 76971.  

Please, stop this batch


Again, a lot of pd2_grafsheet errors (and i kill the download of these wus).
Plese, stop this batch, don't waste our time


You are assuming that the scientists can't get any useful data from these failures?

From other reports on the forums, some of the pd1 tasks are failing while others are succeeding. The results may have useful lessons on why some fail and others don't. Also, for most participants the errors are occuring after a matter of seconds so there is little "wasted" time (and the time you do use is granted credit by the overnight script).


It would be useful though if one of the scientists gave us a reply to say whether the results are useful or just a case of someone missing a decimal point when setting up the batch...
ID: 76972 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1663
Credit: 6,521,035
RAC: 840
Message 76974 - Posted: 8 Jul 2014, 9:41:23 UTC - in response to Message 76972.  

It would be useful though if one of the scientists gave us a reply to say whether the results are useful or just a case of someone missing a decimal point when setting up the batch...


If the scientist wants only to debug this particular batch, they can use ralph@home.... :-)
ID: 76974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1663
Credit: 6,521,035
RAC: 840
Message 76976 - Posted: 8 Jul 2014, 9:51:22 UTC - in response to Message 76972.  

Also, for most participants the errors are occuring after a matter of seconds so there is little "wasted" time (and the time you do use is granted credit by the overnight script).


Not only wasted time, but also wasted adsl (every pd1_graftsheet is 80mb)
:-(

ID: 76976 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 76977 - Posted: 8 Jul 2014, 12:07:13 UTC
Last modified: 8 Jul 2014, 12:09:11 UTC

hi all,
i'm running a linux host, i do get some errors too (see Minirosetta 3.52 thread).
fortunately for me even those tasks which errored out ran to completion. i did not get credits at first but later i found that credits is allocated to the task itself which probably means that the job ran completely.

it apparently has to do with some null pointer errors and seem to affect this particular job

however, i do see cases on windows platform for the same task reallocated to me where the job terminates, some almost when it started.

for windows users, have u tried to reset the project so that the rosetta apps and database is downloaded again? perhaps we could provide some of such feedback in this thread. e.g. if resetting solve the issue it might just be the solution
ID: 76977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 76978 - Posted: 8 Jul 2014, 15:08:27 UTC - in response to Message 76977.  

i did not get credits at first but later i found that credits is allocated to the task itself which probably means that the job ran completely.


There is an automatic script that runs each night to award credit to tasks that ended in an error. Because the script is a modification to the normal BOINC process the granted credit only shows up on the task page.


for windows users, have u tried to reset the project so that the rosetta apps and database is downloaded again? perhaps we could provide some of such feedback in this thread. e.g. if resetting solve the issue it might just be the solution


It is just a problem with this batch. Other tasks are running fine, so it is unlikely to be a problem with the app or database files. Also the fact that so many Windows users are affected suggests that something is wrong with the task design - these things usually turn out to be that one of the scientists missed a decimal place or left a stray reference in one of the task calculations.



If the scientist wants only to debug this particular batch, they can use ralph@home.... :-)


Sorry, I was a little unclear in my earlier comment. I don't think the scientists deliberately released a bad batch. I was trying to point out that the limited results from this batch could still be useful despite the problems.
ID: 76978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1663
Credit: 6,521,035
RAC: 840
Message 76979 - Posted: 8 Jul 2014, 15:58:28 UTC - in response to Message 76978.  

Sorry, I was a little unclear in my earlier comment. I don't think the scientists deliberately released a bad batch. I was trying to point out that the limited results from this batch could still be useful despite the problems.


I have over 300 messages and 400k points on Ralph so i know what is a beta test. I think, like you, that they don't released a bad batch deliberately. But i also think that a "stop" is the best solution. After that, pass the code on Ralph and test it largely.

ID: 76979 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 76980 - Posted: 8 Jul 2014, 16:41:20 UTC
Last modified: 8 Jul 2014, 17:41:17 UTC

i'm making some guesses if things might have improved

i've an instance of pd1_graftsheet_41limit that runs without errors
https://boinc.bakerlab.org/rosetta/result.php?resultid=673049941

apparently it seemed the same task errored out when someone else runs the same(?) job
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=610410753

guess it is hit and miss for now

agree with [VENETO] boboviz to at least reduce the number of pd1_graftsheet_41limit tasks that's being pushed out and perhaps fix the issues, and perhaps beta test it

it's pushing out 80 megs per task when it fails it gets reassigned, if there are 100s of jobs that may add up to (10s of) gigabytes of bandwidth wasted
ID: 76980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1663
Credit: 6,521,035
RAC: 840
Message 76983 - Posted: 9 Jul 2014, 6:28:33 UTC - in response to Message 76980.  

agree with [VENETO] boboviz to at least reduce the number of pd1_graftsheet_41limit tasks that's being pushed out and perhaps fix the issues, and perhaps beta test it


I continue to receive a lot of pd1_graftsheet wus (all errors).
Admins read the forum??

ID: 76983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 76987 - Posted: 9 Jul 2014, 17:45:08 UTC - in response to Message 76983.  

agree with [VENETO] boboviz to at least reduce the number of pd1_graftsheet_41limit tasks that's being pushed out and perhaps fix the issues, and perhaps beta test it


I continue to receive a lot of pd1_graftsheet wus (all errors).
Admins read the forum??



I contacted scientist responsible for this particular batch of jobs. He is looking at it now. THank you for reporting the errors!
ID: 76987 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1663
Credit: 6,521,035
RAC: 840
Message 76990 - Posted: 9 Jul 2014, 20:06:57 UTC - in response to Message 76987.  

I contacted scientist responsible for this particular batch of jobs. He is looking at it now. THank you for reporting the errors!


Thank you!!

ID: 76990 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1663
Credit: 6,521,035
RAC: 840
Message 76998 - Posted: 12 Jul 2014, 19:11:27 UTC - in response to Message 76987.  

I contacted scientist responsible for this particular batch of jobs. He is looking at it now. THank you for reporting the errors!


I continue to receive this kind of wu (with errors). Please, stop it

ID: 76998 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 77002 - Posted: 12 Jul 2014, 21:24:40 UTC

Hi VENETO,
I talked researcher starting these jobs... Apparently the failure is "normal" in that its trying different parameters and if it doesnt pass the filters it doesnt return the result. The issue is that boinc reports an error when no result is seen after the run, even though the fact that it didn't pass the filter is a result!

Future jobs will behave more nicely and not appear to crash. This batch is almost complete.
ID: 77002 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Indigo

Send message
Joined: 5 Dec 07
Posts: 1
Credit: 133,409
RAC: 0
Message 77013 - Posted: 14 Jul 2014, 18:56:32 UTC

Hi all. This is my code and it is doing exactly what it is supposed to do.
It turns out that BOINC reports an "error" when no protein structure is returned.

The reason many jobs are not returning structures is that I'm using strategy called "dead-end elimination."
The reason we need massively parallel computing/simulations is that it's impossible to know ahead of time if a single simulation will return the results we need (i.e. "stochastic sampling"). However, after a simulation has run for a while, we can sometimes tell that it's not going anywhere, and it's a waste of everyone's resources to continue it and save the output, such that a new job is spawned with a different starting point.

I've been a Rosetta developer and scientist for years, but this is my first time using R@H instead of our own supercomputing clusters. I'm gonna reconfigure my job submission strategy to play nicer with BOINC's point system.


Thanks everybody!
Chris Indigo King
Bakerlab
ID: 77013 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 77022 - Posted: 16 Jul 2014, 15:11:54 UTC

hi Indigo,

thanks for coming into the open and sharing these info, i'd guess many participants (including me) appreciate all these feedbacks very much :)

yup i'd think there may be ways to improve the credit system or even minirosetta (esp for the 'windows' platform volunteers to get some credits in these 'special case' failures. After all those participants may have downloaded multiples of the 80 megs start files but 'crashed' (some almost on starting up) :)


ID: 77022 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 214,424,832
RAC: 19
Message 77023 - Posted: 16 Jul 2014, 20:11:37 UTC

Yes, thanks you for posting explanations Indigo. It is really appreciatted.

ID: 77023 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 2 Apr 14
Posts: 282
Credit: 208,966
RAC: 0
Message 77034 - Posted: 19 Jul 2014, 0:33:19 UTC
Last modified: 19 Jul 2014, 0:41:31 UTC

hi Indigo,

Take a look at this workunit

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=612393502

when a different (Windows) PC runs it, apparently it 'crashed'
https://boinc.bakerlab.org/rosetta/result.php?resultid=675364476

the task gets reassigned to my (Linux) PC
https://boinc.bakerlab.org/rosetta/result.php?resultid=675380858
apparently it generates 108 decoys/models - no errors

just like to bring up some details, just in case it might be interesting or perhaps suggest some problems more than simply not finding structures
ID: 77034 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LumenDan

Send message
Joined: 26 Apr 07
Posts: 3
Credit: 4,949,978
RAC: 558
Message 77045 - Posted: 20 Jul 2014, 9:15:55 UTC
Last modified: 20 Jul 2014, 9:17:59 UTC

It turns out that BOINC reports an "error" when no protein structure is returned.


It is a relief to know that the failed units were not required to continue computation and have in fact contributed to the scientific process in a meaningful way.

From an application programmer's point of view reporting an error is one thing, generating an access violation (windows) is another.
"Reason: Access Violation (0xc0000005) at address 0x00757DEB write attempt to address 0x00000000"

The Rosetta application (or boinc core) has definitely crashed when this error occurs and left the operating system to clean up the mess. I don't think a write to a null pointer should ever be considered as normal behaviour and I hope that the fault can be avoided in new batches or future releases of minirosetta.

Please add a null pointer check to the application or create a place holder structure to return when a dead-end calculation terminates to avoid fatal exceptions.

My personal reaction when I see batches with ongoing failures is to question weather my computer configuration is at fault and is there something I need to change to avoid returning bad results. Your responses have certainly put me at ease in that respect :).

Thanks for considering credit allocation for dead-end units. All of mine seem to have failed in the first 30 seconds so I didn't expect any but in the case where more substantial computing time has been invested lack of due credit could skew people away from certain batches.

Best Regards,
LumenDan
ID: 77045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : pd1_graftsheet_41limit keep crashing



©2022 University of Washington
https://www.bakerlab.org