I'm getting errors occasionally and need to know where to report them for my version of Rosetta. The error messages have been as follows:
(type is in red:)
rosetta@home 5/26/2006 9:19:51 PM rosetta not responding to screensaver, exiting
rosetta@home 5/26/2006 9:19:51 PM Unrecoverable error.....,etc... (-exit code_ 1 (0xcffffffff)
.....after more dialog, the end result states that the application is terminated.
I have had this same error happen a number of times since I started running Rosetta. (less than two weeks) I think I read on the site somewhere that credit is earned for all work done. I am not concerned in this regard. I see that there are links provided to report errors for specific versions of Rosetta. I don't see a link for the newer clients to report bugs, so this is why I'm using this thread. If you move this to another thread, please let me know to where it's been moved for future reference. Thank you.
____________
ID: 17219 | Rating: 0 | rate:
/
Moderator9 Forum moderator Project administrator Joined: Jan 22 06 Posts: 1014 ID: 53254 Credit: 0 RAC: 0
I have started this new thread for reporting version 5.16 issues and problems. The first thread was getting rather long and was taking too long to load.
The original thread is located here, but please start using this thread for your reports.
____________
Moderator9
ROSETTA@home FAQ
Moderator Contact
see last post in "Report Problems with Rosetta Version 5.16 I", this is my second error since turning the screensaver back on. It's wuid=18193778
Result ID 21719438
Name T0283_CONTACTS_MAP_FROM_hom006_535_21537_0
Workunit 18193778
Created 27 May 2006 4:39:53 UTC
Sent 27 May 2006 6:37:21 UTC
Received 27 May 2006 19:14:59 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -1073741811 (0xc000000d)
Computer ID 212252
Report deadline 3 Jun 2006 6:37:21 UTC
CPU time 1213.703125
stderr out <core_client_version>5.4.9</core_client_version>
<message>
- exit code -1073741811 (0xc000000d)
</message>
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 3172964
</stderr_txt>
Validate state Invalid
Claimed credit 4.80583329460571
Granted credit 0
application version 5.16
ID: 17252 | Rating: 0 | rate:
/
Jose Joined: Mar 28 06 Posts: 820 ID: 69098 Credit: 48,297 RAC: 0
The computing error came after 13,500+ seconds of processing , and 6 models (it was working on number 7)
____________
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
OK, now there's three fatal windows errors since turning screensaver back on. wuid=18250663. I'm now going to turn screensaver back off.
tony
Result ID 21780793
Name T0283_CONTACTS_CONSERVATIVE_MAP_FROM_hom006_547_8549_0
Workunit 18250663
Created 27 May 2006 17:01:15 UTC
Sent 27 May 2006 19:15:00 UTC
Received 27 May 2006 22:16:39 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -1073741811 (0xc000000d)
Computer ID 212252
Report deadline 3 Jun 2006 19:15:00 UTC
CPU time 9206.765625
stderr out <core_client_version>5.4.9</core_client_version>
<message>
- exit code -1073741811 (0xc000000d)
</message>
<stderr_txt>
# random seed: 2943352
# cpu_run_time_pref: 28800
</stderr_txt>
Validate state Invalid
Claimed credit 36.4555218363274
Granted credit 0
application version 5.16
ID: 17257 | Rating: 0 | rate:
/
David Baker Forum moderator Project administrator Project developer Project scientist Joined: Sep 17 05 Posts: 637 ID: 122 Credit: 214,854 RAC: 0
OK, now there's three fatal windows errors since turning screensaver back on. wuid=18250663. I'm now going to turn screensaver back off.
tony
Result ID 21780793
Name T0283_CONTACTS_CONSERVATIVE_MAP_FROM_hom006_547_8549_0
Workunit 18250663
Created 27 May 2006 17:01:15 UTC
Sent 27 May 2006 19:15:00 UTC
Received 27 May 2006 22:16:39 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -1073741811 (0xc000000d)
Computer ID 212252
Report deadline 3 Jun 2006 19:15:00 UTC
CPU time 9206.765625
stderr out <core_client_version>5.4.9</core_client_version>
<message>
- exit code -1073741811 (0xc000000d)
</message>
<stderr_txt>
# random seed: 2943352
# cpu_run_time_pref: 28800
</stderr_txt>
Validate state Invalid
Claimed credit 36.4555218363274
Granted credit 0
application version 5.16
Yes--Rom in analyzing the current error breakdown thinks that most are associated with the graphics failing. he is testing a solution in which rosetta keeps going and results get returned even if there is a problem with the graphics.
Yes--Rom in analyzing the current error breakdown thinks that most are associated with the graphics failing. he is testing a solution in which rosetta keeps going and results get returned even if there is a problem with the graphics.
If you don't mind, I'll email Rom directly to see if there's anything I can do. I'm one of his Alpha testers anyway.
1/8/2005 3:36:18 PM||request_reschedule_cpus: process exited
1/8/2005 3:36:18 PM|rosetta@home|Computation for result T0283_CONTACTS_MAP_FROM_hom006_535_22929_0 finished
1/8/2005 3:36:19 PM|rosetta@home|Started upload of T0283_CONTACTS_MAP_FROM_hom006_535_22929_0_0
1/8/2005 3:36:24 PM|rosetta@home|Finished upload of T0283_CONTACTS_MAP_FROM_hom006_535_22929_0_0
1/8/2005 3:36:24 PM|rosetta@home|Throughput 31466 bytes/sec
1/8/2005 3:54:13 PM|rosetta@home|Deferring communication with project for 1 days, 19 hours, 59 minutes, and 57 seconds
1/8/2005 4:01:56 PM||Insufficient work; requesting more
1/8/2005 4:01:56 PM|LHC@home|Deferring communication with project for 71 weeks, 5 days, 7 hours, 29 minutes, and 28 seconds
1/8/2005 4:54:14 PM|rosetta@home|Deferring communication with project for 1 days, 18 hours, 59 minutes, and 56 seconds
1/8/2005 11:02:00 PM||Insufficient work; requesting more
1/8/2005 11:02:00 PM|LHC@home|Deferring communication with project for 71 weeks, 5 days, 0 hours, 29 minutes, and 24 seconds
1/8/2005 11:54:18 PM|rosetta@home|Deferring communication with project for 1 days, 11 hours, 59 minutes, and 53 seconds
1/9/2005 12:02:00 AM||Insufficient work; requesting more
1/9/2005 12:02:00 AM|LHC@home|Deferring communication with project for 71 weeks, 4 days, 23 hours, 29 minutes, and 24 seconds
1/9/2005 12:30:39 AM||request_reschedule_cpus: project op
1/9/2005 12:30:40 AM|rosetta@home|Sending scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi
1/9/2005 12:30:40 AM|rosetta@home|Requesting 0 seconds of work, returning 1 results
1/9/2005 12:30:42 AM|rosetta@home|Scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
1/9/2005 12:31:17 AM||request_reschedule_cpus: project op
1/9/2005 12:31:19 AM|rosetta@home|Sending scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi
1/9/2005 12:31:19 AM|rosetta@home|Requesting 8640 seconds of work, returning 0 results
1/9/2005 12:31:20 AM|rosetta@home|Scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
1/9/2005 12:31:20 AM|rosetta@home|Message from server: Not sending work - last RPC too recent: 38 sec
1/9/2005 12:31:20 AM|rosetta@home|No work from project
1/9/2005 12:31:21 AM|rosetta@home|Deferring communication with project for 4 minutes and 1 seconds
It says on the homepage that there are 19,000 workunits in the queue yet I cannot get any workunits. 6 hours comp time wasted....argh. Anyone else having this problem???????????
****Edit****
Hmm, I just got work now? This is interesting. I noticed that for some reason workunits get stuck in the status "ready to report" under the worktab but never actually get uploaded even though BOINC has contacted rosetta servers. Only after I manually press the update button will the workunit go through. I am running version 4.45. Any ideas????
____________
ID: 17265 | Rating: 0 | rate:
/
Moderator9 Forum moderator Project administrator Joined: Jan 22 06 Posts: 1014 ID: 53254 Credit: 0 RAC: 0
....Hmm, I just got work now? This is interesting. I noticed that for some reason workunits get stuck in the status "ready to report" under the worktab but never actually get uploaded even though BOINC has contacted rosetta servers. Only after I manually press the update button will the workunit go through. I am running version 4.45. Any ideas????
You might want to upgrade to version 5.4.9 which is the current version for BOINC.
____________
Moderator9
ROSETTA@home FAQ
Moderator Contact
This is interesting. I noticed that for some reason workunits get stuck in the status "ready to report" under the worktab but never actually get uploaded even though BOINC has contacted rosetta servers. Only after I manually press the update button will the workunit go through. I am running version 4.45. Any ideas????
Reporting is done seperately from uploading to reduce network comms on the server side. JM7 (the man who wrote the scheduler) says this:
Results are reported any time the project is contacted for an update. Updates occur at the first of:
1) A result report is due within 24 hours.
2) It has been at least the connect interval since the result completed.
3) (5.4) It is less than the connect interval till the report deadline.
4) Work is needed.
5) A manual update.
I am crunching on 21970571 right now. It stops after reaching 1.210% at time step 2833. If I stop boinc and let it restart from the last checkpoint (here: from the beginning), it stops at same step 2833 in model 1. For me this seems reproducible.
Any suggestions, what I can do to produce a reasonable error report, except abort the workunit or wait for the watchdog?
I am crunching on 21970571 right now. It stops after reaching 1.210% at time step 2833. If I stop boinc and let it restart from the last checkpoint (here: from the beginning), it stops at same step 2833 in model 1. For me this seems reproducible.
Any suggestions, what I can do to produce a reasonable error report, except abort the workunit or wait for the watchdog?
I'd wait at least an hour. In theory the watchdog should terminate it after an hour. This is a good opportunity to see if it really works as it should. In any case let it run a few hours and if it really keeps stuck at step 2833 abort and you get all the credits for the time crunched.
____________
I was also thinking why this happens only on this particular computer. This is only one of my computers, that has localized version of windows (Slovak language version). Do you think it can be the reason for screensaver crash?
I am currently crunching JUMP_RELAX_LONGRANGEPAIR_PARALLEL_t285__SAVE_ALL_OUT_548_11268_0 using rosetta version 516
It is at 1% after 2.5 hrs. Boincview tells me that it has 5.25 hrs to complete.
Normally it takes around 2.8 hrs per WU.
It is running on a Dell P3 1G #225837
Wait....
That was wierd...
It just went straight to 100% At 2:43
Any Ideas ??
Edit...
I think this is it...
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=18401795
http://boinc.bakerlab.org/rosetta/result.php?resultid=21942605
End Edit...
Thanks
Ian
I am currently crunching JUMP_RELAX_LONGRANGEPAIR_PARALLEL_t285__SAVE_ALL_OUT_548_11268_0 using rosetta version 516
It is at 1% after 2.5 hrs. Boincview tells me that it has 5.25 hrs to complete.
Normally it takes around 2.8 hrs per WU.
It is running on a Dell P3 1G #225837
Wait....
That was wierd...
It just went straight to 100% At 2:43
Any Ideas ??
Edit...
I think this is it...
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=18401795
http://boinc.bakerlab.org/rosetta/result.php?resultid=21942605
End Edit...
Thanks
Ian
Quoting from the FAQ:
Depending on how the Wu is configured, some may have over 1,500,000 steps in the first model and still not reach 1%. This can take over 5 hours of CPU time. There are a few even larger ones.
I am crunching on 21970571 right now. It stops after reaching 1.210% at time step 2833. If I stop boinc and let it restart from the last checkpoint (here: from the beginning), it stops at same step 2833 in model 1. For me this seems reproducible.
Any suggestions, what I can do to produce a reasonable error report, except abort the workunit or wait for the watchdog?
I'd wait at least an hour. In theory the watchdog should terminate it after an hour. This is a good opportunity to see if it really works as it should. In any case let it run a few hours and if it really keeps stuck at step 2833 abort and you get all the credits for the time crunched.
The watchdog killed the workunit. I have made a backup, so I can rerun it, if that is of any interest.
my problem: the progress percentage jumps, it's not advancing fluently
(1%-24%-48%...)
i can live with it, but still it would be cool to have this solved
This is normal and described in this faq about the runtime preference. The % complete for Rosetta is not as definate and easy to compute as some other projects. A given WU will run through as many complete models as possible. Given the percentages in your example, once it completed the first model (see the model number on the graphic) it estimated it would get about 3 more models completed before it reaches your runtime preference. Each completed model is what the scientists need for their work. What happens within a model is not as important. But the additional updates to the % complete were basically added to help diagnose any problems with a given set of WUs.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com
ID: 17375 | Rating: 0 | rate:
/
Sam Miorelli Joined: Feb 16 06 Posts: 7 ID: 59176 Credit: 654,533 RAC: 170
I have a Prescott-based machine that I run Rosetta, Einstein, LHC, and SETI on. None of the other projects have any problems, but Rosetta is about 50% errors. Today alone I had two:
Unrecoverable error for result HOMOLOG_ABRELAX_hom007_t283__505_33607_1 ( - exit code-1073741811 (0xc000000d))
Unrecoverable error for result JUMP_RELAX_LONGRANGEPAIR_FORCESTRAND_t285__SAVE_ALL_OUT_550_36800_0 ( - exit code-1073741811 (0xc000000d))
It only seems that the errors come up when the screensaver is running, not when it's just running in the background as I do other things. My machine runs two units at a time and has 512MB ram. Anyone have any idea why I'm getting these problems in Rosetta?
____________
Anyone have any idea why I'm getting these problems in Rosetta?
Yup, I have an AMD64 3700 which has the same problems with ralph 5.12 and 5.16, and Rosetta 5.16. When I run the screensaver, and leave the machine alone. Windows fatal error, if I keep working with it, so that the screensaver never comes on, successful results. When I turn OFF screensaver, successful results.
Turn OFF screensaver. They are aware of this error and have asked Rom Waltons assistance. I offered Rom my help, which he's not accepted.
tony
PS, I have two other puters which work just fine with the screensaver on.
ID: 17395 | Rating: 0 | rate:
/
Bob Guy Joined: Oct 7 05 Posts: 39 ID: 3119 Credit: 24,895 RAC: 0
Two recent errors because I used the Boinc view graphics button. The WUs complete successfully if I never view the graphics - I have the Boinc screensaver turned off.
My linux machine got its first error: result 22071081:
Wed 31 May 2006 02:42:39 PM CEST|rosetta@home|Unrecoverable error for result JUMP_RELAX_LONGRANGEPAIR_ANTIPARALLEL_t285__SAVE_ALL_OUT_548_29690_0 (process exited with code 131 (0x83))
I'm using boinc version 5.4.9 on x86_64 linux 2.6.15.
Result ID 22071081
Name JUMP_RELAX_LONGRANGEPAIR_ANTIPARALLEL_t285__SAVE_ALL_OUT_548_29690_0
Workunit 18520787
Created 30 May 2006 3:35:03 UTC
Sent 30 May 2006 5:48:44 UTC
Received 31 May 2006 12:47:19 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status 131 (0x83)
Computer ID 70238
Report deadline 6 Jun 2006 5:48:44 UTC
CPU time 6945.15
stderr out
<core_client_version>5.4.9</core_client_version>
<message>
process exited with code 131 (0x83)
</message>
<stderr_txt>
# random seed: 2797211
# cpu_run_time_pref: 10800
No heartbeat from core client for 31 sec - exiting
SIGSEGV: segmentation violationStack trace (19 frames):
[0x8836a6b]
[0x884f74c]
[0xffffe500]
[0x860e7a9]
[0x85ff1f8]
[0x809364c]
[0x860ff95]
[0x8610bb0]
[0x87dca0f]
[0x8728e50]
[0x872a6bb]
[0x80a3a75]
[0x85c3a13]
[0x842093e]
[0x85f1ffb]
[0x8496132]
[0x8498c8f]
[0x88aec34]
[0x8048111]
Exiting...
</stderr_txt>
Validate state Invalid
Claimed credit 12.3736072419708
Granted credit 0
application version 5.16
It means that the watchdog function placed to prevent Wu's stuck in time IS WORKING as it is supposed to do. I imagine that that unit and the ones where the watchdog is terminating the processing ( and there are some few recently) will be analyzed to see what is happening that is causing the watchdog to work.
____________
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 17425 | Rating: 0 | rate:
/
David Baker Forum moderator Project administrator Project developer Project scientist Joined: Sep 17 05 Posts: 637 ID: 122 Credit: 214,854 RAC: 0
I also had watchdog knock down one of those pdbblast guys: resultid
like XS DUC's.
It seems there is a problem with these WUs:
FRA_t297
Thanks for the heads up--we'll look into this right away
____________
ID: 17440 | Rating: 0 | rate:
/
Bin Qian Forum moderator Project administrator Project developer Project scientist Joined: Jul 13 05 Posts: 33 ID: 18 Credit: 36,897 RAC: 0
Thanks to your helpful messages, we've tracked down the rare bug that's causing this in the code and fixed it. The fix will be included in the next release. Great job all!
Luckily we only sent out 5000 of these bad WUs (A very small number compare to the 120,000 done everyday) and about a third of them were affected. You will still get credits for those jobs killed by the watchdog when our credit-grantor runs nightly!
It's been ages since I had another error to report, but this morning I noticed one... never seen that one before.
It means that the watchdog function placed to prevent Wu's stuck in time IS WORKING as it is supposed to do. I imagine that that unit and the ones where the watchdog is terminating the processing ( and there are some few recently) will be analyzed to see what is happening that is causing the watchdog to work.
I had many errors with FRA_t301_hom028_1_LOOPRLX_IGNORE_THE_REST recently. Probably Rosetta needed to use more than 300MB RAM and system was out of memory. I've got two different error codes: -1073741819 (0xc0000005) and -1073741571 (0xc00000fd) .
Results: 22632143 22629987 22628762 22626932 22625240 22624321 22623307 22593482 22585478
I had many errors with FRA_t301_hom028_1_LOOPRLX_IGNORE_THE_REST recently. Probably Rosetta needed to use more than 300MB RAM and system was out of memory. I've got two different error codes: -1073741819 (0xc0000005) and -1073741571 (0xc00000fd) .
Results: 22632143 22629987 22628762 22626932 22625240 22624321 22623307 22593482 22585478
Looks like it might be prudent to take this computer off of Rosetta. It has well below the minimum 500mb recommended, and it is probably not fair to the project to have all these units abort because of that... Just a thought!
____________
Regards,
Bob P.
Looks like it might be prudent to take this computer off of Rosetta. It has well below the minimum 500mb recommended, and it is probably not fair to the project to have all these units abort because of that... Just a thought!
Yes, I took it off Rosetta immediately. And also other computer with 256MB RAM. New workunits are taking A LOT of RAM. I've seen Rosetta using 375MB recently. So I recommend everyone with less than 512MB to be carefull.
If you're getting repeated errors like that, perhaps it'd be a good idea to run Ralph on that machine. Let them track down the source of the errors - and either correct the code, or have Rosetta instantly fail such WUs and state, "not enough Ram for this WU." Someday, we'll hopefully have a client that will tell the server how much ram we have on our machine, and get WUs that will run with that amount of Ram.
____________
My air-cooled dual core AMD is slowly dying. Is the data still good in this unit? It bothers me that a bad machine might poison the result. How did the following WU validate with those errors? Was it restarting from an earlier checkpoint?
Result ID 23165576
Name t304__CASP7_ABRELAX_SAVE_ALL_OUT_cterm2_hom001__654_16007_0
Workunit 19503478
Created 7 Jun 2006 11:57:47 UTC
Sent 7 Jun 2006 13:27:18 UTC
Received 7 Jun 2006 19:29:58 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 241725
Report deadline 14 Jun 2006 13:27:18 UTC
CPU time 14133.575291
stderr out
Exiting...
# cpu_run_time_pref: 14400
# DONE :: 1 starting structures built 20 (nstruct) times
# This process generated 20 decoys from 20 attempts
</stderr_txt>
Validate state Valid
ID: 17993 | Rating: 0 | rate:
/
Moderator9 Forum moderator Project administrator Joined: Jan 22 06 Posts: 1014 ID: 53254 Credit: 0 RAC: 0
My air-cooled dual core AMD is slowly dying. Is the data still good in this unit? It bothers me that a bad machine might poison the result. How did the following WU validate with those errors? Was it restarting from an earlier checkpoint?
...
Validation is not the same problem here as it might be on other BOINC projects. Any model you create successfully is valuable. That model need not match (in most cases it won't) any other models from any other computer. The validation will be made against the known structure of the protein. Eventually when the success rate is sufficiently high for matching known structures, then rosetta could be used for predictions. The prediction capability is what is being tested in CASP.
____________
Moderator9
ROSETTA@home FAQ
Moderator Contact
Since the running of 5.16 my Rac on 1 of my computers had dropped from 250 down to 225 anyone shed any light as what to look for as to why.This 1 computer is 2500+ barton with 1 gig of memory and i have a 1.6 duron with 512 memory out producing it. Thanks in advance
____________ Visit us at Christianboards.org
I noticed my RAC fall from 256 to 225 when I ran Ralph. (Ralph scores are seperate from Rosetta.)
____________
ID: 18060 | Rating: 0 | rate:
/
Moderator9 Forum moderator Project administrator Joined: Jan 22 06 Posts: 1014 ID: 53254 Credit: 0 RAC: 0
I noticed my RAC fall from 256 to 225 when I ran Ralph. (Ralph scores are seperate from Rosetta.)
Moreover, RALPH scores can be zeroed out at any time. RALPH is a test project and scores have no meaning there. If you are running Ralph it will reduce the RAC for any other projects run on that machine. But Running RALPH is VERY helpful to the project. If you are interested in helping create the next version of the Science application and are not concerned about credits, please join us over on RALPH.
Current testing includes a new Mac Intel version of Rosetta.
____________
Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 18138 | Rating: 0 | rate:
/
Jose Joined: Mar 28 06 Posts: 820 ID: 69098 Credit: 48,297 RAC: 0
t299__CASP7_ABRELAX_SAVE_ALL_OUT_nterm_nohelix3_hom008__643_2630_0
Exit status -2147483645 (0x80000003)
CPU time 8925.28125
stderr out <core_client_version>5.5.0</core_client_version>
<message>
One or more arguments are invalid (0x80000003) - exit code -2147483645 (0x80000003)
</message>
<stderr_txt>
# random seed: 2367371
# cpu_run_time_pref: 14400
____________
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 18227 | Rating: 0 | rate:
/
Moderator9 Forum moderator Project administrator Joined: Jan 22 06 Posts: 1014 ID: 53254 Credit: 0 RAC: 0
With the release of the latest version 5.22 of Rosetta, a new thread for reporting problems has been opened here. Please continue to report version 5.16 problems in this thread. But if the issue is with version 5.22 please report in the new thread.
____________
Moderator9
ROSETTA@home FAQ
Moderator Contact
Got stuck somehow yesterday, didn't see it till this morning...
Watchdog didn't kick in neither, I had to abort it myself, lost a lot of time on this one.