Rosetta@home

Report Problems with Rosetta Version 5.16 II

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : Report Problems with Rosetta Version 5.16 II

Sort
AuthorMessage
rdickjune

Joined: May 15 06
Posts: 5
ID: 82710
Credit: 5,529
RAC: 0
Message 17219 - Posted 27 May 2006 5:26:27 UTC

I'm getting errors occasionally and need to know where to report them for my version of Rosetta. The error messages have been as follows:

(type is in red:)
rosetta@home 5/26/2006 9:19:51 PM rosetta not responding to screensaver, exiting
rosetta@home 5/26/2006 9:19:51 PM Unrecoverable error.....,etc... (-exit code_ 1 (0xcffffffff)

.....after more dialog, the end result states that the application is terminated.

I have had this same error happen a number of times since I started running Rosetta. (less than two weeks) I think I read on the site somewhere that credit is earned for all work done. I am not concerned in this regard. I see that there are links provided to report errors for specific versions of Rosetta. I don't see a link for the newer clients to report bugs, so this is why I'm using this thread. If you move this to another thread, please let me know to where it's been moved for future reference. Thank you.

____________

Moderator9
Forum moderator
Project administrator

Joined: Jan 22 06
Posts: 1014
ID: 53254
Credit: 0
RAC: 0
Message 17221 - Posted 27 May 2006 5:45:08 UTC
Last modified: 27 May 2006 5:52:49 UTC

I have started this new thread for reporting version 5.16 issues and problems. The first thread was getting rather long and was taking too long to load.

The original thread is located here, but please start using this thread for your reports.

____________
Moderator9
ROSETTA@home FAQ
Moderator Contact

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 17252 - Posted 27 May 2006 19:19:11 UTC

see last post in "Report Problems with Rosetta Version 5.16 I", this is my second error since turning the screensaver back on. It's wuid=18193778

Result ID 21719438
Name T0283_CONTACTS_MAP_FROM_hom006_535_21537_0
Workunit 18193778
Created 27 May 2006 4:39:53 UTC
Sent 27 May 2006 6:37:21 UTC
Received 27 May 2006 19:14:59 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -1073741811 (0xc000000d)
Computer ID 212252
Report deadline 3 Jun 2006 6:37:21 UTC
CPU time 1213.703125
stderr out <core_client_version>5.4.9</core_client_version>
<message>
- exit code -1073741811 (0xc000000d)
</message>
<stderr_txt>
# cpu_run_time_pref: 28800
# random seed: 3172964

</stderr_txt>


Validate state Invalid
Claimed credit 4.80583329460571
Granted credit 0
application version 5.16

Jose

Joined: Mar 28 06
Posts: 820
ID: 69098
Credit: 48,297
RAC: 0
Message 17255 - Posted 27 May 2006 20:35:54 UTC
Last modified: 27 May 2006 20:39:22 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=18032663

The computing error came after 13,500+ seconds of processing , and 6 models (it was working on number 7)
____________
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 17257 - Posted 27 May 2006 22:19:40 UTC

OK, now there's three fatal windows errors since turning screensaver back on. wuid=18250663. I'm now going to turn screensaver back off.

tony


Result ID 21780793
Name T0283_CONTACTS_CONSERVATIVE_MAP_FROM_hom006_547_8549_0
Workunit 18250663
Created 27 May 2006 17:01:15 UTC
Sent 27 May 2006 19:15:00 UTC
Received 27 May 2006 22:16:39 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -1073741811 (0xc000000d)
Computer ID 212252
Report deadline 3 Jun 2006 19:15:00 UTC
CPU time 9206.765625
stderr out <core_client_version>5.4.9</core_client_version>
<message>
- exit code -1073741811 (0xc000000d)
</message>
<stderr_txt>
# random seed: 2943352
# cpu_run_time_pref: 28800

</stderr_txt>


Validate state Invalid
Claimed credit 36.4555218363274
Granted credit 0
application version 5.16

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 637
ID: 122
Credit: 214,854
RAC: 0
Message 17260 - Posted 28 May 2006 0:07:32 UTC - in response to Message ID 17257.

OK, now there's three fatal windows errors since turning screensaver back on. wuid=18250663. I'm now going to turn screensaver back off.

tony


Result ID 21780793
Name T0283_CONTACTS_CONSERVATIVE_MAP_FROM_hom006_547_8549_0
Workunit 18250663
Created 27 May 2006 17:01:15 UTC
Sent 27 May 2006 19:15:00 UTC
Received 27 May 2006 22:16:39 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -1073741811 (0xc000000d)
Computer ID 212252
Report deadline 3 Jun 2006 19:15:00 UTC
CPU time 9206.765625
stderr out <core_client_version>5.4.9</core_client_version>
<message>
- exit code -1073741811 (0xc000000d)
</message>
<stderr_txt>
# random seed: 2943352
# cpu_run_time_pref: 28800

</stderr_txt>


Validate state Invalid
Claimed credit 36.4555218363274
Granted credit 0
application version 5.16


Yes--Rom in analyzing the current error breakdown thinks that most are associated with the graphics failing. he is testing a solution in which rosetta keeps going and results get returned even if there is a problem with the graphics.

____________

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 17262 - Posted 28 May 2006 0:56:03 UTC - in response to Message ID 17260.

Yes--Rom in analyzing the current error breakdown thinks that most are associated with the graphics failing. he is testing a solution in which rosetta keeps going and results get returned even if there is a problem with the graphics.

If you don't mind, I'll email Rom directly to see if there's anything I can do. I'm one of his Alpha testers anyway.

tony

senatoralex85

Joined: Sep 27 05
Posts: 66
ID: 1329
Credit: 169,644
RAC: 0
Message 17265 - Posted 28 May 2006 5:34:53 UTC
Last modified: 28 May 2006 5:38:55 UTC

1/8/2005 3:36:18 PM||request_reschedule_cpus: process exited
1/8/2005 3:36:18 PM|rosetta@home|Computation for result T0283_CONTACTS_MAP_FROM_hom006_535_22929_0 finished
1/8/2005 3:36:19 PM|rosetta@home|Started upload of T0283_CONTACTS_MAP_FROM_hom006_535_22929_0_0
1/8/2005 3:36:24 PM|rosetta@home|Finished upload of T0283_CONTACTS_MAP_FROM_hom006_535_22929_0_0
1/8/2005 3:36:24 PM|rosetta@home|Throughput 31466 bytes/sec
1/8/2005 3:54:13 PM|rosetta@home|Deferring communication with project for 1 days, 19 hours, 59 minutes, and 57 seconds
1/8/2005 4:01:56 PM||Insufficient work; requesting more
1/8/2005 4:01:56 PM|LHC@home|Deferring communication with project for 71 weeks, 5 days, 7 hours, 29 minutes, and 28 seconds
1/8/2005 4:54:14 PM|rosetta@home|Deferring communication with project for 1 days, 18 hours, 59 minutes, and 56 seconds
1/8/2005 11:02:00 PM||Insufficient work; requesting more
1/8/2005 11:02:00 PM|LHC@home|Deferring communication with project for 71 weeks, 5 days, 0 hours, 29 minutes, and 24 seconds
1/8/2005 11:54:18 PM|rosetta@home|Deferring communication with project for 1 days, 11 hours, 59 minutes, and 53 seconds
1/9/2005 12:02:00 AM||Insufficient work; requesting more
1/9/2005 12:02:00 AM|LHC@home|Deferring communication with project for 71 weeks, 4 days, 23 hours, 29 minutes, and 24 seconds
1/9/2005 12:30:39 AM||request_reschedule_cpus: project op
1/9/2005 12:30:40 AM|rosetta@home|Sending scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi
1/9/2005 12:30:40 AM|rosetta@home|Requesting 0 seconds of work, returning 1 results
1/9/2005 12:30:42 AM|rosetta@home|Scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
1/9/2005 12:31:17 AM||request_reschedule_cpus: project op
1/9/2005 12:31:19 AM|rosetta@home|Sending scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi
1/9/2005 12:31:19 AM|rosetta@home|Requesting 8640 seconds of work, returning 0 results
1/9/2005 12:31:20 AM|rosetta@home|Scheduler request to http://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
1/9/2005 12:31:20 AM|rosetta@home|Message from server: Not sending work - last RPC too recent: 38 sec
1/9/2005 12:31:20 AM|rosetta@home|No work from project
1/9/2005 12:31:21 AM|rosetta@home|Deferring communication with project for 4 minutes and 1 seconds

It says on the homepage that there are 19,000 workunits in the queue yet I cannot get any workunits. 6 hours comp time wasted....argh. Anyone else having this problem???????????

****Edit****

Hmm, I just got work now? This is interesting. I noticed that for some reason workunits get stuck in the status "ready to report" under the worktab but never actually get uploaded even though BOINC has contacted rosetta servers. Only after I manually press the update button will the workunit go through. I am running version 4.45. Any ideas????
____________

Moderator9
Forum moderator
Project administrator

Joined: Jan 22 06
Posts: 1014
ID: 53254
Credit: 0
RAC: 0
Message 17266 - Posted 28 May 2006 6:16:36 UTC - in response to Message ID 17265.

....Hmm, I just got work now? This is interesting. I noticed that for some reason workunits get stuck in the status "ready to report" under the worktab but never actually get uploaded even though BOINC has contacted rosetta servers. Only after I manually press the update button will the workunit go through. I am running version 4.45. Any ideas????

You might want to upgrade to version 5.4.9 which is the current version for BOINC.
____________
Moderator9
ROSETTA@home FAQ
Moderator Contact

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 17270 - Posted 28 May 2006 10:04:18 UTC - in response to Message ID 17265.

This is interesting. I noticed that for some reason workunits get stuck in the status "ready to report" under the worktab but never actually get uploaded even though BOINC has contacted rosetta servers. Only after I manually press the update button will the workunit go through. I am running version 4.45. Any ideas????

Reporting is done seperately from uploading to reduce network comms on the server side. JM7 (the man who wrote the scheduler) says this:

Results are reported any time the project is contacted for an update. Updates occur at the first of:

1) A result report is due within 24 hours.
2) It has been at least the connect interval since the result completed.
3) (5.4) It is less than the connect interval till the report deadline.
4) Work is needed.
5) A manual update.

Robinski

Joined: Mar 7 06
Posts: 51
ID: 64155
Credit: 45,396
RAC: 0
Message 17280 - Posted 28 May 2006 20:52:35 UTC

I just saw I had an error today with

r287__CONTACTEIGHT_SHORTRELAX_SAVE_ALL_OUT_hom001__563_711
see: Result

Possible this is due to the fact I manualy stopped the boinc service but I am not sure if this was around the same time.

Otherwise it is just an error which occured.


It was an Invalide Function error:
<core_client_version>5.5.0</core_client_version>
<message>
Onjuiste functie. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# random seed: 2482790
# cpu_run_time_pref: 3600
ERROR:: Exit at: .\dock_structure.cc line:401

</stderr_txt>
____________
Member of the Dutch Power Cows

Trying to get the world on IPv6, do you have it? check here: IPv6.RHarmsen.nl

anders n Profile

Joined: Sep 19 05
Posts: 403
ID: 578
Credit: 537,904
RAC: 0
Message 17301 - Posted 29 May 2006 14:04:44 UTC

Hi

I'm cruching this http://boinc.bakerlab.org/rosetta/result.php?resultid=21976659 Wu now.

When I select print to screen on my economiprogram the wu halts.

No movment what so ever after on grafics.

Anders n
____________

Rollo

Joined: Jan 2 06
Posts: 21
ID: 45900
Credit: 52,065
RAC: 0
Message 17317 - Posted 29 May 2006 18:39:25 UTC
Last modified: 29 May 2006 18:41:56 UTC

I am crunching on 21970571 right now. It stops after reaching 1.210% at time step 2833. If I stop boinc and let it restart from the last checkpoint (here: from the beginning), it stops at same step 2833 in model 1. For me this seems reproducible.
Any suggestions, what I can do to produce a reasonable error report, except abort the workunit or wait for the watchdog?

tralala

Joined: Apr 8 06
Posts: 376
ID: 73828
Credit: 581,806
RAC: 1
Message 17321 - Posted 29 May 2006 20:29:19 UTC - in response to Message ID 17317.

I am crunching on 21970571 right now. It stops after reaching 1.210% at time step 2833. If I stop boinc and let it restart from the last checkpoint (here: from the beginning), it stops at same step 2833 in model 1. For me this seems reproducible.
Any suggestions, what I can do to produce a reasonable error report, except abort the workunit or wait for the watchdog?


I'd wait at least an hour. In theory the watchdog should terminate it after an hour. This is a good opportunity to see if it really works as it should. In any case let it run a few hours and if it really keeps stuck at step 2833 abort and you get all the credits for the time crunched.
____________

Aglarond

Joined: Jan 29 06
Posts: 26
ID: 55168
Credit: 444,461
RAC: 0
Message 17325 - Posted 29 May 2006 23:29:58 UTC

I had another one of that nasty R@H screensaver crashes. It was result T0283_CONTACTS_CONSERVATIVE_HALFHB_MAP_FROM_hom006_575_8907_0. However today I zipped memory dump that windows was going to send to microsoft. If you think it will help you, you can download WERa78d.dir00.zip (16.1 MB). (I will leave it there for download for at least a month)

I was also thinking why this happens only on this particular computer. This is only one of my computers, that has localized version of windows (Slovak language version). Do you think it can be the reason for screensaver crash?

Winkle

Joined: May 22 06
Posts: 88
ID: 83983
Credit: 1,093,044
RAC: 499
Message 17337 - Posted 30 May 2006 7:11:39 UTC
Last modified: 30 May 2006 7:23:30 UTC

I am currently crunching JUMP_RELAX_LONGRANGEPAIR_PARALLEL_t285__SAVE_ALL_OUT_548_11268_0 using rosetta version 516
It is at 1% after 2.5 hrs. Boincview tells me that it has 5.25 hrs to complete.
Normally it takes around 2.8 hrs per WU.
It is running on a Dell P3 1G #225837

Wait....

That was wierd...
It just went straight to 100% At 2:43
Any Ideas ??

Edit...
I think this is it...
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=18401795
http://boinc.bakerlab.org/rosetta/result.php?resultid=21942605
End Edit...
Thanks
Ian

tralala

Joined: Apr 8 06
Posts: 376
ID: 73828
Credit: 581,806
RAC: 1
Message 17340 - Posted 30 May 2006 7:38:35 UTC - in response to Message ID 17337.

I am currently crunching JUMP_RELAX_LONGRANGEPAIR_PARALLEL_t285__SAVE_ALL_OUT_548_11268_0 using rosetta version 516
It is at 1% after 2.5 hrs. Boincview tells me that it has 5.25 hrs to complete.
Normally it takes around 2.8 hrs per WU.
It is running on a Dell P3 1G #225837

Wait....

That was wierd...
It just went straight to 100% At 2:43
Any Ideas ??

Edit...
I think this is it...
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=18401795
http://boinc.bakerlab.org/rosetta/result.php?resultid=21942605
End Edit...
Thanks
Ian


Quoting from the FAQ:


Depending on how the Wu is configured, some may have over 1,500,000 steps in the first model and still not reach 1%. This can take over 5 hours of CPU time. There are a few even larger ones.

____________

Winkle

Joined: May 22 06
Posts: 88
ID: 83983
Credit: 1,093,044
RAC: 499
Message 17346 - Posted 30 May 2006 9:10:00 UTC - in response to Message ID 17340.

Thanks
I will have a read

Rollo

Joined: Jan 2 06
Posts: 21
ID: 45900
Credit: 52,065
RAC: 0
Message 17350 - Posted 30 May 2006 11:25:14 UTC - in response to Message ID 17321.

I am crunching on 21970571 right now. It stops after reaching 1.210% at time step 2833. If I stop boinc and let it restart from the last checkpoint (here: from the beginning), it stops at same step 2833 in model 1. For me this seems reproducible.
Any suggestions, what I can do to produce a reasonable error report, except abort the workunit or wait for the watchdog?


I'd wait at least an hour. In theory the watchdog should terminate it after an hour. This is a good opportunity to see if it really works as it should. In any case let it run a few hours and if it really keeps stuck at step 2833 abort and you get all the credits for the time crunched.


The watchdog killed the workunit. I have made a backup, so I can rerun it, if that is of any interest.

Augustine

Joined: Sep 17 05
Posts: 28
ID: 299
Credit: 116,341
RAC: 0
Message 17367 - Posted 30 May 2006 14:55:07 UTC
Last modified: 30 May 2006 14:55:34 UTC

I have a runaway WU (here). It reports 100% done, but even after over 11h it keeps on running, even though I limited WU time to 1h.

HTH

____________

Tom Philippart Profile
Avatar

Joined: May 29 06
Posts: 182
ID: 85247
Credit: 503,628
RAC: 8
Message 17372 - Posted 30 May 2006 15:37:06 UTC

my problem: the progress percentage jumps, it's not advancing fluently
(1%-24%-48%...)

i can live with it, but still it would be cool to have this solved
____________

Feet1st Profile
Avatar

Joined: Dec 30 05
Posts: 1725
ID: 44890
Credit: 843,377
RAC: 108
Message 17375 - Posted 30 May 2006 16:07:04 UTC - in response to Message ID 17372.

my problem: the progress percentage jumps, it's not advancing fluently
(1%-24%-48%...)

i can live with it, but still it would be cool to have this solved


This is normal and described in this faq about the runtime preference. The % complete for Rosetta is not as definate and easy to compute as some other projects. A given WU will run through as many complete models as possible. Given the percentages in your example, once it completed the first model (see the model number on the graphic) it estimated it would get about 3 more models completed before it reaches your runtime preference. Each completed model is what the scientists need for their work. What happens within a model is not as important. But the additional updates to the % complete were basically added to help diagnose any problems with a given set of WUs.
____________
If having a DC project with BOINC is of interest to you, with volunteer or cloud computing resources, but have no time for the BOINC learning curve,
use a hosting service that understands BOINC projects: http://DeepSci.com

Sam Miorelli

Joined: Feb 16 06
Posts: 7
ID: 59176
Credit: 654,533
RAC: 170
Message 17390 - Posted 30 May 2006 19:41:25 UTC

I have a Prescott-based machine that I run Rosetta, Einstein, LHC, and SETI on. None of the other projects have any problems, but Rosetta is about 50% errors. Today alone I had two:

Unrecoverable error for result HOMOLOG_ABRELAX_hom007_t283__505_33607_1 ( - exit code-1073741811 (0xc000000d))

Unrecoverable error for result JUMP_RELAX_LONGRANGEPAIR_FORCESTRAND_t285__SAVE_ALL_OUT_550_36800_0 ( - exit code-1073741811 (0xc000000d))

It only seems that the errors come up when the screensaver is running, not when it's just running in the background as I do other things. My machine runs two units at a time and has 512MB ram. Anyone have any idea why I'm getting these problems in Rosetta?
____________

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 17395 - Posted 30 May 2006 21:16:03 UTC - in response to Message ID 17390.
Last modified: 30 May 2006 21:17:13 UTC

Anyone have any idea why I'm getting these problems in Rosetta?

Yup, I have an AMD64 3700 which has the same problems with ralph 5.12 and 5.16, and Rosetta 5.16. When I run the screensaver, and leave the machine alone. Windows fatal error, if I keep working with it, so that the screensaver never comes on, successful results. When I turn OFF screensaver, successful results.

Turn OFF screensaver. They are aware of this error and have asked Rom Waltons assistance. I offered Rom my help, which he's not accepted.

tony

PS, I have two other puters which work just fine with the screensaver on.

Bob Guy

Joined: Oct 7 05
Posts: 39
ID: 3119
Credit: 24,895
RAC: 0
Message 17396 - Posted 30 May 2006 21:21:28 UTC

Two recent errors because I used the Boinc view graphics button. The WUs complete successfully if I never view the graphics - I have the Boinc screensaver turned off.

21418496
21418531




XS_Duc
Avatar

Joined: Dec 30 05
Posts: 17
ID: 45030
Credit: 310,471
RAC: 0
Message 17413 - Posted 31 May 2006 9:55:05 UTC
Last modified: 31 May 2006 9:56:46 UTC

It's been ages since I had another error to report, but this morning I noticed one... never seen that one before.

Resultid21972244 (Workunit18429646)

pieface

Joined: Sep 20 05
Posts: 15
ID: 600
Credit: 528,244
RAC: 3
Message 17419 - Posted 31 May 2006 13:01:03 UTC

I also had watchdog knock down one of those pdbblast guys: resultid
like XS DUC's.

____________

Enno Ruijters

Joined: Sep 23 05
Posts: 2
ID: 859
Credit: 98,631
RAC: 186
Message 17420 - Posted 31 May 2006 13:04:39 UTC

My linux machine got its first error: result 22071081:

Wed 31 May 2006 02:42:39 PM CEST|rosetta@home|Unrecoverable error for result JUMP_RELAX_LONGRANGEPAIR_ANTIPARALLEL_t285__SAVE_ALL_OUT_548_29690_0 (process exited with code 131 (0x83))

I'm using boinc version 5.4.9 on x86_64 linux 2.6.15.


Result ID 22071081
Name JUMP_RELAX_LONGRANGEPAIR_ANTIPARALLEL_t285__SAVE_ALL_OUT_548_29690_0
Workunit 18520787
Created 30 May 2006 3:35:03 UTC
Sent 30 May 2006 5:48:44 UTC
Received 31 May 2006 12:47:19 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status 131 (0x83)
Computer ID 70238
Report deadline 6 Jun 2006 5:48:44 UTC
CPU time 6945.15
stderr out

<core_client_version>5.4.9</core_client_version>
<message>
process exited with code 131 (0x83)
</message>
<stderr_txt>
# random seed: 2797211
# cpu_run_time_pref: 10800
No heartbeat from core client for 31 sec - exiting
SIGSEGV: segmentation violationStack trace (19 frames):
[0x8836a6b]
[0x884f74c]
[0xffffe500]
[0x860e7a9]
[0x85ff1f8]
[0x809364c]
[0x860ff95]
[0x8610bb0]
[0x87dca0f]
[0x8728e50]
[0x872a6bb]
[0x80a3a75]
[0x85c3a13]
[0x842093e]
[0x85f1ffb]
[0x8496132]
[0x8498c8f]
[0x88aec34]
[0x8048111]

Exiting...

</stderr_txt>

Validate state Invalid
Claimed credit 12.3736072419708
Granted credit 0
application version 5.16

tralala

Joined: Apr 8 06
Posts: 376
ID: 73828
Credit: 581,806
RAC: 1
Message 17422 - Posted 31 May 2006 13:14:38 UTC - in response to Message ID 17419.

I also had watchdog knock down one of those pdbblast guys: resultid
like XS DUC's.


It seems there is a problem with these WUs:

FRA_t297
____________

Jose

Joined: Mar 28 06
Posts: 820
ID: 69098
Credit: 48,297
RAC: 0
Message 17425 - Posted 31 May 2006 14:04:04 UTC - in response to Message ID 17413.

It's been ages since I had another error to report, but this morning I noticed one... never seen that one before.

Resultid21972244 (Workunit18429646)



It means that the watchdog function placed to prevent Wu's stuck in time IS WORKING as it is supposed to do. I imagine that that unit and the ones where the watchdog is terminating the processing ( and there are some few recently) will be analyzed to see what is happening that is causing the watchdog to work.
____________
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato

David Baker
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Sep 17 05
Posts: 637
ID: 122
Credit: 214,854
RAC: 0
Message 17440 - Posted 31 May 2006 16:17:56 UTC - in response to Message ID 17422.

I also had watchdog knock down one of those pdbblast guys: resultid
like XS DUC's.


It seems there is a problem with these WUs:

FRA_t297


Thanks for the heads up--we'll look into this right away
____________

Bin Qian
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 13 05
Posts: 33
ID: 18
Credit: 36,897
RAC: 0
Message 17481 - Posted 31 May 2006 23:05:51 UTC - in response to Message ID 17425.


Thanks to your helpful messages, we've tracked down the rare bug that's causing this in the code and fixed it. The fix will be included in the next release. Great job all!

Luckily we only sent out 5000 of these bad WUs (A very small number compare to the 120,000 done everyday) and about a third of them were affected. You will still get credits for those jobs killed by the watchdog when our credit-grantor runs nightly!

It's been ages since I had another error to report, but this morning I noticed one... never seen that one before.

Resultid21972244 (Workunit18429646)



It means that the watchdog function placed to prevent Wu's stuck in time IS WORKING as it is supposed to do. I imagine that that unit and the ones where the watchdog is terminating the processing ( and there are some few recently) will be analyzed to see what is happening that is causing the watchdog to work.


____________

dag Profile
Avatar

Joined: Dec 16 05
Posts: 106
ID: 38674
Credit: 1,000,020
RAC: 0
Message 17561 - Posted 2 Jun 2006 21:30:17 UTC

Hit 100% at around 12 hours (normal). Then stayed there using a slot and not running for 6 hours -- still 100%, no more cpu time being used.

http://boinc.bakerlab.org/rosetta/result.php?resultid=22292990
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=18724964
T0283_CONTACTS_CONSERVATIVE_CALPHA_HALFHB_MAP_FROM_hom024_593_11206
____________
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.

Ian

Joined: Apr 14 06
Posts: 29
ID: 76277
Credit: 25,245
RAC: 0
Message 17562 - Posted 3 Jun 2006 0:55:51 UTC
Last modified: 3 Jun 2006 0:57:00 UTC

Errors from a day or two ago that I only just spotted:

http://boinc.bakerlab.org/rosetta/result.php?resultid=22302203

http://boinc.bakerlab.org/rosetta/result.php?resultid=22240155

Touch wood (or at least wood veneer), very few errors lately.
____________
Ian Cundell, St Albans, UK

Aglarond

Joined: Jan 29 06
Posts: 26
ID: 55168
Credit: 444,461
RAC: 0
Message 17584 - Posted 3 Jun 2006 19:54:21 UTC
Last modified: 3 Jun 2006 19:59:27 UTC

I had many errors with FRA_t301_hom028_1_LOOPRLX_IGNORE_THE_REST recently. Probably Rosetta needed to use more than 300MB RAM and system was out of memory. I've got two different error codes: -1073741819 (0xc0000005) and -1073741571 (0xc00000fd) .
Results:
22632143
22629987
22628762
22626932
22625240
22624321
22623307
22593482
22585478

rbpeake Profile

Joined: Sep 25 05
Posts: 168
ID: 1036
Credit: 246,593
RAC: 0
Message 17595 - Posted 4 Jun 2006 5:12:54 UTC - in response to Message ID 17584.

I had many errors with FRA_t301_hom028_1_LOOPRLX_IGNORE_THE_REST recently. Probably Rosetta needed to use more than 300MB RAM and system was out of memory. I've got two different error codes: -1073741819 (0xc0000005) and -1073741571 (0xc00000fd) .
Results:
22632143
22629987
22628762
22626932
22625240
22624321
22623307
22593482
22585478

Looks like it might be prudent to take this computer off of Rosetta. It has well below the minimum 500mb recommended, and it is probably not fair to the project to have all these units abort because of that... Just a thought!
____________
Regards,
Bob P.

Vester
Avatar

Joined: Nov 2 05
Posts: 242
ID: 8211
Credit: 258,633
RAC: 0
Message 17646 - Posted 5 Jun 2006 4:36:10 UTC

This one is running well, but it is using more memory than any that I have observed: A peak of 318 MB.

[img=http://img213.imageshack.us/img213/7604/capture050620060029159zm.th.png]
Thumbnail. You may have to click on the larger image to see it clearly.

Aglarond

Joined: Jan 29 06
Posts: 26
ID: 55168
Credit: 444,461
RAC: 0
Message 17719 - Posted 6 Jun 2006 0:51:56 UTC - in response to Message ID 17595.
Last modified: 6 Jun 2006 0:53:54 UTC

Looks like it might be prudent to take this computer off of Rosetta. It has well below the minimum 500mb recommended, and it is probably not fair to the project to have all these units abort because of that... Just a thought!


Yes, I took it off Rosetta immediately. And also other computer with 256MB RAM. New workunits are taking A LOT of RAM. I've seen Rosetta using 375MB recently. So I recommend everyone with less than 512MB to be carefull.

BennyRop

Joined: Dec 17 05
Posts: 555
ID: 38837
Credit: 140,800
RAC: 0
Message 17760 - Posted 6 Jun 2006 7:31:42 UTC

If you're getting repeated errors like that, perhaps it'd be a good idea to run Ralph on that machine. Let them track down the source of the errors - and either correct the code, or have Rosetta instantly fail such WUs and state, "not enough Ram for this WU." Someday, we'll hopefully have a client that will tell the server how much ram we have on our machine, and get WUs that will run with that amount of Ram.
____________

Jimi@0wned.org.uk

Joined: Mar 10 06
Posts: 29
ID: 64757
Credit: 335,252
RAC: 0
Message 17993 - Posted 7 Jun 2006 19:36:40 UTC

My air-cooled dual core AMD is slowly dying. Is the data still good in this unit? It bothers me that a bad machine might poison the result. How did the following WU validate with those errors? Was it restarting from an earlier checkpoint?

Result ID 23165576
Name t304__CASP7_ABRELAX_SAVE_ALL_OUT_cterm2_hom001__654_16007_0
Workunit 19503478
Created 7 Jun 2006 11:57:47 UTC
Sent 7 Jun 2006 13:27:18 UTC
Received 7 Jun 2006 19:29:58 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 241725
Report deadline 14 Jun 2006 13:27:18 UTC
CPU time 14133.575291
stderr out

<core_client_version>5.3.6</core_client_version>
<stderr_txt>
# random seed: 1986744
SIGSEGV: segmentation violationStack trace (16 frames):
[0x8836a6b]
[0x884f74c]
[0xffffe500]
[0x88d0170]
[0x88d1a29]
[0x88a0767]
[0x88a2b51]
[0x81eb08b]
[0x87298fc]
[0x87d2f38]
[0x8313d95]
[0x80e49ed]
[0x849682f]
[0x8498c8f]
[0x88aec34]
[0x8048111]

Exiting...
# random seed: 1986744
# cpu_run_time_pref: 14400
SIGSEGV: segmentation violationStack trace (21 frames):
[0x8836a6b]
[0x884f74c]
[0xffffe500]
[0x882e6bc]
[0x8625638]
[0x83671a9]
[0x8361a84]
[0x8729051]
[0x84cea28]
[0x84cedc4]
[0x84cfb67]
[0x84de8b1]
[0x84e06f1]
[0x87d42c3]
[0x86afa6b]
[0x86b2089]
[0x80e5111]
[0x849682f]
[0x8498c8f]
[0x88aec34]
[0x8048111]

Exiting...
# cpu_run_time_pref: 14400
# DONE :: 1 starting structures built 20 (nstruct) times
# This process generated 20 decoys from 20 attempts

</stderr_txt>

Validate state Valid

Moderator9
Forum moderator
Project administrator

Joined: Jan 22 06
Posts: 1014
ID: 53254
Credit: 0
RAC: 0
Message 18047 - Posted 8 Jun 2006 2:30:15 UTC - in response to Message ID 17993.
Last modified: 8 Jun 2006 2:33:42 UTC

My air-cooled dual core AMD is slowly dying. Is the data still good in this unit? It bothers me that a bad machine might poison the result. How did the following WU validate with those errors? Was it restarting from an earlier checkpoint?

...

Validation is not the same problem here as it might be on other BOINC projects. Any model you create successfully is valuable. That model need not match (in most cases it won't) any other models from any other computer. The validation will be made against the known structure of the protein. Eventually when the success rate is sufficiently high for matching known structures, then rosetta could be used for predictions. The prediction capability is what is being tested in CASP.
____________
Moderator9
ROSETTA@home FAQ
Moderator Contact

truckpuller

Joined: Nov 5 05
Posts: 40
ID: 9446
Credit: 229,134
RAC: 0
Message 18059 - Posted 8 Jun 2006 4:16:57 UTC

Since the running of 5.16 my Rac on 1 of my computers had dropped from 250 down to 225 anyone shed any light as what to look for as to why.This 1 computer is 2500+ barton with 1 gig of memory and i have a 1.6 duron with 512 memory out producing it. Thanks in advance
____________
Visit us at Christianboards.org

BennyRop

Joined: Dec 17 05
Posts: 555
ID: 38837
Credit: 140,800
RAC: 0
Message 18060 - Posted 8 Jun 2006 4:29:07 UTC

I noticed my RAC fall from 256 to 225 when I ran Ralph. (Ralph scores are seperate from Rosetta.)
____________

Moderator9
Forum moderator
Project administrator

Joined: Jan 22 06
Posts: 1014
ID: 53254
Credit: 0
RAC: 0
Message 18138 - Posted 8 Jun 2006 15:37:22 UTC - in response to Message ID 18060.

I noticed my RAC fall from 256 to 225 when I ran Ralph. (Ralph scores are seperate from Rosetta.)

Moreover, RALPH scores can be zeroed out at any time. RALPH is a test project and scores have no meaning there. If you are running Ralph it will reduce the RAC for any other projects run on that machine. But Running RALPH is VERY helpful to the project. If you are interested in helping create the next version of the Science application and are not concerned about credits, please join us over on RALPH.

Current testing includes a new Mac Intel version of Rosetta.

____________
Moderator9
ROSETTA@home FAQ
Moderator Contact

Jose

Joined: Mar 28 06
Posts: 820
ID: 69098
Credit: 48,297
RAC: 0
Message 18227 - Posted 9 Jun 2006 4:10:11 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=23001602

t299__CASP7_ABRELAX_SAVE_ALL_OUT_nterm_nohelix3_hom008__643_2630_0
Exit status -2147483645 (0x80000003)
CPU time 8925.28125
stderr out <core_client_version>5.5.0</core_client_version>
<message>
One or more arguments are invalid (0x80000003) - exit code -2147483645 (0x80000003)
</message>
<stderr_txt>
# random seed: 2367371
# cpu_run_time_pref: 14400





____________
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato

Moderator9
Forum moderator
Project administrator

Joined: Jan 22 06
Posts: 1014
ID: 53254
Credit: 0
RAC: 0
Message 18229 - Posted 9 Jun 2006 4:44:01 UTC

With the release of the latest version 5.22 of Rosetta, a new thread for reporting problems has been opened here. Please continue to report version 5.16 problems in this thread. But if the issue is with version 5.22 please report in the new thread.
____________
Moderator9
ROSETTA@home FAQ
Moderator Contact

Jimi@0wned.org.uk

Joined: Mar 10 06
Posts: 29
ID: 64757
Credit: 335,252
RAC: 0
Message 18249 - Posted 9 Jun 2006 8:48:24 UTC

My air-cooled x2 3800 is now kaput. Thrashed to death, 78,582.75 Cobblestones since 31st March. A fallen soldier indeed.

I'll have to make up lost ground with Conroe... :D

XS_Duc
Avatar

Joined: Dec 30 05
Posts: 17
ID: 45030
Credit: 310,471
RAC: 0
Message 18260 - Posted 9 Jun 2006 12:36:33 UTC
Last modified: 9 Jun 2006 12:49:00 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=22893068

Got stuck somehow yesterday, didn't see it till this morning...
Watchdog didn't kick in neither, I had to abort it myself, lost a lot of time on this one.

Message boards : Number crunching : Report Problems with Rosetta Version 5.16 II


Home | Join | About | Participants | Community | Statistics

Copyright © 2010 University of Washington

Last Modified: 3 Dec 2007 20:36:19 UTC
Back to top ^