Report Problems with Rosetta Version 5.16 II

Message boards : Number crunching : Report Problems with Rosetta Version 5.16 II

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Tom Philippart
Avatar

Send message
Joined: 29 May 06
Posts: 183
Credit: 834,667
RAC: 0
Message 17372 - Posted: 30 May 2006, 15:37:06 UTC

my problem: the progress percentage jumps, it's not advancing fluently
(1%-24%-48%...)

i can live with it, but still it would be cool to have this solved
ID: 17372 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 17375 - Posted: 30 May 2006, 16:07:04 UTC - in response to Message 17372.  

my problem: the progress percentage jumps, it's not advancing fluently
(1%-24%-48%...)

i can live with it, but still it would be cool to have this solved


This is normal and described in this faq about the runtime preference. The % complete for Rosetta is not as definate and easy to compute as some other projects. A given WU will run through as many complete models as possible. Given the percentages in your example, once it completed the first model (see the model number on the graphic) it estimated it would get about 3 more models completed before it reaches your runtime preference. Each completed model is what the scientists need for their work. What happens within a model is not as important. But the additional updates to the % complete were basically added to help diagnose any problems with a given set of WUs.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 17375 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sam Miorelli

Send message
Joined: 16 Feb 06
Posts: 7
Credit: 1,303,044
RAC: 0
Message 17390 - Posted: 30 May 2006, 19:41:25 UTC

I have a Prescott-based machine that I run Rosetta, Einstein, LHC, and SETI on. None of the other projects have any problems, but Rosetta is about 50% errors. Today alone I had two:

Unrecoverable error for result HOMOLOG_ABRELAX_hom007_t283__505_33607_1 ( - exit code-1073741811 (0xc000000d))

Unrecoverable error for result JUMP_RELAX_LONGRANGEPAIR_FORCESTRAND_t285__SAVE_ALL_OUT_550_36800_0 ( - exit code-1073741811 (0xc000000d))

It only seems that the errors come up when the screensaver is running, not when it's just running in the background as I do other things. My machine runs two units at a time and has 512MB ram. Anyone have any idea why I'm getting these problems in Rosetta?
ID: 17390 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 17395 - Posted: 30 May 2006, 21:16:03 UTC - in response to Message 17390.  
Last modified: 30 May 2006, 21:17:13 UTC

Anyone have any idea why I'm getting these problems in Rosetta?

Yup, I have an AMD64 3700 which has the same problems with ralph 5.12 and 5.16, and Rosetta 5.16. When I run the screensaver, and leave the machine alone. Windows fatal error, if I keep working with it, so that the screensaver never comes on, successful results. When I turn OFF screensaver, successful results.

Turn OFF screensaver. They are aware of this error and have asked Rom Waltons assistance. I offered Rom my help, which he's not accepted.

tony

PS, I have two other puters which work just fine with the screensaver on.
ID: 17395 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bob Guy

Send message
Joined: 7 Oct 05
Posts: 39
Credit: 24,895
RAC: 0
Message 17396 - Posted: 30 May 2006, 21:21:28 UTC

Two recent errors because I used the Boinc view graphics button. The WUs complete successfully if I never view the graphics - I have the Boinc screensaver turned off.

21418496
21418531




ID: 17396 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_Duc
Avatar

Send message
Joined: 30 Dec 05
Posts: 17
Credit: 310,471
RAC: 0
Message 17413 - Posted: 31 May 2006, 9:55:05 UTC
Last modified: 31 May 2006, 9:56:46 UTC

It's been ages since I had another error to report, but this morning I noticed one... never seen that one before.

Resultid21972244 (Workunit18429646)
ID: 17413 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
pieface

Send message
Joined: 20 Sep 05
Posts: 17
Credit: 797,661
RAC: 0
Message 17419 - Posted: 31 May 2006, 13:01:03 UTC

I also had watchdog knock down one of those pdbblast guys: resultid
like XS DUC's.

ID: 17419 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Enno Ruijters

Send message
Joined: 23 Sep 05
Posts: 2
Credit: 3,194,827
RAC: 0
Message 17420 - Posted: 31 May 2006, 13:04:39 UTC

My linux machine got its first error: result 22071081:

Wed 31 May 2006 02:42:39 PM CEST|rosetta@home|Unrecoverable error for result JUMP_RELAX_LONGRANGEPAIR_ANTIPARALLEL_t285__SAVE_ALL_OUT_548_29690_0 (process exited with code 131 (0x83))

I'm using boinc version 5.4.9 on x86_64 linux 2.6.15.


Result ID 22071081
Name JUMP_RELAX_LONGRANGEPAIR_ANTIPARALLEL_t285__SAVE_ALL_OUT_548_29690_0
Workunit 18520787
Created 30 May 2006 3:35:03 UTC
Sent 30 May 2006 5:48:44 UTC
Received 31 May 2006 12:47:19 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status 131 (0x83)
Computer ID 70238
Report deadline 6 Jun 2006 5:48:44 UTC
CPU time 6945.15
stderr out

<core_client_version>5.4.9</core_client_version>
<message>
process exited with code 131 (0x83)
</message>
<stderr_txt>
# random seed: 2797211
# cpu_run_time_pref: 10800
No heartbeat from core client for 31 sec - exiting
SIGSEGV: segmentation violationStack trace (19 frames):
[0x8836a6b]
[0x884f74c]
[0xffffe500]
[0x860e7a9]
[0x85ff1f8]
[0x809364c]
[0x860ff95]
[0x8610bb0]
[0x87dca0f]
[0x8728e50]
[0x872a6bb]
[0x80a3a75]
[0x85c3a13]
[0x842093e]
[0x85f1ffb]
[0x8496132]
[0x8498c8f]
[0x88aec34]
[0x8048111]

Exiting...

</stderr_txt>

Validate state Invalid
Claimed credit 12.3736072419708
Granted credit 0
application version 5.16
ID: 17420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 17422 - Posted: 31 May 2006, 13:14:38 UTC - in response to Message 17419.  

I also had watchdog knock down one of those pdbblast guys: resultid
like XS DUC's.


It seems there is a problem with these WUs:

FRA_t297
ID: 17422 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jose

Send message
Joined: 28 Mar 06
Posts: 820
Credit: 48,297
RAC: 0
Message 17425 - Posted: 31 May 2006, 14:04:04 UTC - in response to Message 17413.  

It's been ages since I had another error to report, but this morning I noticed one... never seen that one before.

Resultid21972244 (Workunit18429646)



It means that the watchdog function placed to prevent Wu's stuck in time IS WORKING as it is supposed to do. I imagine that that unit and the ones where the watchdog is terminating the processing ( and there are some few recently) will be analyzed to see what is happening that is causing the watchdog to work.
This and no other is the root from which a Tyrant springs; when he first appears he is a protector.”
Plato
ID: 17425 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 17440 - Posted: 31 May 2006, 16:17:56 UTC - in response to Message 17422.  

I also had watchdog knock down one of those pdbblast guys: resultid
like XS DUC's.


It seems there is a problem with these WUs:

FRA_t297


Thanks for the heads up--we'll look into this right away
ID: 17440 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 17481 - Posted: 31 May 2006, 23:05:51 UTC - in response to Message 17425.  


Thanks to your helpful messages, we've tracked down the rare bug that's causing this in the code and fixed it. The fix will be included in the next release. Great job all!

Luckily we only sent out 5000 of these bad WUs (A very small number compare to the 120,000 done everyday) and about a third of them were affected. You will still get credits for those jobs killed by the watchdog when our credit-grantor runs nightly!

It's been ages since I had another error to report, but this morning I noticed one... never seen that one before.

Resultid21972244 (Workunit18429646)



It means that the watchdog function placed to prevent Wu's stuck in time IS WORKING as it is supposed to do. I imagine that that unit and the ones where the watchdog is terminating the processing ( and there are some few recently) will be analyzed to see what is happening that is causing the watchdog to work.


ID: 17481 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 17561 - Posted: 2 Jun 2006, 21:30:17 UTC

Hit 100% at around 12 hours (normal). Then stayed there using a slot and not running for 6 hours -- still 100%, no more cpu time being used.

https://boinc.bakerlab.org/rosetta/result.php?resultid=22292990
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=18724964
T0283_CONTACTS_CONSERVATIVE_CALPHA_HALFHB_MAP_FROM_hom024_593_11206
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 17561 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ian

Send message
Joined: 14 Apr 06
Posts: 29
Credit: 183,497
RAC: 1,235
Message 17562 - Posted: 3 Jun 2006, 0:55:51 UTC
Last modified: 3 Jun 2006, 0:57:00 UTC

Errors from a day or two ago that I only just spotted:

https://boinc.bakerlab.org/rosetta/result.php?resultid=22302203

https://boinc.bakerlab.org/rosetta/result.php?resultid=22240155

Touch wood (or at least wood veneer), very few errors lately.
Ian Cundell, St Albans, UK
ID: 17562 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aglarond

Send message
Joined: 29 Jan 06
Posts: 26
Credit: 446,212
RAC: 0
Message 17584 - Posted: 3 Jun 2006, 19:54:21 UTC
Last modified: 3 Jun 2006, 19:59:27 UTC

I had many errors with FRA_t301_hom028_1_LOOPRLX_IGNORE_THE_REST recently. Probably Rosetta needed to use more than 300MB RAM and system was out of memory. I've got two different error codes: -1073741819 (0xc0000005) and -1073741571 (0xc00000fd) .
Results:
22632143
22629987
22628762
22626932
22625240
22624321
22623307
22593482
22585478
ID: 17584 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 17595 - Posted: 4 Jun 2006, 5:12:54 UTC - in response to Message 17584.  

I had many errors with FRA_t301_hom028_1_LOOPRLX_IGNORE_THE_REST recently. Probably Rosetta needed to use more than 300MB RAM and system was out of memory. I've got two different error codes: -1073741819 (0xc0000005) and -1073741571 (0xc00000fd) .
Results:
22632143
22629987
22628762
22626932
22625240
22624321
22623307
22593482
22585478

Looks like it might be prudent to take this computer off of Rosetta. It has well below the minimum 500mb recommended, and it is probably not fair to the project to have all these units abort because of that... Just a thought!
Regards,
Bob P.
ID: 17595 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Vester
Avatar

Send message
Joined: 2 Nov 05
Posts: 257
Credit: 3,283,885
RAC: 14,798
Message 17646 - Posted: 5 Jun 2006, 4:36:10 UTC

This one is running well, but it is using more memory than any that I have observed: A peak of 318 MB.

[img=http://img213.imageshack.us/img213/7604/capture050620060029159zm.th.png]
Thumbnail. You may have to click on the larger image to see it clearly.
ID: 17646 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aglarond

Send message
Joined: 29 Jan 06
Posts: 26
Credit: 446,212
RAC: 0
Message 17719 - Posted: 6 Jun 2006, 0:51:56 UTC - in response to Message 17595.  
Last modified: 6 Jun 2006, 0:53:54 UTC

Looks like it might be prudent to take this computer off of Rosetta. It has well below the minimum 500mb recommended, and it is probably not fair to the project to have all these units abort because of that... Just a thought!


Yes, I took it off Rosetta immediately. And also other computer with 256MB RAM. New workunits are taking A LOT of RAM. I've seen Rosetta using 375MB recently. So I recommend everyone with less than 512MB to be carefull.
ID: 17719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 17760 - Posted: 6 Jun 2006, 7:31:42 UTC

If you're getting repeated errors like that, perhaps it'd be a good idea to run Ralph on that machine. Let them track down the source of the errors - and either correct the code, or have Rosetta instantly fail such WUs and state, "not enough Ram for this WU." Someday, we'll hopefully have a client that will tell the server how much ram we have on our machine, and get WUs that will run with that amount of Ram.
ID: 17760 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jimi@0wned.org.uk

Send message
Joined: 10 Mar 06
Posts: 29
Credit: 335,252
RAC: 0
Message 17993 - Posted: 7 Jun 2006, 19:36:40 UTC

My air-cooled dual core AMD is slowly dying. Is the data still good in this unit? It bothers me that a bad machine might poison the result. How did the following WU validate with those errors? Was it restarting from an earlier checkpoint?

Result ID 23165576
Name t304__CASP7_ABRELAX_SAVE_ALL_OUT_cterm2_hom001__654_16007_0
Workunit 19503478
Created 7 Jun 2006 11:57:47 UTC
Sent 7 Jun 2006 13:27:18 UTC
Received 7 Jun 2006 19:29:58 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 241725
Report deadline 14 Jun 2006 13:27:18 UTC
CPU time 14133.575291
stderr out

<core_client_version>5.3.6</core_client_version>
<stderr_txt>
# random seed: 1986744
SIGSEGV: segmentation violationStack trace (16 frames):
[0x8836a6b]
[0x884f74c]
[0xffffe500]
[0x88d0170]
[0x88d1a29]
[0x88a0767]
[0x88a2b51]
[0x81eb08b]
[0x87298fc]
[0x87d2f38]
[0x8313d95]
[0x80e49ed]
[0x849682f]
[0x8498c8f]
[0x88aec34]
[0x8048111]

Exiting...
# random seed: 1986744
# cpu_run_time_pref: 14400
SIGSEGV: segmentation violationStack trace (21 frames):
[0x8836a6b]
[0x884f74c]
[0xffffe500]
[0x882e6bc]
[0x8625638]
[0x83671a9]
[0x8361a84]
[0x8729051]
[0x84cea28]
[0x84cedc4]
[0x84cfb67]
[0x84de8b1]
[0x84e06f1]
[0x87d42c3]
[0x86afa6b]
[0x86b2089]
[0x80e5111]
[0x849682f]
[0x8498c8f]
[0x88aec34]
[0x8048111]

Exiting...
# cpu_run_time_pref: 14400
# DONE :: 1 starting structures built 20 (nstruct) times
# This process generated 20 decoys from 20 attempts

</stderr_txt>

Validate state Valid
ID: 17993 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Report Problems with Rosetta Version 5.16 II



©2024 University of Washington
https://www.bakerlab.org