Report Problems with Rosetta Version 5.25

Message boards : Number crunching : Report Problems with Rosetta Version 5.25

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · Next

AuthorMessage
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 24326 - Posted: 23 Aug 2006, 3:13:52 UTC - in response to Message 24323.  

Sigh...it's not my PC. Look, every project runs fine, I stress my PC 24/7. Yes I've tried diagnostic tools but they always turn out ok. For the past few weeks a lot of people have complained about this "stuck" unit issue, so I *know* I'm not alone. Something is broken in the Linux version for sure.


Hmm... Well, my next wild guess is that the bug involves an interaction between Rosetta and some other project. (I'm trying to figure out the key difference between the Linux machines that have this problem and my Linux machines, which don't. The machines I have on Rosetta don't crunch any other project.)
ID: 24326 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 24767 - Posted: 24 Aug 2006, 20:49:28 UTC

Work unit FRA_t370_CASPR_hom001_7_t370_7_dec18IGNORE_THE_REST_3_1213_181_0 is stuck at 40.591%. The activity monitor says its still running but the CPU time doesn't increment.

Using rosetta 5.25 on a 1GHz Emac with OSX 10.3.9
ID: 24767 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 24921 - Posted: 26 Aug 2006, 1:55:23 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=33414273
5croA_BOINC_BACKBONE_HN_PENALTY_ABRELAX_SAVE_ALL_OUT__1175_14_0
ERROR:: Exit at: .dock_structure.cc line:401

https://boinc.bakerlab.org/rosetta/result.php?resultid=33427512
1utg__BOINC_BACKBONE_HN_PENALTY_ABRELAX_SAVE_ALL_OUT__1175_81_0
ERROR:: Exit at: .dock_structure.cc line:401

https://boinc.bakerlab.org/rosetta/result.php?resultid=33427522
2tif__BOINC_BACKBONE_HN_PENALTY_ABRELAX_SAVE_ALL_OUT__1175_81_0
ERROR:: Exit at: .dock_structure.cc line:401

https://boinc.bakerlab.org/rosetta/result.php?resultid=33431583
1louA_BOINC_BACKBONE_HN_PENALTY_ABRELAX_SAVE_ALL_OUT__1175_102_0
ERROR:: Exit at: .dock_structure.cc line:401

https://boinc.bakerlab.org/rosetta/result.php?resultid=33507616
FRA_t367_CASPR_hom001_6_t367_4_1wolA_IGNORE_THE_REST_28_1076_62_1
ERROR:: Exit at: .pack.cc line:1860

https://boinc.bakerlab.org/rosetta/result.php?resultid=33445584
1opd__BOINC_BACKBONE_HN_PENALTY_ABRELAX_SAVE_ALL_OUT__1175_174_0
ERROR:: Exit at: .dock_structure.cc line:401
ID: 24921 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tino Ruiz

Send message
Joined: 12 Oct 05
Posts: 13
Credit: 397,392
RAC: 0
Message 24986 - Posted: 26 Aug 2006, 14:21:11 UTC
Last modified: 26 Aug 2006, 14:21:47 UTC

Now this is odd. I've been able to complete the first work unit since this whole "stuck" issue started a few weeks ago.

Fri 25 Aug 2006 09:14:39 PM AST|rosetta@home|Computation for task FRA_t370_CASPR_hom001_7_t370_7_dec83IGNORE_THE_REST_1_1213_653_0 finished

I didn't change anything. Perhaps the new work units have been fixed?
ID: 24986 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R.L. Casey

Send message
Joined: 7 Jun 06
Posts: 91
Credit: 2,728,885
RAC: 0
Message 24997 - Posted: 26 Aug 2006, 15:28:00 UTC - in response to Message 24986.  
Last modified: 26 Aug 2006, 15:29:28 UTC

Now this is odd. I've been able to complete the first work unit since this whole "stuck" issue started a few weeks ago.

Fri 25 Aug 2006 09:14:39 PM AST|rosetta@home|Computation for task FRA_t370_CASPR_hom001_7_t370_7_dec83IGNORE_THE_REST_1_1213_653_0 finished

I didn't change anything. Perhaps the new work units have been fixed?

Hi MonsterTruck,
The Work Units that I can see for you show that all of your Rosetta runs have ended with a "segmentation violation" or "segment fault" where, as an example, the Rosetta application tries to write data to a memory location that is specified in the processor to be "unavailable" because the memory is for example marked "read-only" or is allocated to another process or the operating system. However, two of the WU results show that some Rosetta decoys were built before the segment fault, and you ended up receiving some credit.

Perhaps you could check some WU results from your other projects to see if some of them are also experiencing errors. I do not know if an application from another app could cause a segmentation fault in the Rosetta application. That's the limit of my knowledge on this, so others hopefully will continue to chime in. Good luck, and keep crunching! The Rosetta people really can use all the support we can give!
Edit: typos.
ID: 24997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 25009 - Posted: 26 Aug 2006, 18:20:30 UTC

MonsterTruck, here's something simple you might try. I assume you're using the default 60 minutes to switch between projects. If you increase that to 240 minutes or more, then it should be able to complete most rosetta WUs without a switch (as you seem to be using the default 3 hours for these WUs). If the switch is causing the corruption, then this should greatly reduce the number of WUs that end up corrupted.
ID: 25009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 25010 - Posted: 26 Aug 2006, 18:36:22 UTC

I think the trouble for MonsterTruck has another source: All but one WU were aborted by him, since he _assumed_ they were stuck. A restart of BOINC resumes all "stuck" WUs, no need to abort. Furthermore his WU all completed fine on other machines. Maybe he was troubled because the progress bar did not move for a long time.
ID: 25010 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tino Ruiz

Send message
Joined: 12 Oct 05
Posts: 13
Credit: 397,392
RAC: 0
Message 25048 - Posted: 26 Aug 2006, 22:34:58 UTC
Last modified: 26 Aug 2006, 22:36:03 UTC

R.L. Casey: thanks for your insight. Yes, vtu@home immediately ends with an error (aborted), but the author is aware of this issue on Linux and told me it's a bug in his app. So far he hasn't been able to fix it yet. I doubt his app interferes with Rosetta, but it's an interesting point of view.

AMD_is_logical: since I've switched to 120 minutes, the amount of errors seem to have gone down. I've upped it to 180 minutes now. Hopefully this will eliminate most, if not all of my problems. I don't want to set it too high since I'm attached to a lot of projects and might miss a few deadlines.

tralala: no, no, no. After a while, the Rosetta work units quit computing altogether. They have "running" next to them, but the CPU % falls to 0, and stays there until BOINC switches to another project. When BOINC returns to Rosetta, the CPU will stay at 0%, doing absolutely nothing. That's why I have to abort it. Apparently with the increased time to work on the units, this seems to be less of an issue.

I'll keep y'all posted if anything changes.
ID: 25048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,797,156
RAC: 2,434
Message 25257 - Posted: 28 Aug 2006, 11:57:29 UTC

I have dropped my switch time from 90 seconds back to 60 seconds and extended the work time from 4 hours to 6 hours.
This seems to have stopped the hung WU's but I still get a few computational errors (process exited:exit code 1).
Nothing like I was getting before.
I also left project in memory.
At least I can now process some WU's rather than all of them erroring out, I have not had abort any WU's for a couple of days now.
ID: 25257 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 25258 - Posted: 28 Aug 2006, 12:52:36 UTC - in response to Message 25048.  
Last modified: 28 Aug 2006, 12:57:29 UTC

R.L. Casey: thanks for your insight. Yes, vtu@home immediately ends with an error (aborted), but the author is aware of this issue on Linux and told me it's a bug in his app. So far he hasn't been able to fix it yet. I doubt his app interferes with Rosetta, but it's an interesting point of view.

AMD_is_logical: since I've switched to 120 minutes, the amount of errors seem to have gone down. I've upped it to 180 minutes now. Hopefully this will eliminate most, if not all of my problems. I don't want to set it too high since I'm attached to a lot of projects and might miss a few deadlines.

tralala: no, no, no. After a while, the Rosetta work units quit computing altogether. They have "running" next to them, but the CPU % falls to 0, and stays there until BOINC switches to another project. When BOINC returns to Rosetta, the CPU will stay at 0%, doing absolutely nothing. That's why I have to abort it. Apparently with the increased time to work on the units, this seems to be less of an issue.

I'll keep y'all posted if anything changes.


Okay if the CPU-Load goes to 0% than it is a real stuck. Seems we now have a new stuck-problem, not the 1% stuck but the 0%-stuck. ;-) Although not as wide spread as the first stuck. It seems only affecting Linux and only if you crunch several projects. Perhaps you can try for a few days to let one of your boxes crunch Rosetta exclusively in order to check whether it has to do with BOINC switches.

The project staff has posted that they soon will test a new app on Ralph, which will be eventually rolled out here and there is a faint hope that this will be solved than.
ID: 25258 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 25392 - Posted: 29 Aug 2006, 3:32:45 UTC - in response to Message 22414.  

Yet another stuck work unit:

NMR_1i27_CASPR_1_1i27__1_id_model_10IGNORE_THE_REST_idl_1218_1949
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=30244351

It has been running for 27 hours but it is stuck at 2 hours accumulated CPU time. BOINC says Rosetta is running but the CPU for the process is at 0% and the load avaerage is 0.

Like the previous stuck WU, I rebooted the machine and the WU immediately terminated with the error

ERROR:: Exit at: initialize.cc line:1618

This machine is running CentOS 4.3
ID: 25392 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ethan
Volunteer moderator

Send message
Joined: 22 Aug 05
Posts: 286
Credit: 9,304,700
RAC: 0
Message 25394 - Posted: 29 Aug 2006, 3:36:21 UTC - in response to Message 25392.  

Yet another stuck work unit:

NMR_1i27_CASPR_1_1i27__1_id_model_10IGNORE_THE_REST_idl_1218_1949
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=30244351

It has been running for 27 hours but it is stuck at 2 hours accumulated CPU time. BOINC says Rosetta is running but the CPU for the process is at 0% and the load avaerage is 0.

Like the previous stuck WU, I rebooted the machine and the WU immediately terminated with the error

ERROR:: Exit at: initialize.cc line:1618

This machine is running CentOS 4.3


Before I pass it along (I apologize for not having a clue), what is CentOS?


ID: 25394 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 25398 - Posted: 29 Aug 2006, 3:49:33 UTC

CentOS is a form of Penguinware?

http://distrowatch.com/table.php?distribution=centos
http://en.wikipedia.org/wiki/CentOS




ID: 25398 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 25402 - Posted: 29 Aug 2006, 4:18:07 UTC - in response to Message 25394.  


Before I pass it along (I apologize for not having a clue), what is CentOS?


It is a free version of a "prominent North American Enterprise Linux vendor" product.
ID: 25402 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 25453 - Posted: 29 Aug 2006, 17:45:26 UTC

hostid=292556
has a couple of pages of errored out results; all with the following error:


ERROR:: Exit at: .fullatom_setup.cc line:550

While most refer to ABRELAX, a handful were of different types - here's a sample:
1di2__CONTROL_ABRELAX_SAVE_ALL_OUT__1182_4124_0
BENCH_ABRELAX_SAVE_ALL_OUT_1opd__BARCODE_CONTROL_filters_1215_8066_0
BENCH_ABRELAX_SAVE_ALL_OUT_ (16? with different endings..)
2reb__CHEAT_ABRELAX_SAVE_ALL_OUT_BARCODE__1193_4427_0
NMR_1i27_CASPR_1_1i27__1_id_model_12IGNORE_THE_REST_idl_1218_2381_0

What is this error? If it's a missing file on an end user's machine, what's the easiest way to force Rosetta to download the missing file? And if it's because someone uploaded a string of WUs that weren't complete - please take their Starbucks card and give them a can of jolt for use just prior to uploading new WUs for us. :)
ID: 25453 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 25458 - Posted: 29 Aug 2006, 18:40:37 UTC

It appears to be when the app tries to read the following file:

bbdep02.May.sortlib.gz

This is a required database file that gets downloaded once when you get your very first work unit and then stays on your computer. I do not know the exact cause of the error but it is not a universal error and is only happening to a small number of users. I suggest manually downloading the file and then placing it in the R@h project directory in your boinc installation if for some reason it no longer exists. Or I would reset the project.

https://boinc.bakerlab.org/rosetta/download/15a/bbdep02.May.sortlib.gz
ID: 25458 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R.L. Casey

Send message
Joined: 7 Jun 06
Posts: 91
Credit: 2,728,885
RAC: 0
Message 25465 - Posted: 29 Aug 2006, 19:19:45 UTC - in response to Message 25402.  


Before I pass it along (I apologize for not having a clue), what is CentOS?


It is a free version of a "prominent North American Enterprise Linux vendor" product.

That would be a recompilation of the open source of Red Hat Linux. BennyRop's WIKIpedia link is super.
ID: 25465 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 25515 - Posted: 29 Aug 2006, 22:31:23 UTC

I've just noticed that, on my host 290356 (Linux x86), from last 7 results, 6 oldest validated, but 4 of them contained "SIGSEGV: segmentation violation" in the output, once also with "*** glibc detected *** malloc(): memory corruption: 0x096a2509 ***".

The newest result SIGSEGVed too, but additionally errored out (hence I noticed it) with code 131 (0x83).

Other projects have no problem (except one bad malaria result).

Peter
ID: 25515 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kmanley57

Send message
Joined: 10 Nov 05
Posts: 1
Credit: 128,709
RAC: 0
Message 25524 - Posted: 29 Aug 2006, 23:06:40 UTC

I have been 'trying' to run version 5.25 for over a day now, and have seen that both my Pentium dual core and single core Linux machines are stopping in the middle of the WU. So I put the dual into 'Rosetta only mode' and it has so far processed both WU given it. The single core is doing better with only 2 out of about 7 WU that hung up. The dual core, this is the first time in over a day that one has got passed about 68%. So that is only 2 out of about 8 that worked on it. I just wanted to pass this information on to the 'Rosetta team'. I will let the two I have in que finish then disconnect from the project and check back in a couple months again.
ID: 25524 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 25538 - Posted: 30 Aug 2006, 1:50:31 UTC

kmanley57 Can you describe a little more what you are seeing? Are you looking at the graphic? If so, does it show the step number increasing? If not, are you looking at the Windows task manager? Is Rosetta using CPU?? In the BOINC manager, is the time CPU time shown in the tasks tab increasing?

I'm thinking there is a BOINC problem with properly dispatching CPU.

For future readers, he's got a Exit status -1073741819 on one Windows machine running BOINC 5.4.9, and a Exit status -164 on another running BOINC 5.2.13.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 25538 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · Next

Message boards : Number crunching : Report Problems with Rosetta Version 5.25



©2024 University of Washington
https://www.bakerlab.org