Report Problems with Rosetta Version 5.07

Message boards : Number crunching : Report Problems with Rosetta Version 5.07

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 15374 - Posted: 3 May 2006, 1:26:45 UTC - in response to Message 15373.  
Last modified: 3 May 2006, 1:31:11 UTC

I still keep getting units that say they are running when they are not. No Error messages in the log.

Go to the Projects tab, look at rosetta, then follow that line over to the "Status" column. does it say "suspended" there? You should also check
the work/tasks tab to see if that particular WU is suspended.
Is rosetta your only project? If not, then are the other projects working OK?
If Rosetta is your only project, right click on the B in the systray and see if Boinc is suspended, or set to do work based on prefs.
If set on based on prefs, then check you "general preferences" under "your account" and see if you have asked it to stop work while in use, or at specific times.

tony
ID: 15374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Philip Hood

Send message
Joined: 11 Feb 06
Posts: 3
Credit: 35,986
RAC: 0
Message 15376 - Posted: 3 May 2006, 1:38:36 UTC

I suspended the work unit after I noticed it wasn't consuming any CPU time, I don't have time to baby sit it right now. Seti and Predictor are also running on this machine and have no problems. Roseeta seems to have this problem every few work units. It used to be worse before 5.07.
ID: 15376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 15377 - Posted: 3 May 2006, 1:39:05 UTC - in response to Message 15373.  
Last modified: 3 May 2006, 1:42:14 UTC

I still keep getting units that say they are running when they are not. No Error messages in the log.

I just reread your post. Do you mean it says "running" in the status column of the work/tasks tab? If yes, have you viewed the graphics to see if they're running. Are you a Win98/me user?

[edit]I see two linux and one win2000 puter, which puter?
ID: 15377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Philip Hood

Send message
Joined: 11 Feb 06
Posts: 3
Credit: 35,986
RAC: 0
Message 15378 - Posted: 3 May 2006, 1:45:30 UTC

This is a linux machine I don't run the graphics on it, and so have no I dea what they would look like. The siutation was definitly that the status of the Work unit was running and that no CPU was being consumed. When the Work units get in this state they hog all the computer time without accomplishing anything.
ID: 15378 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 15379 - Posted: 3 May 2006, 1:50:00 UTC

Sorry, Philip, I'm linux stupid and can't help you further, though I'd like to.

tony
ID: 15379 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 15394 - Posted: 3 May 2006, 8:06:52 UTC - in response to Message 15378.  

This is a linux machine I don't run the graphics on it, and so have no I dea what they would look like. The siutation was definitly that the status of the Work unit was running and that no CPU was being consumed. When the Work units get in this state they hog all the computer time without accomplishing anything.


If you can restart BOINC. If there is still no CPU usage abort. You get credit and the WU will be sent out to someone else.
ID: 15394 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 15400 - Posted: 3 May 2006, 12:48:12 UTC - in response to Message 15378.  
Last modified: 3 May 2006, 12:51:06 UTC

This is a linux machine I don't run the graphics on it, and so have no I dea what they would look like. The siutation was definitly that the status of the Work unit was running and that no CPU was being consumed. When the Work units get in this state they hog all the computer time without accomplishing anything.


Can you do a "ps" to see the status of the BOINC and Rosetta processes? Or use "top" to see if it consumes CPU time?

E.g. on my Linux (notice the STAT column, RN=Running, Nice):
ps u -U boinc
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
boinc     2120  0.0  0.4  7396 3684 ?        S    Apr27   0:06 ./boinc_client
boinc     8605 21.8  8.5 158868 63416 ?      RN   May02 404:22 rosetta_5.07_i686
boinc     8606  0.0  8.5 158868 63416 ?      SN   May02   0:00 rosetta_5.07_i686
boinc     8607  0.0  8.5 158868 63416 ?      SN   May02   0:00 rosetta_5.07_i686
boinc     8608  0.0  8.5 158868 63416 ?      SN   May02   0:00 rosetta_5.07_i686

I had a similar problem with yours 3+ months ago, on a under-spec'ed Linux where I was running 6 different BOINC projects with leave-preempted-in-mem=Yes on a PC with just 256MB RAM, where BOINC would think Rosetta was running, but it didn't. So BOINC wouldn't switch between projects, effectively "hanging".

I never looked into it, I just reduced # of BOINC projects to 3 and I've never had the problem again.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 15400 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rebel Alliance

Send message
Joined: 4 Nov 05
Posts: 50
Credit: 3,579,531
RAC: 0
Message 15416 - Posted: 3 May 2006, 17:03:59 UTC


ID: 15416 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 15429 - Posted: 3 May 2006, 20:27:25 UTC

OS = Linux 2.6.10
CPU = AMD Sempron 3000+
Memory = 1024M (64M shared video)

Failure Rate: approximately 70%

With v5.01 of the Rosetta app, this rig ran clean. Near 100% completion.
With v5.07, I'm lucky to get 1 result in 3 successfully completed.

Any suggestions?
ID: 15429 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 15443 - Posted: 3 May 2006, 21:40:11 UTC - in response to Message 15429.  
Last modified: 3 May 2006, 21:40:21 UTC

OS = Linux 2.6.10
CPU = AMD Sempron 3000+
Memory = 1024M (64M shared video)

Failure Rate: approximately 70%

With v5.01 of the Rosetta app, this rig ran clean. Near 100% completion.
With v5.07, I'm lucky to get 1 result in 3 successfully completed.

Any suggestions?


I looked on your host and I see as many errors for 5.01 as for 5.07. All failed WU on your host completed succesful on another host. Almost all your errors have exit code 131 - this may help the team to figure out what's going on on your machine.
ID: 15443 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 15444 - Posted: 3 May 2006, 21:42:11 UTC - in response to Message 15429.  
Last modified: 3 May 2006, 21:44:42 UTC

OS = Linux 2.6.10
CPU = AMD Sempron 3000+
Memory = 1024M (64M shared video)

Failure Rate: approximately 70%

With v5.01 of the Rosetta app, this rig ran clean. Near 100% completion.
With v5.07, I'm lucky to get 1 result in 3 successfully completed.

Any suggestions?


Looking at your host's log, you seem to get SIGSEGV errors.

Btw, do you have Leave-in-mem-when-preempted=YES? (I would try this first). It looks as if WUs are restarted several times. Which Linux distro (FC5?)

Also see here for others having similar problem.
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 15444 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
charmed

Send message
Joined: 2 Nov 05
Posts: 11
Credit: 1,780,440
RAC: 0
Message 15445 - Posted: 3 May 2006, 21:43:25 UTC
Last modified: 3 May 2006, 21:44:42 UTC

This work unit failed as I was watching it https://boinc.bakerlab.org/rosetta/result.php?resultid=19041999
Running Win xp on an Athlon64 3200+ with 1gb memory.

ID: 15445 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 15472 - Posted: 4 May 2006, 2:51:57 UTC - in response to Message 15443.  
Last modified: 4 May 2006, 3:03:33 UTC

I looked on your host and I see as many errors for 5.01 as for 5.07. All failed WU on your host completed succesful on another host. Almost all your errors have exit code 131 - this may help the team to figure out what's going on on your machine.


Many thanks for your quick response =)

I am especially grateful for the detective work that produced the exit code. I will be sure to include this in further posts regarding this host.
ID: 15472 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David Emigh
Avatar

Send message
Joined: 13 Mar 06
Posts: 158
Credit: 417,178
RAC: 0
Message 15473 - Posted: 4 May 2006, 3:01:49 UTC - in response to Message 15444.  
Last modified: 4 May 2006, 3:02:30 UTC

Looking at your host's log, you seem to get SIGSEGV errors.

Btw, do you have Leave-in-mem-when-preempted=YES? (I would try this first). It looks as if WUs are restarted several times. Which Linux distro (FC5?)

Also see here for others having similar problem.


Awesome response!!!

I set the leave-in-mem-when-preempted to NO quite a while ago when I saw a message in technical news that said to do so. I will return that variable to YES immediately.

The Linux distribution I am using is LinSpire 5.0 (build 5.0.59).

Thank you very much for your response =)
ID: 15473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 15483 - Posted: 4 May 2006, 6:10:49 UTC

Not a failure but a "suspicious" WU:

https://boinc.bakerlab.org/rosetta/result.php?resultid=19037537

This one generated 1123 decoys in 8 hours. Each model started in Full Atom Relax Mode in a somewhat "unfolded" stage (only a part of the amino acid strain was visible) and had alsways high RMSD (about 50). After a few steps it quited and started a new model.
ID: 15483 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Rebirther
Avatar

Send message
Joined: 17 Sep 05
Posts: 116
Credit: 41,315
RAC: 0
Message 15488 - Posted: 4 May 2006, 10:08:55 UTC

I have suspend following WU: FA_CASP6_t198__470_5745_0
After 2:13h only 1.04%. Steps increasing very low.
Last entry stdout.txt:
CYCLES::number is 1 x total_residue: 69
initializing full atom coordinates
BOINC :: [2006-05-04 11:46:11] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 7 :: num_decoys: 7 :: farlx_stage: 10
dump_fullatom_pdb: farlxcheck
starting score 357.328156 rms 4.70180273
starting full atom minimization
[T/F OPT]Default FALSE value for [-infinite_loop]

Should I running further or abort it? Don`t know how long does it take? Normally 3h for one WU. 200MB RAM usage now.
ID: 15488 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 15491 - Posted: 4 May 2006, 11:17:47 UTC - in response to Message 15488.  

I have suspend following WU: FA_CASP6_t198__470_5745_0
After 2:13h only 1.04%. Steps increasing very low.
Last entry stdout.txt:
CYCLES::number is 1 x total_residue: 69
initializing full atom coordinates
BOINC :: [2006-05-04 11:46:11] :: checkpoint_decoys() :: saved decoy info :: attempted_decoys: 7 :: num_decoys: 7 :: farlx_stage: 10
dump_fullatom_pdb: farlxcheck
starting score 357.328156 rms 4.70180273
starting full atom minimization
[T/F OPT]Default FALSE value for [-infinite_loop]

Should I running further or abort it? Don`t know how long does it take? Normally 3h for one WU. 200MB RAM usage now.

t198 is one of the bigger proteins - 235 amino acids. I'd let it run at least 4 hour before I abort. Better abort only if reaching 24 hours and the 300 credit claiming barrier for failed WUs.
ID: 15491 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 15500 - Posted: 4 May 2006, 12:12:32 UTC - in response to Message 15491.  

I have suspend following WU: FA_CASP6_t198__470_5745_0
After 2:13h only 1.04%. Steps increasing very low....
Don`t know how long does it take? Normally 3h for one WU. 200MB RAM usage now.

t198 is one of the bigger proteins - 235 amino acids. I'd let it run at least 4 hour before I abort. Better abort only if reaching 24 hours and the 300 credit claiming barrier for failed WUs.

I had one run over my 8 hour preference time, but then it completed. Just a huge protein it seems! Now I have set my preference time to 12 hours. And I have learned to be patient! :)

Regards,
Bob P.
ID: 15500 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,217,773
RAC: 2,465
Message 15512 - Posted: 4 May 2006, 15:49:26 UTC
Last modified: 4 May 2006, 15:52:58 UTC

Keep in mind, that 1.04% number is REALLY just telling you that it is still on model 1. Once it completes model one it will recompute the % completed and may determine that you're 60% done, or even 100% and end it.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 15512 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile kb7rzf
Avatar

Send message
Joined: 7 Oct 05
Posts: 16
Credit: 35,427
RAC: 0
Message 15521 - Posted: 4 May 2006, 17:24:40 UTC

Got this error on This WU, and here's the info:

Result ID 18984230
Name HBLR_1.0_2tif_ROT_TRIALS_TRIE_CHECKPOINTS_482_214_0
Workunit 15712387
Created 3 May 2006 0:08:00 UTC
Sent 3 May 2006 4:07:40 UTC
Received 4 May 2006 16:13:06 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status 1 (0x1)
Computer ID 12719
Report deadline 17 May 2006 4:07:40 UTC
CPU time 8127.546875
stderr out <core_client_version>5.4.2</core_client_version>
<message>Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# random seed: 1065667
# cpu_run_time_pref: 14400
# cpu_run_time_pref: 14400
# cpu_run_time_pref: 14400
ERROR:: Exit at: .hbonds.cc line:293

</stderr_txt>


Validate state Invalid
Claimed credit 15.0702212672248
Granted credit 0
application version 5.07


Thanks.

Jeremy

ID: 15521 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : Report Problems with Rosetta Version 5.07



©2021 University of Washington
https://www.bakerlab.org