Report Problems with Rosetta Version 5.25

Message boards : Number crunching : Report Problems with Rosetta Version 5.25

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 · Next

AuthorMessage
Ethan
Volunteer moderator

Send message
Joined: 22 Aug 05
Posts: 286
Credit: 9,304,700
RAC: 0
Message 24154 - Posted: 21 Aug 2006, 18:48:19 UTC

The scientists are looking into these errors. Let's wait and see what they're able to do.
ID: 24154 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Saenger
Avatar

Send message
Joined: 19 Sep 05
Posts: 271
Credit: 824,883
RAC: 0
Message 24156 - Posted: 21 Aug 2006, 18:57:27 UTC - in response to Message 24151.  

So I think the labelling should be changed, as it's also possible that a result is really invalid, for example when the hardware is faulty and delivers no useful results.
It would be difficult to decide: Is the result invalid, because the computer failed? Or is the result invalid, because the used "routines / parameter combination" doesn't work? The second is a very useful result for Rosetta.

That's right.
I don't know how it can be determined, or if at all.
If not at all, I prefer the solution with somehow less credits, not necessary half, but less.
But if possible, the "useful errors" should definitely get credit, while kaputt hardware should not.
But I will wait and see, it's nothing important.
ID: 24156 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tino Ruiz

Send message
Joined: 12 Oct 05
Posts: 13
Credit: 397,392
RAC: 0
Message 24159 - Posted: 21 Aug 2006, 19:27:44 UTC

Sigh...looks like I spoke too soon. :-/ I had to abort this unit because it's stuck, again. And leaving the app in memory is not an option for me as I'm attached to 14 projects.

Mon 21 Aug 2006 03:24:21 PM AST|rosetta@home|Unrecoverable error for result 1dhn__BOINC_BACKBONE_O_PENALTY_ABRELAX_SAVE_ALL_OUT__1176_735_0 (aborted by user)

ID: 24159 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 24163 - Posted: 21 Aug 2006, 19:55:42 UTC - in response to Message 24150.  

My opinion is that if the "decoys" is ok you should get credit for them.

Anders n

[edit] I assume that if the computer has done 5 decoys and fails on no 6

it reports the 5 that was ok ?! [/edit]


That sounds like a good idea.

At the moment they seem to be using claimed credit, which can be many times higher than what the WU would have gotten if it had been valid. Obviously that needs work.
ID: 24163 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,818,279
RAC: 888
Message 24205 - Posted: 21 Aug 2006, 23:42:19 UTC

Still getting lots of these errors where the WU hangs or just errors out, the ones that hang saying they are running (for hours) but the CPU is idle, I have to abort.
The following have error "process exited with code 131" "SIGSEGV:segmentation violation". The times are where the counters stopped.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28811448 (1 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28811405 (1 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28811373 (3.34 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28130018 (1.5 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28130012 (1.5 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28129987 (2 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28129980 (1 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28129979 (2 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28129978 (2 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28129967 (1 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606986 (1 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606985 (2.64 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606960 (1 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606959 (0.5 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606946 (1.87 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606945 (2.74 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606904 (1.85 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606894 (1 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606893 (1 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606884 (1.67 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606869 (1 hr)

These next ones I had to abort as they just hung with error SIGSEGV, and timers stopped as well the cpus dropped to zero.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28811399 (1 hr, error : glibc detected : corrupted double-linked list 0x0aa89e38 : SIGSEGV)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28197646 (1.5 hr, error : glibc detected : corrupted double-linked list 0x09f18228 : SIGSEGV)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606951 (1 hr, error : glibc detected : corrupted double-linked list 0x0b3e0950 : SIGSEGV)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28811398 (1 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28130073 (1.5 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28130028 (1.5 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28129992 (1 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606949 (1 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606947 (1 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606937 (2.66 hr)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28606898 (1.89 hr)

Also had "process got signal 11 : SIGSEGV"
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28811388 (1 hr)
And process exited with code 1
ERROR::Exit at:fragments.cc line:459
FILE_LOCK::unlock():close failed.:Bad file descripter

All the above are on my 2 Linux machines, 2 more on my Windows machine

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=28228077 (91 sec)
"unhandled exception record"
Reason : Access Violation (0xc0000005) at address 0x004A4529 read attempt to address 0x00000024
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=29009370 (1014 sec)
'Incorrect Function. exit code 1
ERROR::Exit at: .dock_structure.cc line:401

A lot of the "code 131" errors seem to happen when the Boinc Manager switches from one project to another. When switching the WU errors out.
I hope this helps the developers as it is becoming a nuisence to me and I might have to stop using the Linux machines for Rosetta so they keep doing something useful rather than stuck on a WU not doing anything.


ID: 24205 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ethan
Volunteer moderator

Send message
Joined: 22 Aug 05
Posts: 286
Credit: 9,304,700
RAC: 0
Message 24218 - Posted: 22 Aug 2006, 1:48:48 UTC - in response to Message 24216.  
Last modified: 22 Aug 2006, 1:56:13 UTC

From Fuzzy:

I hope the practice continues, if the WU is what is wrong nothing to do with your system and you have spent say 23 hours of a 24 hour unit working why should you not get credits ?

That's right, and that's why they grant something over @LHC.
But how is it determined that it was the software, and not the hardware?


We report them here, and then one of the devs look at them to see the strerr. And if they recognize it as something related to the software, we get credit for them.


ID: 24218 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 24227 - Posted: 22 Aug 2006, 3:21:45 UTC

Conan, I've been trying to figure out what might be causing your problem, but I haven't really been able to. One idea, though, is that the version of BOINC you're using might have a bug when used on a multiprocessor machine.

If you're interested, here's something you could try. Download the latest recommended version of BOINC to one of your Linux machines. http://boinc.berkeley.edu/download.php?all_platforms=1
Then stop BOINC, install the downloaded BOINC into a NEW directory, start BOINC in that new directory, and attach to Rosetta. This should test for a bad or corrupted BOINC client as well as corruption in the Rosetta directory.
ID: 24227 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Whl.

Send message
Joined: 29 Dec 05
Posts: 203
Credit: 275,802
RAC: 0
Message 24235 - Posted: 22 Aug 2006, 4:13:51 UTC

Carrying on from:

https://boinc.bakerlab.org/rosetta/forum_thread.php?id=2162#24232

Whl Wrote :
Here.

Ethan Wrote :
Note the 2nd half of my message :)

I don't think they have time to figure out who your teammate is to resolve the problem.

Please move this conversation to this forum, which is more on topic with your issue:

https://boinc.bakerlab.org/forum_thread.php?id=1891


Sorry Ethan. Had to go do some stuff there. Its nearly 5.15 am here in Scotland.

My team mate is Nite Owl and I dont have all the information on his WU right now. But it probably does'nt matter now anyway, as he has moved all of his machines to WCG Yesterday.


ID: 24235 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ethan
Volunteer moderator

Send message
Joined: 22 Aug 05
Posts: 286
Credit: 9,304,700
RAC: 0
Message 24237 - Posted: 22 Aug 2006, 4:30:39 UTC - in response to Message 24235.  

he has moved all of his machines to WCG Yesterday.


Sorry we couldn't have helped sooner. It's good to know another project that uses Rosetta will benefit from the extra work.
ID: 24237 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,818,279
RAC: 888
Message 24238 - Posted: 22 Aug 2006, 5:01:51 UTC

Thanks AMD_is_logical,
What you say has merit but I don't only run Rosetta on the Linux machines but 4 other projects on one and 5 other projects on the other as well and have no trouble with them. I would have to copy over all the files for the other projects into the new folder so I can keep working, would I not? Possibly just have to rename the folders maybe?
The 2 machines in question are :-
AMD Opteron Dual 848 (2 cpus) with 2 Gb RAM, 2 X 250 Gb HD, Linux Fedora Core 3
AMD opteron Dual 275 (2 cpus) with 4 Gb RAM, 2 X 250 Gb HD, Linux Fedora Core 3.
Chips are standard and not overclocked.
Would not a corrupted Boinc programme affect the other projects as well?
ID: 24238 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 24262 - Posted: 22 Aug 2006, 11:41:15 UTC - in response to Message 24238.  
Last modified: 22 Aug 2006, 11:42:43 UTC

Thanks AMD_is_logical,
What you say has merit but I don't only run Rosetta on the Linux machines but 4 other projects on one and 5 other projects on the other as well and have no trouble with them. I would have to copy over all the files for the other projects into the new folder so I can keep working, would I not? Possibly just have to rename the folders maybe?
The 2 machines in question are :-
AMD Opteron Dual 848 (2 cpus) with 2 Gb RAM, 2 X 250 Gb HD, Linux Fedora Core 3
AMD opteron Dual 275 (2 cpus) with 4 Gb RAM, 2 X 250 Gb HD, Linux Fedora Core 3.
Chips are standard and not overclocked.
Would not a corrupted Boinc programme affect the other projects as well?


Conan you can try just upgrading to 5.5.13 without deinstalling your current version:

http://boinc.berkeley.edu/download_all.php?platform=linux&version=5.5.13&type=sea

That would not reset any of your projects or abort current WUs. Or you try to deinstall BOINC and reinstall 5.5.13. I _assume_ it does not reset your projects and WUs either but as a safety measure you can set all projects to "no new work", crunch your cache empty, deinstall BOINC and reinstall it.

If you are attached to many projects you might also try out BAM an account manager which allows you to manage your different projects on one single webpage:

http://www.boincstats.com/bam/

Btw, recently I had a hanging WU with 0% processor usage as well on my windows box (first time). It happened after another task utilized the CPU 100% and when I killed that task via TaskManager the Rosetta task did not kick in properly. I had to restart BOINC in order to get the process running again.
ID: 24262 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Shaftoe
Avatar

Send message
Joined: 30 Apr 06
Posts: 115
Credit: 1,307,916
RAC: 0
Message 24264 - Posted: 22 Aug 2006, 11:46:16 UTC - in response to Message 24237.  

he has moved all of his machines to WCG Yesterday.


Sorry we couldn't have helped sooner. It's good to know another project that uses Rosetta will benefit from the extra work.


Ethan - could you please elaborate on this? Does WCG crunch Rosetta too? If so, cool!
Team Starfire World BOINC
ID: 24264 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Saenger
Avatar

Send message
Joined: 19 Sep 05
Posts: 271
Credit: 824,883
RAC: 0
Message 24266 - Posted: 22 Aug 2006, 12:37:09 UTC - in response to Message 24264.  

he has moved all of his machines to WCG Yesterday.


Sorry we couldn't have helped sooner. It's good to know another project that uses Rosetta will benefit from the extra work.


Ethan - could you please elaborate on this? Does WCG crunch Rosetta too? If so, cool!

WCG uses the Rosetta algorithm for the Humane Proteome Folding. Fight Aids and Cancer use differernt applications afaik.
ID: 24266 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 24267 - Posted: 22 Aug 2006, 12:45:04 UTC - in response to Message 24264.  
Last modified: 22 Aug 2006, 12:45:31 UTC

he has moved all of his machines to WCG Yesterday.


Sorry we couldn't have helped sooner. It's good to know another project that uses Rosetta will benefit from the extra work.


Ethan - could you please elaborate on this? Does WCG crunch Rosetta too? If so, cool!


You may want to read the latest journal entry from David Baker. I quote:

I'm working tonight on a manuscript with my former graduate student Rich Bonneau on some of the results from HPF1 done on the world community grid. We predicted structures for all the proteins in one of the best studied eukaryotic organisms--the yeast used to make bread and beer, and then integrated these predictions with other experimental data to assign 500 proteins of previously unknown structure to protein structural families. After this is done, we will start working on the report on the structures of human proteins also done in HPF1. These efforts used the low resolution version of rosetta (which is all we had several years ago when the HPF project started); I am of course excited about HPF2 which is using the protocol we have been improving on rosetta@home (I sent Rich and the collaborators at IBM the code last March) and should produce much more accurate models.


So yes they use the Rosetta application as well although not the latest one. Their goal is different though they study a limited set of specific proteins whereas Rosetta@home tries to improve the overall prediction capabilities of Rosetta, shows what can be achieved in competitions (CASP) and will soon start the HIV research. My understanding is that WCG focuses more on smooth crunching experience without taking too many risks (updating the application often, not using redundancy (WCG uses a quorom of 3) etc.) whereas Rosetta does science at the very front in different directions thus taking more risks (WU errors, bugs in new versions, quorom of 1 etc.).
ID: 24267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 24268 - Posted: 22 Aug 2006, 13:06:20 UTC - in response to Message 24238.  

What you say has merit but I don't only run Rosetta on the Linux machines but 4 other projects on one and 5 other projects on the other as well and have no trouble with them. I would have to copy over all the files for the other projects into the new folder so I can keep working, would I not? Possibly just have to rename the folders maybe?


What I had in mind was a temporary test just to see if Rosetta worked running by itself with a fresh start and the latest recommended BOINC client. If it did, I would then have suggested upgrading the BOINC client in your current BOINC directory. If that failed to help, I would then have suggested suspending the other projects to see there was some sort of interaction between the various projects.

Or you could skip the test and just upgrade the BOINC client, as tralala suggests. I was a little hesitant about suggesting changes to your main BOINC directory without some evidence that it would help.

Would not a corrupted Boinc programme affect the other projects as well?


If this bug always showed itself it would have been found long ago, so it must be something subtle. Perhaps it's only seen with a particular version of the BOINC client and a particular version of Rosetta when running on a dual processor machine.
ID: 24268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tino Ruiz

Send message
Joined: 12 Oct 05
Posts: 13
Credit: 397,392
RAC: 0
Message 24271 - Posted: 22 Aug 2006, 14:15:12 UTC

Tue 22 Aug 2006 10:13:26 AM AST|rosetta@home|Unrecoverable error for result FRA_t368_CASPR_hom001_7_t368_7_dec146IGNORE_THE_REST_1_1179_407_0 (aborted by user)

Same deal, it keeps getting stuck. I'm on a single core, single processor CPU. :-/
ID: 24271 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jack Shaftoe
Avatar

Send message
Joined: 30 Apr 06
Posts: 115
Credit: 1,307,916
RAC: 0
Message 24294 - Posted: 22 Aug 2006, 17:46:51 UTC - in response to Message 24267.  

Thank you both for the info Saenger and Tralala!
ID: 24294 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 24312 - Posted: 22 Aug 2006, 21:47:09 UTC - in response to Message 24271.  

Tue 22 Aug 2006 10:13:26 AM AST|rosetta@home|Unrecoverable error for result FRA_t368_CASPR_hom001_7_t368_7_dec146IGNORE_THE_REST_1_1179_407_0 (aborted by user)

Same deal, it keeps getting stuck. I'm on a single core, single processor CPU. :-/


Have you run any diagnostics such as Memtest86 and SuperPi ?

ID: 24312 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tino Ruiz

Send message
Joined: 12 Oct 05
Posts: 13
Credit: 397,392
RAC: 0
Message 24323 - Posted: 23 Aug 2006, 2:25:28 UTC

Sigh...it's not my PC. Look, every project runs fine, I stress my PC 24/7. Yes I've tried diagnostic tools but they always turn out ok. For the past few weeks a lot of people have complained about this "stuck" unit issue, so I *know* I'm not alone. Something is broken in the Linux version for sure.
ID: 24323 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 24325 - Posted: 23 Aug 2006, 3:04:57 UTC

Is it possible to setup a second profile (i.e. Home and Work instead of just default) here on Boinc, and run one of the two Linux machines with 100% Rosetta with WUs set to have a 1 hour time limit? Run for a day to prove that Rosetta is fine on your system as the only app.

Point A. If it passes, then add 1 more Boinc project to the mix. (2 hour switch, don't leave in memory). Run for a day..
if adding a project fails, turn on "leave in memory" and try again. If Leave in Memory =on fails, report findings.
if adding a project passes, add a couple more boinc projects to the mix. Go to Point A.


Or add Ralph, and see if Ralph will pass back enough information to track down the problem.
ID: 24325 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 · Next

Message boards : Number crunching : Report Problems with Rosetta Version 5.25



©2024 University of Washington
https://www.bakerlab.org