Minirosetta v1.40 bug thread

Message boards : Number crunching : Minirosetta v1.40 bug thread

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 · Next

AuthorMessage
A Few Good Men

Send message
Joined: 25 Mar 07
Posts: 14
Credit: 2,031,382
RAC: 0
Message 57303 - Posted: 28 Nov 2008, 7:08:48 UTC

Please send email to my account when an alternate to mini 1.40 test is available. Thanks in Advance.
ID: 57303 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 11 Jan 08
Posts: 23
Credit: 2,163,056
RAC: 0
Message 57304 - Posted: 28 Nov 2008, 7:34:35 UTC - in response to Message 57282.  
Last modified: 28 Nov 2008, 7:35:29 UTC

FalconFly, i noticed that you are crunching for LHC@home as well.
It might be that LHC@home is causing your crashes. I've had some crashes too this week. Next time it happens check your boinc.log file, the last message there, before SIGSEGV and the stack trace, is probably: [lhcathome] Scheduler request
A few weeks ago this has also been mentioned by several people in the LHC@home message boards.

AdeB


Darn, it seems you could be right on the spot with that. Nice catch!
I haven't seen any anomalies for >24hrs now, as the most recent batch of LHC WorkUnits have been processed.

Given the somewhat shaky state of LHC@Home, I'd say Rosetta is off the hook concerning my recent problems :)
ID: 57304 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sarha1

Send message
Joined: 23 Sep 05
Posts: 5
Credit: 6,339,735
RAC: 0
Message 57305 - Posted: 28 Nov 2008, 8:58:05 UTC

Validate error. WTH?
Extremely high claimed credit (100x more than expected).

https://boinc.bakerlab.org/rosetta/result.php?resultid=210214915
https://boinc.bakerlab.org/rosetta/result.php?resultid=210214913

Athlon 64 3200+ 1GB RAM WIN XP prof. SP3
ID: 57305 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57307 - Posted: 28 Nov 2008, 11:24:06 UTC - in response to Message 57303.  

Please send email to my account when an alternate to mini 1.40 test is available. Thanks in Advance.

I second that!
ID: 57307 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57308 - Posted: 28 Nov 2008, 12:00:04 UTC - in response to Message 57307.  

Please send email to my account when an alternate to mini 1.40 test is available. Thanks in Advance.

I second that!


i'll stay as my error rate is low, but i have to agree, the team needs to take and revamp all these tasks with stupid errors, such as Nan's and recovering checkpoints and lock file errors along with all the other stupid problems that could be taken care of if they were tested on Ralph properly before being released to here.

the idea of Rosetta is research of proteins and not research of bad programing.
ID: 57308 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,338,560
RAC: 2,014
Message 57309 - Posted: 28 Nov 2008, 13:55:35 UTC

A workunit where my computer completed some models successfully without getting any credit:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=191865519
ID: 57309 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,338,560
RAC: 2,014
Message 57310 - Posted: 28 Nov 2008, 13:59:18 UTC - in response to Message 57307.  

Please send email to my account when an alternate to mini 1.40 test is available. Thanks in Advance.

I second that!


The problems seem to be mainly in workunits that use the new features, so an option to avoid getting any of the workunits using those features would be useful.

ID: 57310 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 57312 - Posted: 28 Nov 2008, 14:14:18 UTC

am i doing something wrong here or what? https://boinc.bakerlab.org/rosetta/results.php?hostid=267483
ID: 57312 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57315 - Posted: 28 Nov 2008, 14:51:37 UTC - in response to Message 57312.  

am i doing something wrong here or what? https://boinc.bakerlab.org/rosetta/results.php?hostid=267483


nothing is wrong, other than you need to try out the stuff i pointed out about lockfiles in a previous message to you. if you give that a try it should clear up the problem.

the others, as i pointed out last time, seem to time out (10 days no processing or reporting) due to some unknown reason. to much work, not enough on time or cpu time being dedicated to rosetta, or just a rash of bad luck.

try solving the lockfile issue and then don't accept any new work until you have completed what you have in queue and when that is done then accept new work and see what results you have.
ID: 57315 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 57317 - Posted: 28 Nov 2008, 15:37:26 UTC - in response to Message 57315.  




i looked and cant find the procedure what do i do?





am i doing something wrong here or what? https://boinc.bakerlab.org/rosetta/results.php?hostid=267483


nothing is wrong, other than you need to try out the stuff i pointed out about lockfiles in a previous message to you. if you give that a try it should clear up the problem.

the others, as i pointed out last time, seem to time out (10 days no processing or reporting) due to some unknown reason. to much work, not enough on time or cpu time being dedicated to rosetta, or just a rash of bad luck.

try solving the lockfile issue and then don't accept any new work until you have completed what you have in queue and when that is done then accept new work and see what results you have.

ID: 57317 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57320 - Posted: 28 Nov 2008, 16:13:48 UTC - in response to Message 57317.  

goto here for my original post and go here for the boinc wiki description. here is where you will find the files you need to remove after you shut all boinc processes down: If you are going to delete it then you can find the lockfile that is actually called boinc_lockfile and it is in boinc folder then subfolder projects and then subfolder slots.




i looked and cant find the procedure what do i do?





am i doing something wrong here or what? https://boinc.bakerlab.org/rosetta/results.php?hostid=267483


nothing is wrong, other than you need to try out the stuff i pointed out about lockfiles in a previous message to you. if you give that a try it should clear up the problem.

the others, as i pointed out last time, seem to time out (10 days no processing or reporting) due to some unknown reason. to much work, not enough on time or cpu time being dedicated to rosetta, or just a rash of bad luck.

try solving the lockfile issue and then don't accept any new work until you have completed what you have in queue and when that is done then accept new work and see what results you have.


ID: 57320 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,338,560
RAC: 2,014
Message 57328 - Posted: 28 Nov 2008, 16:59:18 UTC
Last modified: 28 Nov 2008, 16:59:59 UTC

I just suspended BOINC entirely for my weekly antiviral and antispyware checks, then noticed that a rosetta@home workunit was still using CPU time on my computer:

https://boinc.bakerlab.org/rosetta/results.php?userid=264600

I then also suspended the rosetta@home project and that specific task; this didn't stop it from using CPU time. Since this is using only one core of my dual core PC, I'm going to try running the antiviral and antispyware programs as usual, even with that workunit still running.

11/28/2008 8:40:30 AM|rosetta@home|Starting 1shfA_BOINC_ABRELAX_SPLIT_SPLIT_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1shfA-_4844_644_1
11/28/2008 8:40:31 AM|rosetta@home|Starting task 1shfA_BOINC_ABRELAX_SPLIT_SPLIT_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1shfA-_4844_644_1 using minirosetta version 140
ID: 57328 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57330 - Posted: 28 Nov 2008, 17:24:36 UTC - in response to Message 57308.  

Please send email to my account when an alternate to mini 1.40 test is available. Thanks in Advance.

I second that!


i'll stay as my error rate is low, but i have to agree, the team needs to take and revamp all these tasks with stupid errors, such as Nan's and recovering checkpoints and lock file errors along with all the other stupid problems that could be taken care of if they were tested on Ralph properly before being released to here.

the idea of Rosetta is research of proteins and not research of bad programing.

Very well put.

So why do the project developers say nothing about this here?
ID: 57330 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,338,560
RAC: 2,014
Message 57334 - Posted: 28 Nov 2008, 20:25:53 UTC - in response to Message 57330.  

Please send email to my account when an alternate to mini 1.40 test is available. Thanks in Advance.

I second that!


i'll stay as my error rate is low, but i have to agree, the team needs to take and revamp all these tasks with stupid errors, such as Nan's and recovering checkpoints and lock file errors along with all the other stupid problems that could be taken care of if they were tested on Ralph properly before being released to here.

the idea of Rosetta is research of proteins and not research of bad programing.

Very well put.

So why do the project developers say nothing about this here?


I suspect it's because they're too busy reading all the problem reports.

Do you think it would be enough to move just the workunits using the new features introduced in 1.39 and 1.40 back to Ralph, so they'd still have something for the rest of the participants to do until they fix the new problems?

ID: 57334 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,338,560
RAC: 2,014
Message 57335 - Posted: 28 Nov 2008, 20:26:02 UTC - in response to Message 57330.  
Last modified: 28 Nov 2008, 20:28:58 UTC

(Duplicate message - deleted)
ID: 57335 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57337 - Posted: 28 Nov 2008, 22:42:06 UTC - in response to Message 57334.  

Please send email to my account when an alternate to mini 1.40 test is available. Thanks in Advance.

I second that!


i'll stay as my error rate is low, but i have to agree, the team needs to take and revamp all these tasks with stupid errors, such as Nan's and recovering checkpoints and lock file errors along with all the other stupid problems that could be taken care of if they were tested on Ralph properly before being released to here.

the idea of Rosetta is research of proteins and not research of bad programing.

Very well put.

So why do the project developers say nothing about this here?


I suspect it's because they're too busy reading all the problem reports.

Do you think it would be enough to move just the workunits using the new features introduced in 1.39 and 1.40 back to Ralph, so they'd still have something for the rest of the participants to do until they fix the new problems?


Wouldn't that be best?
ID: 57337 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57338 - Posted: 28 Nov 2008, 23:05:27 UTC - in response to Message 57337.  

Please send email to my account when an alternate to mini 1.40 test is available. Thanks in Advance.

I second that!


i'll stay as my error rate is low, but i have to agree, the team needs to take and revamp all these tasks with stupid errors, such as Nan's and recovering checkpoints and lock file errors along with all the other stupid problems that could be taken care of if they were tested on Ralph properly before being released to here.

the idea of Rosetta is research of proteins and not research of bad programing.

Very well put.

So why do the project developers say nothing about this here?


I suspect it's because they're too busy reading all the problem reports.

Do you think it would be enough to move just the workunits using the new features introduced in 1.39 and 1.40 back to Ralph, so they'd still have something for the rest of the participants to do until they fix the new problems?


Wouldn't that be best?


I agree with you guys on this.
ID: 57338 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 57341 - Posted: 29 Nov 2008, 1:50:11 UTC - in response to Message 57338.  

this is getting too crazy ill give it 2 more days disconnect and then ill be back in a couple weeks to see if this is back to working


Please send email to my account when an alternate to mini 1.40 test is available. Thanks in Advance.

I second that!


i'll stay as my error rate is low, but i have to agree, the team needs to take and revamp all these tasks with stupid errors, such as Nan's and recovering checkpoints and lock file errors along with all the other stupid problems that could be taken care of if they were tested on Ralph properly before being released to here.

the idea of Rosetta is research of proteins and not research of bad programing.

Very well put.

So why do the project developers say nothing about this here?


I suspect it's because they're too busy reading all the problem reports.

Do you think it would be enough to move just the workunits using the new features introduced in 1.39 and 1.40 back to Ralph, so they'd still have something for the rest of the participants to do until they fix the new problems?


Wouldn't that be best?


I agree with you guys on this.

ID: 57341 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,338,560
RAC: 2,014
Message 57342 - Posted: 29 Nov 2008, 2:24:08 UTC - in response to Message 57328.  

I just suspended BOINC entirely for my weekly antiviral and antispyware checks, then noticed that a rosetta@home workunit was still using CPU time on my computer:

https://boinc.bakerlab.org/rosetta/results.php?userid=264600

I then also suspended the rosetta@home project and that specific task; this didn't stop it from using CPU time. Since this is using only one core of my dual core PC, I'm going to try running the antiviral and antispyware programs as usual, even with that workunit still running.

11/28/2008 8:40:30 AM|rosetta@home|Starting 1shfA_BOINC_ABRELAX_SPLIT_SPLIT_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1shfA-_4844_644_1
11/28/2008 8:40:31 AM|rosetta@home|Starting task 1shfA_BOINC_ABRELAX_SPLIT_SPLIT_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1shfA-_4844_644_1 using minirosetta version 140


The Ad-Aware 2008 program apparantly ran correctly even with that workunit still running, without taking longer than usual. It found about twice as many cookies as usual, which makes me suspect that I forgot to run it last week. It was unable to remove all these cookies without restarting Vista - something which happens about half the time even when all workunits respond correctly to a suspend - so I let it restart Vista. Since I have to restart BOINC manually every time Vista restarts, I was then able to run the remaining antispyware programs and the antivirus program before restarting BOINC.

What filename should I expect for the cookie from Rosetta@home, so I can tell that program not to delete it?

When that workunit got a CPU core again, it repeated the same problem of continuing to run even after BOINC tries to give another workunit a turn on that CPU core.

I'm going to tell BOINC not to download any more Rosetta@home workunits until I have more time to watch for such behavior.
ID: 57342 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 57344 - Posted: 29 Nov 2008, 3:19:41 UTC

Very sorry about all the problems, we are working to fix them as fast as possible. One source of the problems is that we are now running a broader range of applications on rosetta@home so there are more sources of error. I do apologize for the problems; we have an absolute rule to check all work units first on ralph, but there are some errors which don't get caught this way. Our top priority now is to find the source of the problems and to fix them.
ID: 57344 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 · Next

Message boards : Number crunching : Minirosetta v1.40 bug thread



©2024 University of Washington
https://www.bakerlab.org