Minirosetta v1.40 bug thread

Message boards : Number crunching : Minirosetta v1.40 bug thread

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · Next

AuthorMessage
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 57344 - Posted: 29 Nov 2008, 3:19:41 UTC

Very sorry about all the problems, we are working to fix them as fast as possible. One source of the problems is that we are now running a broader range of applications on rosetta@home so there are more sources of error. I do apologize for the problems; we have an absolute rule to check all work units first on ralph, but there are some errors which don't get caught this way. Our top priority now is to find the source of the problems and to fix them.
ID: 57344 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 57345 - Posted: 29 Nov 2008, 3:19:47 UTC

Hi everyone,

the fix to the NAN hbonding problem will be included in the next update (probably after this weekend) and we are still investigating the problem of lockfile and that some WUs cannot be suspended. Sorry for the trouble and inconvenience and we will try our best to avoid such problems from happening on such a large scale in future.

Please continue to report other errors and problems that are not mentioned above.
ID: 57345 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,539,771
RAC: 1,212
Message 57347 - Posted: 29 Nov 2008, 8:14:22 UTC

overlapping loop regions error
https://boinc.bakerlab.org/rosetta/result.php?resultid=210296907
cc_0_6_nocst_homo_bench_foldcst_chunk_general_t286__olange_IGNORE_THE_REST_1FXWF_7_4848_20_1
died at 13 secs
<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
recovering checkpoint of tag S_1FXWF_7_00000001 with id abrelax_rg_state
Loops::add_loop error -- overlapping loop regions
existing loop begin/end: 123/182
new loop begin/end: 182/202
ERROR:: Exit from: ....srcprotocolsloopsLoopClass.cc line: 233
called boinc_finish

</stderr_txt>
]]>
ID: 57347 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 57348 - Posted: 29 Nov 2008, 9:44:36 UTC

message 4366

This message from James describes among other matters some of the problems that are being solved on RALPH

Conan - those loop boundary errors were input errors by the person who submitted those workunits. The validate errors are the result of a new format added that's not yet supported by the BOINC server, and we'll have to update our server code to deal with it over the weekend. That slow workunit bug looks like something that we fixed several months ago, we've alerted the person who submitted those jobs and he's looking into them.





ID: 57348 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2838
Credit: 2,020,043
RAC: 0
Message 57353 - Posted: 29 Nov 2008, 14:58:47 UTC

ID: 57353 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1852
Credit: 34,079,443
RAC: 6,952
Message 57392 - Posted: 1 Dec 2008, 12:46:37 UTC - in response to Message 57345.  

Hi everyone,

The fix to the NAN hbonding problem will be included in the next update (probably after this weekend) and we are still investigating the problem of lockfile and that some WUs cannot be suspended. Sorry for the trouble and inconvenience and we will try our best to avoid such problems from happening on such a large scale in future.

Please continue to report other errors and problems that are not mentioned above.

Thanks to you and David for the above comments.

As I'm out of work (just with results to upload) I'll take the opportunity to delete all the lockfiles again, as previously advised, and reset the project. Seems to me like the perfect opportunity.

I suggest others with similar problems to do the same.
ID: 57392 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 57393 - Posted: 1 Dec 2008, 12:55:30 UTC

I'll take the opportunity to delete all the lockfiles again, as previously advised, and reset the project. Seems to me like the perfect opportunity.

Make sure you upload your results before you reset, or you will lose everything.
ID: 57393 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57398 - Posted: 1 Dec 2008, 14:27:27 UTC

So, once again, the lock file thingy!

https://boinc.bakerlab.org/rosetta/result.php?resultid=211319613
ID: 57398 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1852
Credit: 34,079,443
RAC: 6,952
Message 57410 - Posted: 1 Dec 2008, 16:58:47 UTC - in response to Message 57393.  
Last modified: 1 Dec 2008, 16:59:26 UTC

I'll take the opportunity to delete all the lockfiles again, as previously advised, and reset the project. Seems to me like the perfect opportunity.

Make sure you upload your results before you reset, or you will lose everything.

Good point. I realised that just in time. I've set Boinc Manager not to get new WUs just yet and waiting for the upload to go through successfully before I reset.

Just noticed only 43k successes in the last 24hours. Some are obviously going through, but I don't know if there's a bottleneck or a problem receiving them on the Rosetta side. Nothing's going through for me yet.
ID: 57410 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Tyka

Send message
Joined: 20 Oct 05
Posts: 96
Credit: 2,190
RAC: 0
Message 57411 - Posted: 1 Dec 2008, 16:59:23 UTC - in response to Message 57392.  


As I'm out of work (just with results to upload) I'll take the opportunity to delete all the lockfiles again, as previously advised, and reset the project. Seems to me like the perfect opportunity.



*All* the lock files ? Where do they accumulate ? Do they accumulate after avery job ? Or only after failed ones ? This might be a leading thread to solving this silly lockfile problem!

Cheers, Mike


http://beautifulproteins.blogspot.com/
http://www.miketyka.com/
ID: 57411 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1852
Credit: 34,079,443
RAC: 6,952
Message 57415 - Posted: 1 Dec 2008, 17:14:36 UTC - in response to Message 57411.  

As I'm out of work (just with results to upload) I'll take the opportunity to delete all the lockfiles again, as previously advised, and reset the project. Seems to me like the perfect opportunity.


*All* the lock files ? Where do they accumulate ? Do they accumulate after every job ? Or only after failed ones ? This might be a leading thread to solving this silly lockfile problem!

In fact I spoke too soon before checking. The last time this was mentioned there were numerous 0-byte boinc_lockfiles in C:ProgramDataBOINCslots (and folders 1, 2, 3, 4 etc) - under Vista64 btw.

This time the slots folder was empty, so no lockfiles, even though I got many WUs with too many errors after repeated "Can't acquire lockfile" messages. I'd been away from home 11/27 to 11/30

See my results

Also, note this Validate error here

Server state Over
Outcome Validate error
Client state Done
Exit status 0 (0x0)

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 7200
======================================================
DONE :: 1 starting structures 4470.93 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
called boinc_finish
Can't set up shared mem: -1
Will run in standalone mode.
# cpu_run_time_pref: 7200
Can't acquire lockfile - exiting

</stderr_txt>
]]>

I don't think I've noticed this particular one before.
ID: 57415 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57417 - Posted: 1 Dec 2008, 17:40:44 UTC - in response to Message 57410.  
Last modified: 1 Dec 2008, 17:41:45 UTC

I'll take the opportunity to delete all the lockfiles again, as previously advised, and reset the project. Seems to me like the perfect opportunity.

Make sure you upload your results before you reset, or you will lose everything.

Good point. I realised that just in time. I've set Boinc Manager not to get new WUs just yet and waiting for the upload to go through successfully before I reset.

That was wise. What I've been doing is to set Boinc Manager not to get new tasks too. I then click 'Update', so that the client communicates with the Roseta server(s). Finally, to avoid doing something wrong, I boot the computer. That makes the Rosetta slots disappear (with them the lock file(s). Like magic.

Of course, when I allow a new WU to be downloaded, the process is fraked up. Again.
ID: 57417 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1193
Credit: 13,203,213
RAC: 555
Message 57418 - Posted: 1 Dec 2008, 17:46:01 UTC - in response to Message 57410.  

Just noticed only 43k successes in the last 24hours. Some are obviously going through, but I don't know if there's a bottleneck or a problem receiving them on the Rosetta side. Nothing's going through for me yet.


The uploads server hasn't caught up with uploading all the results from all the workunits that completed during the recent fileserver problem. If you have enough free disk space to hold the results, and have told BOINC it can use enough of it that Rosetta@home's share will hold a few day's worth of results, all you should really have to do is wait for the uploads server to catch up.

ID: 57418 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 57453 - Posted: 2 Dec 2008, 6:43:23 UTC

Hi .

I've got two more of these, they don't want to stop when preempted.


1tig__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1tig_-_4845_1488_0

1c9oA_BOINC_ABRELAX_SPLIT_SPLIT_IGNORE_THE_REST-S25-9-S3-3--1c9oA-_4678_404_1

pete.




ID: 57453 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ramostol

Send message
Joined: 6 Feb 07
Posts: 64
Credit: 584,052
RAC: 0
Message 57460 - Posted: 2 Dec 2008, 9:50:03 UTC

My (MacBook) "abinitio_nohomfrag_70_A_1unpA_4466"-tasks show a failure rate of three out of four, all failures terminate after some hours' computing with finishing file absent.

Cannot link to a result, as I am unable to report to the project at the moment.
ID: 57460 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B^S] HenryHunter

Send message
Joined: 28 May 08
Posts: 1
Credit: 72,915
RAC: 0
Message 57464 - Posted: 2 Dec 2008, 11:35:41 UTC - in response to Message 56741.  

Please report any bugs in this version here.

Sarel.


02.12.2008 04:58:10|rosetta@home|Message from server: Server error: can't attach shared memory
any solution?
CU
ID: 57464 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,539,771
RAC: 1,212
Message 57465 - Posted: 2 Dec 2008, 11:38:24 UTC - in response to Message 57464.  

Please report any bugs in this version here.

Sarel.


02.12.2008 04:58:10|rosetta@home|Message from server: Server error: can't attach shared memory
any solution?
CU


see here if things do not resolve themselves automatically.
The team created a new server for task processing as the main server was getting overloaded. The address has changed, but should correct automatically. if not see the link.
ID: 57465 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikylinux

Send message
Joined: 25 Jul 07
Posts: 3
Credit: 73,155
RAC: 0
Message 57466 - Posted: 2 Dec 2008, 11:46:07 UTC
Last modified: 2 Dec 2008, 11:51:35 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=211014218
https://boinc.bakerlab.org/rosetta/result.php?resultid=208363538
https://boinc.bakerlab.org/rosetta/result.php?resultid=208319555
https://boinc.bakerlab.org/rosetta/result.php?resultid=206052369



And workunits:
https://boinc.bakerlab.org/rosetta/result.php?resultid=209971190
and
https://boinc.bakerlab.org/rosetta/result.php?resultid=210257656

are working by 37 and 19 hours.... I wait a bit and stop the tasks....
ID: 57466 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
upstatelabs

Send message
Joined: 22 Jun 06
Posts: 10
Credit: 516,767
RAC: 0
Message 57469 - Posted: 2 Dec 2008, 13:14:46 UTC

I have a pair of errors to report:

12/1/2008 11:07:42 PM|rosetta@home|Task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1494_0 exited with zero status but no 'finished' file
12/1/2008 11:07:42 PM|rosetta@home|If this happens repeatedly you may need to reset the project.
12/1/2008 11:07:43 PM|rosetta@home|Restarting task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1494_0 using minirosetta version 140
12/1/2008 11:08:24 PM|rosetta@home|Task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1494_0 exited with zero status but no 'finished' file
12/1/2008 11:08:24 PM|rosetta@home|If this happens repeatedly you may need to reset the project.
12/1/2008 11:08:24 PM|rosetta@home|Restarting task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1494_0 using minirosetta version 140
12/1/2008 11:09:05 PM|rosetta@home|Task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1494_0 exited with zero status but no 'finished' file

Above repeating ~50 times.
And this:

12/2/2008 5:19:27 AM|rosetta@home|Task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1471_0 exited with zero status but no 'finished' file
12/2/2008 5:19:27 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
12/2/2008 5:19:27 AM|rosetta@home|Restarting task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1471_0 using minirosetta version 140
12/2/2008 5:20:08 AM|rosetta@home|Task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1471_0 exited with zero status but no 'finished' file
12/2/2008 5:20:08 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
12/2/2008 5:20:08 AM|rosetta@home|Restarting task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1471_0 using minirosetta version 140

Again repeating many times.

Could someone look into this?

Thanks!

ID: 57469 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5528
Credit: 5,539,771
RAC: 1,212
Message 57471 - Posted: 2 Dec 2008, 14:29:53 UTC - in response to Message 57469.  
Last modified: 2 Dec 2008, 14:31:01 UTC

I have a pair of errors to report:

12/1/2008 11:07:42 PM|rosetta@home|Task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1494_0 exited with zero status but no 'finished' file
12/1/2008 11:07:42 PM|rosetta@home|If this happens repeatedly you may need to reset the project.
12/1/2008 11:07:43 PM|rosetta@home|Restarting task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1494_0 using minirosetta version 140
12/1/2008 11:08:24 PM|rosetta@home|Task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1494_0 exited with zero status but no 'finished' file
12/1/2008 11:08:24 PM|rosetta@home|If this happens repeatedly you may need to reset the project.
12/1/2008 11:08:24 PM|rosetta@home|Restarting task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1494_0 using minirosetta version 140
12/1/2008 11:09:05 PM|rosetta@home|Task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1494_0 exited with zero status but no 'finished' file

Above repeating ~50 times.
And this:

12/2/2008 5:19:27 AM|rosetta@home|Task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1471_0 exited with zero status but no 'finished' file
12/2/2008 5:19:27 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
12/2/2008 5:19:27 AM|rosetta@home|Restarting task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1471_0 using minirosetta version 140
12/2/2008 5:20:08 AM|rosetta@home|Task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1471_0 exited with zero status but no 'finished' file
12/2/2008 5:20:08 AM|rosetta@home|If this happens repeatedly you may need to reset the project.
12/2/2008 5:20:08 AM|rosetta@home|Restarting task 1vie__BOINC_ABRELAX_SPLIT_SPLIT2_NOHATR_IGNORE_THE_REST-S25-9-S3-3--1vie_-_4845_1471_0 using minirosetta version 140

Again repeating many times.

Could someone look into this?

Thanks!


could you post the links either in plain text or in a link so people can look directly at the files your talking about? because you have two system on rosetta it would take quite a long time to isolate the tasks you are talking about.
ID: 57471 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · Next

Message boards : Number crunching : Minirosetta v1.40 bug thread



©2022 University of Washington
https://www.bakerlab.org