Minirosetta v1.40 bug thread

Message boards : Number crunching : Minirosetta v1.40 bug thread

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 15 · Next

AuthorMessage
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57213 - Posted: 24 Nov 2008, 15:57:28 UTC - in response to Message 57200.  

read this. It was posted in another new thread by peter leman.

within that wiki article is the link to "lockfile" and it mentions: Where this becomes problematical is when a process dies (crashes) and the Lock File is never closed. This us usually corrected with a reboot action, but not always.

If you are going to delete it then you can find the lockfile that is actually called boinc_lockfile and it is in boinc folder then subfolder projects and then subfolder slots.

see if the reboot of boinc helps and if not then follow the directions in the wiki article.


Thank you!

It worked out, booting the computer. The 'slots' disappeared and, with them, the lock file. Now to see if the error wont happen again after I resume Rosetta's tasks.
ID: 57213 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57214 - Posted: 24 Nov 2008, 16:15:46 UTC - in response to Message 57213.  

read this. It was posted in another new thread by peter leman.

within that wiki article is the link to "lockfile" and it mentions: Where this becomes problematical is when a process dies (crashes) and the Lock File is never closed. This us usually corrected with a reboot action, but not always.

If you are going to delete it then you can find the lockfile that is actually called boinc_lockfile and it is in boinc folder then subfolder projects and then subfolder slots.

see if the reboot of boinc helps and if not then follow the directions in the wiki article.


Thank you!

It worked out, booting the computer. The 'slots' disappeared and, with them, the lock file. Now to see if the error wont happen again after I resume Rosetta's tasks.


glad to help, but also thanks to peter leman for creating the original thread with the lockfile topic.
ID: 57214 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 57221 - Posted: 24 Nov 2008, 22:30:28 UTC

ID: 57221 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57225 - Posted: 24 Nov 2008, 22:46:03 UTC - in response to Message 57221.  

4 lockfiles <--- see discussion below

208603902
208601704
208601702
208596319

and

1 NAN

208596316

ID: 57225 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 05
Posts: 268
Credit: 402,585
RAC: 0
Message 57228 - Posted: 25 Nov 2008, 9:57:36 UTC

4 lockfiles <--- see discussion below

Yes sorry, I didn't see that until later. Strange that all the errors appear to be on the loopbuild models and yet some of these are not affected. In the past few days I have had boinc stop completely which in the past has meant that at a model has jammed up the works. Restart boinc and everything start working again with no apparent model failure. Strange.
ID: 57228 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rob Lilley

Send message
Joined: 11 Jan 06
Posts: 11
Credit: 133,120
RAC: 0
Message 57229 - Posted: 25 Nov 2008, 11:12:58 UTC - in response to Message 57212.  
Last modified: 25 Nov 2008, 11:13:47 UTC

This Minirosetta v1.40 Work Unit is another one that won't suspend and continues running when pre-empted by a QMC WU.

Should I abort the Minirosetta Work Unit or suspend all the other projects that play nice until the Minirosetta WU finishes?



try what i did, exit boinc manager and then reopen. however i did a complete reboot of the system after that so i have no idea if just the closing and reopening of boinc manager will solve that problem.


I tried that, and it doesn't seem to work. I am running BOINC as a service, so there's a different method for stopping it anyway, according to a thread I found somwhere on the BOINC message boards, but that doesn't work either. If I do stop both the BOINC Manager and the BOINC Core Client, Minirosetta continues to run. The Windows Task Manager shows the minirosetta task hasn't unloaded, and the CPU usage stays at 100%. I could end the Minirosetta process, but I am reluctant to do that.


try exiting the boinc, do not delete anything from your folders. goto add/remove software and unistall boinc. then reinstall. kind of drastic, but thats all i can think of. maybe someone else has an different idea.



Didn't do that, just suspended all other projects then restarted the computer. Turns out it was probably a bad WU anyway, as it worked for a while then came to a dead stop and woudln't restart. After I tried restarting BOINC, it then errored out and came up with the lockfile error others are experiencing, as you will see here.

Ah well, some other poor sap will get that nasty WU now :(
ID: 57229 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57230 - Posted: 25 Nov 2008, 11:38:03 UTC

I think the "rule" for loopbuild is to not do anything to it or it crashes and burns badly.
ID: 57230 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 57232 - Posted: 25 Nov 2008, 18:51:30 UTC

ID: 57232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 57233 - Posted: 25 Nov 2008, 20:48:03 UTC - in response to Message 57232.  

ID: 57233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57235 - Posted: 25 Nov 2008, 22:09:43 UTC - in response to Message 57233.  

rochester..have a look at this mornings discussion down below on lockfile issues.
it will save you more errors and loss of credit.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=189333988

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=189333988


ID: 57235 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 57240 - Posted: 26 Nov 2008, 3:30:48 UTC

Hi.

I have another task that dosen't want to stop when preempted time & percentage

are ticking up, it is currently running.

1lis__BOINC_ABRELAX_SPLIT_SPLIT2_IGNORE_THE_REST-S25-9-S3-3--1lis_-_4768_2176_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=191579804

pete.

ID: 57240 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Stacey Baird
Avatar

Send message
Joined: 11 Apr 06
Posts: 19
Credit: 74,745
RAC: 0
Message 57247 - Posted: 26 Nov 2008, 13:27:41 UTC

Probable Problem
11/26/2008 9:15:00 PM|rosetta@home|Restarting task 1acf__BOINC_ABRELAX_SPLIT_SPLIT2_IGNORE_THE_REST-S25-9-S3-3--1acf_-_4768_1359_0 using minirosetta version 140

The above is stuck on 00.9:59.00, nine minutes 59 seconds remaining.
CPU time of more than five hours increases but time remaining never decreases.

Should I abort? Hmmm, as I read farther below, others are having the same problem.

Good Luck
ID: 57247 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 653
Credit: 11,840,739
RAC: 274
Message 57249 - Posted: 26 Nov 2008, 15:44:31 UTC

210279108 NAN in HBonding.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 57249 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 11 Jan 08
Posts: 23
Credit: 2,163,056
RAC: 0
Message 57250 - Posted: 26 Nov 2008, 16:09:41 UTC - in response to Message 57249.  
Last modified: 26 Nov 2008, 16:12:56 UTC

I'm seeing a significantly above average failures, which result in the shutdown/crash of BOINC (MiniRosetta 1.40).

Happens across all my Linux Systems with no derterminable pattern (64bit BOINC V5.10.45) and naturally results in loss of computing power (need to restart BOINC or the System for ease of purpose)

Otherwise, repeatedly above average numbers of WorkUnits stuck at a certain percentage and its MiniRosetta Task either failed or using 0% CPU power, effectively blocking a CPU core each. Also requires a BOINC restart to get the affected WorkUnits kick into action again.
ID: 57250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 57253 - Posted: 26 Nov 2008, 18:41:49 UTC - in response to Message 57250.  

I'm seeing a significantly above average failures, which result in the shutdown/crash of BOINC (MiniRosetta 1.40).

Happens across all my Linux Systems with no derterminable pattern (64bit BOINC V5.10.45) and naturally results in loss of computing power (need to restart BOINC or the System for ease of purpose)

Otherwise, repeatedly above average numbers of WorkUnits stuck at a certain percentage and its MiniRosetta Task either failed or using 0% CPU power, effectively blocking a CPU core each. Also requires a BOINC restart to get the affected WorkUnits kick into action again.



for the team to know what is going on, please post your affected work units links in your next message.
ID: 57253 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2113
Credit: 41,065,024
RAC: 21,613
Message 57257 - Posted: 26 Nov 2008, 19:36:57 UTC - in response to Message 57200.  

read this. It was posted in another new thread by peter leman.

within that wiki article is the link to "lockfile" and it mentions: Where this becomes problematical is when a process dies (crashes) and the Lock File is never closed. This us usually corrected with a reboot action, but not always.

If you are going to delete it then you can find the lockfile that is actually called boinc_lockfile and it is in boinc folder then subfolder projects and then subfolder slots.

see if the reboot of boinc helps and if not then follow the directions in the wiki article.

Thanks for highlighting Peter's message on this subject, Greg.

I've closed all apps, ended the MiniRosetta processes, deleted the files and am about to do a re-boot. Fingers crossed. I promise to report back soon.
ID: 57257 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 11 Jan 08
Posts: 23
Credit: 2,163,056
RAC: 0
Message 57258 - Posted: 26 Nov 2008, 21:09:02 UTC - in response to Message 57253.  
Last modified: 26 Nov 2008, 21:22:26 UTC

for the team to know what is going on, please post your affected work units links in your next message.


This is going to be a tedious task, as the WorkUnits (most of them) complete normally after the deadlock is solved.
And after BOINC has crashed, I have no way of telling which WorkUnit may have caused it, since I'm looking at upto 8 WorkUnits per Host which will restart all normal when re-launching BOINC.

For now I'm afraid I'm best off with just solving the deadlocks, had to do that ~8 times today already.

(the only real solution I'd see is to run BOINC in debug mode to get behind it crashing or the MiniRosetta Client failing, which I'm very hesitant to do on 24 active production Systems running 24/7 at full speed - sounds like loads of work :p )

Anyway, for now I haven't seen any such behaviour on my 32bit Win32 Systems so far, only my Linux Systems seem randomly affected.

-- edit --

Oh, forgot :
How does Rosetta react to undervolting of CPUs ?

Most of my Systems run with reduced Vcore tested stable with Prime95, given a small safety buffer and have 100% validation on other Projects (Einstein, MalariaControl, SETI, LHC).

I'm very careful before I blame anything on a Project Client when I'm not running hardware 100% to its specifications.
ID: 57258 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57259 - Posted: 26 Nov 2008, 21:30:16 UTC - in response to Message 57214.  

read this. It was posted in another new thread by peter leman.

within that wiki article is the link to "lockfile" and it mentions: Where this becomes problematical is when a process dies (crashes) and the Lock File is never closed. This us usually corrected with a reboot action, but not always.

If you are going to delete it then you can find the lockfile that is actually called boinc_lockfile and it is in boinc folder then subfolder projects and then subfolder slots.

see if the reboot of boinc helps and if not then follow the directions in the wiki article.


Thank you!

It worked out, booting the computer. The 'slots' disappeared and, with them, the lock file. Now to see if the error wont happen again after I resume Rosetta's tasks.


glad to help, but also thanks to peter leman for creating the original thread with the lockfile topic.

Now the update:

The boot was no more than a (short-lived) temporary solution. It all happened again. I believe the error occurs when Rosetta tasks are paused (for the BOINC client to switch to other projects) and, when they start again, it all goes to crap (and this is the technical term).

These were the tasks. I let them be processed till the end':

https://boinc.bakerlab.org/rosetta/result.php?resultid=209770604
https://boinc.bakerlab.org/rosetta/result.php?resultid=209817742

More came by, I aborted them when they started the usual (afore transcribed) 'you may need to reset the project'.

The Rosetta project is now suspended again until a solution to this is 'Revealed' to me.
ID: 57259 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 57261 - Posted: 26 Nov 2008, 21:56:54 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=210070351
https://boinc.bakerlab.org/rosetta/result.php?resultid=210070348
https://boinc.bakerlab.org/rosetta/result.php?resultid=209966564
https://boinc.bakerlab.org/rosetta/result.php?resultid=208462198
https://boinc.bakerlab.org/rosetta/result.php?resultid=209224858
https://boinc.bakerlab.org/rosetta/result.php?resultid=209224828

ID: 57261 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2113
Credit: 41,065,024
RAC: 21,613
Message 57264 - Posted: 26 Nov 2008, 23:16:41 UTC - in response to Message 57257.  

read this. It was posted in another new thread by peter leman.

within that wiki article is the link to "lockfile" and it mentions: Where this becomes problematical is when a process dies (crashes) and the Lock File is never closed. This us usually corrected with a reboot action, but not always.

If you are going to delete it then you can find the lockfile that is actually called boinc_lockfile and it is in boinc folder then subfolder projects and then subfolder slots.

see if the reboot of boinc helps and if not then follow the directions in the wiki article.

Thanks for highlighting Peter's message on this subject, Greg.

I've closed all apps, ended the MiniRosetta processes, deleted the files and am about to do a re-boot. Fingers crossed. I promise to report back soon.

Sorry, no good whatsoever - possibly worse. 1 success, 6 failures. Of the 4 that errored out before I aborted them:


210309372
210290406
Can't acquire lockfile - exiting
Outcome Client error
Client state Compute error
Exit status -226 (0xffffff1e)

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>too many exit(0)s</message>

And

Outcome Client error
Client state Compute error
Exit status 1 (0x1)

<core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>Incorrect function. (0x1) - exit code 1 (0x1)</message>
210317441
<stderr_txt>
# cpu_run_time_pref: 7200
recovering checkpoint of tag S_1VYHA_5_00000001 with id abrelax_rg_state
Loops::add_loop error -- overlapping loop regions
existing loop begin/end: 92/124
new loop begin/end: 124/191
ERROR:: Exit from: ....srcprotocolsloopsLoopClass.cc line: 233
called boinc_finish
</stderr_txt>

210318343
<stderr_txt>
recovering checkpoint of tag S_1BE9A_3_00000001 with id abrelax_rg_state
Loops::add_loop error -- overlapping loop regions
existing loop begin/end: 1/20
new loop begin/end: 20/31
ERROR:: Exit from: ....srcprotocolsloopsLoopClass.cc line: 233
called boinc_finish
</stderr_txt>

Not sure what those last two were about tbh, but they fell over quick enough.

Any more ideas, anyone?
ID: 57264 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · 12 · 13 . . . 15 · Next

Message boards : Number crunching : Minirosetta v1.40 bug thread



©2024 University of Washington
https://www.bakerlab.org