Minirosetta v1.40 bug thread

Message boards : Number crunching : Minirosetta v1.40 bug thread

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 15 · Next

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 1
Message 57265 - Posted: 26 Nov 2008, 23:46:16 UTC
Last modified: 26 Nov 2008, 23:51:27 UTC

sid, thats pretty odd as the first 2 tasks have the same output errors as rochester and the others. even with the -226. so something else happened on your system.

looks like its my turn again for compute errors.
ID: 57265 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 1
Message 57266 - Posted: 26 Nov 2008, 23:54:03 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=209650572
loopbuild_boinc4_tex_cst_hombench_loopbuild_tex_cst_t326__IGNORE_THE_REST_1ZH8A_6_4790_9_0

https://boinc.bakerlab.org/rosetta/result.php?resultid=209650574
loopbuild_boinc4_tex_cst_hombench_loopbuild_tex_cst_t326__IGNORE_THE_REST_1ZH8A_6_4790_10

2X's - <core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 21600

ERROR: NANs occured in hbonding!
ERROR:: Exit from: ....srccorescoringhbondshbonds_geom.cc line: 763
called boinc_finish

</stderr_txt>
]]>

2956 for the last one and 6905 seconds for the first
ID: 57266 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 57271 - Posted: 27 Nov 2008, 6:39:04 UTC
Last modified: 27 Nov 2008, 6:39:48 UTC

Hi.

Here's another, after 3hrs, 38mins.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=191817904

1g73A_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1g73A-_4652_57735_0

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>

ERROR: NANs occured in hbonding!
ERROR:: Exit from: src/core/scoring/hbonds/hbonds_geom.cc line: 763
called boinc_finish

pete.
ID: 57271 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
HA-SOFT, s.r.o.

Send message
Joined: 27 Jan 07
Posts: 10
Credit: 94,518,643
RAC: 0
Message 57272 - Posted: 27 Nov 2008, 7:27:11 UTC

I have problem on W2008Server 64 bit, where all Minirosetta task hangs at 0.00 progress. Rosetta beta work ok. BOINC 6.2.19

Zdenek
ID: 57272 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BF

Send message
Joined: 1 Dec 05
Posts: 1
Credit: 3,854,531
RAC: 0
Message 57274 - Posted: 27 Nov 2008, 10:00:02 UTC

I have the same problem. Rosetta beta works well but rosetta mini gives compute error within seconds.
Most of the time, I got an access violation:
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00030003

Engaging BOINC Windows Runtime Debugger...


(I can provide the complete file if needed).


This computer has WinXP SP2 - and a core 2 duo processor (E6600).

Another pc with the same configuration but with a pentium 4 works well.

BF
ID: 57274 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Ball

Send message
Joined: 25 Nov 05
Posts: 25
Credit: 1,439,333
RAC: 0
Message 57275 - Posted: 27 Nov 2008, 11:21:42 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=210193423

2vik__BOINC_ABRELAX_SPLIT_SPLIT2_IGNORE_THE_REST-S25-9-S3-3--2vik_-_4768_1689_1

Vista home premium 64 bit system with 5 GB of ram. C2 Quad Q6600. Only running BOINC. 2 rosetta tasks were running along with 2 tasks from other projects. Lots of free memory and disk space. BOINC is set to leave tasks in memory. BOINC is not used as a screensaver. BOINC client version is 6.2.19.

The WU above was running but the CPU time (3 hours 50 minutes 2 seconds) and percent complete (about 69%) weren't increasing. I checked with task manager and it WAS using 25% cpu (1 of 4 cores in a C2Q Q6600). I suspended the WU and the status in the BOINC manager changed from running to waiting to run. However, windows task manager showed that it was still running. I had another rosetta task running so I suspended the second WU as well to make sure I had the right one. The second rosetta WU stopped using CPU when it was suspended but remained in memory as it should. BOINC manager now showed NO rosetta tasks running, but windows task manager showed the problem WU was still using all the cpu time it could get. I killed it in task manager and aborted the WU. When looking at the result, I found that I was the second person to get the WU and it had died on the other computer after about 3 minutes.

IIRC, the WU was on the 5th model when this happened.

Hope this helps.
Have you read a good Science Fiction book lately?
ID: 57275 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,428,086
RAC: 0
Message 57282 - Posted: 27 Nov 2008, 14:01:23 UTC - in response to Message 57258.  

for the team to know what is going on, please post your affected work units links in your next message.


This is going to be a tedious task, as the WorkUnits (most of them) complete normally after the deadlock is solved.
And after BOINC has crashed, I have no way of telling which WorkUnit may have caused it, since I'm looking at upto 8 WorkUnits per Host which will restart all normal when re-launching BOINC.

For now I'm afraid I'm best off with just solving the deadlocks, had to do that ~8 times today already.

(the only real solution I'd see is to run BOINC in debug mode to get behind it crashing or the MiniRosetta Client failing, which I'm very hesitant to do on 24 active production Systems running 24/7 at full speed - sounds like loads of work :p )

Anyway, for now I haven't seen any such behaviour on my 32bit Win32 Systems so far, only my Linux Systems seem randomly affected.

-- edit --

Oh, forgot :
How does Rosetta react to undervolting of CPUs ?

Most of my Systems run with reduced Vcore tested stable with Prime95, given a small safety buffer and have 100% validation on other Projects (Einstein, MalariaControl, SETI, LHC).

I'm very careful before I blame anything on a Project Client when I'm not running hardware 100% to its specifications.


FalconFly, i noticed that you are crunching for LHC@home as well.
It might be that LHC@home is causing your crashes. I've had some crashes too this week. Next time it happens check your boinc.log file, the last message there, before SIGSEGV and the stack trace, is probably: [lhcathome] Scheduler request
A few weeks ago this has also been mentioned by several people in the LHC@home message boards.

AdeB
ID: 57282 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 57284 - Posted: 27 Nov 2008, 15:37:13 UTC

ID: 57284 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 1
Message 57287 - Posted: 27 Nov 2008, 17:35:30 UTC - in response to Message 57284.  

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=189759587


this is not really a issue with the task.
but rather the time of 10 days to crunch the task and report back has expired. you may have to much work on your system and it is not active enough to complete the work assigned to it. I see no CPU time on this task, so it appears it never got crunched to begin with. There are no error codes either.

This was also the case of another task you reported earlier. It never got crunched in 10 days.
ID: 57287 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 57289 - Posted: 27 Nov 2008, 20:49:36 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=208906339

ID: 57289 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rifleman

Send message
Joined: 19 Nov 08
Posts: 17
Credit: 139,408
RAC: 0
Message 57290 - Posted: 27 Nov 2008, 22:07:00 UTC

Can someone take a quick look at my results and see if they know why I am getting massive numbers of errors and wasted time? The ones I terminated myself were still runing in task manager after retarting BOINC so I'd end up with 8 WUs vying for CPU time while only 4 showed in BOINC.
Here is my results page and thanks. https://boinc.bakerlab.org/rosetta/results.php?userid=288725
ID: 57290 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike.Gibson

Send message
Joined: 3 Nov 07
Posts: 19
Credit: 311,844
RAC: 0
Message 57291 - Posted: 27 Nov 2008, 22:39:50 UTC

I am using a dual-core 3800+ with Vista Premium and Boinc 6.2.19.

If I have a mini 1.40 & Beta 5.98 running and suspend the project, both tasks are shown as suspended by user. However, the mini 1.40 keeps on running, albeit slowly.

Two other tasks start to run, one at normal speed and the other slowly.

Obviously, one of the new tasks is running on its own in one core and the other new task is sharing the second core with mini 1.40.

I have never seen a core sharing before. Is this ok, or is this a problem. None of my other projects show any signs of this phenomenum.

Any ideas?
ID: 57291 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rifleman

Send message
Joined: 19 Nov 08
Posts: 17
Credit: 139,408
RAC: 0
Message 57292 - Posted: 27 Nov 2008, 23:15:33 UTC

Check your processes running in task manager by pressing control, alt, delete. Do you show more than the normal number of tasks running?
ID: 57292 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 1
Message 57293 - Posted: 27 Nov 2008, 23:45:35 UTC - in response to Message 57290.  
Last modified: 27 Nov 2008, 23:46:46 UTC

Can someone take a quick look at my results and see if they know why I am getting massive numbers of errors and wasted time? The ones I terminated myself were still runing in task manager after retarting BOINC so I'd end up with 8 WUs vying for CPU time while only 4 showed in BOINC.
Here is my results page and thanks. https://boinc.bakerlab.org/rosetta/results.php?userid=288725


Your link to your results page is not correct, that is your own internal link i think. Here is the public one: https://boinc.bakerlab.org/rosetta/results.php?hostid=948562

Read this message and then go up the board a bit and see what others did when it comes to lockfile issues.

I see that other results have Nan issues. No one has explained what this is or if they are working on a fix for it or not.

Your non nan and lockfile results that errored out are possibly due to a out of date graphics card device driver. read this message for an explanation.

hope this helps get you back on track.
ID: 57293 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 1
Message 57294 - Posted: 27 Nov 2008, 23:50:25 UTC - in response to Message 57291.  

I am using a dual-core 3800+ with Vista Premium and Boinc 6.2.19.

If I have a mini 1.40 & Beta 5.98 running and suspend the project, both tasks are shown as suspended by user. However, the mini 1.40 keeps on running, albeit slowly.

Two other tasks start to run, one at normal speed and the other slowly.

Obviously, one of the new tasks is running on its own in one core and the other new task is sharing the second core with mini 1.40.

I have never seen a core sharing before. Is this ok, or is this a problem. None of my other projects show any signs of this phenomenum.

Any ideas?



There has been a few problems I have experienced and others have as well with 1.4 tasks not suspending, that was mostly in the loopbuild tasks. I have found that you have to just exit boinc and restart it. you may also have to reboot your system. but that is probably last ditch. After one or both of these steps boinc mgr will act properly again.
ID: 57294 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike.Gibson

Send message
Joined: 3 Nov 07
Posts: 19
Credit: 311,844
RAC: 0
Message 57295 - Posted: 28 Nov 2008, 1:40:49 UTC - in response to Message 57292.  

Check your processes running in task manager by pressing control, alt, delete. Do you show more than the normal number of tasks running?


Already checked - all 3 registered at variable amounts around 44%, 22% & 22%.
ID: 57295 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike.Gibson

Send message
Joined: 3 Nov 07
Posts: 19
Credit: 311,844
RAC: 0
Message 57296 - Posted: 28 Nov 2008, 1:45:37 UTC - in response to Message 57294.  

I am using a dual-core 3800+ with Vista Premium and Boinc 6.2.19.

If I have a mini 1.40 & Beta 5.98 running and suspend the project, both tasks are shown as suspended by user. However, the mini 1.40 keeps on running, albeit slowly.

Two other tasks start to run, one at normal speed and the other slowly.

Obviously, one of the new tasks is running on its own in one core and the other new task is sharing the second core with mini 1.40.

I have never seen a core sharing before. Is this ok, or is this a problem. None of my other projects show any signs of this phenomenum.

Any ideas?



There has been a few problems I have experienced and others have as well with 1.4 tasks not suspending, that was mostly in the loopbuild tasks. I have found that you have to just exit boinc and restart it. you may also have to reboot your system. but that is probably last ditch. After one or both of these steps boinc mgr will act properly again.



I have tried all sorts of combinations including reboots but it recurs next time. It seems to happen with either suspending project or suspending task. However, suspending both can clear the problem until the next time.
ID: 57296 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 57300 - Posted: 28 Nov 2008, 5:56:42 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=210335636

ERROR: NANs occured in hbonding!
ERROR:: Exit from: ....srccorescoringhbondshbonds_geom.cc line: 763
called boinc_finish

CPU time 39732.38 ((((((
ID: 57300 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Alec Rosa

Send message
Joined: 11 Nov 08
Posts: 18
Credit: 2,635
RAC: 0
Message 57301 - Posted: 28 Nov 2008, 6:50:27 UTC

People said it -- I said it -- I insist:

Rosetta mini should be worked as a Beta project. It seems SO obvious!

We want to crunch Rosetta again. Start sending Rosetta Beta 5.xx WU again, please!
ID: 57301 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 57302 - Posted: 28 Nov 2008, 6:58:34 UTC - in response to Message 57301.  


https://boinc.bakerlab.org/rosetta/results.php?hostid=267483



People said it -- I said it -- I insist:

Rosetta mini should be worked as a Beta project. It seems SO obvious!

We want to crunch Rosetta again. Start sending Rosetta Beta 5.xx WU again, please!

ID: 57302 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 15 · Next

Message boards : Number crunching : Minirosetta v1.40 bug thread



©2024 University of Washington
https://www.bakerlab.org