Report problems with Rosetta version 5.36

Message boards : Number crunching : Report problems with Rosetta version 5.36

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 30824 - Posted: 8 Nov 2006, 22:14:25 UTC
Last modified: 8 Nov 2006, 22:32:12 UTC

These two EDIT three computers A + B + C have produced these seven EDIT ten hung tasks in the last few days, the majority today: A1 + A2 + A3 + B1 + B2 + B3 + B4 + C1 + C2 + C3

Interstingly it is only these two EDIT three hosts out of 10 currently running Rosetta, but I am getting to expect to see a yellow stripe across my BoincView display to show me these two boxes have stopped again.

The most recent two to stop, A3 and B4, were both on the same protein, but that may well be coincidence as the other 5 are not for that protein. It was weird to see two tasks failing at the same time and with the same protein unde investigation.

This is not a screensaver issue as none of my BOINC clients run graphics.

The boxes A and B are two of my three slowest boxes, but interestingly these two boxes have 368Mb RAM, whereas the other that is equally slow has only 256Mb and has not had this issue (yet).

I had wondered if all the failed tasks are in the larger than 256Mb category - EDIT: until I spotted the same problem had occurred on box C, which has a faster (ahem, not quite so slow) cpu but only 256Mb RAM, so it does not seem to be a memory issue either.

R~~
ID: 30824 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 30825 - Posted: 8 Nov 2006, 22:32:30 UTC - in response to Message 30824.  

I had wondered if all the failed tasks are in the larger than 256Mb category


Yes, I wonder. What exactly is the requirement on the "large" WUs? >256MB? Or >256MB per core? or is it <= 512MB?


Rosetta Moderator: Mod.Sense
ID: 30825 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 4,124,428
RAC: 2,489
Message 30828 - Posted: 9 Nov 2006, 0:23:23 UTC

>> Result https://boinc.bakerlab.org/rosetta/result.php?resultid=45844369
has a validate error and stuck so was killed by watchdog.
Preference time 21600 was killed at 26708.58.
The cc_config.xml did not trap anything and I wonder if it even works on my version 5.2.13 ?
This is only the second workunit run since turning the screensaver back on and has failed.
I have double checked the syntax of 'cc_config.xml' and it appears correct as per FluffyChicken's and the Boinc sites instructions. Will continue to monitor.
Thanks for the reply Rhiju, haven't given up yet.
I have not had any Ralph work units on this machine for a few days so can't check if still a problem or not with Ralph as well.
ID: 30828 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 652,097
RAC: 417
Message 30829 - Posted: 9 Nov 2006, 0:45:12 UTC - in response to Message 30806.  
Last modified: 9 Nov 2006, 0:45:54 UTC


Don't do the GUIRPC one otherwise you'll have a never ending list updated very very fast.
I would stick with the screensaver one.

Just in case the
cc_config.xml file should be for screensaver
<cc_config>
<log_flags>
<scrsave_debug>1</scrsave_debug>
</log_flags>
</cc_config>

Also attach to Ralph@Home as we are up to R@H 5.40 there trying to fix some error code. http://ralph.bakerlab.org

You will know if the logging is working as it'll be logging from the beginning
08/11/2006 15:22:51||[scrsave_debug] ACTIVE_TASK::check_graphics_mode_ack(): got graphics ack <mode_hide_graphics/> for 1dcj__ETABLE_TEST_ABRELAX_rhh13sm6__1470_193_0, previous mode <mode_unsupported/>

Also example of mem_use debug, updated every 10 seconds, though
08/11/2006 15:22:59|ralph@home|[mem_usage_debug] 1dcj__ETABLE_TEST_ABRELAX_rhh13sm6__1470_193_0: RAM 28.30MB, page 54.39MB, 710.91 page faults/sec, user CPU 8.903, kernel CPU 0.110


Thanks FluffyChicken, I have set this up, currently I have a Ralph 5.40 WU so we'll see how it goes.

ID: 30829 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 30837 - Posted: 9 Nov 2006, 5:13:16 UTC

I have a 5.36 work unit on a remote system that appears to have hung for 8 days with no additional CPU time past the first 2 hours 31 minutes. This system does not run a screensaver, and has BOINC installed as a service.

https://boinc.bakerlab.org/rosetta/result.php?resultid=44919383

Why hasnt the watchdog killed this work unit?

ID: 30837 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 4,124,428
RAC: 2,489
Message 30840 - Posted: 9 Nov 2006, 8:11:16 UTC

> This workunit failed with debug data
https://boinc.bakerlab.org/rosetta/result.php?resultid=46039396

I don't think this one is related to the screensaver as the debug data does not mention it?
ID: 30840 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 4,124,428
RAC: 2,489
Message 30841 - Posted: 9 Nov 2006, 9:31:20 UTC - in response to Message 30840.  

> This workunit failed with debug data
https://boinc.bakerlab.org/rosetta/result.php?resultid=46039396

I don't think this one is related to the screensaver as the debug data does not mention it?


Also this one
https://boinc.bakerlab.org/rosetta/result.php?resultid=46133984

CPU time 4755.046875
stderr out

<core_client_version>5.2.13</core_client_version>
<message>The system cannot find the path specified. (0x3) - exit code 3 (0x3)
</message>
<stderr_txt>
# random seed: 3450426
# cpu_run_time_pref: 21600

This one did stop with the screensaver on. The screen was not updating and the processor had dropped to idle, I did not get any debug information.
3 workunits processed so far today on this machine and 2 have failed. Without the Boinc screen saver I had no failures, will keep drbugging.
ID: 30841 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 30843 - Posted: 9 Nov 2006, 12:38:54 UTC

I think they used a different cc_config type setup for the older clients.

you should be using 5.4.9/11 anyway if you are having problems with earlier configurations. They fixed some screensaver code among other parts.

Though I do not know how well the logging works in 5.4.9/11 as none of my setups use it (that I know of ;-)) All have 5.6.4/5 or 5.7.2 installed where the logging works well.
Unfortunatly BOINC developers have a habbit of updating the website to reflect the current test versions of the client so if the logging gets altered (I do remember a change in file name/convention some time back but when I don't know.) Older client users are stuffed!. But then they don't continue to develop the client for nothing ;-)

Side / According to BOINCStats, 5.4.9/11 are the most commonly used client by people.
Team mauisun.org
ID: 30843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 30864 - Posted: 9 Nov 2006, 18:07:09 UTC
Last modified: 9 Nov 2006, 18:07:45 UTC

45723693 & 45755534 were both running on my hyperthreaded machine. BOINC seemed to lose contact with the running threads (title bar did not show localhost, tasks tab empty), retry communications failed. Exited and restarted BOINC. Both WUs ended prematurely.

24hr time preferenece. But they only ran for 13 and 10.5 hrs.
Both show No heartbeat from core client for 31 sec - exiting.
I'm running BOINC 5.4.9 on Windows.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 30864 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile scsimodo

Send message
Joined: 17 Sep 05
Posts: 93
Credit: 946,359
RAC: 0
Message 30867 - Posted: 9 Nov 2006, 20:02:06 UTC
Last modified: 9 Nov 2006, 20:02:57 UTC

Had a few crashed WUs the last days:

Result
Result
Result

This is my Host.

ID: 30867 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile scsimodo

Send message
Joined: 17 Sep 05
Posts: 93
Credit: 946,359
RAC: 0
Message 30868 - Posted: 9 Nov 2006, 20:35:32 UTC

Next One:

Result
ID: 30868 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 30869 - Posted: 9 Nov 2006, 20:49:00 UTC - in response to Message 30837.  
Last modified: 9 Nov 2006, 20:56:42 UTC

I have a 5.36 work unit on a remote system that appears to have hung for 8 days with no additional CPU time past the first 2 hours 31 minutes. This system does not run a screensaver, and has BOINC installed as a service.

https://boinc.bakerlab.org/rosetta/result.php?resultid=44919383

Why hasnt the watchdog killed this work unit?


watchdog can't kill an app that has already died for any other reason.

We call these "stopped clock" errors, or "cpu frozen", etc. What has really happened is that the app has gone to meet its maker but has been nailed to its perch by the client which has not noticed its death, early demise, etc. Perhaps we should call this the Norwegian Blue app ;-)

Here you have two bugs at once. Firstly whatever bug in the app that caused the access violation that caused win to stop the app running. You can see that something did this if you look at your result on the website, now it has been reported.

Second bug, the fact that the client does not notice when one of its daughter processes has exited. This counts as a bug in the client, imo, and is not down to Rosetta but to the BOINC people to sort out.

Failure to accrue cpu time is possible for the app if a user task grabs 100% cpu for a prolonged time, so the client cannot assume from a stopped clock that there is definitely something wrong. It should however, imho, ask the operating system at this point if the app is still alive.

In the meantime, we need to intervene when we notice this. A clock that is stopped for more than a couple of minutes is usually a sign that the task has ended abruptly. Suspend/resume (of the task, not of the project) is usually enough to get the app to either pick up again, or more likely to begin to upload (whether as an error or success).

If it freezes a second time round, give it at least two minutes to get going again, then suspend, abort, resume the task should force it to go into upload as an error.

Rosetta usually grants credit for errorred results in both these cases, for the work the app did before it froze. What we don't get is credit for the time the box was idle -- but getting anything at all for an errorred app is one up on the other projects.

R~~
ID: 30869 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 4,124,428
RAC: 2,489
Message 30885 - Posted: 10 Nov 2006, 5:08:33 UTC

>> Ok all the following workunits have failed with the screensaver on and fact were on the screen when I noticed a couple of them had failed. The others had gone into a 'not responding' mode according to Task Manager and the processor had dropped to idle.
Only about 2 workunits have worked since I turned the screensaver back on.

https://boinc.bakerlab.org/rosetta/result.php?resultid=46243254
https://boinc.bakerlab.org/rosetta/result.php?resultid=46243277
these 2 had error code "exit code 1073807364"

https://boinc.bakerlab.org/rosetta/result.php?resultid=46243290
https://boinc.bakerlab.org/rosetta/result.php?resultid=46243291
https://boinc.bakerlab.org/rosetta/result.php?resultid=46243301 (failed at 34 sec)
https://boinc.bakerlab.org/rosetta/result.php?resultid=46243303 (fail at 610 sec)
https://boinc.bakerlab.org/rosetta/result.php?resultid=46243310

these last 5 all errored out with "stuck" or running too long killed by watchdog.
Or had access violations.

> I have now updated to client version 5.4.11 will leave the screensaver on to see if I can trap some debug information or see if the problem happens anymore.
I can not confirm if Ralph is having the same problem as I have had no workunits for a few days now.
Will keep you posted.
ID: 30885 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jim

Send message
Joined: 15 Oct 06
Posts: 22
Credit: 5,410,546
RAC: 0
Message 30886 - Posted: 10 Nov 2006, 5:40:11 UTC

I am also running version 5.36 with client version 5.4.11 on a AMD 3000+
and Windows XP.
I opened the Show Graphics window and the machine locked up when I went to close the graphics window. I had to exit it using Task Manager.

The workunit ended at that point with the exit code of 1073807364 (0x40010004).
<core_client_version>5.4.11</core_client_version>
<message>
- exit code 1073807364 (0x40010004)
</message>
<stderr_txt>
# random seed: 3382412
# cpu_run_time_pref: 10800

</stderr_txt>

Before I opned the Show Graphics window everything appeared to be processing normally.

Jim

ID: 30886 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile RuDiablo
Avatar

Send message
Joined: 11 Nov 05
Posts: 2
Credit: 463,636
RAC: 0
Message 30889 - Posted: 10 Nov 2006, 7:59:50 UTC
Last modified: 10 Nov 2006, 8:00:47 UTC

My teammate have stdout.txt ~30MB
File contains line:
DANGER:: 0-overlap chainbreak score does not match the derivative!!!!!!!!!!!!!!
Talk to Phil about fixing this.

or

DANGER:: AI chainbreak score does not match the derivative!!!!!!!!!!!!!!
Talk to Phil about fixing this.

Who is Phil? :)
ID: 30889 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 30896 - Posted: 10 Nov 2006, 9:28:37 UTC

Conan, is 5.4.11 showing the debug info ?

Any time you open and close the Graphics it will make a log entry.



It would have been nice if BOINC had brought out a 5.6.6 version and made it gold since the logging ceratainly works properly with 5.6.x+

Though it would be even better if Rosetta@home uploaded the *.pdb debug file like they didi during the initial stages so people having problems could grab it and give real debug information back to Rosetta@Home.
(Mods Admin ???? have they thought of doing this again ?)

Also another added bonus of the 5.6.x+series is they can log what the actual graphics cards being used are, which means they may see a trend if a particular graphics card causes a particular crash (currently suspected to be ATI and sometime integrated Intels, No one has mentioned Nvidia cards having screensaver troubles... of course not all of these are screensaver problems though).


Guess we will not get most of this till 5.8.x comes out which will probably be a while as there are a lot of changes to test.
Team mauisun.org
ID: 30896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 4,124,428
RAC: 2,489
Message 30898 - Posted: 10 Nov 2006, 10:34:00 UTC - in response to Message 30896.  
Last modified: 10 Nov 2006, 10:38:23 UTC

Conan, is 5.4.11 showing the debug info ?

Any time you open and close the Graphics it will make a log entry.



It would have been nice if BOINC had brought out a 5.6.6 version and made it gold since the logging ceratainly works properly with 5.6.x+

Though it would be even better if Rosetta@home uploaded the *.pdb debug file like they didi during the initial stages so people having problems could grab it and give real debug information back to Rosetta@Home.
(Mods Admin ???? have they thought of doing this again ?)

Also another added bonus of the 5.6.x+series is they can log what the actual graphics cards being used are, which means they may see a trend if a particular graphics card causes a particular crash (currently suspected to be ATI and sometime integrated Intels, No one has mentioned Nvidia cards having screensaver troubles... of course not all of these are screensaver problems though).


Guess we will not get most of this till 5.8.x comes out which will probably be a while as there are a lot of changes to test.


Thanks FluffyChicken,
no I can't see anything happening in the 'stdoutdae' (where Boinc says it is going) file indicating any debug information.
I may consider going the whole hog and put on the latest version but my experience with 5.4.11 was not good as it keeps trying to do a completely fresh install and wipe out the current versions data files (it does not detect Boinc version 5.2.13 that is already there). This has already happened once and added another computer to my list, so I just added the main components (boinc manager,boinc client,boinc command,boinc.dll). This maybe why the debug is not working.
The only thing I have found it that now an error message tells me Seti has the wrong url since changing to 5.4.11. So something is happening, I will have to detatch and reattach to see if that problem fixes itself.
A big plus is that since changing to 5.4.11 I have had no lock ups yet with the current WU over 3 hours now, the last ones were failing from a minute to 2 hours, so I might just wait awhile and see if anymore fail first before upgrading again.
Just checked and 5.4.11 is the latest recommended version. So why does it wipe previous versions off the map? Not a good update if you ask me.
ID: 30898 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 30899 - Posted: 10 Nov 2006, 12:09:53 UTC - in response to Message 30898.  

[quote.....I may consider going the whole hog and put on the latest version but my experience with 5.4.11 was not good as it keeps trying to do a completely fresh install and wipe out the current versions data files (it does not detect Boinc version 5.2.13 that is already there). This has already happened once and added another computer to my list, so I just added the main components (boinc manager,boinc client,boinc command,boinc.dll). This maybe why the debug is not working.
The only thing I have found it that now an error message tells me Seti has the wrong url since changing to 5.4.11. So something is happening, I will have to detatch and reattach to see if that problem fixes itself.
A big plus is that since changing to 5.4.11 I have had no lock ups yet with the current WU over 3 hours now, the last ones were failing from a minute to 2 hours, so I might just wait awhile and see if anymore fail first before upgrading again.
Just checked and 5.4.11 is the latest recommended version. So why does it wipe previous versions off the map? Not a good update if you ask me.[/quote]

It didn't for me. Strange, though it is a long time since I tested it.
you can always 'merge' the computers in your profile if you want to keep them together and neat.

They did change some URL projet code (for security) but I don't remember seti changing their URL either..

Team mauisun.org
ID: 30899 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 4,124,428
RAC: 2,489
Message 30904 - Posted: 10 Nov 2006, 15:44:20 UTC

> Everything was working fine, graphics not locking up and WU up to 5 hours 28 minutes and nearing completion.
I moved the mouse which stopped the screensaver and whilst looking at my task list on the manager I saw the Rosetta WU die.

https://boinc.bakerlab.org/result.php?resultid=46405086

This is the new error I received (at least it is a different one for me)

CPU time 19728.5
stderr out

<core_client_version>5.4.11</core_client_version>
<message>
Maximum disk usage exceeded
</message>
<stderr_txt>
# random seed: 3295364
# cpu_run_time_pref: 21600


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x7C901230

Engaging BOINC Windows Runtime Debugger...

Have not seen this one for a long time, I have a 250 GB Hd and 2 Gb of RAM so disk or memory should not be a problem. The Boinc message said that the WU was aborted. This was bone by either Rosetta, which I doubt, or by Boinc client/manager. It could also be a bit of both.
ID: 30904 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 30906 - Posted: 10 Nov 2006, 16:10:09 UTC - in response to Message 30904.  

<core_client_version>5.4.11</core_client_version>
<message>
Maximum disk usage exceeded
</message>
<stderr_txt>
# random seed: 3295364
# cpu_run_time_pref: 21600

Unhandled Exception Detected...

46421301
<core_client_version>5.4.9</core_client_version>
<message>
Maximum disk usage exceeded
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# random seed: 3277006
# cpu_run_time_pref: 28800
SIGSEGV: segmentation violation

...another "disk space exceeded" error and it's the same WU type as the one Conan reported (my first error in ages, btw ;-).
Team betterhumans.com - discuss and celebrate the future - hoelder1in.org
ID: 30906 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Report problems with Rosetta version 5.36



©2024 University of Washington
https://www.bakerlab.org