Common Denominator?: compute errors and zero cpu usage

Message boards : Number crunching : Common Denominator?: compute errors and zero cpu usage

To post messages, you must log in.

AuthorMessage
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 49968 - Posted: 23 Dec 2007, 13:01:02 UTC
Last modified: 23 Dec 2007, 13:14:40 UTC

I awoke this morning to a prompt saying "boinc needs to connect to the internet". Strangely though I'm on an adsl line and so the internet should always be available. I have had this several times since moving to adsl years ago and know that I usually just need to reset the modem. What this means is that my internet connection went down during the night. I think this is the common denominator to what I describe next and to many of the problems sparsely reported at Rosetta.

I then noticed two things before resetting the modem. One, Many of my hosts had consecutive "computation errors" (as listed below by host) and Secondly, two of the 5 machines running linux showed ZERO cpu usage although the task page showed wus as "running". Killall -9 boinc and restarting boinc fixed this error (after resetting modem). I have seen many reports of the "zero cpu use" bug and have even seen it once myself. It's tough to track down freakish occurences.

This report is mainly applies to Linux users as my one windows machine seemed to work flawlessly last night. The "zero cpu usage" bug might be affecting windows users but I'm not sure of that.

I wonder what process is calling for internet, then not getting it, and causing "computation errors". Or perhaps, once that call fails yeilds computation errors on successive wus until the internet returns. This has to have something to do with the loss of communications. I don't think it is solely a 5.91 issue, as I have seen it before with earlier versions of the Rosetta app and there have been reports of this by others prior to 5.91. This may well be a "Boinc" problem as well. The reason I'm certain this to be either an app and/or Boinc problem is that 5.91 has run well up until last nite, then failed last nite, then has continued to run well after resetting the modem.

For my AMD64 3700 hostid=692481 these are the computation errors:

resultid=128342834
resultid=128342219

For my AMD64 X2 4800 hostid=692483 this is the computation error

resultid=128614143

Form my AMD64 X2 6000 hostid=699377 these are the computation errors:

resultid=128626741
resultid=128632201
resultid=128647762
resultid=128660399

For my host AMD64 X2 5200 hostid=586640 these are the computation errors:

resultid=128344775
resultid=128344731

Now the AMD64 2800 didn't have any errors and my Windows Mobile AMD64 3700 didn't as well. However, this many errors consecutively, spread across this many hosts, is truely unique to my memory.

If any of those out there who also think their computation and/or zero cpu usage problems might also have to do with internet failure, please keep an eye out for this and report it as they come.

thanks
tony
ID: 49968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BitSpit
Avatar

Send message
Joined: 5 Nov 05
Posts: 33
Credit: 4,147,344
RAC: 0
Message 49970 - Posted: 23 Dec 2007, 14:49:30 UTC
Last modified: 23 Dec 2007, 14:57:48 UTC

I remember reading about this a couple years ago. This is a known BOINC problem. From what I remember, it's net code uses blocking calls. The problem is that those blocking calls prevent BOINC from communicating with any running app. If I'm remembering correctly, the BOINC devs have said that it's broken by design and don't want to put in the effort required to fix it.

Edit: Here's a good explanation of the problem.
ID: 49970 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luuklag

Send message
Joined: 13 Sep 07
Posts: 262
Credit: 4,171
RAC: 0
Message 50031 - Posted: 25 Dec 2007, 12:22:26 UTC

well the problem is i think from what i read in your post. the rosetta app errored out, not the boinc the rosetta!. that made the wu's fail, and if rosetta error's out the rosetta app itself, and not boinc, wich is given the ok to use the inet, wants to connect to the project servers and upload an bug report. so it asks for permission to use your inet.
ID: 50031 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Common Denominator?: compute errors and zero cpu usage



©2024 University of Washington
https://www.bakerlab.org