RAC dropping, BOINC dropping comms

Message boards : Number crunching : RAC dropping, BOINC dropping comms

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
larry1186

Send message
Joined: 18 Apr 06
Posts: 7
Credit: 329,257
RAC: 0
Message 31907 - Posted: 1 Dec 2006, 17:54:51 UTC - in response to Message 31840.  

...one's computer may be asleep for hours or days (3-day weekend?).


...or a four day weekend... :(
I have my computer at work set up and running a CPDN-SA model which was almost done and I expected my model to be finished when I got back from Thanksgiving break on Monday. It lost the connection to localhost at about 7:30 pm, Wednesday, a couple hours after I left for the long weekend. The model finally finished this morning tho... Four days of a 3.06 Ghz dual processor sitting idle is painful.

Earlier this week I found out that one of my projects had the hostid set to zero. I changed it to what it should be and now the manager hasn't dropped it's connection yet. We shall see what the weekend brings.
Don't get distracted by shiny objects.
ID: 31907 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,589,590
RAC: 198
Message 31909 - Posted: 1 Dec 2006, 19:06:25 UTC

Wow. Good thing I read this thread. Yep, the same thing is happening to me (boinc dead while boincmgr still running). It started (for me) just after upgrading to cunch3r's 5.7.5 client (affinity turned on) a couple weeks ago. Since thin, it has happened 15-20 time across 10 machines.

I was just about to switch back to 5.4.11, but now I see there is no point. It is happening with 5.4.11 too.

Very frustrating! Lots of crunching time down the drain!
Reno, NV
Team: SETI.USA
ID: 31909 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Blainer

Send message
Joined: 14 Nov 06
Posts: 1
Credit: 1,814,334
RAC: 0
Message 31912 - Posted: 1 Dec 2006, 20:19:02 UTC
Last modified: 1 Dec 2006, 20:19:55 UTC

I am having BOINC crashing as well, only when Rosetta is downloading. SETI and Einstein do not have any problems.

It crashes with the same memory address as the others have reported as well.

Running on a Core2Duo, XP Pro SP2, BOINC 5.4.11, and happened with both Rosetta 5.40 and 5.41.

Here's the latest dump:

*** UNHANDLED EXCEPTION ****
Reason: Access Violation (0xc0000005) at address 0x0033B014 read attempt to address 0x00000008

*** Dump of the (offending) thread: ***
eax=013dfc98 ebx=00f24118 ecx=00000000 edx=00f241c0 esi=01251b48 edi=00f241c0
eip=0033b014 esp=01defee0 ebp=00fbd4b0
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010202

ChildEBP RetAddr  Args to Child
01defee0 0033adcd 00f241c0 00000000 01251b48 00000015 libcurl!Curl_llist_insert_next+0x5 (c:boincsrcsdkscurllibllist.c:78) FPO: [3,0,0] 
01deff00 0032f7b3 00f24118 00f241c0 00000015 00fbd4b0 libcurl!Curl_hash_add+0xb (c:boincsrcsdkscurllibhash.c:165) FPO: [4,0,0] 
01deff24 0032fae5 012e2880 01248520 00f93dc8 00000050 libcurl!Curl_cache_addr+0x19 (c:boincsrcsdkscurllibhostip.c:361) FPO: [4,1,0] 
01deff48 0032fb52 003c7170 0032fd7c 0122afe8 00000000 libcurl!addrinfo_callback+0x15 (c:boincsrcsdkscurllibhostasyn.c:131) FPO: [0,1,0] 
01deff50 0032fd7c 0122afe8 00000000 003c7170 00000000 libcurl!Curl_addrinfo4_callback+0x12 (c:boincsrcsdkscurllibhostasyn.c:161) FPO: [3,0,0] 
01deff80 7c349565 00000000 00000000 0012ed08 00b873d8 libcurl!gethostbyname_thread+0x0 (c:boincsrcsdkscurllibhostthre.c:335) FPO: [1,4,0] 
01deffb4 7c80b683 00b873d8 00000000 0012ed08 00b873d8 MSVCR71!__endthreadex+0x0 (c:boincsrcsdkscurllibhostthre.c:335) 
01deffec 00000000 7c3494f6 00b873d8 00000000 000000c8 kernel32!_BaseThreadStart@8+0x0 (c:boincsrcsdkscurllibhostthre.c:335) 

Exiting...


I hope this is figured out soon. I'm leaving for the weekend, and I'd love to leave the system on for 72 hours of straight processing time, but there's no point if BOINC is probably going to die 3 hours after I leave. :D
ID: 31912 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 31914 - Posted: 1 Dec 2006, 21:27:59 UTC - in response to Message 31912.  

...I hope this is figured out soon. ...

We sit here and discuss the problem amongst ourselves; the issue was reported on a boinc client forum. But is anyone trying to resolve it? Does anyone capable of resolving it consider this a problem at all? Are the right people looking into it -- Boinc? Rosetta? Or are we just hoping for action that isn't coming?
ID: 31914 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 31915 - Posted: 1 Dec 2006, 21:31:27 UTC

I've sent details and one crash log to Rom and should be able to send some more next week.
ID: 31915 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 31917 - Posted: 1 Dec 2006, 22:48:10 UTC - in response to Message 31915.  

I've sent details and one crash log to Rom and should be able to send some more next week.
You are pro-active and I hope it bears fruit.
ID: 31917 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile EdMulock
Avatar

Send message
Joined: 14 Mar 06
Posts: 30
Credit: 2,347,485
RAC: 0
Message 31943 - Posted: 2 Dec 2006, 16:02:42 UTC

This is definately frustrating since more than half my farm is across town, and it keeps happening.
ID: 31943 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 31948 - Posted: 2 Dec 2006, 16:43:13 UTC
Last modified: 2 Dec 2006, 16:44:28 UTC

Ed, you mentioned in your other post that you are running 4 projects. I've noticed that it seems when I bring BOINC back up after such a drop, that it is ALWAYS downloading files. Which implies to me that something occurs during the download which causes a hic-cup in the TCP stack on the PC, and causes the manager to lose contact with the boinc.exe that controls everything (which eventually causes everything to shutdown). Do you know if there is any pattern about what is being downloaded at the time?? Since I always run Rosetta, I always see Rosetta being downloaded. But I was wondering if you find you are just as likely to catch the other projects downloading at the time you restart?

It sometimes is hard to catch the transfers, since the screen only refreshes every 5 seconds, and your BOINC Manager generally doesn't open to the transfers tab. If you have a fast connection, and sometimes the file I've got left is only a few bytes long anyway, you can miss them. I happen to see them because I set BOINC to only use the network at night when I'm away from the machine. So, anytime I am AT the machine to restart BOINC, it is during the hours BOINC does not use the network... so the transfers that were in progress during the drop are suspended until the next night.

I tried the suggestion in this thread to limit BOINC to one connection at a time. As best I can tell, BOINC is ignoring that setting and still using two at a time. So, I still see the BOINC manager dropping the localhost, and setting to one file at a time did not seem to help the problem.

I should also note that now that I've had more occurences of the problem, I've seen cases where more then one file is in my transfers list as well. In fact that is how I saw that my setting requesting only one connection at a time was apparently ignored.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 31948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,589,590
RAC: 198
Message 31988 - Posted: 3 Dec 2006, 5:11:42 UTC

Some observations:

1) Seems to be happening across several versions of BOINC (5.4.9, 5.4.11, 5.7.5 Crunch3r). But what about OS? I have 6 Macs, none of which have had this problem *at all*.

2) Perhaps it is project related? My Macs crunch SETI only. Perhaps it is a problem with Rosetta? Maybe started with 5.40 or 5.41? Just a stab in the dark. Can anyone tie the problem starting with the release of either of those?

3) I recently increased my resource allocation for Rosetta, and noticed the problems happening more frequently. Maybe just coincidence.
Reno, NV
Team: SETI.USA
ID: 31988 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 32018 - Posted: 3 Dec 2006, 18:30:14 UTC

Zombie, I'm running three machines on Windows. All have had the problem at one time or another. I ALWAYS see files being transferred when I fire BOINC back up again, and for this reason, I have been of the camp that feels this is either a BOINC or Windows TCP stack problem, not Rosetta. ...But I only crunch Rosetta and Ralph now, so that's why I am hoping Ed can share some of his experience from a setup with 4 projects going.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 32018 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Colin Smith

Send message
Joined: 29 Sep 06
Posts: 1
Credit: 894,657
RAC: 0
Message 32045 - Posted: 4 Dec 2006, 3:10:46 UTC

I run 3 projects, Einstein, Rosetta, and Climate Prediction. I have been having problems since November 7 or 8, about the same time as everybody else. After trying to figure out why BOINC wouldn't run anymore, and hearing that all the people who were having difficulties were running Rosetta, I suspended Rosetta to see if it would make a difference. It ran for 2 days without quiting, when before I was lucky if it lasted 4 hours. Just to make sure, I resumed Rosetta again today, and within a couple of hours i found BOINC down again.

Based on this, I am pretty sure that something in Rosetta is causing it.
ID: 32045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,589,590
RAC: 198
Message 32097 - Posted: 5 Dec 2006, 5:27:36 UTC

Hmmmmmm.....

Maybe this is a coincidence, but It's been 36 hours since any of my Windows machines have had another problem. Knock wood. For the last several weeks, I haven't gone more than 8 hours without at least one crapping out.
Reno, NV
Team: SETI.USA
ID: 32097 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile EdMulock
Avatar

Send message
Joined: 14 Mar 06
Posts: 30
Credit: 2,347,485
RAC: 0
Message 32115 - Posted: 5 Dec 2006, 16:57:54 UTC - in response to Message 31948.  

[quote]Ed, you mentioned in your other post that you are running 4 projects. I've noticed that it seems when I bring BOINC back up after such a drop, that it is ALWAYS downloading files.... /quote]

I only run Rosetta. New theory is that this is related to hyperthreaded Intel processors. So I am trying limiting max processors to 1.

ID: 32115 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,589,590
RAC: 198
Message 32130 - Posted: 5 Dec 2006, 22:23:34 UTC - in response to Message 32097.  

Maybe this is a coincidence, but It's been 36 hours since any of my Windows machines have had another problem. Knock wood. For the last several weeks, I haven't gone more than 8 hours without at least one crapping out.


Spoke too soon. Just had one fail. It's a HT machine, if that helps.

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=318945


Reno, NV
Team: SETI.USA
ID: 32130 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 32135 - Posted: 6 Dec 2006, 0:59:09 UTC

All of mine are HT machines. Ed may be on to something there. Keep us posted.

Also, Ed, sorry that I misread the post of yours that I'd quoted. I see now that you were quoting another post of a user which had 4 projects. Life is full of details. :(
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 32135 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 32141 - Posted: 6 Dec 2006, 3:00:41 UTC

Just lost connection to client few minutes ago. I selected the Transfer Tab before exiting the mgr. When I restarted mgr I saw 2 rosetta files in a download status and 10 more Rosetta files in download pending status. The associated task appears to be 2tif_ETABLE_TEST_ABRELAX_nov19_1411_6818_2

I'm using boinc 5.4.9, windows xp, hyperthreaded, 2 ative projects with Rosetta getting 60% resource share. But my error rate isn't too high as some people seem to experience.
Dump Timestamp : 12/05/06 21:14:25
Dump Timestamp : 11/23/06 23:46:30
Dump Timestamp : 11/22/06 15:47:48
Dump Timestamp : 11/10/06 00:16:06
Dump Timestamp : 11/09/06 16:09:28
Dump Timestamp : 11/03/06 15:26:57
ID: 32141 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1829
Credit: 115,879,649
RAC: 60,477
Message 32158 - Posted: 6 Dec 2006, 9:27:46 UTC
Last modified: 6 Dec 2006, 9:28:39 UTC

i've had this happen on a machine twice now:

Win XP
Athlon XP-M
Non-service install
Running only Rosetta

All my other machines run a service install so I've not seen this before. This machine is running BOINC from a network drive, but its been running like this for about a year now and i've never had a problem with it dropping comms before without the hub or server being powered down (or changes being made to the onboard NIC), neither of which have happened recently. The first time I didn't think anything of it and just restarted BOINC. This morning I started BOINC before I had a good look under all the tabs. If it happens again I'll have a look through the logs...

Danny
ID: 32158 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 32262 - Posted: 8 Dec 2006, 9:35:23 UTC

I had another crash overnight, which is my first since 5.41 was released, which pretty much rules out the long command-line as the culprit.

I've had the crash on various PCs configurations, the only common factor is Rosetta.

NT4, Win2000 and WinXP.
PIII, P4 and Athlon XP.
Single CPU, dual CPU (not dual core) and hyperthreaded P4.
ID: 32262 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nothing But Idle Time

Send message
Joined: 28 Sep 05
Posts: 209
Credit: 139,545
RAC: 0
Message 32268 - Posted: 8 Dec 2006, 12:34:05 UTC - in response to Message 32262.  

...I've had the crash on various PCs configurations, the only common factor is Rosetta...

And if only Rosetta was affected by this phenomenon, but when the client stops so do all other attached projects. I've resorted to filling up my cache and setting Rosetta to "No new tasks": after all the queued Rosetta tasks are done I report them and get new tasks under my supervision. If the downloads cause a problem I can provide remedy immediately.
ID: 32268 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,589,590
RAC: 198
Message 32271 - Posted: 8 Dec 2006, 13:56:40 UTC

Went three whole days without it happening, then bang, 3 machines failed overnight. Two were P4 w/ HT, one was an X2 (no HT obviously). These three machines are all windows, all running 20+ projects, none running as a service. Let me know if there is any further information what would be useful.
Reno, NV
Team: SETI.USA
ID: 32271 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : RAC dropping, BOINC dropping comms



©2024 University of Washington
https://www.bakerlab.org