Loads and loads of computing errors today

Message boards : Number crunching : Loads and loads of computing errors today

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 1700 - Posted: 25 Oct 2005, 1:52:41 UTC

There must be something wrong at the server side. All of my machines are Macs. ! is a Dual G4, one is a Laptop G4, and the third is a Dual G5. The Dual G4 is producing a very high error rate, the G5 has a few but not as many, The Laptop is having no problems. I have changed nthing on my end. Al that has changed it the type of WU (if the name means anything). The Random_length, and Random_Gauss seem to bee a problem. Since the BOINC client is nothing more than a scheduler for the application it would not be the problem here. It is either the application or the WU. Since the application has changed I would think that is the place to start (hello David).

Regards
phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 1700 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1702 - Posted: 25 Oct 2005, 3:29:30 UTC - in response to Message 1700.  

There must be something wrong at the server side. All of my machines are Macs. ! is a Dual G4, one is a Laptop G4, and the third is a Dual G5. The Dual G4 is producing a very high error rate, the G5 has a few but not as many, The Laptop is having no problems. I have changed nthing on my end. Al that has changed it the type of WU (if the name means anything). The Random_length, and Random_Gauss seem to bee a problem. Since the BOINC client is nothing more than a scheduler for the application it would not be the problem here. It is either the application or the WU. Since the application has changed I would think that is the place to start (hello David).

Regards
phil


Can you restart your client on the Dual G4 and see what happens?
ID: 1702 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mscharmack
Avatar

Send message
Joined: 29 Sep 05
Posts: 2
Credit: 11,323
RAC: 0
Message 1703 - Posted: 25 Oct 2005, 3:40:20 UTC

about 99% of the WU's downloaded since 21 Oct 2005 ~19:15 UTC have ended with an unrecoverable error. Nothing has changed on my computers and with so many people having the same problem, it has to be a problem with Rosetta@home work units.
ID: 1703 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1709 - Posted: 25 Oct 2005, 7:07:17 UTC

Well, just to be contrary, but, I have not had an error since the 22nd ...

Running on windows XP Pro and OS-X (G5 - Tiger) ...

WIndows machines are AMD Athlon 64, Xeon (32-bit and 64-bit), P4 (HT and non-HT) ...

So, I don't get it ...

As I said, I had one that "stuck" but restarting the BOINC Client Software looks like it "cured" that one. Others I was suspicious about taking too long a suspend/resume seemed to work for them ...

Obviously our mileage is varying ...
ID: 1709 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 1713 - Posted: 25 Oct 2005, 7:24:27 UTC - in response to Message 1703.  
Last modified: 25 Oct 2005, 8:19:18 UTC

quote]about 99% of the WU's downloaded since 21 Oct 2005 ~19:15 UTC have ended with an unrecoverable error. Nothing has changed on my computers and with so many people having the same problem, it has to be a problem with Rosetta@home work units. [/quote]
>I had a look at your computer benchmarks as I have similiar AMD units and you appear to be using BOINC ver.4.19 or are overclocked. If you are still using 4.19 you should upgrade to BOINC ver 5.2.2 and that should solve the problem.(I had the same problems on 4 boxes). Hope this helps....Cheers,Rog.
ID: 1713 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 1714 - Posted: 25 Oct 2005, 7:29:01 UTC - in response to Message 1698.  

I had 3 out of 4 WUs end in computation errors.....

>If you are still using BOINC ver. 4.19 an upgrade to ver 5.2.2 should solve your problems.....Cheers, Rog.
ID: 1714 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1715 - Posted: 25 Oct 2005, 7:43:11 UTC
Last modified: 25 Oct 2005, 7:43:42 UTC

David,

We need to put a restart capability into the science application. I just had another couple work units spend 2-5 hours at 1% completion... restarting the client seems to fix them (well, I have one that may be hung still after a restart it is at 8 minutes and still 1%).

I think if it has spent an hour (ha! There it went up!) at 1%, it should be halted, completely unloaded, restarted, and if it hangs 3 times like that ... well ... something is bad. But, this is a pretty hefty waste of resources as it can "sneak-up" on you if you don't watch it. heck, if i had not been unable to sleep, these might have tried to run all night doing nothing ...

As most of them seem to start within 10 minutes, we might try a lower limit of 20 minutes ... but, your call ...
ID: 1715 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [BAT] tutta55
Avatar

Send message
Joined: 16 Sep 05
Posts: 59
Credit: 99,832
RAC: 0
Message 1716 - Posted: 25 Oct 2005, 7:43:37 UTC

Sorry to contradict you, Roger. But some people running 4.45 also have the problem. Just take a look at the WU I refer to in the message I started this thread with. There are many similar cases where both 4.19 and 4.45 result in an error, albeit with a different error message.

And, the problem is indeed with the new Rosetta app, since I never had it with their 4.77 version. If they now require version 5 of the boinc client software, that is fine with me. But then it should be clearly stated, and if possible imposed by the server. If not, well I think the problem should be fixed. People running older versions of the boinc client may have good reasons to do so.

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair
ID: 1716 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [BAT] tutta55
Avatar

Send message
Joined: 16 Sep 05
Posts: 59
Credit: 99,832
RAC: 0
Message 1717 - Posted: 25 Oct 2005, 7:49:46 UTC - in response to Message 1715.  
Last modified: 25 Oct 2005, 7:54:11 UTC

As most of them seem to start within 10 minutes, we might try a lower limit of 20 minutes ... but, your call ...


@Paul: 20 minutes would be a bit too low. Good ol' me has a PIII 800MHz and the WU named sim_aneal take about 50 minutes to get passed the 1% barrier :-)

Additionally, if this auto restart is implemented, it would be nice if the CPU time already spent was not reset, but added to the total processing time.

BOINC.BE: For Belgians who love the smell of glowing red cpu's in the morning
Tutta55's Lair
ID: 1717 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AnRM
Avatar

Send message
Joined: 18 Sep 05
Posts: 123
Credit: 1,355,486
RAC: 0
Message 1719 - Posted: 25 Oct 2005, 8:17:28 UTC - in response to Message 1716.  
Last modified: 25 Oct 2005, 8:21:47 UTC

Sorry to contradict you, Roger. But some people running 4.45 also have the problem. Just take a look at the WU I refer to in the message I started this thread with. There are many similar cases where both 4.19 and 4.45 result in an error, albeit with a different error message.

And, the problem is indeed with the new Rosetta app, since I never had it with their 4.77 version. If they now require version 5 of the boinc client software, that is fine with me. But then it should be clearly stated, and if possible imposed by the server. If not, well I think the problem should be fixed. People running older versions of the boinc client may have good reasons to do so.

>I can't argue with your logic as my problems obviously started with R@H 4.78 as well. I edited out my comment.... I guess I should have been more precise when I said it wasn't the Rosetta app. In my experience it isn't the Rosetta app. if you upgrade to BOINC 5.x. Your point is well taken, though, as some people not may want to upgrade.....Cheers, Rog.
ID: 1719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 1723 - Posted: 25 Oct 2005, 10:17:16 UTC - in response to Message 1719.  
Last modified: 25 Oct 2005, 10:19:10 UTC

In my experience it isn't the Rosetta app. if you upgrade to BOINC 5.x. Your point is well taken, though, as some people not may want to upgrade.....Cheers, Rog.


My Athlon XP 3000+ has the 5.2.2 boinc client and trashes about 2 out of 10 work units. Most of them are of the 0xC000005 type but there have been a few others (error 1 and error -164)

No real pattern to it - sometimes it's every second WU, sometimes it's three in a row, sometimes all is well for half a dozen. It has 1GB RAM and runs Rosetta exclusively (apart from limited normal use of the computer).

It had a hardware problem (disk drive cable), which showed up in Windows event logs (Windows XP Pro SP2) but that's been fixed. I still get problems with Rosetta WU crashing out and there's no messages in the event logs indicating there's any hardware/software problem at the time of these crashes.

Maybe I should let it run dry, then reset the project or even detach, uninstall BOINC, and start from scratch?

The only other thing that ~may~ be an issue, which I have seen on another PC, but not this one, was that when the clock (time) was adjusted, a WU crashed. Could it be related?
*** Join BOINC@Australia today ***
ID: 1723 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[AF>Belgique]Mamouth

Send message
Joined: 18 Sep 05
Posts: 4
Credit: 580,683
RAC: 0
Message 1725 - Posted: 25 Oct 2005, 10:53:52 UTC

my 50 cents

on my P4 1.5 ghz WIN2K at work never had any error

At home on a P4 3.0ghz HT with WINXP I get a lot of errors

both are using CC 5.X


ID: 1725 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 1726 - Posted: 25 Oct 2005, 11:31:15 UTC - in response to Message 1702.  

There must be something wrong at the server side. All of my machines are Macs. ! is a Dual G4, one is a Laptop G4, and the third is a Dual G5. The Dual G4 is producing a very high error rate, the G5 has a few but not as many, The Laptop is having no problems. I have changed nthing on my end. Al that has changed it the type of WU (if the name means anything). The Random_length, and Random_Gauss seem to bee a problem. Since the BOINC client is nothing more than a scheduler for the application it would not be the problem here. It is either the application or the WU. Since the application has changed I would think that is the place to start (hello David).

Regards
phil


Can you restart your client on the Dual G4 and see what happens?


David,

I have already tried that. All of the errors seem to be on WU of the type "1hz6A_abrelaxmode_random_length05_16882" and only on the G$ Dual.

The only error on the Dual G5 was of type "1hz6A_abrelaxmode_random_gauss_sim_aneal_00047".

I will restart the entire system again and see if it makes a diff, but so far restarting BOINC has not changed a thing. These errors started on the 24th. Before that all was well on all three systems.

Regards
Phil




We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 1726 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 1729 - Posted: 25 Oct 2005, 12:16:03 UTC

I haven't had any problems the past many days, since I've had the WU's left in memory, for what it's worth.

The WU's are fast as lightning, but they come and go without any problems.

Only one WU vanished into thin air after a restore of my system. :-(




[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 1729 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile kb7rzf
Avatar

Send message
Joined: 7 Oct 05
Posts: 16
Credit: 35,427
RAC: 0
Message 1732 - Posted: 25 Oct 2005, 14:35:45 UTC

Since I started with this project on Oct 15th, I have only had 2 WU's that errored, all others have been just fine. I'm running a Dell Dimension computer, with an Intel Celeron 2.6GHZ, 512mb Ram, WinXP Home, BOINC 5.2.2, and I have my preferences set to leave the project in memory. Dunno how helpful that info is but there ya have it. :-)

Jeremy

ID: 1732 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 1765 - Posted: 26 Oct 2005, 2:58:16 UTC

Ok, so just to be clear. I have three systems working on R@H. All three are Macs. One is a 2GHz G5 dual CPU, One is a 1.4 GHz G4 Dual CPU, and One is a Powerbook 1GHz G4. All three are running Mac OS 10.4.2. All three computers are running the same version of BOINC, MacNN 4.44 Superbench, and they are all using the R@H 4.77 client app. The G5 Dual is running E@H, R@H, P@H, and S@H. The The G4 Dual is running P@H, R@H, S@H, and CP@H, (it is also attached to E@H and XtremLab but they are suspended and have been for some time now). The Powerbook is running S@H, P@H and R@H. All three systems were running fine until the WUs distributed after 23 Oct 2005 23:09:31 UTC. Thats when the errors started. None of the other apps are having any problems on any of the systems.

The G5 and the Powerbook are not having any problems except the occasional client error but they are not common. The G4 Dual gets a client error on every WU from R@H.

I have tried restarting the BOINC client, I have tried restarting the computer (which of course restarts the BOINC client), I have tried resetting the R@H project. I am still getting client errors on every R@H WU. Sometimes they error in just a few seconds, and sometimes they error after an hour or so. Most of these WUs are "1hz6A_abrelaxmode_random_length05_xxxx" type, and some are "1hz6A_abrelaxmode_random_gauss_cntrlx_xxxx", but all the errors are one or the other of these two types.

After the reset the system downloaded two more WUs (1pvaA_abrelax_68232, 1pvaA_abrelax_66394). It would seem to me that if these two WUs process ok, that this should tell us all something. It would seem to me that the BOINC client is not the problem (all three are the same), it would also seem to me that there is some problem in terms of compatibility with a Dual G4 Mac and your application where the "Length" and "Gauss" type Wus are concerned. Perhaps some element of dual CPUs is not compatable. in all cases with all WU types

All other system conditions have remained constant on these systems. Now if the G4 Dual processes the two WUs it now has with no problem, then it is most likely the WUs that are causing the problem. If they fail it would seem to me that there is something wrong with the application compile. Perhaps a DEV flag not set right for the G4 Dual system during compile. In any case the BOINC client does nothing more than scheduling and keeping score, and in my case i do not believe the BOINC client is the problem.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 1765 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1773 - Posted: 26 Oct 2005, 5:18:41 UTC
Last modified: 26 Oct 2005, 5:24:07 UTC

It could very well be a compiler issue for the G4 dual CPU. In order to support 10.3.9 we had to recompile the gcc4 compiler (rosetta has issues with gcc3.3) on a 10.3.9 machine since the gcc4 compiler that comes with Xcode2 limits apps to OSX10.4 (unless the cross-dev SDK is used, but it didn't work when I tried it). This has helped overall since we are now getting results from people with 10.3.9 and the success rates have increased dramatically. The drawback is that we can't take advantage of the Mac specific optimizations. The rosetta boinc code will be available soon if anyone is interested in helping us debug and optimize. I'll be sure to post it on the news when it is available. Snake_doctor, I would stop R@h on your dual G4 and dedicate it to the other projects until we get a fix (if you haven't already).
ID: 1773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 1787 - Posted: 26 Oct 2005, 12:06:55 UTC - in response to Message 1773.  

It could very well be a compiler issue for the G4 dual CPU. In order to support 10.3.9 we had to recompile the gcc4 compiler (rosetta has issues with gcc3.3) on a 10.3.9 machine since the gcc4 compiler that comes with Xcode2 limits apps to OSX10.4 (unless the cross-dev SDK is used, but it didn't work when I tried it). This has helped overall since we are now getting results from people with 10.3.9 and the success rates have increased dramatically. The drawback is that we can't take advantage of the Mac specific optimizations. The rosetta boinc code will be available soon if anyone is interested in helping us debug and optimize. I'll be sure to post it on the news when it is available. Snake_doctor, I would stop R@h on your dual G4 and dedicate it to the other projects until we get a fix (if you haven't already).


David,

You know one of the reasons i upgraded this machine to 10.4 was to do Rosetta. I will let it finish the two it is working on now just so we can see if the problem is WU related.

Perhaps two versions of the App would solve the problem. Seems it was working fine before 4.77 on all but the 10.3.9 systems. So maybe the previous version for those with 10.4.x and the new one for 10.3.9?

Regards
Phil

We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 1787 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 1791 - Posted: 26 Oct 2005, 13:01:24 UTC
Last modified: 26 Oct 2005, 13:11:21 UTC

It's uncanny. I decided to run PPAH for a while on my one PC that has regular problems with Rosetta (Athlon XP 3000+, Win XP Pro SP2, 1GB RAM, BOINC 5.2.2) to see if it's the PC, BOINC or Rosetta app causing the errors.

https://boinc.bakerlab.org/rosetta/result.php?resultid=429221

Perhaps too early to tell, but it did 6 PPAH work units in a row, no problem. It went back to Rosetta and BANG - unrecoverable error, trashed the WU (over an hour's crunching wasted). And yes, I keep the WU in memory. It's started on the next Rosetta WU - will see how that goes.

I'll keep an eye on it but if this keeps up, I may have to get this PC running something else, even though on paper it is more than qualified to run Rosetta. No problems on my other systems (3*P4 and 1*Athlon 64)

EDIT: nothing in the Windows event logs at the time of the WU crashing and the computer did not crash or get rebooted.

*** Join BOINC@Australia today ***
ID: 1791 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 1795 - Posted: 26 Oct 2005, 13:46:33 UTC - in response to Message 1791.  

It's uncanny. I decided to run PPAH for a while on my one PC that has regular problems with Rosetta (Athlon XP 3000+, Win XP Pro SP2, 1GB RAM, BOINC 5.2.2) to see if it's the PC, BOINC or Rosetta app causing the errors.



For what it is worth, and it probably has already been thought of, the problem seems so spotty and seemingly random (and I know that it is being looked into to see if it is not random), that I wonder if R@H causes some boxes to run hot, and that trashes the unit? Just a thought...
Regards,
Bob P.
ID: 1795 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Loads and loads of computing errors today



©2024 University of Washington
https://www.bakerlab.org