Problems with Rosetta version 5.43

Message boards : Number crunching : Problems with Rosetta version 5.43

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8

AuthorMessage
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 35423 - Posted: 23 Jan 2007, 22:20:48 UTC - in response to Message 35360.  

Thanks, feet1st and Thomas Bates. I don't remember seeing this kind of error before. Apparently the worker thread is caughted in an error(incorrect function), but it did not end the thread itself correctly and later on the watchdog kicked in to end the run. But Thomas reported that he had to manually kill the run. It seems that the boinc manager lost track on that WU and did not respond to the threads at all. Puzzled...
Chu, RE: Thomas Bates, looks like this wu. It is called 1ail__BOINC_NOFILTERS_ABRELAX_SAVE_ALL_OUT_NEWRELAXFLAGS_frags83__1505_1221_0

It SAYS the watchdog ended it, but apparently not so.


ID: 35423 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,621,003
RAC: 0
Message 35427 - Posted: 23 Jan 2007, 23:05:27 UTC - in response to Message 35421.  
Last modified: 23 Jan 2007, 23:06:21 UTC

That is the error code for not transfering result files correctly, either because the result files are not generated or because the client is unable to send the result files back to the server correctly. If you have only experienced such a problem recently, I would suggest to reset the project on your hosts as the current application has not been changed since last December and the specific WUs are returning valid results from other hosts. Seem like some communication issue between your host and the server, but I am not exactly sure what is causing that.

Thanks. I'll give that a shot. Also, I noticed that these jobs did not run the full length of time (3 hours). So it's not like these are completing properly, and then failing to send back to the server correctly.

Two changes on my end about the same time these failures started:

upgraded from 5.8.3 to 5.8.4
Attached SZTAKI project

Is it possible for another project to cause these problems? No one is reporting similar problems on the BOINC alpha alias, so I am thinking it is not being caused by the upgrade to 5.8.4.

Edit: Resetting did not solve the problem. I am going to try downgrading to 5.8.3 to see if that changes anything.
Reno, NV
Team: SETI.USA
ID: 35427 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 35445 - Posted: 24 Jan 2007, 14:32:18 UTC
Last modified: 24 Jan 2007, 14:34:55 UTC

New iMac participant has WUs failing after 5min, but not the WU name of the known problem. Their report here. Exit status 131 and some odd messages in the results.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 35445 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile arminius

Send message
Joined: 23 Sep 05
Posts: 8
Credit: 805,403
RAC: 0
Message 35451 - Posted: 24 Jan 2007, 16:52:07 UTC

Client error on all PSH_0144_looprlx_... WUs; date: 22.01.2007; Linux 10.2
More: https://boinc.bakerlab.org/rosetta/results.php?hostid=6399
a.
ID: 35451 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 11 Feb 06
Posts: 316
Credit: 6,621,003
RAC: 0
Message 35453 - Posted: 24 Jan 2007, 17:23:52 UTC - in response to Message 35427.  

Edit: Resetting did not solve the problem. I am going to try downgrading to 5.8.3 to see if that changes anything.


FYI: Downgrading further to 5.8.2 solved the problem. obviously I was mistaken on when the problem started (with 5.8.3, not 5.8.4). In any case, I saw this problem only with the mac version, not the windows version.
Reno, NV
Team: SETI.USA
ID: 35453 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
finch

Send message
Joined: 23 Nov 05
Posts: 8
Credit: 4,548
RAC: 0
Message 35629 - Posted: 27 Jan 2007, 19:01:10 UTC - in response to Message 35359.  

Are you all using Windows or Linux? I've had no failures on 3 Windows XP Pro systems I am running, but I finally had to detach from Rosetta on *ALL* of my Linux systems (I have two running Ubuntu 6.06 and one running Ubuntu 6.10) as I was seeing nearly 90% failure and these systems were offering rosetta 50% of their time. I didn't mind the tasks that ran 10-15 seconds before failing ...

- Lynn

Lynn:
While I don't use Linux for crunching - do these machines run Rosetta fine if they're running Rosetta 100% of the time? We've had better luck with the windows machines if we set the keep in memory setting to yes. Perhaps someone can confirm or deny if that's been a problem with the Linux systems as well. ...



I wanted to respond to lynn and bennyrop. I am running ubuntu 6.06. I only have 512MB ram.

My situation wasn't exactly the same as lynn's as the work units didn't end after 10-15 seconds. They would run for considerable time.

I have run a half dozen or so units and have had failures with all units that have been preempted and removed from memory while running in tandem with worldgrid. At some point I would catch the work units showing no progress and cpu time would no longer increment. I would finally have to abort the units. The two different times that I have suspended worldgrid and run rosetta without interuption I have had no problems. This is a small sample, but I am fairly confident that it isn't just a coincidence but that there is a connection. I'm not going to run both projects simultaneously anymore. I'll alternate between them.
ID: 35629 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Michael.L

Send message
Joined: 12 Nov 06
Posts: 67
Credit: 31,295
RAC: 0
Message 35763 - Posted: 30 Jan 2007, 19:19:03 UTC
Last modified: 30 Jan 2007, 19:23:05 UTC

30/01/2007 19:06:32|rosetta@home|Unrecoverable error for result 1eyvA_BOINC_ABRELAX_NEWRELAXFLAGS_frags83__1521_2499_0 ( - exit code 1073807364 (0x40010004))


Thinks that WU froze for some time at about 75pct.

W.XP home amd64 3200+
Maybe there is a connection in that BOINC manager froze around the same time
ID: 35763 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8

Message boards : Number crunching : Problems with Rosetta version 5.43



©2024 University of Washington
https://www.bakerlab.org