Problems with Rosetta version 5.40

Message boards : Number crunching : Problems with Rosetta version 5.40

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 31335 - Posted: 18 Nov 2006, 1:24:21 UTC - in response to Message 31329.  
Last modified: 18 Nov 2006, 1:24:47 UTC

I looked through your recent results and most of them were caused by the backward compatibility problem which happend earlier this week (scroll down to the first few posts). Happy crunching!
i got more errors in last 5 days than i did in last 100 days


ID: 31335 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
alexpoon

Send message
Joined: 28 Dec 05
Posts: 6
Credit: 1,846
RAC: 0
Message 31416 - Posted: 19 Nov 2006, 11:13:04 UTC

19/11/2006 19:12:11|rosetta@home|Unrecoverable error for result PSH_0051_looprlx_GP120_OD1_138_148_5484_1404_20_0 ( - exit code -529697949 (0xe06d7363))

ID: 31416 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 31434 - Posted: 19 Nov 2006, 19:29:38 UTC

The grafics on this typ of wu-s donĀ“t work very well.

Most of the time there is nothing in the Accepted and Low Energy boxes.

https://boinc.bakerlab.org/rosetta/result.php?resultid=47869770

Anders n


ID: 31434 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 460,257
RAC: 0
Message 31445 - Posted: 20 Nov 2006, 1:37:46 UTC

I had to abort this wu: resultid=47806539

It had gotten stuck for hours, and was not using any CPU time, even though the Boinc CC said it was running. I suppose if I let it run a few more hours the watchdog would have stopped it, but I didn't want to waste any more time on it.

ID: 31445 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mats Petersson

Send message
Joined: 29 Sep 05
Posts: 225
Credit: 951,788
RAC: 0
Message 31463 - Posted: 20 Nov 2006, 14:34:44 UTC

I aborted resultid 47184521, as like previous post it got stuck...

--
Mats
ID: 31463 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile SOAN
Avatar

Send message
Joined: 27 Sep 05
Posts: 252
Credit: 63,160
RAC: 0
Message 31475 - Posted: 20 Nov 2006, 19:31:27 UTC

This one ran for an hour at 98% of the cpu but never registered any CPU time: resultid=48156890
ID: 31475 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 31479 - Posted: 20 Nov 2006, 21:20:55 UTC

If others are seeing the problem reported in the last three updates, could you please just try ending and restarting BOINC? By end, I mean the File -> Exit.

I've been seeing problems for several months where BOINC seems to lose contact and/or control of the Rosetta threads that do the crunching. It is BOINC that is supposed to tell the Rosetta thread when to be active. So, by the description that the Rosetta thread isn't getting CPU, it points more to a BOINC problem.

Also, please report your BOINC version, and your platform (Windows, Linux, Mac).
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 31479 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 31508 - Posted: 21 Nov 2006, 12:43:24 UTC - in response to Message 31479.  

If others are seeing the problem reported in the last three updates, could you please just try ending and restarting BOINC? By end, I mean the File -> Exit.


File -> Exit does not end BOINC if it is running as a system service, or if it is running under another user's login.

One way to end boinc when running as a service is

ControlPanel -> Admin Tools -> Sefvices

Right-click on BOINC, stop.

Another is Start -> Run and enter net stop boinc

The only way to exit boinc that works across *all* windows configurations is to get a command window, cd to the BOINC folder, and type

boinccmd --quit

but my suggestion is to find which of the easy ways works for your set up, and use that.

In Linux get a shell window (terminal window), cd to the BOINC folder, and type

./boinc_cmd --quit

note the extra underscore in the name of the command!

By the way, the boinccmd or boinc_cmd command is quite powerful. It can also control an instance of boinc running on another machine on the network and this is the way you would place commands in a .BAT file or shell script.

To see the entire range of its abilities, on win use

boinccmd --help|more

or on Linux

./boinc_cmd --help|less

NB: The windows version's help output shows all its examples as for the linux version, so you do need to remember to leave out the underscore when taking its advice!!

R~~
ID: 31508 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Steve Shedroff

Send message
Joined: 7 Nov 05
Posts: 11
Credit: 250,657
RAC: 0
Message 31509 - Posted: 21 Nov 2006, 13:05:52 UTC

Ever since about October 20 somthing my computer was crashing more than normal. Windows XP SP2. I am not in control of the maintnenace of this box so I always wait a few weeks to see it the auto-updates I get fix things. I noticed that Rosetta was up-versioned about the same time as my crashing problem so I stoped receiving work units for a short time. The problem went away. I noticed a new version of Rosetta today so I started back up. I'll let you know if I have further problems. Could have all been a coincinence. The crash alwasys strted with slow mouse/keyboard resonce, then total mouse and keyboard lockout. Power down was only way to reboot. Just thought you should know.
Steve
ID: 31509 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 31512 - Posted: 21 Nov 2006, 14:09:54 UTC - in response to Message 31509.  

Ever since about October 20 somthing my computer was crashing more than normal. ... Just thought you should know.
Steve


Thanks Steve,

In this case it fits into what we already know about problems with the changeover between versions.

However it is always worth mentioning this kind of thing in case it has not been picked up before.

Also Rosetta is a dev project, and not everyone has time to weather out a bugstorm when they occur. Taking shelter in another project for a month is a good strategy! Glad you came back.

R~~
ID: 31512 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 31530 - Posted: 21 Nov 2006, 20:03:18 UTC
Last modified: 21 Nov 2006, 20:14:13 UTC

How long do we wait without an update to the %complete figure before we get worried?

This task ran OK and did 18 decoys in about 3hrs on this machine, in the PSH_0058_looprlx series of WU.

I have since put the target time up to 10hrs, and now another PSH_0058_looprlx task updated the %complete at around 3hrs (31%), and has now got to almost 10hrs without the %complete changing again. This seems a genuine effect, as boinccmd shows

checkpoint CPU time: 11234.859375
current CPU time: 35711.500000
fraction done: 0.31510

ie about 24k sec without a checkpoint, ie over 6 hrs. This long-running task will be here once it is reported so you can see what stderr makes of this.

Leaving it running for now, but wondered if others have seen this effect.

By the way, it occurs to me that others may wonder how to see the checkpoint info, so I have just started the how to use boinccmd thread.

R~~
ID: 31530 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Christian Diepold
Avatar

Send message
Joined: 23 Sep 05
Posts: 37
Credit: 300,225
RAC: 0
Message 31539 - Posted: 21 Nov 2006, 23:06:07 UTC
Last modified: 21 Nov 2006, 23:08:21 UTC

I had a crashed WU today. From time to time I look at the gfx of the current Rosetta WU and I never had problems with that before. But today, when I openend the gfx for a WU, my firewall popped up told me, that Rosetta wanted to contact "msdl.microsoft.com". I was like, WTF. I said no, and the second I hit the "no" button of my firewall, the WU crashed with exit code 1 (0x1).

That's the WU.

What gives? Why would Rosey call M-$oft? And why would a no in my firewall - just the same as if the internet was off - crash the whole WU? All coincidence?
ID: 31539 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Killersocke@rosetta

Send message
Joined: 13 Nov 06
Posts: 29
Credit: 2,579,125
RAC: 154
Message 31542 - Posted: 22 Nov 2006, 0:53:47 UTC

so many problems with the WU's and they crashing my PC.
So i will stop my Work here.
Pls let me knew when the Problems are sovled.
ID: 31542 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
genes
Avatar

Send message
Joined: 8 Oct 05
Posts: 60
Credit: 460,257
RAC: 0
Message 31549 - Posted: 22 Nov 2006, 3:01:34 UTC - in response to Message 31445.  

I had to abort this wu: resultid=47806539

It had gotten stuck for hours, and was not using any CPU time, even though the Boinc CC said it was running. I suppose if I let it run a few more hours the watchdog would have stopped it, but I didn't want to waste any more time on it.


The above WU was running under Boinc CC 5.7.2 on Windows XP.

ID: 31549 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 31556 - Posted: 22 Nov 2006, 8:54:49 UTC - in response to Message 31539.  
Last modified: 22 Nov 2006, 9:25:25 UTC

I had a crashed WU today. From time to time I look at the gfx of the current Rosetta WU and I never had problems with that before. But today, when I openend the gfx for a WU, my firewall popped up told me, that Rosetta wanted to contact "msdl.microsoft.com". I was like, WTF. I said no, and the second I hit the "no" button of my firewall, the WU crashed with exit code 1 (0x1).

That's the WU.

What gives? Why would Rosey call M-$oft? And why would a no in my firewall - just the same as if the internet was off - crash the whole WU? All coincidence?


Several people have already reported problems with the graphics on the current app, so it makes sense that this happened only when you opened the graphics.

My guess is that the WU was already dead when your firewall popped up and asked the question, and was running the microsoft (M$) debugger before quitting totally.

msdl is possibly a debugging site within M$. If you try to open it with a browser you get redirected to their main site, to specific pages about using the debugger.

There are two reasons I can think of why the debugger could try to talk to the msdl site - to ask for the translation of the error message into German (you have your machine set to use German wherever possible, I guess?), or to report the error automatically. Note - these are both guesses, I don't actually know.

EDIT: Equally, it could be that the call comes, not from the debugger, but directly from the M$ code in the interface between Rosey and the graphics driver. Both the above reasons would apply there, too.

Then, after you hit "no", the debugger completed immeditately and the error was reported.

I can understand why the whole thing felt very suspicious, but in fact I think it may simply be M$'s usual dodgy practice of using the net when you don't expect it to.

R~~
ID: 31556 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Christian Diepold
Avatar

Send message
Joined: 23 Sep 05
Posts: 37
Credit: 300,225
RAC: 0
Message 31557 - Posted: 22 Nov 2006, 9:09:48 UTC

Ah, thx for the ideas River. Didn't know that msdl was a debugger. I always thought it meant "msdownload" or something like that. Yes, my BOINC version is set to German, so that theory of yours makes pretty much sense.

:-)
ID: 31557 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 31558 - Posted: 22 Nov 2006, 9:24:32 UTC - in response to Message 31557.  
Last modified: 22 Nov 2006, 9:24:53 UTC

Ah, thx for the ideas River. Didn't know that msdl was a debugger. I always thought it meant "msdownload" or something like that. Yes, my BOINC version is set to German, so that theory of yours makes pretty much sense.

:-)


I am only guessing from the fact that the browser re-directs to debug info. But my guess is Ms debug logging or suchlike. Who really knows?

R~~

ps, note also the extra paragraph marked EDIT in my previous post
ID: 31558 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 149
Credit: 3,514,290
RAC: 445
Message 31564 - Posted: 22 Nov 2006, 11:49:12 UTC

> Had this one fail and I did not notice each time I looked at it that it was not doing anything and the cpu had dropped back to idle. Boinc Manager said it was running but nothing was going on.
It had hung for about 20 hours before I aborted it.

https://boinc.bakerlab.org/rosetta/result.php?resultid=47931779
ID: 31564 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 149
Credit: 3,514,290
RAC: 445
Message 31784 - Posted: 28 Nov 2006, 23:35:12 UTC

> Just another addition about the Graphics problem with Rosetta/Ralph (they both do it). Sometimes in the debug data it shows a different project was the screensaver and not Rosetta at the time of failure.
This maybe true but what may have been forgotten is that on a Dual Core processor or a hyperthreaded processor, 2 jobs are running at the same time but only one screensaver at a time can be on the screen.
A number of the lockups I had showed Seti or Einstein as being the sreensaver when it froze. I then went to Task Manager and it told me that the Rosetta WU was "Not Responding" anymore, the other task was actually still running without any problems. So I am guessing here that the screensaver may of been about to change to the other project (rosetta) and 'hung' in the process (although the shortest WU I had fail ran less than 30 minutes with the screensaver on before dying).
Anyway it could be something to look at.
I have for a couple of weeks now not had the screensaver on for both Ralph and Rosetta due to the 70% + failure rate, I have had only the isolated failure of a WU since I did this.
There are another 1 or 2 threads about this problem now, as people are not sure if it is 5.40 problem, a rosetta problem or a Boinc problem. Without graphics on I don't have a problem.
ID: 31784 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Buffalo Bill
Avatar

Send message
Joined: 25 Mar 06
Posts: 71
Credit: 1,630,458
RAC: 0
Message 31793 - Posted: 29 Nov 2006, 2:04:39 UTC

This one tried to upload several times and I tried to get it to upload with "retry now". I kept getting a message about the file being locked by the scheduler. After a reboot and a few more tries I aborted the upload and it caused the WU to error out.

49343410
ID: 31793 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Problems with Rosetta version 5.40



©2023 University of Washington
https://www.bakerlab.org