Report Problems with Rosetta Version 5.25

Message boards : Number crunching : Report Problems with Rosetta Version 5.25

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · Next

AuthorMessage
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 25571 - Posted: 30 Aug 2006, 9:19:41 UTC - in response to Message 25524.  

I have been 'trying' to run version 5.25 for over a day now, and have seen that both my Pentium dual core and single core Linux machines are stopping in the middle of the WU. So I put the dual into 'Rosetta only mode' and it has so far processed both WU given it. The single core is doing better with only 2 out of about 7 WU that hung up. The dual core, this is the first time in over a day that one has got passed about 68%. So that is only 2 out of about 8 that worked on it. I just wanted to pass this information on to the 'Rosetta team'. I will let the two I have in que finish then disconnect from the project and check back in a couple months again.


Hi kmanley,

that is the new "0%-stuck", which seems to affect only Linux machines who crunch multiple projects. It's known and well described from many Linux user (it happens, when Rosetta gets swaped out by another project and when it gets swapped in again the CPU is not used, although BOINC reports running). So far there are only workarounds: 1. Put your host in Rosetta-only mode, 2. Restart BOINC often.

Both options are not convenient, however you could decide to crunch Rosetta for a week exclusively, than another project and so on. There should be a new application in the coming weeks, whether it solves the problem is another question.
ID: 25571 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tino Ruiz

Send message
Joined: 12 Oct 05
Posts: 13
Credit: 397,392
RAC: 0
Message 25635 - Posted: 30 Aug 2006, 20:03:39 UTC - in response to Message 25458.  

It appears to be when the app tries to read the following file:

bbdep02.May.sortlib.gz

This is a required database file that gets downloaded once when you get your very first work unit and then stays on your computer. I do not know the exact cause of the error but it is not a universal error and is only happening to a small number of users. I suggest manually downloading the file and then placing it in the R@h project directory in your boinc installation if for some reason it no longer exists. Or I would reset the project.

https://boinc.bakerlab.org/rosetta/download/15a/bbdep02.May.sortlib.gz

I'm sorry, but where exactly do I put that file? I've looked *everywhere* for a BOINC install directory but couldn't find one. Does anyone know the default directory for a debian-based distro (Xubuntu)?
ID: 25635 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pappa

Send message
Joined: 4 Aug 06
Posts: 3
Credit: 302,149
RAC: 0
Message 25658 - Posted: 31 Aug 2006, 0:17:38 UTC - in response to Message 24218.  

Ethan

Saenger, noted (not Seti Specifically) that even in Seti a -9 "Noisy Workunit" receives credit for time ran... This is calulated on the benchmark.
Of the machines I am rotating through various projects have one that had an error... https://boinc.bakerlab.org/rosetta/result.php?resultid=32537818... I noted the error so that I can remove it from the Cross Project Stats that I am collecting... That specific error you will have to look in your database for as it is no longer viewable... That is a single error out of what I presume are over 100 returned results https://boinc.bakerlab.org/rosetta/results.php?hostid=284093.

So in most cases unless a machine goes "rogue" and then just start mangeling results, I would then hope that you have a mechanism for reducing the number of workunits to less than one/day. I would presume that you would hope it was a software glitch... Then giving the User Partial credit.

Regards Pappa

From Fuzzy:

I hope the practice continues, if the WU is what is wrong nothing to do with your system and you have spent say 23 hours of a 24 hour unit working why should you not get credits ?

That's right, and that's why they grant something over @LHC.
But how is it determined that it was the software, and not the hardware?


We report them here, and then one of the devs look at them to see the strerr. And if they recognize it as something related to the software, we get credit for them.



ID: 25658 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 652
Credit: 11,662,550
RAC: 1,151
Message 25812 - Posted: 1 Sep 2006, 7:57:43 UTC
Last modified: 1 Sep 2006, 8:00:18 UTC

Thiw wu would seem to be stuck on my Evesham node. A 2.533GHz P-IV northwood, (no hyperthreading, not overclocked), Windows NT4 SP6a, BOINC 5.2.13.

Showing CPU time 08:59:27, Progress 48.74%, To completion 13:47:12. This machine is set to 20 hour wu's.

When this wu is "running" nothing changes on BOINC Manager, the System Idle Process is 99% active, and my CPU temperature is a refreshing 42C. Clicking "Show Graphics" does nothing. Suspending it, another project pops into life, with my current STD, it happens to be MCDN. Suspending that so Rosetta is top again, and it enters the same state, stuck, no processing.

By judicious "Suspend" fiddling, I've established that swapping between Rosetta/Einstein and Rosetta/SIMAP does not alter anything, I do not believe, therefore, that it is a Rosetta/MCDN interaction.

The message log looks totally normal, Rosetta suspending (left in memory), and another project resuming, then that project suspending (left in memory) and Rosetta resuming.

In the Rosetta projects directory, there are no files flagged as being modified 1st September, (it is 10:00:00 1st September my time as I write), so it is possible it has been in this state since yesterday, there are 5 files showing yesterday as their last changed date.

Currently, I have Rosetta suspended so the other projects can get on with their productive work. If there is anything I can do with this one, or any diagnostic info I can obtain, please advise. I will leave it in this state until 18:00:00 my time, (roughly 8 hours), if nothing, then I will abort it as we are going away for the weekend.

*** EDIT ***

Grammar.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 25812 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 25822 - Posted: 1 Sep 2006, 10:35:16 UTC
Last modified: 1 Sep 2006, 10:39:53 UTC

@adrianxw

Restart BOINC and the WU will resume.

This is the new "0%-Bug" which appears to happen when switching projects. So far there were only Linux puters reporting it, but it seems windows machines are in rare cases affected as well. :-(
ID: 25822 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 652
Credit: 11,662,550
RAC: 1,151
Message 25825 - Posted: 1 Sep 2006, 11:26:13 UTC
Last modified: 1 Sep 2006, 11:37:59 UTC

I stopped/started BOINC. Once restarted, I removed Rosetta from suspension and forced a scheduling event. Rosetta dropped back to the last checkpoint at 08:47:18 48.70% and started running.

I can't say if this is the same fault LINUX is having. If it was a general problem, I'd expect it to be present in roughly the Linux/Windows ratio rather then rare on one system. I have fiddled a lot switching it in and out, and the % complete has never dropped to zero, as it appears to for most who have reported this issue. Maybe it is the same problem but it manifests slightly differently across OS's? That might give a handle to the root cause.

Whatever, I hope that adds a clue to the hunt. I'll keep watching it.

*** EDIT ***

It has now reached 49.92% complete so has past the point it stopped before. Of course, it was not pre-empted this time.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 25825 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 25826 - Posted: 1 Sep 2006, 12:06:09 UTC - in response to Message 25825.  

I stopped/started BOINC. Once restarted, I removed Rosetta from suspension and forced a scheduling event. Rosetta dropped back to the last checkpoint at 08:47:18 48.70% and started running.

I can't say if this is the same fault LINUX is having. If it was a general problem, I'd expect it to be present in roughly the Linux/Windows ratio rather then rare on one system. I have fiddled a lot switching it in and out, and the % complete has never dropped to zero, as it appears to for most who have reported this issue. Maybe it is the same problem but it manifests slightly differently across OS's? That might give a handle to the root cause.

Whatever, I hope that adds a clue to the hunt. I'll keep watching it.

*** EDIT ***

It has now reached 49.92% complete so has past the point it stopped before. Of course, it was not pre-empted this time.


All reports I read so far describe tht the CPU-Load goes to 0% while WU is marked running in BOINC but actually not advancing. The progress of the WU does not go back to 0% (unless you do a restart before the first checkpoint was written). My naming it "0%-Bug" I was referring to the CPU-Load not the progress bar.
ID: 25826 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 652
Credit: 11,662,550
RAC: 1,151
Message 25837 - Posted: 1 Sep 2006, 15:22:04 UTC
Last modified: 1 Sep 2006, 15:22:57 UTC

Fair enough, that does sound like what I was seeing. In fact, it was not the stationary BOINC Manager that first caught my eye, it was the suprisingly low CPU temperature on the MoBo monitor.

I hope they fix that soon, the machine that was showing this is BOINC only, (most of the time, certainly at present, - it is, in fact, a backup web server), and I only look at it from time to time. With Rosetta set to 50% CPU quota on it, that is potentially a lot of lost CPU time.

*** EDIT ***

The wu is now Pre-empted, but is showing 58.34% complete, so is clearly doing something.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 25837 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 25838 - Posted: 1 Sep 2006, 15:26:53 UTC - in response to Message 25837.  

Fair enough, that does sound like what I was seeing. In fact, it was not the stationary BOINC Manager that first caught my eye, it was the suprisingly low CPU temperature on the MoBo monitor.

I hope they fix that soon, the machine that was showing this is BOINC only, (most of the time, certainly at present, - it is, in fact, a backup web server), and I only look at it from time to time. With Rosetta set to 50% CPU quota on it, that is potentially a lot of lost CPU time.

*** EDIT ***

The wu is now Pre-empted, but is showing 58.34% complete, so is clearly doing something.


I hope they find the problem. As it is only a sporadic failure it might not be that easy. As a workaround you can let project a crunch for two weeks at 100% and then project b and so on. Quite inconvenient, but should prevent further such instances.
ID: 25838 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 25841 - Posted: 1 Sep 2006, 17:13:50 UTC - in response to Message 25838.  

As a workaround you can let project a crunch for two weeks at 100% and then project b and so on. Quite inconvenient, but should prevent further such instances.


You could set the "Switch between applications every" time in general preferences to be large enough so that Rosetta WUs complete without being switched.
ID: 25841 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
terry

Send message
Joined: 7 Aug 06
Posts: 1
Credit: 22,721
RAC: 0
Message 25922 - Posted: 3 Sep 2006, 4:36:04 UTC

I've got two files that have stalled - that is - though the manager shows them as running the cpu usage timer doesn't increase. i let them both run for a while to see if they would start to move again - but no change. when i suspended them the manager moved on to the next file and its been working well since. how do i return them unfinished if indeed thats what is needed - or do i just abort them?
ID: 25922 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
R.L. Casey

Send message
Joined: 7 Jun 06
Posts: 91
Credit: 2,728,885
RAC: 0
Message 25924 - Posted: 3 Sep 2006, 5:21:09 UTC - in response to Message 25922.  
Last modified: 3 Sep 2006, 5:22:11 UTC

I've got two files that have stalled - that is - though the manager shows them as running the cpu usage timer doesn't increase. i let them both run for a while to see if they would start to move again - but no change. when i suspended them the manager moved on to the next file and its been working well since. how do i return them unfinished if indeed thats what is needed - or do i just abort them?

You can abort them, it will be noted as an invalid result and the Work Units will be sent out for someone else to crunch. You may want to post the WU numbers to the Report Problems with Rosetta Version 5.2.5 thread if that's the version you are using.
Keep crunching!
ID: 25924 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 25937 - Posted: 3 Sep 2006, 9:37:00 UTC - in response to Message 25922.  

I've got two files that have stalled - that is - though the manager shows them as running the cpu usage timer doesn't increase. i let them both run for a while to see if they would start to move again - but no change. when i suspended them the manager moved on to the next file and its been working well since. how do i return them unfinished if indeed thats what is needed - or do i just abort them?


This error has been reported repeatedly. The workaround is to restart BOINC.
ID: 25937 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 25966 - Posted: 3 Sep 2006, 21:30:42 UTC

Hi all

Can someone shed some light on these errors i have never had a problem

with any project before.

Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00730D5C read attempt to address 0xFFEAFF62

Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00730DE9 read attempt to address 0xEFECA35C

Windows also put up an error box the first time not the second.

ID: 25966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Andrew

Send message
Joined: 7 Mar 06
Posts: 1
Credit: 28,863
RAC: 0
Message 26099 - Posted: 5 Sep 2006, 12:38:38 UTC

Hi Rhiju,

This is Andrew from the Kuhlman lab. I've got boinc running on a Mac here, and over the weekend it didn't fetch any jobs. Here are the last several messages from Boinc Manager:

Fri Sep 1 12:09:50 2006|rosetta@home|Pausing result NMR_1mzl_CASPR_1_1mzl_1_id_model_13IGNORE_THE_REST_idl_1221_1677_0 (removed from memory)
Fri Sep 1 12:11:10 2006||Resuming computation and network activity
Fri Sep 1 12:11:10 2006||request_reschedule_cpus: Resuming activities
Sat Sep 2 14:26:18 2006||Suspending computation and network activity - running CPU benchmarks
Sat Sep 2 14:26:18 2006|rosetta@home|Pausing result NMR_1mzl_CASPR_1_1mzl_1_id_model_13IGNORE_THE_REST_idl_1221_1677_0 (removed from memory)
Sat Sep 2 14:26:20 2006||Running CPU benchmarks
Sat Sep 2 14:26:28 2006||Failed to stop applications; aborting CPU benchmarks
Sat Sep 2 14:26:29 2006||Resuming computation and network activity
Sat Sep 2 14:26:29 2006||request_reschedule_cpus: Resuming activities
Sat Sep 2 14:26:29 2006||ACTIVE_TASK_SET::check_app_exited(): pid 19632 not found
Mon Sep 4 07:26:53 2006||Suspending work fetch because computer is overcommitted.
Mon Sep 4 07:26:53 2006||Using earliest-deadline-first scheduling because computer is overcommitted.
Tue Sep 5 08:17:26 2006||Suspending computation and network activity - user is active
Tue Sep 5 08:17:26 2006|rosetta@home|Pausing result NMR_1mzl_CASPR_1_1mzl_1_id_model_13IGNORE_THE_REST_idl_1221_1677_0 (removed from memory)
ID: 26099 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 26101 - Posted: 5 Sep 2006, 12:43:48 UTC - in response to Message 26099.  

Hi Rhiju,

This is Andrew from the Kuhlman lab. I've got boinc running on a Mac here, and over the weekend it didn't fetch any jobs. Here are the last several messages from Boinc Manager:

Fri Sep 1 12:09:50 2006|rosetta@home|Pausing result NMR_1mzl_CASPR_1_1mzl_1_id_model_13IGNORE_THE_REST_idl_1221_1677_0 (removed from memory)
Fri Sep 1 12:11:10 2006||Resuming computation and network activity
Fri Sep 1 12:11:10 2006||request_reschedule_cpus: Resuming activities
Sat Sep 2 14:26:18 2006||Suspending computation and network activity - running CPU benchmarks
Sat Sep 2 14:26:18 2006|rosetta@home|Pausing result NMR_1mzl_CASPR_1_1mzl_1_id_model_13IGNORE_THE_REST_idl_1221_1677_0 (removed from memory)
Sat Sep 2 14:26:20 2006||Running CPU benchmarks
Sat Sep 2 14:26:28 2006||Failed to stop applications; aborting CPU benchmarks
Sat Sep 2 14:26:29 2006||Resuming computation and network activity
Sat Sep 2 14:26:29 2006||request_reschedule_cpus: Resuming activities
Sat Sep 2 14:26:29 2006||ACTIVE_TASK_SET::check_app_exited(): pid 19632 not found
Mon Sep 4 07:26:53 2006||Suspending work fetch because computer is overcommitted.
Mon Sep 4 07:26:53 2006||Using earliest-deadline-first scheduling because computer is overcommitted.
Tue Sep 5 08:17:26 2006||Suspending computation and network activity - user is active
Tue Sep 5 08:17:26 2006|rosetta@home|Pausing result NMR_1mzl_CASPR_1_1mzl_1_id_model_13IGNORE_THE_REST_idl_1221_1677_0 (removed from memory)



You have a stuck WU which does not advance, probably your CPU-Load is 0%. Check whether this is the case and restart BOINC and the WU will finish.
ID: 26101 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 26142 - Posted: 5 Sep 2006, 23:57:58 UTC - in response to Message 26101.  
Last modified: 5 Sep 2006, 23:58:57 UTC

You have a stuck WU which does not advance, probably your CPU-Load is 0%. Check whether this is the case and restart BOINC and the WU will finish.

Few days ago it happened to my host 290356 (Linux x86) that after restarting Boinc, the stuck app was left in memory and I had to kill it by hand. (Possibly Boinc lost a track of it? New Rosetta WU was started, was continuously running 3:59 hours and then Boinc made an attempt to start Seti Beta, but nothing happened and the host was idle, so 3 hours later I restarted the whole Boinc.)

Peter
ID: 26142 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pepo
Avatar

Send message
Joined: 28 Sep 05
Posts: 115
Credit: 101,358
RAC: 0
Message 26146 - Posted: 6 Sep 2006, 0:39:18 UTC

Just now exact the same happened. Rosetta was running 89:39.34 from start, then Boinc (5.5.15, it could also be a problem of this alpha version) made an attempt to start Seti Beta. Seti Beta is nowhere, Rosetta is there and sleeping and the machine is idle.

After stopping Boinc... 4 rosetta_5.25_i6 processes (probably threads) are still there. And after starting Boinc it launched Seti Beta, old Rosetta's are still sleeping there. XXXX

Possibly a coincidence (because of the STD, LTD and shares, thus running Seti Beta most often), but if...

Peter
ID: 26146 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 26149 - Posted: 6 Sep 2006, 2:08:39 UTC - in response to Message 25966.  


Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00730D5C read attempt to address 0xFFEAFF62

Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00730DE9 read attempt to address 0xEFECA35C


You might want to ask about the status of the Access Violation errors that were reported on Ralph. In this thread.

ID: 26149 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 26155 - Posted: 6 Sep 2006, 7:33:59 UTC

@Pepo

Is the rosetta process still present if you Exit BOINC (not just stopping).
If there is no BOINC.exe process present but a rosetta process then there is something wrong with BOINC I think, since it should kill all child processes when it exits.
ID: 26155 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · Next

Message boards : Number crunching : Report Problems with Rosetta Version 5.25



©2024 University of Washington
https://www.bakerlab.org