Problems with Rosetta stable version 5.69 and beta version 5.77

Message boards : Number crunching : Problems with Rosetta stable version 5.69 and beta version 5.77

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 45725 - Posted: 3 Sep 2007, 17:45:28 UTC - in response to Message 45720.  
Last modified: 3 Sep 2007, 20:43:52 UTC

Something goes wrong with 5.77 on my machine. It gets down to saying 00:09:57 and then stays there. So for the second time I am about to abort a task. Suspect this old PC just isn't capable or something. Been chugging along with RAH for over a year, I guess, but maybe it's time to quit.


Jerry, ever tried just letting them run? They are probably doing just fine. Rosetta has no way to know ahead of time exactly how long it will take to crunch a given model, so the estimate is... well... just an estimate. Once it gets down to within 10 minutes of your target runtime, it starts to exponentially reduce the time remaining less and less, with the idea being that the time to completion will generally still be going down.

The watch dog is always there looking over your tasks, and prepared to abort them if it deems necessary.
Rosetta Moderator: Mod.Sense
ID: 45725 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Warren B. Rogers

Send message
Joined: 3 Oct 05
Posts: 5
Credit: 1,127,824
RAC: 0
Message 45726 - Posted: 3 Sep 2007, 18:11:08 UTC - in response to Message 45725.  
Last modified: 3 Sep 2007, 18:16:43 UTC

Something goes wrong with 5.77 on my machine. It gets down to saying 00:09:57 and then stays there. So for the second time I am about to abort a task. Suspect this old PC just isn't capable or something. Been chugging along with RAH for over a year, I guess, but maybe it's time to quit.


Jerry, ever tried just letting them run? That are probably doing just fine. Rosetta has no way to know ahead of time exactly how long it will take to crunch a given model, so the estimate is... well... just an estimate. Once it gets down to within 10 minutes of your target runtime, it starts to exponentially reduce the time remaining less and less, with the idea being that the time to completion will generally still be going down.

The watch dog is always there looking over your tasks, and prepared to abort them if it deems necessary.


Good day all. I've also noticed this problem with version 5.77 but it doesn't happen all of the time only occasionally. When I notice that this is happening I usually suspend that WU and let something else run. The one time I just let it run it was at 09:57 for about 3 hours and the count down timer was not moving. Also, I saw the amount of work being done drop to the 1000th of a percent/sec which was considerably slower than it was for the first 96% of the WU. I didn't want to abort the WU and just hoped that if it did something else for a while and worked it's way back to the WU it would finish it properly. Well, it did take about 1 hour to finish but it was much better than the rate it was moving at. I hope my experience is helpful.

Warren Rogers
ID: 45726 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Winkle

Send message
Joined: 22 May 06
Posts: 88
Credit: 1,354,930
RAC: 0
Message 45739 - Posted: 4 Sep 2007, 11:18:11 UTC

Hi
On Beta 5.77. I suspended the Rosetta project through BOINC manager, and the project came up suspended, but task manager still says Rosetta is running at approx 90% CPU time. Is this a known bug ?
Ian
ID: 45739 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 45745 - Posted: 4 Sep 2007, 15:00:41 UTC

Winkle, I hear such thing about Linux off and on, but I see all of your machines are Windows. Which one is seeing this occur? What BOINC version is installed there?
Rosetta Moderator: Mod.Sense
ID: 45745 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Warren B. Rogers

Send message
Joined: 3 Oct 05
Posts: 5
Credit: 1,127,824
RAC: 0
Message 45748 - Posted: 4 Sep 2007, 16:08:51 UTC - in response to Message 45731.  


Good day all. I've also noticed this problem with version 5.77 but it doesn't happen all of the time only occasionally. When I notice that this is happening I usually suspend that WU and let something else run. The one time I just let it run it was at 09:57 for about 3 hours and the count down timer was not moving. Also, I saw the amount of work being done drop to the 1000th of a percent/sec which was considerably slower than it was for the first 96% of the WU. I didn't want to abort the WU and just hoped that if it did something else for a while and worked it's way back to the WU it would finish it properly. Well, it did take about 1 hour to finish but it was much better than the rate it was moving at. I hope my experience is helpful.

Warren Rogers


Was it that WU which took your computer about 19K seconds? In that case there was nothing wrong, in my opinion. It is just one of those work units which need a large amount of time to generate even one model. And since at least one model needs to be generated, this one model may exceed your pre-set runtime. Maybe the low modelcount also causes the inaccuracies in the estimated time remaining.


Yes it did take about 19K seconds to complete and I had another on that took 20K to complete as well. The problem is that it is taking about 1 1/2 to 2 hours to get to the 10 minute mark then it sort of hangs up there for about 4 hours unless I suspend the project and let something else run and let the BOINC manager work it's way back to the WU.

Thanks,

Warren

ID: 45748 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
DerAndreas

Send message
Joined: 21 Jan 07
Posts: 2
Credit: 110,247
RAC: 0
Message 45812 - Posted: 9 Sep 2007, 12:22:37 UTC

Hello to all,

On my Both machines there are download problems.
With 5.69 there shows this message
|Sending scheduler request: Requested by user
|Reporting 1 tasks
|Scheduler RPC succeeded
|Message from server: Server can't open log file (../log_boinc/cgi.log)
|Deferring communication for 1 hr 0 min 0 sec
|Reason: project is down

On the other maschine wich is running 5.77 there is this message:
|[file_xfer] Started upload of file CNTRL_01ABRELAX_SAVE_ALL_OUT_-1iibA-_filters_1782_486910_0_0
|[error] Error on file upload: can't open log file
|[file_xfer] Temporarily failed upload of CNTRL_01ABRELAX_SAVE_ALL_OUT_-1iibA-_filters_1782_486910_0_0: transient upload error
|Backing off 5 min 38 sec on upload of file CNTRL_01ABRELAX_SAVE_ALL_OUT_-1iibA-_filters_1782_486910_0_0

Sometimes the second message will by on the first one too.
On both machine running Boinc 5.10.20 the jobs and the error are in my previous version of Boinc 5.8.15 the same.

What can i do?

--
Greetings from Germany
ID: 45812 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 45815 - Posted: 9 Sep 2007, 12:38:51 UTC - in response to Message 45812.  

Hello to all,

What can i do?

--
Greetings from Germany


Nohting but wait for now:)

ID: 45815 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Winkle

Send message
Joined: 22 May 06
Posts: 88
Credit: 1,354,930
RAC: 0
Message 46896 - Posted: 24 Sep 2007, 11:41:57 UTC - in response to Message 45745.  

Sorry for the late reply. The BOINC version is 5.4.9. I just got back from Fiji, and left it running while I was away. All the other machines were fine, but on this one BOINC had locked up. I rebooted and had to kill the tasks, and now operates normally. The machine number is 225833

Regards

Ian

Winkle, I hear such thing about Linux off and on, but I see all of your machines are Windows. Which one is seeing this occur? What BOINC version is installed there?

ID: 46896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46907 - Posted: 24 Sep 2007, 16:29:24 UTC

Winkle, looks like you are running BOINC version 5.4.9 on that machine. Suggest you start by downloading and installing a more current BOINC version.
Rosetta Moderator: Mod.Sense
ID: 46907 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Winkle

Send message
Joined: 22 May 06
Posts: 88
Credit: 1,354,930
RAC: 0
Message 46941 - Posted: 24 Sep 2007, 22:38:56 UTC - in response to Message 46907.  

Thanks, will do. I hadn't realised it all had changed so much. The computers simply sit there and crunch away.

Winkle, looks like you are running BOINC version 5.4.9 on that machine. Suggest you start by downloading and installing a more current BOINC version.

ID: 46941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luuklag

Send message
Joined: 13 Sep 07
Posts: 262
Credit: 4,171
RAC: 0
Message 47279 - Posted: 1 Oct 2007, 15:33:58 UTC

dont know if this is a good place to post, but i c some admins and stuff post, so i hope i got some attention.

please read my post about the abrelax WU's, toppictitel is abbrelax btw ;)

i have 6 failed WU's most about a wrong sin/cosin value, so if some1 could have a look at it.

Luuklag
ID: 47279 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Francis
Avatar

Send message
Joined: 24 Nov 05
Posts: 8
Credit: 623,519
RAC: 0
Message 47361 - Posted: 3 Oct 2007, 22:02:17 UTC

For quite a while I have had no prblems with any of the work units I have been sent.
Today, when I got home, there was one unit that had already been sent in and had a compute error,don't know what caused it.
While I was looking at it,one unit finished premature'ly. The error message I got was;

10/3/2007 5:49:23 PM|rosetta@home|Reason: Unrecoverable error for result CNTRL_01ABRELAX_SAVE_ALL_OUT_-1a19A-_filters_1782_524938_0 ( - exit code -1073741819 (0xc0000005))
10/3/2007 5:49:23 PM|rosetta@home|Computation for task CNTRL_01ABRELAX_SAVE_ALL_OUT_-1a19A-_filters_1782_524938_0 finished

Hope this is of use.

Mike F,


ID: 47361 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Jmarks
Avatar

Send message
Joined: 16 Jul 07
Posts: 132
Credit: 98,025
RAC: 0
Message 47376 - Posted: 4 Oct 2007, 12:44:31 UTC
Last modified: 4 Oct 2007, 12:44:56 UTC

I do not know if this post is needed becuase I have a seperate Number crunching post about it validating but not showing up on bonicstats with the 2 other rosetta wu's that my pc completed yesterday. These were 5.80 wu's.
Since it was a 5.69 wu maybe you need to look at it.
109142983

If need to look at my full post here is the link.
Msg3638
Jmarks
ID: 47376 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 47471 - Posted: 7 Oct 2007, 0:10:26 UTC

Problem with this one, got stuck. app 5.69.

CNTRL_01ABRELAX_SAVE_ALL_OUT_-1ubi_-_filters_1782_490977_0_0

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=100428228

Pete.


ID: 47471 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [XTBA>XTC] ZeuZ

Send message
Joined: 4 Jun 06
Posts: 2
Credit: 19,725
RAC: 0
Message 47491 - Posted: 7 Oct 2007, 17:24:11 UTC
Last modified: 7 Oct 2007, 17:26:11 UTC

Hello everybody

Somes crunchers seems to have a problem with the granted credit, the claimed credit is higher than the granted credit on a lot of wus

We don't know what happens, it's a bit annoying

For exemple

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=99953073

https://boinc.bakerlab.org/rosetta/results.php?hostid=587047&offset=40

https://boinc.bakerlab.org/rosetta/results.php?hostid=603973

Why these computers have that problem while the other don't?

Thank you very much

ZeuZ @ L'Alliance Francophone - XTC Mini TEAM
ID: 47491 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>EDLS>Physique] Pas93

Send message
Joined: 28 Sep 05
Posts: 3
Credit: 1,436,260
RAC: 0
Message 47497 - Posted: 7 Oct 2007, 20:06:47 UTC

Yes, i'm the same problem


https://boinc.bakerlab.org/rosetta/results.php?hostid=591113

:(


ID: 47497 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 47516 - Posted: 8 Oct 2007, 15:38:22 UTC

i thought you guys would be happy with more credit.
rosie is rewarding your computer for finishing faster than the average of the other computers that have run this model.
ID: 47516 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 47518 - Posted: 8 Oct 2007, 16:27:51 UTC - in response to Message 47516.  

i thought you guys would be happy with more credit.
rosie is rewarding your computer for finishing faster than the average of the other computers that have run this model.


Actually ZueZ's point was they are awarded LESS then claimed. But Greg is on to the right cause, your machine took longer to complete the task then the BOINC benchmarks would have predicted. The benchmarks don't require much memory nor much L2 cache, and so they don't test enough to give a good prediction on how long a given machine will take to complete a given amount of work.

The credit claimed is just based on your machine's benchmark rating for the time period it worked on the task. The credit granted is based on an average of the claims of others that have also worked on models from the same batch of work.
Rosetta Moderator: Mod.Sense
ID: 47518 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [XTBA>XTC] ZeuZ

Send message
Joined: 4 Jun 06
Posts: 2
Credit: 19,725
RAC: 0
Message 47521 - Posted: 8 Oct 2007, 19:42:29 UTC
Last modified: 8 Oct 2007, 19:49:15 UTC

Thank you for your replies guys

So the benchmark is a bit falsified on some of our machines, ok but here is a problem all the same https://boinc.bakerlab.org/rosetta/workunit.php?wuid=99953073

The benchmark seem to be right, a core2quad is a fast cpu, 2 point for 11 000 seconde of calculation is very strange, i think it's a problem on some wu because the other wu calculated are right https://boinc.bakerlab.org/rosetta/results.php?hostid=573215&offset=40


ID: 47521 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 47522 - Posted: 8 Oct 2007, 21:02:24 UTC - in response to Message 47521.  

Thank you for your replies guys

So the benchmark is a bit falsified on some of our machines, ok but here is a problem all the same https://boinc.bakerlab.org/rosetta/workunit.php?wuid=99953073

The benchmark seem to be right, a core2quad is a fast cpu, 2 point for 11 000 seconde of calculation is very strange, i think it's a problem on some wu because the other wu calculated are right https://boinc.bakerlab.org/rosetta/results.php?hostid=573215&offset=40


David, Rhiju, check this out. The result shows two! completion sections. One shows 29 decoys and 10494 seconds of CPU, the other shows 30 decoys and 10861.7 seconds of CPU. ...and either one would normally have granted more then 2 credits. It's almost like it completed the task once and then later ran another model on it.

This is Windows XP Pro. on Intel Core2 Quad. With only 1GB of memory for 4CPUs, it would say this machine is probably memory constrained. From what I can tell, more of the machines reporting Linux problems are memory constrained as well.

ZeuZ, I didn't mean to say any of the bechmarks were falsified (although some people do that, and that is a big part of why Rosetta made a more independant credit system). I only meant that the work measured in the benchmark is trivial (simple) when compared to running Rosetta. So it is possible for one machine to show benchmarks twice as high as another, and yet it does not get twice as much work done.
Rosetta Moderator: Mod.Sense
ID: 47522 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Problems with Rosetta stable version 5.69 and beta version 5.77



©2024 University of Washington
https://www.bakerlab.org