Rosetta@home

Problems with Rosetta version 5.93

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : Problems with Rosetta version 5.93

Sort
AuthorMessage
Ingemar

Joined: Feb 28 06
Posts: 20
ID: 61985
Credit: 1,680
RAC: 0
Message 50335 - Posted 5 Jan 2008 1:39:56 UTC

Please post problems and/or bugs with rosetta 5.93. Thanks for your
support!
____________

Thomas Leibold

Joined: Jul 30 06
Posts: 55
ID: 102494
Credit: 19,256,322
RAC: 7,733
Message 50345 - Posted 5 Jan 2008 7:21:01 UTC - in response to Message ID 50335.

Please post problems and/or bugs with rosetta 5.93.


My problem with 5.93 is that once again an insufficiently tested client is released for the Rosetta project. Was the trouble that the 5.90 client caused for Linux users (and the 1zpy workunits for everyone) not severe enough to make project developers think about what they are doing (e.g. learning from their mistakes) ?
Do you really have such an excess of contributors that you can afford to irritate a significant portion of them away to other projects ?

There were less then 20 hours between the 5.93 announcement on Ralph and the same one on Rosetta. During that time my test machine has been getting 0 workunits from Ralph. In fact it didn't get any 5.92 work either and as of this post still has not received any work from Ralph (it did get 5.93 workunits from Rosetta already).

If Rhiju hadn't said " We'll certainly pay closer attention to this in the future, and do tests for, say, at least two days. " I would have left Rosetta right then (I was already testing the Folding@Home SMP Linux client). Not that I think even 2 full days are really sufficient, it should probably be two weeks. To say that I'm disappointed about how quickly this turned into an empty promise is an understatement.

____________
Team Helix

Luuklag

Joined: Sep 13 07
Posts: 262
ID: 205058
Credit: 4,171
RAC: 0
Message 50354 - Posted 5 Jan 2008 13:27:44 UTC - in response to Message ID 50345.

Please post problems and/or bugs with rosetta 5.93.


My problem with 5.93 is that once again an insufficiently tested client is released for the Rosetta project. Was the trouble that the 5.90 client caused for Linux users (and the 1zpy workunits for everyone) not severe enough to make project developers think about what they are doing (e.g. learning from their mistakes) ?
Do you really have such an excess of contributors that you can afford to irritate a significant portion of them away to other projects ?

There were less then 20 hours between the 5.93 announcement on Ralph and the same one on Rosetta. During that time my test machine has been getting 0 workunits from Ralph. In fact it didn't get any 5.92 work either and as of this post still has not received any work from Ralph (it did get 5.93 workunits from Rosetta already).

If Rhiju hadn't said " We'll certainly pay closer attention to this in the future, and do tests for, say, at least two days. " I would have left Rosetta right then (I was already testing the Folding@Home SMP Linux client). Not that I think even 2 full days are really sufficient, it should probably be two weeks. To say that I'm disappointed about how quickly this turned into an empty promise is an understatement.


if you have an existing code which has proven to be running good, and you know what your changes will cause i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong.

dcdc Profile

Joined: Nov 3 05
Posts: 1488
ID: 8948
Credit: 19,185,928
RAC: 13,098
Message 50366 - Posted 5 Jan 2008 16:44:05 UTC - in response to Message ID 50354.

if you have an existing code which has proven to be running good, and you know what your changes will cause i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong.


Luuklag, that's just not true. Anyone with any programming experience will tell you the same.

Rosetta is losing people that have invested a lot of time, money and faith into the project because of Rosetta's recent instability. Most of them would have stayed if the problems were accidental and measures were in place to prevent this, but the testing on Ralph is still apparently minimal...

To the Devs: Do you need more computers on Ralph? If so, just ask and I'm sure you'll get them.
____________

Luuklag

Joined: Sep 13 07
Posts: 262
ID: 205058
Credit: 4,171
RAC: 0
Message 50369 - Posted 5 Jan 2008 16:50:35 UTC

well on the other hand like with the 5.90 for linux, people told it went wrong, so they quickly updated it to 5.91. that this takes 1 or 2 days because the code needs to be looked through and repaired, annoys the people, but hey, nothing is perfect.

j2satx

Joined: Sep 17 05
Posts: 97
ID: 253
Credit: 3,371,456
RAC: 839
Message 50371 - Posted 5 Jan 2008 17:09:17 UTC - in response to Message ID 50366.

if you have an existing code which has proven to be running good, and you know what your changes will cause i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong.


Luuklag, that's just not true. Anyone with any programming experience will tell you the same.

Rosetta is losing people that have invested a lot of time, money and faith into the project because of Rosetta's recent instability. Most of them would have stayed if the problems were accidental and measures were in place to prevent this, but the testing on Ralph is still apparently minimal...

To the Devs: Do you need more computers on Ralph? If so, just ask and I'm sure you'll get them.


It doesn't do any good to add Linux computers to Ralph, since the WUs go to any OS.

I for one have detached all my computers from Ralph and Rosetta. Ralph doesn't do any good when the majority of testing is done on Rosetta.

dcdc Profile

Joined: Nov 3 05
Posts: 1488
ID: 8948
Credit: 19,185,928
RAC: 13,098
Message 50376 - Posted 5 Jan 2008 17:39:32 UTC - in response to Message ID 50369.

well on the other hand like with the 5.90 for linux, people told it went wrong, so they quickly updated it to 5.91. that this takes 1 or 2 days because the code needs to be looked through and repaired, annoys the people, but hey, nothing is perfect.

It shouldn't get to that point - the one to two days of fixing should be on Ralph because that is installed on computers whose owners understand what's going on and that there might be problems.

It needs testing BEFORE it gets released. People donate their computers, generally because they believe in the project, but releasing code that has had minimal testing (especially when there's a platform in place for the testing) is... poor, and it's causing people to leave.
____________

Luuklag

Joined: Sep 13 07
Posts: 262
ID: 205058
Credit: 4,171
RAC: 0
Message 50377 - Posted 5 Jan 2008 18:21:18 UTC - in response to Message ID 50376.
Last modified: 5 Jan 2008 18:29:25 UTC

well on the other hand like with the 5.90 for linux, people told it went wrong, so they quickly updated it to 5.91. that this takes 1 or 2 days because the code needs to be looked through and repaired, annoys the people, but hey, nothing is perfect.

It shouldn't get to that point - the one to two days of fixing should be on Ralph because that is installed on computers whose owners understand what's going on and that there might be problems.

It needs testing BEFORE it gets released. People donate their computers, generally because they believe in the project, but releasing code that has had minimal testing (especially when there's a platform in place for the testing) is... poor, and it's causing people to leave.



ok now i heard enaugh from people, im now going to add ralph, sharing with rosetta rosetta/ralph (7/8)/1(8) with rosie running 4 hour WU's and ralph 2 hour WU's

[EDIT] it seams ralph dous have a purpose

Jan. 3, 2008
The Ralph executable has been updated to 5.93. The recently added 5.92 contained a bug which is removed in this version. Please report bugs in this thread.

cause that version never came to rosie.

Thomas Leibold

Joined: Jul 30 06
Posts: 55
ID: 102494
Credit: 19,256,322
RAC: 7,733
Message 50385 - Posted 5 Jan 2008 22:23:41 UTC - in response to Message ID 50354.
Last modified: 5 Jan 2008 22:29:35 UTC


if you have an existing code which has proven to be running good,

The Rosetta Linux client has known issues in the interaction between the main computation thread, the watchdog thread and the Boinc client. These have been there for a very long time and they have still not been resolved (I have no way of knowing if anybody is even attempting to resolve them). However this clearly means that the premise of starting with known good code is false.

and you know what your changes will cause

I'm an experienced software developer and I can assure you that no matter how well you think you understand the code you are changing and all the consequences of making that change, there is always the possibility of overlooking something. It is especially challenging when you change code that needs to run not only in one particular well controlled environment of your own, but at many different customer sites over which you have no control whatsoever. Following good practices while developing software is important, but no substitute for testing.

i dont think there is much more to test left then test if it runs, so that could be done within lets say 8 hours or so as long as you get like 10 to 50 results you know if it works or not, if the majority of those WU's error out you know your wrong.


It seems that you don't understand the nature of the problem with the 5.90 Linux client or some of the 1zpy workunits: they never finish.

In the 5.90 Linux case they will forever run while remaining at near 0 cpu time accumulated. If you restart Boinc some, but not all of the workunits processed with the 5.90 will show the correct amount of cpu time and finish while others will restart and continue again forever.

In the 1zpy workunit case (even with the 5.91 client) they will get to the point of showing 100% completed, but remain in state "running". When restarting Boinc the amount of cpu time accumulated resets to a low number and the workunit starts again.

In both of these recent cases:
- preferred runtime is ignored
- the 4 times runtime safeguard is not working
- the workunits even continue beyond the project deadline for returning the result
- even restarting Boinc does not resolve the problem and the workunit continues to be stuck

Any unattended Linux server running R@H may very well continue to run these stuck 5.90/1zpy workunits for another year or longer. It certainly doesn't help that nobody from the Rosetta Team of Developers has made any attempt to communicate the nature of the problem to the user community (especially the requirement that those workunits have to be manually cleaned up).

An 8 hour test cannot detect that some workunits get stuck and don't complete ever! The reason I believe a 2 week test period is most sensible is that there is enough time to get problem reports from folks who perhaps check their servers only once a week.
____________
Team Helix

dcdc Profile

Joined: Nov 3 05
Posts: 1488
ID: 8948
Credit: 19,185,928
RAC: 13,098
Message 50387 - Posted 6 Jan 2008 0:44:41 UTC - in response to Message ID 50385.


An 8 hour test cannot detect that some workunits get stuck and don't complete ever! The reason I believe a 2 week test period is most sensible is that there is enough time to get problem reports from folks who perhaps check their servers only once a week.

I think two weeks is probably too long for the team to wait before their code is tested on a large scale (Rosetta) but five or six days on Ralph would give plenty of results. If Ralph needs more monitored PCs then we can do that between us, but we need to be confident that Rosetta is stable as it isn't always running on monitored PCs.
____________

Azurrio Profile

Joined: Feb 20 06
Posts: 8
ID: 60240
Credit: 124,604
RAC: 556
Message 50407 - Posted 6 Jan 2008 19:32:46 UTC
Last modified: 6 Jan 2008 19:33:11 UTC

Validate Error on this (whatever that means :D)

rochester new york Profile
Avatar

Joined: Jul 2 06
Posts: 2337
ID: 98229
Credit: 756,356
RAC: 316
Message 50425 - Posted 7 Jan 2008 12:12:48 UTC
Last modified: 7 Jan 2008 12:17:57 UTC

error......... http://boinc.bakerlab.org/rosetta/workunit.php?wuid=118727989

Mike.Gibson

Joined: Nov 3 07
Posts: 19
ID: 217599
Credit: 189,254
RAC: 0
Message 50440 - Posted 7 Jan 2008 21:18:03 UTC

Hi, folks

I thought that the same problem as I was having on 5.90 was recurring. That is it was adding time but not finishing from 10 minutes to go. But no, this time it was just the countdown that stuck. After 10 minutes stuck on 10 minutes to go it suddenly finished.

Regards

Mike

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50446 - Posted 7 Jan 2008 22:58:47 UTC

Mike, it's normal for the time to basically stand still (time remaining actually is reduced very slowly) once you get below about 12 minutes. This occurs most frequently when you have a runtime preference that is less then the time it takes to compute a single model of a complex protein.
____________
Rosetta Moderator: Mod.Sense

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50448 - Posted 8 Jan 2008 2:23:39 UTC
Last modified: 8 Jan 2008 2:26:07 UTC

Perhaps if this thread were called "NO Problems with Rosetta version 5.93" then there'd be more posts (knock on wood). I'm not seeing many reports of errors. Looking at my own hosts, I've not had one error either with Windows or Linux. Wonder what would happen if they fed us some 1zpy boinc twist rings wus???.

While this updated chart only shows a small number of 5.93's I'd have thought I'd have seen some sign of trouble by now, expecially if you consider the numbers for 5.90 (in windows) and 5.91 (in Linux). I've highlighted 5.93 in red.

(my puters are dual boot using both windows and linux as designated by the "l" or "w" following the number. The number is the AMD designation, so put AMD64 in front of the 2800 and 3700, and put AMD64 X2 in front of the 4800, 5200, and 6000.)

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 4,820,543
RAC: 2,349
Message 50451 - Posted 8 Jan 2008 5:46:22 UTC

Well if you're looking for TWISTed RINGS troubles with 5.93 here's one It was only one so I didn't bother reporting it. Watchdog ended the run

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=117443652

This one was a bit weird too, though no error.

http://boinc.bakerlab.org/rosetta/result.php?resultid=131065424
____________

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,061,841
RAC: 1,331
Message 50453 - Posted 8 Jan 2008 12:38:17 UTC

This one was ended by the watchdog for being stuck for 900 seconds:

http://boinc.bakerlab.org/rosetta/result.php?resultid=131544223

RoSi

Joined: Nov 9 07
Posts: 1
ID: 219516
Credit: 310,476
RAC: 160
Message 50462 - Posted 8 Jan 2008 19:09:28 UTC

I'm facing the same problem.

Sometimes when Boinc switches back to Rosetta then the Progress jumps immediately to 100% but the Status is still on running but a hardware tool shows that the CPU is idling.

Today I stopped therefore all other projects so that Rosetta was working alone. Finally after 1h57min out of 20h (or so) it did the same. Suddenly the Progress jumped to 100% but the Status still showed running while the CPU was idling...

This occurs only under Linux. On my Windows client everything runs smoothly. The Linux client uses Boinc 5.10.28, Rosetta Beta 5.93 and Ubuntu 7.10.

Luuklag

Joined: Sep 13 07
Posts: 262
ID: 205058
Credit: 4,171
RAC: 0
Message 50464 - Posted 8 Jan 2008 20:36:06 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=131390192

rochester new york Profile
Avatar

Joined: Jul 2 06
Posts: 2337
ID: 98229
Credit: 756,356
RAC: 316
Message 50469 - Posted 8 Jan 2008 21:15:25 UTC
Last modified: 8 Jan 2008 21:16:27 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=131390192

Ingemar

Joined: Feb 28 06
Posts: 20
ID: 61985
Credit: 1,680
RAC: 0
Message 50473 - Posted 8 Jan 2008 22:53:27 UTC

In this case it was only 2 days between the boinc and ralph update which I agree is on the short side. However, I just want to alert you to one issue: how long one tests the code on ralph before submitting to boinc depends on the nature of the update. In this case the latest two updates on ralph concerned only code in one scientific protocol. The rest of rosetta stayed the same and the bulk of the jobs sent out were running on identical code base as the previous boinc version. As for the new code, we had tested it on ralph and locally without problems so we were condfident it would not mess things up.

Still, I generally agree that more time between updates is better. I will leave more time the next time I make an update. But as I alluded to above, how much testing is necessary on ralph must be determined on a case to case basis. Also, compared to many other projects we are doing more code development and probably need to update more often.

As for the latest problems: They were mainly caused by high-memory jobs run with higher priority run over holiday season when the queue was not completely full. We have noted these problems and are working on solutions to avoid these problems in the future.

We hope for your continued support, thanks!
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50474 - Posted 8 Jan 2008 23:14:42 UTC - in response to Message ID 50473.

As for the latest problems: ...We have noted these problems and are working on solutions to avoid these problems in the future.


...Ingemar is referring to the "no new work" messages that people were seeing the last 2 weeks of so of December and early January. This was observed mostly on systems with only the project minimum 256MB of memory.

____________
Rosetta Moderator: Mod.Sense

cnick6

Joined: May 30 06
Posts: 24
ID: 85398
Credit: 1,081,843
RAC: 370
Message 50475 - Posted 9 Jan 2008 0:30:13 UTC
Last modified: 9 Jan 2008 0:31:08 UTC

Had a compute failure on Linux 5.93

http://boinc.bakerlab.org/rosetta/result.php?resultid=132057576
____________

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50478 - Posted 9 Jan 2008 2:32:55 UTC
Last modified: 9 Jan 2008 2:34:07 UTC

Yup, Looks like the 1zpy's are getting their normal "watchdog" endings (but in this case valid and credited.

resultid=131958175

(this is the only one for me so far. I.E no other errors of any kind except this one, so not so bad)

Angus Profile

Joined: Sep 17 05
Posts: 412
ID: 83
Credit: 321,053
RAC: 0
Message 50481 - Posted 9 Jan 2008 3:31:22 UTC - in response to Message ID 50473.
Last modified: 9 Jan 2008 3:33:18 UTC

Perhaps you need to stop new development for a bit, and concentrate on FIXING the broken crap you have now. (See Thomas Liebold's post earlier in this thread) Every release is accompanied by an endless litany of failed WUs. In no case, for a project of this size, is a few hours of testing an application even remotely adequate.

Fix the problems, and perhaps those who have left this project might consider coming back.

In this case it was only 2 days between the boinc and ralph update which I agree is on the short side. However, I just want to alert you to one issue: how long one tests the code on ralph before submitting to boinc depends on the nature of the update. In this case the latest two updates on ralph concerned only code in one scientific protocol. The rest of rosetta stayed the same and the bulk of the jobs sent out were running on identical code base as the previous boinc version. As for the new code, we had tested it on ralph and locally without problems so we were condfident it would not mess things up.

Still, I generally agree that more time between updates is better. I will leave more time the next time I make an update. But as I alluded to above, how much testing is necessary on ralph must be determined on a case to case basis. Also, compared to many other projects we are doing more code development and probably need to update more often.

As for the latest problems: They were mainly caused by high-memory jobs run with higher priority run over holiday season when the queue was not completely full. We have noted these problems and are working on solutions to avoid these problems in the future.

We hope for your continued support, thanks!

____________
Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :)



"You can't fix stupid" (Ron White)

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 4,820,543
RAC: 2,349
Message 50484 - Posted 9 Jan 2008 8:17:08 UTC - in response to Message ID 50478.

Yup, Looks like the 1zpy's are getting their normal "watchdog" endings (but in this case valid and credited.

resultid=131958175

(this is the only one for me so far. I.E no other errors of any kind except this one, so not so bad)


And this is a 1zpy job my box received, ended by watchdog, also validated and credited.

http://boinc.bakerlab.org/rosetta/result.php?resultid=132099437
____________

dcdc Profile

Joined: Nov 3 05
Posts: 1488
ID: 8948
Credit: 19,185,928
RAC: 13,098
Message 50485 - Posted 9 Jan 2008 10:26:55 UTC - in response to Message ID 50473.

In this case it was only 2 days between the boinc and ralph update which I agree is on the short side. However, I just want to alert you to one issue: how long one tests the code on ralph before submitting to boinc depends on the nature of the update. In this case the latest two updates on ralph concerned only code in one scientific protocol. The rest of rosetta stayed the same and the bulk of the jobs sent out were running on identical code base as the previous boinc version. As for the new code, we had tested it on ralph and locally without problems so we were condfident it would not mess things up.

Still, I generally agree that more time between updates is better. I will leave more time the next time I make an update. But as I alluded to above, how much testing is necessary on ralph must be determined on a case to case basis. Also, compared to many other projects we are doing more code development and probably need to update more often.

As for the latest problems: They were mainly caused by high-memory jobs run with higher priority run over holiday season when the queue was not completely full. We have noted these problems and are working on solutions to avoid these problems in the future.

We hope for your continued support, thanks!


Thanks for the response Ingemar - it makes a big difference.

____________

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50487 - Posted 9 Jan 2008 13:11:48 UTC
Last modified: 9 Jan 2008 13:12:08 UTC

My 4800 using linux had a watchdog ended WU, but it too was valid and credited. Here's the current scoreboard of 5.93 for my hosts. It seems consistent that it's the 1zpy wus that I have issues with. Perhaps, since I get credit and the wu is considered valid, that this is more of an "informational message", rather than an "error"???

Luuklag

Joined: Sep 13 07
Posts: 262
ID: 205058
Credit: 4,171
RAC: 0
Message 50496 - Posted 9 Jan 2008 20:09:16 UTC
Last modified: 9 Jan 2008 20:12:00 UTC

and to more errors.

1
2

something wrong with sin and cosin

Robby1959

Joined: May 10 07
Posts: 17
ID: 175372
Credit: 3,518,250
RAC: 2,378
Message 50518 - Posted 10 Jan 2008 4:02:53 UTC

can anyone tell me why my wifes laptop toshiba sat. keeps running work while it is in use? the machine is set not to run and the server is set not to run I am stumped it has a 1.6 intel dual core w/ 1 gig of ram, I think another laptop I set up is having the same problems and its bogging down the system any ideas btw it will snooze if told to also how long is the snooze timer

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50528 - Posted 10 Jan 2008 12:35:45 UTC
Last modified: 10 Jan 2008 12:40:31 UTC

My AMD64 X2 4800 using Windows had TWO "1zpy" wus ended by the watchdog yesterday. At this point, I don't see the need to link to the results as they must have PLENTY of samples to work with.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50532 - Posted 10 Jan 2008 16:42:40 UTC - in response to Message ID 50518.

can anyone tell me why my wifes laptop toshiba sat. keeps running work while it is in use? the machine is set not to run and the server is set not to run I am stumped it has a 1.6 intel dual core w/ 1 gig of ram, I think another laptop I set up is having the same problems and its bogging down the system any ideas btw it will snooze if told to also how long is the snooze timer


Robby, you want to review the "preferences" on that PC, which are a local override of your web-based preferences, that pertain only to that specific machine. And just make sure that the checkbox for customized preferences is checked (if that's what you wish to do) and then what it says in the "do work after idle for..." (the "advanced" pulldown, then preferences, then processor usage tab in the advanced view). If the customized preferences box is not checked, then it's just going to use the web-based preferences, which are defined for up to 4 different venues. You can see which venue this machine is considered to be at in the messages as BOINC starts.

I believe the snooze is for 30min.
____________
Rosetta Moderator: Mod.Sense

Robby1959

Joined: May 10 07
Posts: 17
ID: 175372
Credit: 3,518,250
RAC: 2,378
Message 50533 - Posted 10 Jan 2008 17:26:29 UTC - in response to Message ID 50532.

can anyone tell me why my wifes laptop toshiba sat. keeps running work while it is in use? the machine is set not to run and the server is set not to run I am stumped it has a 1.6 intel dual core w/ 1 gig of ram, I think another laptop I set up is having the same problems and its bogging down the system any ideas btw it will snooze if told to also how long is the snooze timer


Robby, you want to review the "preferences" on that PC, which are a local override of your web-based preferences, that pertain only to that specific machine. And just make sure that the checkbox for customized preferences is checked (if that's what you wish to do) and then what it says in the "do work after idle for..." (the "advanced" pulldown, then preferences, then processor usage tab in the advanced view). If the customized preferences box is not checked, then it's just going to use the web-based preferences, which are defined for up to 4 different venues. You can see which venue this machine is considered to be at in the messages as BOINC starts.

I believe the snooze is for 30min.

Luuklag

Joined: Sep 13 07
Posts: 262
ID: 205058
Credit: 4,171
RAC: 0
Message 50550 - Posted 10 Jan 2008 20:56:14 UTC
Last modified: 10 Jan 2008 21:37:21 UTC

im running a 9.93 task, boinc twist rings etc etc. but now when i look at the grpahics there is no image in the box searching nor in the box accepted nor in low energy, the only image showed is that of native.

for the rest it is running normaly.
dragging around in the boxes dousn't gets the vieuw back, and restarting the graphics also gives no result.

for screes mail me
[edit]

ok it triggered the debugger now, after more then 2 hours


error

Karl

Joined: May 12 06
Posts: 11
ID: 82369
Credit: 188,211
RAC: 0
Message 50562 - Posted 11 Jan 2008 11:54:41 UTC

What is happening with my work unit accounting? On Jan 6, my average was 462 and some decimal. Today it is 408.75. This is an enormous drop in work units. I haven't changed the any of my prferences at all over the last week.
____________

Conan Profile
Avatar

Joined: Oct 11 05
Posts: 134
ID: 4053
Credit: 1,599,032
RAC: 24
Message 50565 - Posted 11 Jan 2008 12:29:28 UTC

Getting a number of errors on my Windows machine, no problems on my Linux machines,

Get the following error message

<core_client_version>5.8.15</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 21600
# random seed: 1309647
No heartbeat from core client for 31 sec - exiting
# cpu_run_time_pref: 21600
No heartbeat from core client for 31 sec - exiting
# cpu_run_time_pref: 21600
# random seed: 1309647
No heartbeat from core client for 31 sec - exiting
# cpu_run_time_pref: 21600
# random seed: 1309647
No heartbeat from core client for 31 sec - exiting
# cpu_run_time_pref: 21600
# random seed: 1309647
No heartbeat from core client for 31 sec - exiting
Too many restarts with no progress. Keep application in memory while preempted.
======================================================
DONE :: 1 starting structures 0 cpu seconds
This process generated 0 decoys from 0 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
<message>
<file_xfer_error>
<file_name>n003_1_NMRREF_n003_1_id_model_13_idl_Structural_Genomics_Target_2486_6284_0_0</file_name>
<error_code>-161</error_code>
</file_xfer_error>

</message>

On WU 131203333
WU 131203968
WU 132195281
WU 132270738

____________

Luuklag

Joined: Sep 13 07
Posts: 262
ID: 205058
Credit: 4,171
RAC: 0
Message 50576 - Posted 11 Jan 2008 17:58:50 UTC

and another 2 errors

the second one didn't eaven start, it said maximum disk usage exceeded, but the next task just started fine, so im like WTF :O ?!?!?!
and the first errored out because of sin / cosin out of range

1
2

the last few days i had 6 errors and only 1 succes almost all on the tasks like "mlt__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK_RELAX-2mlt_-crystal_foldanddock"
anyone has an idea what going on, would like some1 explaining this.

Michael Matthews

Joined: Dec 12 05
Posts: 3
ID: 36095
Credit: 37,852
RAC: 0
Message 50586 - Posted 11 Jan 2008 20:37:28 UTC

I am running BOINC Manger 5.10.30 on a Windows XP SP 2 operating system (with 1 GB of RAM). The entire computer has crashed and switched off whenever the Rosetta 5.93 Beta has been running. This only happens when the BOINC Manager runs Rosetta 5.93 Beta (why is a beta version being sent out?). When the BOINC Manager runs the SETI@home application there are never any crashes.

I believe that the Rosetta work units that crashed are:

120779979 <http://boinc.bakerlab.org/rosetta/workunit.php?wuid=120779979>

120604480 <http://boinc.bakerlab.org/rosetta/workunit.php?wuid=120604480>

I looked in the <C:\Program Files\BOINC\> and <C:\Program Files\BOINC\projects\boinc.bakerlab.org_rosetta\> directories for some kind of log file that might explain what the error was that caused the crash but I did not find anything conclusive. I did see this part of a log in the <C:\Program Files\BOINC\stdoutdae.txt> log file which corresponds to around the last time my system crashed and powered off last night (timestamps are in the PST timezone):


10-Jan-2008 10:52:04 [SETI@home] Sending scheduler request: To fetch work. Requesting 94 seconds of work, reporting 0 completed tasks
10-Jan-2008 10:52:39 [SETI@home] Scheduler request succeeded: got 0 new tasks
10-Jan-2008 10:52:44 [rosetta@home] Sending scheduler request: To fetch work. Requesting 2890 seconds of work, reporting 0 completed tasks
10-Jan-2008 10:52:49 [rosetta@home] Scheduler request succeeded: got 1 new tasks
10-Jan-2008 10:52:51 [rosetta@home] Started download of vf_1ail_.fasta.gz
10-Jan-2008 10:52:51 [rosetta@home] Started download of vf_1ail_.psipred_ss2.gz
10-Jan-2008 10:52:52 [rosetta@home] Finished download of vf_1ail_.fasta.gz
10-Jan-2008 10:52:52 [rosetta@home] Finished download of vf_1ail_.psipred_ss2.gz
10-Jan-2008 10:52:52 [rosetta@home] Started download of paths.vf2.17.5.txt.gz
10-Jan-2008 10:52:52 [rosetta@home] Started download of boinc_vf_aa1ail_03_05.200_v1_3.gz
10-Jan-2008 10:52:54 [rosetta@home] Finished download of paths.vf2.17.5.txt.gz
10-Jan-2008 10:52:54 [rosetta@home] Finished download of boinc_vf_aa1ail_03_05.200_v1_3.gz
10-Jan-2008 10:52:54 [rosetta@home] Started download of boinc_vf_aa1ail_09_05.200_v1_3.gz
10-Jan-2008 10:52:54 [rosetta@home] Started download of boinc_vf_aa1ail_05_05.200_v1_3.gz
10-Jan-2008 10:52:55 [rosetta@home] Finished download of boinc_vf_aa1ail_09_05.200_v1_3.gz
10-Jan-2008 10:52:55 [rosetta@home] Finished download of boinc_vf_aa1ail_05_05.200_v1_3.gz
10-Jan-2008 10:52:55 [rosetta@home] Started download of boinc_vf_aa1ail_17_05.200_v1_3.gz
10-Jan-2008 10:52:55 [rosetta@home] Started download of vf_1ail.pdb.gz
10-Jan-2008 10:52:56 [rosetta@home] Finished download of vf_1ail.pdb.gz
10-Jan-2008 10:52:56 [rosetta@home] Started download of abrelax_description.txt
10-Jan-2008 10:52:58 [rosetta@home] Finished download of boinc_vf_aa1ail_17_05.200_v1_3.gz
10-Jan-2008 10:52:58 [rosetta@home] Finished download of abrelax_description.txt
10-Jan-2008 10:53:28 [rosetta@home] Starting 1ail__BOINC_ABRELAX_VF_IGNORE_THE_REST-S25-17-S3-5--1ail_-vf__2534_625_0
10-Jan-2008 10:53:28 [rosetta@home] Starting task 1ail__BOINC_ABRELAX_VF_IGNORE_THE_REST-S25-17-S3-5--1ail_-vf__2534_625_0 using rosetta_beta version 593
10-Jan-2008 11:51:17 [---] Suspending computation - user is active
10-Jan-2008 11:51:17 [---] Suspending network activity - user is active
10-Jan-2008 13:18:22 [---] Resuming computation
10-Jan-2008 13:18:22 [---] Resuming network activity
10-Jan-2008 13:22:39 [---] Suspending computation - user is active
10-Jan-2008 13:22:39 [---] Suspending network activity - user is active
10-Jan-2008 14:23:53 [---] Resuming computation
10-Jan-2008 14:23:53 [---] Resuming network activity
10-Jan-2008 14:34:18 [---] Suspending computation - user is active
10-Jan-2008 14:34:18 [---] Suspending network activity - user is active
10-Jan-2008 14:58:29 [---] Resuming computation
10-Jan-2008 14:58:29 [---] Resuming network activity
10-Jan-2008 15:20:35 [SETI@home] Restarting task 01mr07ag.12577.7025.16.6.28_2 using setiathome_enhanced version 527
10-Jan-2008 15:25:19 [---] Suspending computation - user is active
10-Jan-2008 15:25:19 [---] Suspending network activity - user is active
10-Jan-2008 15:45:19 [---] Resuming computation
10-Jan-2008 15:45:19 [---] Resuming network activity
10-Jan-2008 16:33:22 [SETI@home] Sending scheduler request: To fetch work. Requesting 82 seconds of work, reporting 0 completed tasks
10-Jan-2008 16:33:27 [SETI@home] Scheduler request succeeded: got 1 new tasks
10-Jan-2008 16:33:29 [SETI@home] Started download of 22fe07ah.17355.19704.14.6.137
10-Jan-2008 16:33:31 [SETI@home] Finished download of 22fe07ah.17355.19704.14.6.137
10-Jan-2008 17:59:38 [SETI@home] Computation for task 01mr07ag.12577.7025.16.6.28_2 finished
10-Jan-2008 17:59:38 [SETI@home] Starting 29no06ae.28122.22955.13.6.124_0
10-Jan-2008 17:59:38 [SETI@home] Starting task 29no06ae.28122.22955.13.6.124_0 using setiathome_enhanced version 527
10-Jan-2008 17:59:41 [SETI@home] Started upload of 01mr07ag.12577.7025.16.6.28_2_0
10-Jan-2008 17:59:43 [SETI@home] Finished upload of 01mr07ag.12577.7025.16.6.28_2_0
10-Jan-2008 18:21:41 [SETI@home] Sending scheduler request: To fetch work. Requesting 64 seconds of work, reporting 1 completed tasks
10-Jan-2008 18:21:46 [SETI@home] Scheduler request succeeded: got 1 new tasks
10-Jan-2008 18:21:48 [SETI@home] Started download of 02mr07ad.17282.483.5.6.18
10-Jan-2008 18:21:49 [SETI@home] Finished download of 02mr07ad.17282.483.5.6.18
11-Jan-2008 06:47:59 [---] Starting BOINC client version 5.10.30 for windows_intelx86
11-Jan-2008 06:47:59 [---] log flags: task, file_xfer, sched_ops
11-Jan-2008 06:47:59 [---] Libraries: libcurl/7.17.1 OpenSSL/0.9.8e zlib/1.2.3
11-Jan-2008 06:47:59 [---] Data directory: C:\Program Files\BOINC
11-Jan-2008 06:47:59 [---] Processor: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.20GHz [x86 Family 15 Model 2 Stepping 4]
11-Jan-2008 06:47:59 [---] Processor features: fpu tsc sse sse2 mmx
11-Jan-2008 06:47:59 [---] OS: Microsoft Windows XP: Professional Edition, Service Pack 2, (05.01.2600.00)
11-Jan-2008 06:47:59 [---] Memory: 1023.48 MB physical, 2.40 GB virtual
11-Jan-2008 06:47:59 [---] Disk: 74.52 GB total, 27.12 GB free
11-Jan-2008 06:47:59 [---] Local time is UTC -8 hours
11-Jan-2008 06:47:59 [rosetta@home] URL: http://boinc.bakerlab.org/rosetta/; Computer ID: 99273; location: home; project prefs: default
11-Jan-2008 06:47:59 [SETI@home] URL: http://setiathome.berkeley.edu/; Computer ID: 33200; location: (none); project prefs: default
11-Jan-2008 06:47:59 [---] General prefs: from rosetta@home (last modified 12-Dec-2005 11:43:10)
11-Jan-2008 06:47:59 [---] Host location: home
11-Jan-2008 06:47:59 [---] General prefs: no separate prefs for home; using your defaults
11-Jan-2008 06:47:59 [---] Preferences limit memory usage when active to 511.74MB
11-Jan-2008 06:47:59 [---] Preferences limit memory usage when idle to 921.14MB
11-Jan-2008 06:47:59 [---] Preferences limit disk usage to 0.93GB


This same behavior had also happened sometime ago (I don't remember when). I am seriously considering dropping Rosetta because they release BOINC applications that crashes my system and I cannot afford that.


-Michael

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4818
ID: 85645
Credit: 1,852,264
RAC: 1,628
Message 50587 - Posted 11 Jan 2008 20:59:44 UTC

the suspending because user is active can be corrected by disabling this feature in your profile in RAH. goto computing preferences and the second line from the top 'suspend work while computer is use' and change that to NO.

that should get rid of the Suspending computation - user is active/Suspending network activity - user is active problems.

since your two tasks have not reported back to the project yet there is nothing to see online about why they may or may not have crashed.
____________

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50592 - Posted 11 Jan 2008 21:21:53 UTC - in response to Message ID 50586.

The entire computer has crashed and switched off whenever the Rosetta 5.93 Beta has been running.

This same behavior had also happened sometime ago (I don't remember when).
-Michael

Michael, If you puter is shutting down, then some basic computer parameter is being triggered. Have you checked for dust build up on your CPU heatsink, RAM, Power Supply Unit, etc lately? I believe what you are experiencing is heat related, or atleast it's my "best guess". I can't think of anything in the Seti or Rosetta (or any other Boinc Science app) that would cause a system shutdown.

When any application runs it creates heat. depending upon the application, some apps work your processor harder than others, but all try to get 100% use out of the processor. The more efficient an app becomes, the hotter the processor should get as it's doing more in less time.

anyway, I'd check the temps and/or dust accumulation inside your puter.

tony

Michael Matthews

Joined: Dec 12 05
Posts: 3
ID: 36095
Credit: 37,852
RAC: 0
Message 50593 - Posted 11 Jan 2008 21:34:56 UTC - in response to Message ID 50592.

The entire computer has crashed and switched off whenever the Rosetta 5.93 Beta has been running.

This same behavior had also happened sometime ago (I don't remember when).
-Michael

Michael, If you puter is shutting down, then some basic computer parameter is being triggered. Have you checked for dust build up on your CPU heatsink, RAM, Power Supply Unit, etc lately? I believe what you are experiencing is heat related, or atleast it's my "best guess". I can't think of anything in the Seti or Rosetta (or any other Boinc Science app) that would cause a system shutdown.

When any application runs it creates heat. depending upon the application, some apps work your processor harder than others, but all try to get 100% use out of the processor. The more efficient an app becomes, the hotter the processor should get as it's doing more in less time.

anyway, I'd check the temps and/or dust accumulation inside your puter.

tony


The computer does not have any dust build up or fan problems. The shutting down problem only occurs with the Rosetta Beta 5.93 application and no other software (even ones with high CPU usage). The computer did not shutdown until the Rosetta Beta 5.93 was sent to my computer to run. As I stated before, the SETI@home application (version Enhanced 5.27) never causes this problem (it runs 80% of the time BOINC runs). The computer crashes only with Rosetta Beta 5.93.

-Michael

dcdc Profile

Joined: Nov 3 05
Posts: 1488
ID: 8948
Credit: 19,185,928
RAC: 13,098
Message 50594 - Posted 11 Jan 2008 21:43:31 UTC - in response to Message ID 50593.
Last modified: 11 Jan 2008 21:44:00 UTC


The computer does not have any dust build up or fan problems. The shutting down problem only occurs with the Rosetta Beta 5.93 application and no other software (even ones with high CPU usage). The computer did not shutdown until the Rosetta Beta 5.93 was sent to my computer to run. As I stated before, the SETI@home application (version Enhanced 5.27) never causes this problem (it runs 80% of the time BOINC runs). The computer crashes only with Rosetta Beta 5.93.

-Michael


do you have graphics/screensaver enabled?
____________

Luuklag

Joined: Sep 13 07
Posts: 262
ID: 205058
Credit: 4,171
RAC: 0
Message 50598 - Posted 12 Jan 2008 0:03:41 UTC

argh im getting angry

from the last 9 WU's i had 7 errored out. 7!!!!
thats 77.77%
today 2 more WU's crashed, but i dont feel like posting links anymore, its always the same stuff, sin and cosin thats out of range, when are you guys going to fix this. or give me a reply.?

Michael Matthews

Joined: Dec 12 05
Posts: 3
ID: 36095
Credit: 37,852
RAC: 0
Message 50599 - Posted 12 Jan 2008 0:15:45 UTC - in response to Message ID 50594.


The computer does not have any dust build up or fan problems. The shutting down problem only occurs with the Rosetta Beta 5.93 application and no other software (even ones with high CPU usage). The computer did not shutdown until the Rosetta Beta 5.93 was sent to my computer to run. As I stated before, the SETI@home application (version Enhanced 5.27) never causes this problem (it runs 80% of the time BOINC runs). The computer crashes only with Rosetta Beta 5.93.

-Michael


do you have graphics/screensaver enabled?


I only have the minimal graphics enabled for BOINC. All that is displayed is a graphic of the BOINC logo, the application that is running (Rosetta@home or SETI@home), the work unit name, and the percentage of the work unit completed so far. None of the 3D graphics is being used.

Rosetta@home Beta 5.93 crashed again this afternoon. I'm getting rid of it.


-Michael

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50611 - Posted 12 Jan 2008 12:47:39 UTC

It's been three days without any new watchdog errors.

Here's my scoreboard for 5.93.



Personally, I wonder what's different between my hosts and those of users like Luuklag who also has an AMD64 host but IS getting computation errors. I haven't seen one computation error yet, so something must be different.

Luuklag

Joined: Sep 13 07
Posts: 262
ID: 205058
Credit: 4,171
RAC: 0
Message 50618 - Posted 12 Jan 2008 18:15:06 UTC - in response to Message ID 50611.

well i guess its the type of WU i ran, 1 type but only finished 1 sucessfully out of 7 of them or so. so i guess its in the type of WU.


It's been three days without any new watchdog errors.

Here's my scoreboard for 5.93.



Personally, I wonder what's different between my hosts and those of users like Luuklag who also has an AMD64 host but IS getting computation errors. I haven't seen one computation error yet, so something must be different.

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50621 - Posted 12 Jan 2008 20:38:17 UTC - in response to Message ID 50618.
Last modified: 12 Jan 2008 20:58:03 UTC

well i guess its the type of WU i ran, 1 type but only finished 1 sucessfully out of 7 of them or so. so i guess its in the type of WU.


I took the liberty of running your host with my "Rosetta-Pal". Then I copied and color coded all the work from yours combined with all the work from my "windows" hosts. Then I sorted by WU name and weeded out work not of the same "Job type", so we'd be comparing apples with apples. You had windows xp, I had winxp. You had AMD64, I had AMD64. Etc, Etc.

Anyway, I found 4 instances were we did the same "job type" and you can see them below. I see that of the first job type, you had many computation errors, but your host also did one of them successfully.

Your hosts are "Blue" when you had a error, and "Green" when you successfully completed one. Mine are a various colors so I added descriptions to the first column. My host can be discerned from the previous chart with the exception of my wife's laptop "M3700" which is a "Mobile AMD64 3700" using win xp(can't put linux on that one....lol).

So, from what I see, it's probably NOT the job type/wus, or at least my hosts aren't having trouble with them.

I wonder what else it could be??



[edit] on the second set of WUs I noticed a very early return date on the your wu I saw, so I rechecked, and that computation error was with 5.90, whereas my hosts were using 5.93. Also, that one was not a computation error, but Invalid.

Also, Look at the 'good' wu you returned (green text), It's the very next consecutive "task ID" and "Work unit ID" number from the previous one, which failed, so your own host managed to do one type that it had previous failed to do.[/edit]

Path7

Joined: Aug 25 07
Posts: 128
ID: 201002
Credit: 61,751
RAC: 0
Message 50622 - Posted 12 Jan 2008 21:43:10 UTC - in response to Message ID 50598.
Last modified: 12 Jan 2008 21:44:39 UTC

argh im getting angry

from the last 9 WU's i had 7 errored out. 7!!!!
thats 77.77%
today 2 more WU's crashed, but i dont feel like posting links anymore, its always the same stuff, sin and cosin thats out of range, when are you guys going to fix this. or give me a reply.?


Hi Luuklag,

I looked into your tasks and opened the task details of WU 132125634
The Windows Runtime Debugger show also:
ModLoad: 07280000 0000f000 C:\WINDOWS\system32\ATKOGL32.dll (6.14.10.138) (-exported- Symbols Loaded)
File Version : 6, 14, 10, 138
Company Name : ASUSTeK COMPUTER INC.
Product Name : ASUSTeK Computer Inc. AsusOGL
Product Version: 6, 14, 10, 138

ModLoad: 69500000 00574000 C:\WINDOWS\system32\nvoglnt.dll (6.14.10.9147) (-exported- Symbols Loaded)
File Version : 6.14.10.9147
Company Name : NVIDIA Corporation
Product Name : NVIDIA Compatible OpenGL ICD
Product Version: 6.14.10.9147
Those 2: ATKOGL32.dll & nvoglnt.dll both look like graphics card drivers. However one of the drivers might as well from some add-on software.
Perhaps you changed your graphics card and left an old driver?

I hope this information is useful to you.
Path7.

dcdc Profile

Joined: Nov 3 05
Posts: 1488
ID: 8948
Credit: 19,185,928
RAC: 13,098
Message 50623 - Posted 12 Jan 2008 21:46:34 UTC

i also had a quick look and this:

[01/08/08 21:42:17] TRACE [3172]: Retrieved the required window station

[01/08/08 21:42:17] TRACE [3172]: Retrieved the required desktop

[01/08/08 21:47:11] TRACE [3172]: Retrieved the required window station

[01/08/08 21:47:11] TRACE [3172]: Retrieved the required desktop

i would presume is a graphics issue, which would support Path7's detective work ;)
____________

Luuklag

Joined: Sep 13 07
Posts: 262
ID: 205058
Credit: 4,171
RAC: 0
Message 50634 - Posted 13 Jan 2008 13:29:54 UTC - in response to Message ID 50622.

argh im getting angry

from the last 9 WU's i had 7 errored out. 7!!!!
thats 77.77%
today 2 more WU's crashed, but i dont feel like posting links anymore, its always the same stuff, sin and cosin thats out of range, when are you guys going to fix this. or give me a reply.?


Hi Luuklag,

I looked into your tasks and opened the task details of WU 132125634
The Windows Runtime Debugger show also:
ModLoad: 07280000 0000f000 C:\WINDOWS\system32\ATKOGL32.dll (6.14.10.138) (-exported- Symbols Loaded)
File Version : 6, 14, 10, 138
Company Name : ASUSTeK COMPUTER INC.
Product Name : ASUSTeK Computer Inc. AsusOGL
Product Version: 6, 14, 10, 138

ModLoad: 69500000 00574000 C:\WINDOWS\system32\nvoglnt.dll (6.14.10.9147) (-exported- Symbols Loaded)
File Version : 6.14.10.9147
Company Name : NVIDIA Corporation
Product Name : NVIDIA Compatible OpenGL ICD
Product Version: 6.14.10.9147
Those 2: ATKOGL32.dll & nvoglnt.dll both look like graphics card drivers. However one of the drivers might as well from some add-on software.
Perhaps you changed your graphics card and left an old driver?

I hope this information is useful to you.
Path7.


yes got a new card about 2 months ago, same manufacturer, cause my card was called back because of cooling issues, it made enormous noize cause the bearings of the fan broke down. i just installed the new drivers, imho just an update of the drives, so i dont think there is a problem with that, cause i can do everything like play UT3 on high.

Barraud Denis Profile
Avatar

Joined: May 8 06
Posts: 6
ID: 81508
Credit: 1,258,677
RAC: 0
Message 50643 - Posted 13 Jan 2008 15:13:57 UTC

roseta failed and stop/block boinc completely my Q6600, so i have stop this project to protect my others WU running on boinc. The boinc manager stay in memory but is not running, no WU could work. Even with BOINC and all projets completely reinstalled after a reboot, roseta bug again and block boinc.

the only way to recover boinc, i found was to kill boinc manager, restart it and supress the roseta project rapidely, before it reload a new wu.

I think roseta must be upgraded to disconnect it better from boinc, when it failled in error, to prevent boinc freeze.

The information i have from event observer.

Type de l'événement : Erreur
Source de l'événement : Application Error
Catégorie de l'événement : Aucun
ID de l'événement : 1000
Date : 13/01/2008
Heure : 15:05:54
Utilisateur : N/A
Ordinateur : C2Q1
Description :
Application défaillante minirosetta_1.03_windows_intelx86.exe, version 0.0.0.0, module défaillant minirosetta_1.03_windows_intelx86.exe, version 0.0.0.0, adresse de défaillance 0x0027e8c2.

Pour plus d'informations, consultez le centre Aide et support à l'adresse http://go.microsoft.com/fwlink/events.asp.
Données :
0000: 41 70 70 6c 69 63 61 74 Applicat
0008: 69 6f 6e 20 46 61 69 6c ion Fail
0010: 75 72 65 20 20 6d 69 6e ure min
0018: 69 72 6f 73 65 74 74 61 irosetta
0020: 5f 31 2e 30 33 5f 77 69 _1.03_wi
0028: 6e 64 6f 77 73 5f 69 6e ndows_in
0030: 74 65 6c 78 38 36 2e 65 telx86.e
0038: 78 65 20 30 2e 30 2e 30 xe 0.0.0
0040: 2e 30 20 69 6e 20 6d 69 .0 in mi
0048: 6e 69 72 6f 73 65 74 74 nirosett
0050: 61 5f 31 2e 30 33 5f 77 a_1.03_w
0058: 69 6e 64 6f 77 73 5f 69 indows_i
0060: 6e 74 65 6c 78 38 36 2e ntelx86.
0068: 65 78 65 20 30 2e 30 2e exe 0.0.
0070: 30 2e 30 20 61 74 20 6f 0.0 at o
0078: 66 66 73 65 74 20 30 30 ffset 00
0080: 32 37 65 38 63 32 0d 0a 27e8c2..

____________

Luuklag

Joined: Sep 13 07
Posts: 262
ID: 205058
Credit: 4,171
RAC: 0
Message 50651 - Posted 13 Jan 2008 16:55:17 UTC - in response to Message ID 50643.

anyone please translate it into english...


roseta failed and stop/block boinc completely my Q6600, so i have stop this project to protect my others WU running on boinc. The boinc manager stay in memory but is not running, no WU could work. Even with BOINC and all projets completely reinstalled after a reboot, roseta bug again and block boinc.

the only way to recover boinc, i found was to kill boinc manager, restart it and supress the roseta project rapidely, before it reload a new wu.

I think roseta must be upgraded to disconnect it better from boinc, when it failled in error, to prevent boinc freeze.

The information i have from event observer.

Type de l'événement : Erreur
Source de l'événement : Application Error
Catégorie de l'événement : Aucun
ID de l'événement : 1000
Date : 13/01/2008
Heure : 15:05:54
Utilisateur : N/A
Ordinateur : C2Q1
Description :
Application défaillante minirosetta_1.03_windows_intelx86.exe, version 0.0.0.0, module défaillant minirosetta_1.03_windows_intelx86.exe, version 0.0.0.0, adresse de défaillance 0x0027e8c2.

Pour plus d'informations, consultez le centre Aide et support à l'adresse http://go.microsoft.com/fwlink/events.asp.
Données :
0000: 41 70 70 6c 69 63 61 74 Applicat
0008: 69 6f 6e 20 46 61 69 6c ion Fail
0010: 75 72 65 20 20 6d 69 6e ure min
0018: 69 72 6f 73 65 74 74 61 irosetta
0020: 5f 31 2e 30 33 5f 77 69 _1.03_wi
0028: 6e 64 6f 77 73 5f 69 6e ndows_in
0030: 74 65 6c 78 38 36 2e 65 telx86.e
0038: 78 65 20 30 2e 30 2e 30 xe 0.0.0
0040: 2e 30 20 69 6e 20 6d 69 .0 in mi
0048: 6e 69 72 6f 73 65 74 74 nirosett
0050: 61 5f 31 2e 30 33 5f 77 a_1.03_w
0058: 69 6e 64 6f 77 73 5f 69 indows_i
0060: 6e 74 65 6c 78 38 36 2e ntelx86.
0068: 65 78 65 20 30 2e 30 2e exe 0.0.
0070: 30 2e 30 20 61 74 20 6f 0.0 at o
0078: 66 66 73 65 74 20 30 30 ffset 00
0080: 32 37 65 38 63 32 0d 0a 27e8c2..

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4818
ID: 85645
Credit: 1,852,264
RAC: 1,628
Message 50661 - Posted 13 Jan 2008 18:26:21 UTC - in response to Message ID 50651.
Last modified: 13 Jan 2008 18:27:04 UTC

see the enlish stuff in ( )

anyone please translate it into english...


roseta failed and stop/block boinc completely my Q6600, so i have stop this project to protect my others WU running on boinc. The boinc manager stay in memory but is not running, no WU could work. Even with BOINC and all projets completely reinstalled after a reboot, roseta bug again and block boinc.

the only way to recover boinc, i found was to kill boinc manager, restart it and supress the roseta project rapidely, before it reload a new wu.

I think roseta must be upgraded to disconnect it better from boinc, when it failled in error, to prevent boinc freeze.

The information i have from event observer.

Type de l'événement : Erreur - type of event: error
Source de l'événement : Application Error - source of event
Catégorie de l'événement : Aucun - catagory of event: none
ID de l'événement : 1000 - ID of the event
Date : 13/01/2008
Heure : 15:05:54
Utilisateur : N/A - user is N/A
Ordinateur : C2Q1 - computer (id or name?)
Description :
Application défaillante (failing applications)minirosetta_1.03_windows_intelx86.exe, version 0.0.0.0, module défaillant (failing module) minirosetta_1.03_windows_intelx86.exe, version 0.0.0.0, adresse de défaillance (address of failure) 0x0027e8c2.

Pour plus d'informations, consultez le centre Aide et support à l'adresse http://go.microsoft.com/fwlink/events.asp. (the usal sentance about to find more information visit .....)
Données (data):
0000: 41 70 70 6c 69 63 61 74 Applicat
0008: 69 6f 6e 20 46 61 69 6c ion Fail
0010: 75 72 65 20 20 6d 69 6e ure min
0018: 69 72 6f 73 65 74 74 61 irosetta
0020: 5f 31 2e 30 33 5f 77 69 _1.03_wi
0028: 6e 64 6f 77 73 5f 69 6e ndows_in
0030: 74 65 6c 78 38 36 2e 65 telx86.e
0038: 78 65 20 30 2e 30 2e 30 xe 0.0.0
0040: 2e 30 20 69 6e 20 6d 69 .0 in mi
0048: 6e 69 72 6f 73 65 74 74 nirosett
0050: 61 5f 31 2e 30 33 5f 77 a_1.03_w
0058: 69 6e 64 6f 77 73 5f 69 indows_i
0060: 6e 74 65 6c 78 38 36 2e ntelx86.
0068: 65 78 65 20 30 2e 30 2e exe 0.0.
0070: 30 2e 30 20 61 74 20 6f 0.0 at o
0078: 66 66 73 65 74 20 30 30 ffset 00
0080: 32 37 65 38 63 32 0d 0a 27e8c2..

application failure minirosetta_1.03 windows intelx86.exe 0.0.0.0 in minirosetta_1.03_windows_intelx86>exe 0.0.0.0 at offset 0027e8c2

(used http://babelfish.altavista.com/tr for the translation of the text)
I am not a French expert.


____________

Ingemar

Joined: Feb 28 06
Posts: 20
ID: 61985
Credit: 1,680
RAC: 0
Message 50671 - Posted 14 Jan 2008 0:32:17 UTC - in response to Message ID 50598.

argh im getting angry

from the last 9 WU's i had 7 errored out. 7!!!!
thats 77.77%
today 2 more WU's crashed, but i dont feel like posting links anymore, its always the same stuff, sin and cosin thats out of range, when are you guys going to fix this. or give me a reply.?


Hi Luuklag,

The overall error rates of the WU that are crashing for you are much lower than what you observe (around 2-5%). You may be unlucky, on the other hand they are caused by the the same problem (the cosine error) and not only for one type of WU so we need to fix that. We are looking into this problem to find the bug.
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 551
ID: 105843
Credit: 3,089,054
RAC: 2,001
Message 50678 - Posted 14 Jan 2008 7:09:04 UTC

Just returned this task it is marked as valid, but has this in result file.

fyi

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=121084932

5croA_BOINC_ABRELAX_VF_IGNORE_THE_REST-S25-18-S3-11--5croA-vf__2597_848_0

sin_cos_range ERROR: 1.2851869 is outside of [-1,+1] sin and cos value legal range
sin_cos_range ERROR: 1.2833332 is outside of [-1,+1] sin and cos value legal range

pete.

____________


Yeti
Avatar

Joined: Nov 2 05
Posts: 45
ID: 8304
Credit: 456,148
RAC: 0
Message 50679 - Posted 14 Jan 2008 11:24:13 UTC

Here is a 5.93er WU that errored with Exit status -1073741819 (0xc0000005)

http://boinc.bakerlab.org/rosetta/result.php?resultid=133082830

The box is a Double-Quad-Xeon, running 2003 Server 64 Bit with 8 GB memory


____________


Supporting BOINC, a great concept !

Yeti
Avatar

Joined: Nov 2 05
Posts: 45
ID: 8304
Credit: 456,148
RAC: 0
Message 50680 - Posted 14 Jan 2008 11:28:03 UTC

And one word from me:

Please, discuss things like Rosetta against Ralph please in a different thread; I restarted crunching Rosetta with 5.93 and was looking, if something relevant is to be find about Errors with 5.93, but I had to read all your discussion.

Yes, the content of this discussion is okay, but for me it is definitely the wrong place in this thread
____________


Supporting BOINC, a great concept !

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50681 - Posted 14 Jan 2008 13:16:53 UTC
Last modified: 14 Jan 2008 13:19:21 UTC

I finally got a computation error, and strangely enough, I woke to find one wus stuck at 100% and gkrellm showed 0% cpu use for that core. I have suspended and resumed that wu and now wait for it to run again. The "stuck one" is 1zpy__BOINC_DEFAULT_SYMM_FOLD_AND_DOCK-1zpy_native_2_2519_22709_0. The one which has already reported as a computation error is resultid=133308819 1zpy__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK_RELAX-1zpy_-native__2477_294683_0 and shows:

<core_client_version>5.10.21</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3191248
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -66.1132 for 900 seconds
**********************************************************************
GZIP SILENT FILE: ./xx1zpy.out
SIGSEGV: segmentation violation
Stack trace (22 frames):
[0x8da3037]
[0x8d9de2c]
[0xffffe500]
[0x89a1824]
[0x804c828]
[0x8a8ae99]
[0x8a8babf]
[0x8d0c170]
[0x8c12abe]
[0x8c14e33]
[0x804c7c2]
[0x8a835ed]
[0x8a8586f]
[0x89363de]
[0x89380e3]
[0x893ba27]
[0x898ad7a]
[0x85e96d6]
[0x87289d2]
[0x8728af2]
[0x8e07384]
[0x8048111]

Exiting...

so, it looks like I'm going to have two computation errors for my AMD64 X2 5200 under Linux

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50684 - Posted 14 Jan 2008 15:21:00 UTC
Last modified: 14 Jan 2008 15:28:20 UTC

too late to edit.

The second one which was stuck, remained stuck after the work scheduler got back around to it. I ended up exiting the mangager, opening Konsole, and killing Boinc. I then restarted and opened the manager. The result showed "ready to report", so it must have uploaded before the manager displayed it.

Anyway, It was considered "Valid" and was granted credit like this never even happened. It's resultid=133326615
which shows:

<core_client_version>5.10.21</core_client_version>
<![CDATA[
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3623102
======================================================
DONE :: 1 starting structures 9911.7 cpu seconds
This process generated 6 decoys from 6 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>

Which seems completely uneventful to me, but I know it stuck. Leaving my host only using one core for who knows how long.

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50687 - Posted 14 Jan 2008 20:17:38 UTC

oops. linked to the wrong work unit for the stuck one. It was really, resultid=133258619 which showed this.

<core_client_version>5.10.21</core_client_version>
<![CDATA[
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3630287
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -84.1725 for 900 seconds
**********************************************************************
GZIP SILENT FILE: ./xx1zpy.out
SIGSEGV: segmentation violation
Stack trace (21 frames):
[0x8da3037]
[0x8d9de2c]
[0xffffe500]
[0x8e2a1b9]
[0x8df8727]
[0x8dfaba1]
[0x8cb4a2c]
[0x8c1179b]
[0x8c14e33]
[0x804c7c2]
[0x8a835ed]
[0x8a8586f]
[0x89363de]
[0x893822e]
[0x893ba27]
[0x898ad7a]
[0x85e96d6]
[0x87289d2]
[0x8728af2]
[0x8e07384]
[0x8048111]

Exiting...
No heartbeat from core client for 31 sec - exiting
FILE_LOCK::unlock(): close failed.: Bad file descriptor
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -82.6613 for 900 seconds
**********************************************************************
GZIP SILENT FILE: ./xx1zpy.out
SIGSEGV: segmentation violation
Stack trace (22 frames):
[0x8da3037]
[0x8d9de2c]
[0xffffe500]
[0x89a1824]
[0x804c828]
[0x8a8ae99]
[0x8a8babf]
[0x8d0c170]
[0x8c12abe]
[0x8c14e33]
[0x804c7c2]
[0x8a835ed]
[0x8a8586f]
[0x89363de]
[0x893822e]
[0x893ba27]
[0x898ad7a]
[0x85e96d6]
[0x87289d2]
[0x8728af2]
[0x8e07384]
[0x8048111]

Exiting...
SIGSEGV: segmentation violation
SIGABRT: abort called
[insert] about 200 more of the "abort called", but I snipped it for brevity
SIGABRT: abort called

</stderr_txt>
]]>

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 1,301,330
RAC: 676
Message 50689 - Posted 14 Jan 2008 20:56:52 UTC

resultid 133097235 had some problems, but is valid after all - strange.

<core_client_version>5.8.15</core_client_version>
<![CDATA[
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3031158
SIGSEGV: segmentation violation
Stack trace (12 frames):
[0x8da3037]
[0x8d9de2c]
[0xffffe420]
[0x8e28653]
[0x8df90a1]
[0x8dfaac9]
[0x83e8c0f]
[0x8e0e98f]
[0x8d9fab7]
[0x8da10d5]
[0x8d9a0c5]
[0x8e3aa1a]

Exiting...
SIGSEGV: segmentation violation
Stack trace (17 frames):
[0x8da3037]
[0x8d9de2c]
[0xffffe420]
[0x881d8ba]
[0x881f90a]
[0x88263b5]
[0x8827d6d]
[0x84fcf7a]
[0x84fd442]
[0x8b3e9c0]
[0x8b4134b]
[0x80d8efd]
[0x85eaa7e]
[0x8728a47]
[0x8728af2]
[0x8e07384]
[0x8048111]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
SIGSEGV: segmentation violation
Stack trace (19 frames):
[0x8da3037]
[0x8d9de2c]
[0xffffe420]
[0x850ea02]
[0x8c12f90]
[0x876ba6c]
[0x876c3fe]
[0x87703bb]
[0x878176f]
[0x8787179]
[0x8cf4461]
[0x8b3e9dc]
[0x8b4134b]
[0x80d8efd]
[0x85eaa7e]
[0x8728a47]
[0x8728af2]
[0x8e07384]
[0x8048111]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
======================================================
DONE :: 1 starting structures 10809.5 cpu seconds
This process generated 8 decoys from 8 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>
____________

hedera Profile
Avatar

Joined: Jul 15 06
Posts: 66
ID: 100139
Credit: 1,109,835
RAC: 1,164
Message 50691 - Posted 14 Jan 2008 23:48:23 UTC

5.93 is eating my Windows machine alive. I tried to do something this afternoon and the box was so hung it was barely responding. Here's my system, from the opening log:

01/14/2008 8:11:54 AM||Starting BOINC client version 5.10.20 for windows_intelx86
01/14/2008 8:11:54 AM||log flags: task, file_xfer, sched_ops
01/14/2008 8:11:54 AM||Libraries: libcurl/7.16.4 OpenSSL/0.9.8e zlib/1.2.3
01/14/2008 8:11:54 AM||Data directory: C:\Program Files\BOINC
01/14/2008 8:11:56 AM||Processor: 2 GenuineIntel Intel(R) Pentium(R) 4 CPU 3.20GHz [x86 Family 15 Model 4 Stepping 1]
01/14/2008 8:11:56 AM||Processor features: fpu tsc pae nx sse sse2 mmx
01/14/2008 8:11:57 AM||OS: Microsoft Windows XP: Professional Edition, Service Pack 2, (05.01.2600.00)
01/14/2008 8:11:57 AM||Memory: 1022.09 MB physical, 2.40 GB virtual
01/14/2008 8:11:57 AM||Disk: 145.27 GB total, 106.66 GB free
01/14/2008 8:11:57 AM||Local time is UTC -8 hours

In mid-afternoon (around 3:30 PM local), first of all I had three WUs running at once; and when I looked at the task manager I saw that they were using a whole lot of memory:

319,896K
258,352K
34,636K

That's 612,884K, just for Rosetta! Add to this the fact that ZoneAlarm Internet Security (which I recently installed to replace Norton) was running some kind of update, and I could barely get the mouse to respond. I suspended Rosetta temporarily so I could post this and let ZA finish whatever it was doing. (I'll be discussing this with them.)

I've been running on the assumption that my computing preferences, which are pretty standard, would give me 2 WUs using, between them, 98-100% of CPU but NOT this much memory! Is there some tweak I should do to my settings? Should I expect to be running 3 or even 4 WUs at a time?


____________
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50693 - Posted 15 Jan 2008 3:56:55 UTC

hedera, you are correct to expect only 2 tasks running at a time to be normal on that machine.

You can control the amount of memory you wish to allow BOINC to use for the WUs it is currently running. This is in the General Preferences, or the local preferences for each machine.
____________
Rosetta Moderator: Mod.Sense

hedera Profile
Avatar

Joined: Jul 15 06
Posts: 66
ID: 100139
Credit: 1,109,835
RAC: 1,164
Message 50695 - Posted 15 Jan 2008 5:07:51 UTC

OK, my current memory preferences are:

50% when computer is in use
90% when computer isn't in use

How would you advise me to trim that to keep 2 and only 2 WUs running? As far as I could tell from the BOINC manager console, when one of the WUs got above 90% (maybe above 95%), it began using enough less memory that Rosetta could launch another WU... I didn't see 3 WUs working unless at least one of them was in the high 90% completed range.
____________
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

Ananas

Joined: Jan 1 06
Posts: 224
ID: 45336
Credit: 357,493
RAC: 438
Message 50700 - Posted 15 Jan 2008 7:43:00 UTC
Last modified: 15 Jan 2008 8:15:33 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=133324121

<message>Maximum disk usage exceeded
</message>

NTFS partition with 4GB (not compressed)
2.9GB free
800MB used by BOINC (total directory size, this includes 2 paused climate models)

BOINC is allowed 8GB or 100% and asked to leave 0.01GB free

which means that the Rosetta WU must have used ~2.8 GB when it crashed?

Or do Rosetta WUs come with a builtin disk usage limit different from the BOINC limit?

p.s.: The root path name is just d:\BOINC so MAX_FNAME should not play a role, even though it is weird to include all those informations in the filename.
Afaik. MAX_FNAME (and PATH_MAX) is 256/256 on NTFS and 128/143 on DOS, so the Rosetta filename (~150 characters including the path) would have violated the DOS pathname length but should still work under Win32

Viking69
Avatar

Joined: Oct 3 05
Posts: 17
ID: 2393
Credit: 1,067,734
RAC: 232
Message 50704 - Posted 15 Jan 2008 9:09:08 UTC
Last modified: 15 Jan 2008 9:17:05 UTC

Windows Vista: I couln't get my BOINC manager to come up after I was away for 3 days. The PC is on 24/7.


I tried restarting the service ( always run as a service ) an dhad no luck, I loged off with no change, I downloaded and installed 5.10.35 ( I was on 5.10.30 ) and still no luck. I looked into the slots folder and I saw that I had 4 that were rosetta but the folder said 'mini'. I deleted the slots folder with the service stopped ( it prevented me to do that with the service running ) and I was then able to see the tasks board. The service is currently stopped so I can write what I had in queue. (3) 1zpy files and (1) BAKavsc3 files.

Thesea are the only WU's that I have for Rosetta on my Vista box.
I will be starting the service as soon as i post this to see what happens.

**update**
After starting the service for BOINC again, 3 of the Rosettas uploaded and a 4th is currently processing. It is a 1zpy file. I seem to have gotten credit for the reported WU's, so they did finish without error.

M.L.

Joined: Nov 21 06
Posts: 182
ID: 130574
Credit: 180,462
RAC: 0
Message 50710 - Posted 15 Jan 2008 16:10:13 UTC

Task ID 133439620
Name 1zpy__BOINC_DEFAULT_SYMM_FOLD_AND_DOCK-1zpy_-native__2519_34438_0
Workunit 121403622
Created 14 Jan 2008 11:42:55 UTC
Sent 14 Jan 2008 11:43:40 UTC
Received 15 Jan 2008 14:07:22 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 717897
Report deadline 24 Jan 2008 11:43:40 UTC
CPU time 6261.875
stderr out <core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3628558
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -96.4799 for 900 seconds
**********************************************************************
GZIP SILENT FILE: .\xx1zpy.out

</stderr_txt>
]]>


Validate state Valid
Claimed credit 25.9837414273405
Granted credit 20
application version 5.93





Home | Join | About | Participants | Community | Statistics

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 4,820,543
RAC: 2,349
Message 50711 - Posted 15 Jan 2008 16:42:17 UTC - in response to Message ID 50710.

Task ID 133439620
Name 1zpy__BOINC_DEFAULT_SYMM_FOLD_AND_DOCK-1zpy_-native__2519_34438_0
Workunit 121403622


Ditto

http://boinc.bakerlab.org/rosetta/result.php?resultid=133231124
____________

KWSN THE Holy Hand Grenade! Profile

Joined: May 3 07
Posts: 5
ID: 172695
Credit: 532,162
RAC: 1,005
Message 50716 - Posted 15 Jan 2008 18:46:37 UTC

Is anyone else getting compute errors like this? (5.93, Win XP pro x64 and win XP home (different machine)

Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3348287
# cpu_run_time_pref: 14400
ERROR:: Exit from: .\fullatom_energy.cc line: 2128


I've had about 8 WU's fail for this reason...
____________

NickHan

Joined: Jul 2 07
Posts: 4
ID: 187731
Credit: 108,170
RAC: 0
Message 50718 - Posted 15 Jan 2008 20:16:32 UTC

Version 5.93 reached 96% plus on a WU showing 10 mins to go. An hour later 97% and 10 mins to go. Stopped BOINC and restarted the WU came up at 97% and when computation restarted reset to zero and 6 hours 20 remaining! Sigh Any ideas?

Knorr

Joined: Feb 18 06
Posts: 21
ID: 59849
Credit: 50,464
RAC: 0
Message 50719 - Posted 15 Jan 2008 20:19:33 UTC

Had an invalid result

http://boinc.bakerlab.org/result.php?resultid=133554789

The watchdog didn't end the run at first.
It ran for more than 4 hrs, with a setting of 2 hrs.
I suspended the task, and then resumed it a bit later, and the task ended itself.

- Knorr
____________

Luuklag

Joined: Sep 13 07
Posts: 262
ID: 205058
Credit: 4,171
RAC: 0
Message 50721 - Posted 15 Jan 2008 20:27:49 UTC

im having not much time to post these days, school is asking to much from me atm, to much things to finish. but im still having errors, a big deal of erros, 1 or 2 days ago 4 or 5 WU's in a row, some triggered watchdog. but thanks for letting me know sin cosin thing is a bit common and your looking into it, some more of these small posts will really boost the morale.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50727 - Posted 15 Jan 2008 22:16:32 UTC - in response to Message ID 50718.
Last modified: 15 Jan 2008 22:20:25 UTC

Version 5.93 reached 96% plus on a WU showing 10 mins to go. An hour later 97% and 10 mins to go. Stopped BOINC and restarted the WU came up at 97% and when computation restarted reset to zero and 6 hours 20 remaining! Sigh Any ideas?


Ideas? Yes, don't stop BOINC. Seriously.

The fact that your % complete reset to zero implies that no checkpoint was reached during the calculations. Some types of work are able to checkpoint very frequently, some are not.

The time to completion is an estimate, and not always a very accurate estimate. Some of the work they are sending out can take 5 or 6 hours to complete a single model (longer on a slower machine). This is especially true for the 1zpy's. If your preferred runtime is less then this, you will see an estimated time to completion of something under 10 minutes for any time over your preference. So if your preference is the default 3hrs for example, it will show 10min to complete, with expoentially small reductions in that time for the last 2 or 3 hours of the model.
____________
Rosetta Moderator: Mod.Sense

Ananas

Joined: Jan 1 06
Posts: 224
ID: 45336
Credit: 357,493
RAC: 438
Message 50730 - Posted 15 Jan 2008 23:34:15 UTC
Last modified: 15 Jan 2008 23:40:29 UTC

No watchdog thing yet but a candidate (mgth-3-1sg9_a_w012_MolecularReplacement_2482_77037) :

file "farlxcheck" last touched 2.5 hours ago (96.60%), the BOF looks like this :


286 LEU 67.29 165.85 0.00 0.00 chi_offsets
287 THR 58.79 60.00 0.00 0.00 chi_offsets
288 LEU 177.42 66.34 0.00 0.00 chi_offsets

the fraction of chi1 correct 133 246 0.54
the fraction of chi12 correct 41 200 0.20
the fraction of chi123 correct 3 74 0.04

Maybe this helps somehow.

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50774 - Posted 17 Jan 2008 17:51:05 UTC
Last modified: 17 Jan 2008 17:52:41 UTC

got another stuck one. See details in this post, except this time it restarted at 10 minutes instead of uploading immediately. Looks like I'm in "Babysitter mode" until this one finishes.

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50777 - Posted 17 Jan 2008 20:10:39 UTC
Last modified: 17 Jan 2008 20:11:12 UTC

That stuck WU which restarted is resultid=133551161 which ended itself on this go around. Was Valid and creditted (but not for the first wasted 2 hours spent on it, plus however long it was stuck for).

The says:

Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3171268
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -113.019 for 900 seconds
**********************************************************************
GZIP SILENT FILE: ./xx1zpy.out
*** glibc detected *** corrupted double-linked list: 0x092683c0 ***
SIGABRT: abort called
Stack trace (18 frames):
[0x8da3037]
[0x8d9de2c]
[0xffffe500]
[0x8e0e444]
[0x8e2330f]
[0x8e27d01]
[0x8e28176]
[0x8e28653]
[0x8df90a1]
[0x8dfaac9]
[0x83c4cc5]
[0x8e0e98f]
[0x8d9fab7]
[0x8d9ff27]
[0x8d2023d]
[0x8d20f35]
[0x8d9a0c5]
[0x8e3aa1a]

Exiting...
No heartbeat from core client for 31 sec - exiting
FILE_LOCK::unlock(): close failed.: Bad file descriptor
Graphics are disabled due to configuration...
# cpu_run_time_pref: 10800
# random seed: 3171268
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -89.0742 for 900 seconds
**********************************************************************
GZIP SILENT FILE: ./xx1zpy.out
SIGSEGV: segmentation violation
Stack trace (22 frames):
[0x8da3037]
[0x8d9de2c]
[0xffffe500]
[0x89a1824]
[0x804c828]
[0x8a8ae99]
[0x8a8babf]
[0x8d0c170]
[0x8c12abe]
[0x8c14e33]
[0x804c7c2]
[0x8a835ed]
[0x8a8586f]
[0x89363de]
[0x89380e3]
[0x893ba27]
[0x898ad7a]
[0x85e96d6]
[0x87289d2]
[0x8728af2]
[0x8e07384]
[0x8048111]

Exiting...

</stderr_txt>
]]>

Hope something in all this ends with a fix at some point.

hedera Profile
Avatar

Joined: Jul 15 06
Posts: 66
ID: 100139
Credit: 1,109,835
RAC: 1,164
Message 50786 - Posted 17 Jan 2008 22:11:16 UTC

I notice 2 things today, which may simply mean I notice things slowly:

1. My system is running MUCH faster today. Yesterday I waited minutes for the screen to change.

2. BOINC is running Rosetta Beta 5.93.

I don't recall noticing that I had Rosetta Beta 5.93 before, am I just slow at noticing? Because it feels like something has changed. Was I simply running some very intensive WUs yesterday?? Today's memory usage is noticeably lower. Yesterday I was running these tasks:

http://boinc.bakerlab.org/rosetta/result.php?resultid=133780391
http://boinc.bakerlab.org/rosetta/result.php?resultid=133748830

I'm STILL running this task (it's about done), which has been going since sometime on the 15th:

http://boinc.bakerlab.org/rosetta/result.php?resultid=133728745

Are these tasks unusually complex or large??

____________
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50789 - Posted 17 Jan 2008 22:58:34 UTC

Your result 133748830 is a 1zpy. Yes, they take a long time to complete a single model. V5.93 has been out for some time. But depending on which WUs your machine is assigned, and how large a cache of work you keep, you may not have seen much work under v5.93 until now. But more likely you just hadn't noticed.

____________
Rosetta Moderator: Mod.Sense

Mike.Gibson

Joined: Nov 3 07
Posts: 19
ID: 217599
Credit: 189,254
RAC: 0
Message 50790 - Posted 18 Jan 2008 0:40:51 UTC - in response to Message ID 50727.

Thanks for this explanation. I had been dumping "stuck" 5.90s and was about to dump a "stuck" 5.93. As a result of your explanation, repeated below with the original question, I set a time of 10 hours in place of the default and lo & behold, after a while, the time to go shot up from 10 minutes to 5 hours meaning a total time of over 8 hours on a 3800+ dual-core with 1MB RAM! Also the progress dropped from 95% to about 35%. It is now going well.

Would it not be better to put out a message about the possible time increase and also to change the default from 3 hours to something more realistic? Presumably, this is only a few minutes work to do and it would solve all these problems.

Apart from anything else, BOINC Manager needs to know how long these units can take in order to assess what units to obtain and also for assessing priorities. If something is going to take 3 times the expected time, it could cause other units/projects to default on time limits.

Regards

Mike

Version 5.93 reached 96% plus on a WU showing 10 mins to go. An hour later 97% and 10 mins to go. Stopped BOINC and restarted the WU came up at 97% and when computation restarted reset to zero and 6 hours 20 remaining! Sigh Any ideas?


Ideas? Yes, don't stop BOINC. Seriously.

The fact that your % complete reset to zero implies that no checkpoint was reached during the calculations. Some types of work are able to checkpoint very frequently, some are not.

The time to completion is an estimate, and not always a very accurate estimate. Some of the work they are sending out can take 5 or 6 hours to complete a single model (longer on a slower machine). This is especially true for the 1zpy's. If your preferred runtime is less then this, you will see an estimated time to completion of something under 10 minutes for any time over your preference. So if your preference is the default 3hrs for example, it will show 10min to complete, with expoentially small reductions in that time for the last 2 or 3 hours of the model.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50792 - Posted 18 Jan 2008 3:02:08 UTC

Mike, if everyone had the same time preference, and if all tasks had roughly the same time per model, what you say would certainly be done. But neither is the case. Some people want shorter times (and, yes, it would be nice if they never received a task that took longer then that, but it's not a perfect world). The mixture of work varies over time. The ratio of long to short model tasks varies. ...and you are correct, this can (and does) throw off the estimates and confuse BOINC about how much work to get.

The best way to get a fairly concistent and predictable completion time is to go the 24hr maximum runtime preference. But, if your machine is only on 2 hours a day, it would take you more then 10 days to complete a task and it would never get returned before the deadline. ...there's always something. But if BOINC is running 24hrs a day anyway, then this will offer the most predictability for human, and BOINC.
____________
Rosetta Moderator: Mod.Sense

Mike.Gibson

Joined: Nov 3 07
Posts: 19
ID: 217599
Credit: 189,254
RAC: 0
Message 50796 - Posted 18 Jan 2008 10:30:09 UTC - in response to Message ID 50792.

I see where you are coming from, but, if you take the 2 hours a day machine as an example, it will start the unit thinking it will finish within the deadline but when the 3 hours is up, a couple of days later, it then sticks on the 3 hours and no progress seems to be happening and the time will be wasted when the unit is eventually aborted or the deadline passes. It is far better for the true time to appear and then the unit can be aborted before it starts if the deadline cannot be met. That way another shorter unit can be run in its place, successfully.

Cheers

Mike

Mike, if everyone had the same time preference, and if all tasks had roughly the same time per model, what you say would certainly be done. But neither is the case. Some people want shorter times (and, yes, it would be nice if they never received a task that took longer then that, but it's not a perfect world). The mixture of work varies over time. The ratio of long to short model tasks varies. ...and you are correct, this can (and does) throw off the estimates and confuse BOINC about how much work to get.

The best way to get a fairly concistent and predictable completion time is to go the 24hr maximum runtime preference. But, if your machine is only on 2 hours a day, it would take you more then 10 days to complete a task and it would never get returned before the deadline. ...there's always something. But if BOINC is running 24hrs a day anyway, then this will offer the most predictability for human, and BOINC.

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,061,841
RAC: 1,331
Message 50805 - Posted 18 Jan 2008 21:54:42 UTC

This WU (on one of my Linux machines): http://boinc.bakerlab.org/rosetta/result.php?resultid=133853424
was ended by the watchdog for 900 seconds of no progress.

Then it bombed out giving a stack trace.

Then it bombed again with another stack trace.

Then it hung, showing 100% done and about an hour of CPU in the manager. The time in the manager wasn't changing and no CPU was being used.

It's clear that Rosetta still has the bug where the watchdog can't terminate a WU on a Linux machine without crashing.

So I decided to kill -9 the Rosetta process.

Boinc showed a message saying the WU exited with zero status but no "finished" file. Boinc restarted the WU.

Then the WU completed normally, with a "successful" and "valid" result.

:p

Mike.Gibson

Joined: Nov 3 07
Posts: 19
ID: 217599
Credit: 189,254
RAC: 0
Message 50812 - Posted 19 Jan 2008 0:39:21 UTC

As a follow up to message 50796 etc, when my 1c26 unit approached the 10-hour preferred runtime, I increased the runtime to the maximum of 24 hours. As soon as it took effect, the progress % fell to 38. (i.e. CPU time /24)

Another 7 hours have gone by and the grogress % is still based on CPU time/24.

Another consequence of increasing the runtime was that BOINC Manager woke up to the fact that I had 6 Rosetta units that were liable to miss their deadline and consequently commandeered both cores of my 3800+ dual-core machine for Rosetta at the expense of everything else. This brought a second Rosetta into play, an s099 unit, which now seems to be going along the same lines with 7 hours CPU time and 29% progress.

Heaven help anyone with a PIII machine! They will never finish. Even I am wondering if how many, if any, of my units will finish before the deadline of 23/1/08. I am not expecting them to finish within the 24 hours.

Does anyone know how long these will take, please?

Regards

Mike

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 551
ID: 105843
Credit: 3,089,054
RAC: 2,001
Message 50813 - Posted 19 Jan 2008 0:59:45 UTC - in response to Message ID 50812.

As a follow up to message 50796 etc, when my 1c26 unit approached the 10-hour preferred runtime, I increased the runtime to the maximum of 24 hours. As soon as it took effect, the progress % fell to 38. (i.e. CPU time /24)

Another 7 hours have gone by and the grogress % is still based on CPU time/24.

Another consequence of increasing the runtime was that BOINC Manager woke up to the fact that I had 6 Rosetta units that were liable to miss their deadline and consequently commandeered both cores of my 3800+ dual-core machine for Rosetta at the expense of everything else. This brought a second Rosetta into play, an s099 unit, which now seems to be going along the same lines with 7 hours CPU time and 29% progress.

Heaven help anyone with a PIII machine! They will never finish. Even I am wondering if how many, if any, of my units will finish before the deadline of 23/1/08. I am not expecting them to finish within the 24 hours.

Does anyone know how long these will take, please?

Regards

Mike


Hi Mike, since you changed your runtime to 24hrs that's how long the tasks

will take give or take a few minutes for how may models your computer can do.

Pete.




____________


dcdc Profile

Joined: Nov 3 05
Posts: 1488
ID: 8948
Credit: 19,185,928
RAC: 13,098
Message 50815 - Posted 19 Jan 2008 1:48:04 UTC - in response to Message ID 50812.

As a follow up to message 50796 etc, when my 1c26 unit approached the 10-hour preferred runtime, I increased the runtime to the maximum of 24 hours. As soon as it took effect, the progress % fell to 38. (i.e. CPU time /24)

Another 7 hours have gone by and the grogress % is still based on CPU time/24.

Another consequence of increasing the runtime was that BOINC Manager woke up to the fact that I had 6 Rosetta units that were liable to miss their deadline and consequently commandeered both cores of my 3800+ dual-core machine for Rosetta at the expense of everything else. This brought a second Rosetta into play, an s099 unit, which now seems to be going along the same lines with 7 hours CPU time and 29% progress.

Heaven help anyone with a PIII machine! They will never finish. Even I am wondering if how many, if any, of my units will finish before the deadline of 23/1/08. I am not expecting them to finish within the 24 hours.

Does anyone know how long these will take, please?

Regards

Mike

Mike - I think you misunderstand the run-time (or I misunderstand your post!). The runtime is not a time-out - it's the preferred run-time for each task. Each task consists of a number of decoys (models) and Rosetta will run as many as it can within the run-time you set. If you change this from 10hrs to 24 hrs then Rosetta will continue running models for 24hrs before calling the task complete and letting BOINC submit it.

If the task has run for over 10hrs and you change the preference back to 10hrs now Rosetta will finish the task once it finishes the next decoy. Users with slower computers will still fall within the run-time preference - they just fit fewer decoys into each task in that time.

HTH
Danny
____________

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50817 - Posted 19 Jan 2008 3:36:56 UTC
Last modified: 19 Jan 2008 3:42:21 UTC

Here's a "scoreboard" update. It shows all the errors for all my systems as it pertains to 5.93, and thier percentages. Any error is annoying, but from my perspective, there's not a large percentage of them.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50818 - Posted 19 Jan 2008 5:14:56 UTC

Mike, my apologies, I generally dig up a link to info. warning you that changing the runtime impacts all of your existing work, and that it is possible to end up scheduled to miss deadlines. I generally recommend changing runtime gradually over time, so BOINC can react to the change. The good news is that if you change the preference back down, the pending work gets adjusted down as well (but it may not reflect that on work that hasn't been started until BOINC completes a couple of tasks under the new preference).

A PIII takes longer to complete a single model, but a 24hr preference is still just 24hrs. So, if a P4 takes 5 hours to complete the recent long running tasks, the PIII might take 10. A PIII would then complete a second model at around 20hrs, and then it would mark it completed (because to begin a third model would be so far over the 24hr preference). So, it still only takes a day to do a 24hr work unit, but the PIII will only do (for example) 2 of the hard models, and a P4 might do 4 of the models of the same level of difficulty.

Where a PIII really is hurting is when it is asked to do a 1-3hr runtime preference. It must do at least one model, and for tasks where that take a PIII longer then the runtime preference, he just keeps chugging, and showing the 10min. time to completion, which very gradually decreases over time.
____________
Rosetta Moderator: Mod.Sense

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 1,301,330
RAC: 676
Message 50838 - Posted 20 Jan 2008 17:21:03 UTC

workunit 134230483 had several sin_cos_range errors.
____________

eric

Joined: Jan 2 07
Posts: 23
ID: 139003
Credit: 815,696
RAC: 0
Message 50853 - Posted 21 Jan 2008 0:14:00 UTC

Once again I am having major problems with a new version of Rosetta. On one of my XP boxes the computer is locking up. That computer only has 512 MB of RAM. On one of my Linux boxes I am getting a ton of compute errors.

http://boinc.bakerlab.org/rosetta/results.php?hostid=702448

I am stopping Rosetta on that box and if this keeps up I am going to have to move my resources to different projects. That is a shame because I really feel that Rosetta is a great project to support. But on the other hand I can't keep wasting all this electricity on failed work units.
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4818
ID: 85645
Credit: 1,852,264
RAC: 1,628
Message 50870 - Posted 21 Jan 2008 18:50:43 UTC

Validate Error yet again
Task ID 133449376
Name 1g2z__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK_RELAX-1g2z_-crystal_foldanddock__2599_17309_0

Just wasted another 4 hours of CPU time

Validate error The task was reported but could not be validated, typically because the output files were lost on the server. <-- lost on the server? oh give me a break
____________

Dave Mickey

Joined: Dec 29 07
Posts: 33
ID: 231007
Credit: 3,897,360
RAC: 423
Message 50882 - Posted 22 Jan 2008 1:47:57 UTC

I too have bumped into the "10 minutes to go" thing,
and not understood, for a couple of reasons. First time,
I shut down BOINC and restarted it, and eventually, that
unit started again, and went to 10 minutes for a really long
time again.

I say it eventually restarted, because in the episode where it
went to 10 minutes, it somehow monopolized the CPU, and rang
up huge STD and LTD, by staying on Rosetta exclusively.
Thus when BOINC restarted, it went to s@h for many hours due to debt.
This machine is set to switch every 60 minutes, but something
in this scenario managed to override that and give Rosetta
something like 12 or 15 hours of uninterrupted CPU (should be 50/50).
No hints in the BOINC console output log, and BV has not (that I've
seen) reported that any deadline problem is the culprit.

What is it about this 10 minute to go anomaly that convinces BOINC
that Rosetta deserves large chunks of cpu time? (altho, the big debt
accumulation started well before it got to the 10 minute thing....)

(just trying to understand)

Dave

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50886 - Posted 22 Jan 2008 6:25:19 UTC
Last modified: 22 Jan 2008 6:26:01 UTC

Dave, I am not certain of the current state of affairs with BOINC. I know at one time they were talking about adding function to try to make task switches just after checkpoints to preserve more work for all projects. And it would make sense as well to try and let a task run another 10min to complete, even if it does not checkpoint, so perhaps BOINC allowed it to run, assuming it's estimated time was correct, and that it would soon finish.

As you say, debt balanced everything out in the end.

If anyone knows for certain if the short estimated time to completion is disturbing the BOINC Manager's decision, please let me know, or post a link.
____________
Rosetta Moderator: Mod.Sense

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4818
ID: 85645
Credit: 1,852,264
RAC: 1,628
Message 50898 - Posted 22 Jan 2008 20:53:56 UTC
Last modified: 22 Jan 2008 20:55:16 UTC

ANOTHER validate error, the second in 24 hours - 18 hrs to be precise between errors.
Task ID 133556076
Name s099_1_homologymodel_strictosidine_synthase_2472_63483_0

your killing my average with these errors and I am not sure if the results are making it into your system with this.

why do your servers keep losing files? refer to the explanation quoted from the website in my previous post.

someone want to answer this?

seems like its time for a bit of system maintance before yet another crash happens.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50915 - Posted 23 Jan 2008 18:40:54 UTC

Greg, I am not in a position to know for certain, but I suspect that the DNS attack on the servers may have resulted in some odd things occuring.
____________
Rosetta Moderator: Mod.Sense

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4818
ID: 85645
Credit: 1,852,264
RAC: 1,628
Message 50923 - Posted 23 Jan 2008 19:26:09 UTC - in response to Message ID 50915.

thats possible, everything is ok now, 24hrs no problems reporting or validating.

Greg, I am not in a position to know for certain, but I suspect that the DNS attack on the servers may have resulted in some odd things occuring.


____________

csbyrosetta

Joined: Dec 24 05
Posts: 4
ID: 42892
Credit: 751,204
RAC: 0
Message 50936 - Posted 24 Jan 2008 8:28:47 UTC

2h4o.........
seems to have an Problem. Got 3 of them with the same problem.

http://boinc.bakerlab.org/rosetta/result.php?resultid=135428621

'<core_client_version>5.3.12.tx36</core_client_version>
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 1755374
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
CPU time: 46787.2 seconds. Greater than 4X preferred time: 10800 seconds
**********************************************************************
GZIP SILENT FILE: .\xx2h4o.out

</stderr_txt>'

Shutdown by watchdog because of long run time.
Should all of the 2h4o WU's deleted?

____________

Dr Who Fan
Avatar

Joined: May 28 06
Posts: 29
ID: 85050
Credit: 39,288
RAC: 0
Message 50937 - Posted 24 Jan 2008 8:44:21 UTC

This Task ID 135491299 failed validation.

Name 2tif__LOGREG_ABRELAX_PILOT2_FRAG_CORRECTION_SAVE_ALL_OUT-2tif_-_BARCODE__2670_6464_0
Workunit 123308703
Created 23 Jan 2008 14:44:05 UTC
Sent 23 Jan 2008 14:45:03 UTC
Received 24 Jan 2008 5:57:28 UTC
Server state Over
Outcome Validate error
Client state Done
Exit status 0 (0x0)
Computer ID 230539
Report deadline 2 Feb 2008 14:45:03 UTC
CPU time 4275.497864
stderr out

<core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 7200
# random seed: 1671937
==
</stderr_txt>
]]>

Validate state Invalid
Claimed credit 5.53358788153339
Granted credit 0
application version 5.93
____________

FalconFly Profile
Avatar

Joined: Jan 11 08
Posts: 23
ID: 234757
Credit: 2,162,896
RAC: 1
Message 50945 - Posted 24 Jan 2008 19:15:47 UTC - in response to Message ID 50937.
Last modified: 24 Jan 2008 19:21:14 UTC

Noted a couple of 2H4O_BOINC_TWIST_RINGS WorkUnits stuck at ~10min remaining as well, all well beyond their target runtime. CPU time counts upwards but no progress is made.

Oddball :
Restarting BOINC on a System beyond runtime causes CPU time to drop from beyond target runtime to some point inside target runtime (e.g. 6h16m to 2h16m with a 6h preferences set), progress bar moved back accordingly from 99%.

The same happens on a couple of Systems tested (CPU time dropped from 23h back to a seemingly random point within target runtime)

Based on granted Credits and Decoys tested, the affected 2H4O_BOINC_TWIST_RINGS will stall at some point, but still cause full CPU utilization. WorkUnit will be ended by Watchdog after hitting 4x expected runtime.

------
All occurred with BOINC V5.10.28 and various Linux Systems.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50946 - Posted 24 Jan 2008 19:38:03 UTC

Falcon, what is your Rosetta Preference for target runtime?
Please see related info. in this thread.
____________
Rosetta Moderator: Mod.Sense

csbyrosetta

Joined: Dec 24 05
Posts: 4
ID: 42892
Credit: 751,204
RAC: 0
Message 50948 - Posted 24 Jan 2008 19:42:11 UTC

See my Post above. Its not a Problem with the target runtime, i've got 3 cut off by watchdog and the fourth is aktuell running (only a pic in the native Window, nothing else).
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50954 - Posted 24 Jan 2008 20:53:48 UTC

Ended by watchdog, and running beyond their runtime target are two rather different things.
____________
Rosetta Moderator: Mod.Sense

FalconFly Profile
Avatar

Joined: Jan 11 08
Posts: 23
ID: 234757
Credit: 2,162,896
RAC: 1
Message 50955 - Posted 24 Jan 2008 21:32:45 UTC - in response to Message ID 50946.
Last modified: 24 Jan 2008 21:37:54 UTC

Falcon, what is your Rosetta Preference for target runtime?
Please see related info. in this thread.


Was set at 6 hours until this evening, when I reduced it to 4 (4x4h no progress is at least better than 4x6h no progress)

Typical WorkUnits that finished already :
Watchdog Terminated
Watchdog Terminated + Segmentation Violation (still valid though)
Watchdog Terminated
Watchdog Terminated

----------
If the WorkUnit just takes that long (and can't finish within 4 or 6 hours on a modern Athlon64 X2), I don't mind the increased runtime. I don't expect that to take 24 hours though (unless the Models are really much more complex than expected, which could be in theory for all I know)

Looking at Claimed vs. Granted Credit however, it seems that approx. 50-70% of the runtime is simply lost due to Watchdog not cutting in until 4x the set runtime (not sure what the Client actually does in that time).
____________

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,061,841
RAC: 1,331
Message 50957 - Posted 25 Jan 2008 3:17:55 UTC

I think there is something seriously wrong with the 2h4o_ WUs. They just seem to sit there using CPU, but not writing anything to the output files. They never end until the watchdog says they've used 4x the CPU time preference.

csbyrosetta

Joined: Dec 24 05
Posts: 4
ID: 42892
Credit: 751,204
RAC: 0
Message 50959 - Posted 25 Jan 2008 8:20:08 UTC - in response to Message ID 50954.

Ended by watchdog, and running beyond their runtime target are two rather different things.


I think i'll mean the same as FalconFly.
The 2h4o - WU's have got a Problem. I restartet one WU on the Quad, CPU-time jumps down to 1h:xx (last working Checkpoint?) and seems to be running. Finished with wrong CPU-Time of 11337 sec (3h:8) but with Heartbeat-error.

http://boinc.bakerlab.org/rosetta/result.php?resultid=135428650

On the X2 the CPU-Time jumps down to 0h:0x after restart (from 6h:59), seems to run but dont work anymore until the watchdog will stop it. This WU would konsum 4x3h + 6h:59 = 19h of CPU-Time.
If such a WU will be stopped and restarted because of the Scheduler and resets the CPU-Time it will be a never ending loop.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50962 - Posted 25 Jan 2008 14:32:19 UTC

If such a WU will be stopped and restarted because of the Scheduler and resets the CPU-Time it will be a never ending loop.


The watchdog will catch such a thing and abort it for you. In this case, it would notice that the task was restarted 5 times from the exact same point. In other words, "I've started this thing 5 times and never reached a checkpoint, so I'm going to abort it".

The basic idea being that whatever it is about that task is not well suited to how you are using your computer, and so the watchdog ends it, reports it back and get another task, which will tend to have different behavior.
____________
Rosetta Moderator: Mod.Sense

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50965 - Posted 25 Jan 2008 17:36:45 UTC
Last modified: 25 Jan 2008 18:25:05 UTC

Here's a snapshot of what others might be describing about the 2h4o wus. This is on my wifes laptop which was set to 1 hour run time pref, but I changed it at somepoint last nite to 6 hours(note: I changed it before I knew about this one, her laptop is WAAAAY out in the dining room, which never sees meals on the table, so I'm seldom there). Either way, we're way past that. It's longest recorded decoy (out of 912 recorded) was a 1gida which lasted 16627 seconds (4.61 hrs). I suspended the other projects to see what happens with it.



[edit] after 1 hour run time the cpu time has progressed one hour, and the "% complete has progressed from 98.558 to 98.664, but the "to comp" has remained unchanged.

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50974 - Posted 25 Jan 2008 20:31:20 UTC

I can't edit after 60 min.

After 3 hours the cpu time seems right, % comp has progressed up to 98.846, and "to comp" has gone up one second to 00:09:54. Hmmm, at .1%/hour there's just 11 more hours to go making it 25 hours/decoy...gotta be a record. I'll not post again until it's nearly over. (yes...I know 98.848% seems like it'd be nearly over...LOL)

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 50975 - Posted 25 Jan 2008 21:14:45 UTC

Astro, the estimates are based on the time to completion as compared to the target runtimg... except for the final 10min to completion. They make increasingly fine adjustments to show things are still moving forward, but the client really doesn't know how long that model is going to take.

Once the model completes your time to completion will zip from whereever it was near 10min to zero. You can't take a .1% adjustment and extrapolate that into a prediction on final time to completion. The last 10-12min of the time to completion do not work that way. And the time prior to that is just based on the time spent, as compared to target runtime. So, until you've completed a model on that task, there really isn't a great method to arrive at a true predicted time to completion. For most tasks, which take less then an hour per model, this method works fairly well. These 6+hr per model tasks are basicaly the worst case for the time estimate calculations.
____________
Rosetta Moderator: Mod.Sense

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50976 - Posted 25 Jan 2008 21:28:14 UTC - in response to Message ID 50975.

These 6+hr per model tasks are basicaly the worst case for the time estimate calculations.

I do think somethings wrong.

For the last 3 hours, it's progressed .1%/hour. If that were true for the full length of the wu, and given I'm 15 hours into it, then I should only be at 1.5% complete. At some point the "% comp" had to have progressed faster, and then at some point went into slow motion mode. I'm aware of how the "to comp" works and have NO issue with that. Also, If I'd had a 1hr, 2 hr, 3 hr, and shortly a 4 hour run time preference, then this one would have been ended by the watchdog. Ofcourse, I'm assuming it'll finish at all. If the .1%/hour holds, then a 6 hour pref would have been ended by the watchdog (have to wait and see total run time before I can say that definitively).

I guess, If these are really that long, then admin should change the % comp mechanism, and say something about having some "unusually LARGE" wus in the system ATM. Otherwise, you're going to get alot of questions and who knows how many users will "abort" just because they don't know it might be "normal".

Heck, I feel that I'm doing them a favor even running it as my gut feeling (without admin acknowledgement that this is normal) is I'm going to get nada for a days work.

hedera Profile
Avatar

Joined: Jul 15 06
Posts: 66
ID: 100139
Credit: 1,109,835
RAC: 1,164
Message 50977 - Posted 25 Jan 2008 21:49:18 UTC

Another 24ho oddity - I had this one, work unit 123329393:

2h4o__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK-2h4o_-native__2668_12846

that ran 12.53 hours of CPU time! My outcome says "Success" and "Done" (and I got plenty of credit for it) - but when I look at the details, I see:

<core_client_version>5.10.20</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 1745555
# cpu_run_time_pref: 10800
# random seed: 1745555
# cpu_run_time_pref: 10800
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
CPU time: 45123.3 seconds. Greater than 4X preferred time: 10800 seconds
**********************************************************************
GZIP SILENT FILE: .\xx2h4o.out

But it shows a "Validate state" of VALID. I certainly am not complaining about the credit, but how can it be done and valid if Watchdog shut it down??
____________
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50981 - Posted 25 Jan 2008 22:48:34 UTC
Last modified: 25 Jan 2008 22:54:53 UTC

THat's kind of my point. It was ended by the watchdog at 4X his/her 3 hour run pref at 12 hour + a bit. The Task ID shows NO decoy info at all. Was any scientifically worthwhile work performed? Or is it just credit for time served?? This is going to be very typical of all participants except those with a "cpu run time pref" exceeding 8-12 hours(depending on processor, etc).

and now that I think about my one wu and my 6 hour run time. It looks like I should bump that up to the next step above 6hrs or suffer the same fate as everyone else. It's at 99.005 percent after 16:38:00, so is holding to the .1%/hour.

[edit] moved up to 8 hour pref

hedera Profile
Avatar

Joined: Jul 15 06
Posts: 66
ID: 100139
Credit: 1,109,835
RAC: 1,164
Message 50982 - Posted 25 Jan 2008 23:17:18 UTC

What's the default on CPU runtime limits? I looked at my settings and target CPU runtime is "not selected", so I have whatever you get when you don't specify. Do we have a consensus on what it should be?? I'm a little confused.
____________
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

rochester new york Profile
Avatar

Joined: Jul 2 06
Posts: 2337
ID: 98229
Credit: 756,356
RAC: 316
Message 50983 - Posted 25 Jan 2008 23:35:41 UTC - in response to Message ID 50982.

What's the default on CPU runtime limits? I looked at my settings and target CPU runtime is "not selected", so I have whatever you get when you don't specify. Do we have a consensus on what it should be?? I'm a little confused.

i got mine set for 1 day however i think it defaults to 3 hours

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50984 - Posted 25 Jan 2008 23:39:27 UTC
Last modified: 25 Jan 2008 23:41:44 UTC

yes, the "not selected" is the default of 3 hours.

also, that's set on the "web" so your client must "call home" in order to see and apply the change. This happens when it gets/reports work. But, if you want it to change in the middle of a run, you must do a "project update". You can manually update the projects from the "projects" tab on the manager. Highlight the project name in the right hand box by clicking on it. Then click the "update" button to the left.

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50986 - Posted 26 Jan 2008 11:16:21 UTC
Last modified: 26 Jan 2008 11:32:07 UTC

Here's an updated pic taken 17 hours + later(from the first pic posted) of the same wu. I changed my runtime pref to 8 last nite, but that only gets me up to 32 hours before the watchdog kicks in. Perhaps I'll go to 12 hour pref so it'll be able to finish normally as long as it doesn't take more than "48 HOURS to do ONE decoy" on a Mobile AMD64 3700 w/1 G ram. Boy, If the others I have in cache take anywhere near this long....all my Boincsimap, Einstein, and the rest of my rosetta will be past the deadline.

Also notice the rate of completion seems to be continually slowing (atleast I assume so) since it's only progressed .3% overnite while I slept, instead of the .1%/hour I was seeing. At 28 + hours, this decoy has already taken more than 6 times it's previous "longest decoy".




[Edit] I went to 12 hours "run time pref", so hopefully it'll finish in the next 19 hours.

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 1,301,330
RAC: 676
Message 50987 - Posted 26 Jan 2008 12:37:22 UTC - in response to Message ID 50986.



[Edit] I went to 12 hours "run time pref", so hopefully it'll finish in the next 19 hours.


You really are on a mission to find out how long it will take.
Go for it, Astro!
____________

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50988 - Posted 26 Jan 2008 12:48:43 UTC - in response to Message ID 50987.
Last modified: 26 Jan 2008 12:55:36 UTC



[Edit] I went to 12 hours "run time pref", so hopefully it'll finish in the next 19 hours.


You really are on a mission to find out how long it will take.
Go for it, Astro!

Someone's got to be the guinea pig. "show me the little plastic wheel, and I'll take her for a spin".

I'll take this posting opportunity for an update:

CPU time 30:21:01, 99.453% complete, 00:09:56 remaining, using the benchmark claiming method, this wu is worth 429.75 credits so far. I wonder....

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 4,820,543
RAC: 2,349
Message 50989 - Posted 26 Jan 2008 13:25:01 UTC - in response to Message ID 50988.
Last modified: 26 Jan 2008 13:26:32 UTC


Someone's got to be the guinea pig. "show me the little plastic wheel, and I'll take her for a spin".

I'll take this posting opportunity for an update:

CPU time 30:21:01, 99.453% complete, 00:09:56 remaining, using the benchmark claiming method, this wu is worth 429.75 credits so far. I wonder....


My box crunched one in the same category as yours (TWIST RINGS TWIST ANGEL). It ran a bit over 2.5 times the normal runtime, which is 6 hrs for me. A personal record!. ;)
____________

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 4,820,543
RAC: 2,349
Message 50990 - Posted 26 Jan 2008 13:44:01 UTC - in response to Message ID 50988.

[quote
CPU time 30:21:01, 99.453% complete, 00:09:56 remaining, using the benchmark claiming method, this wu is worth 429.75 credits so far. I wonder....[/quote]

Mine claimed 329 credits, but only got the standard 20 credits for a watchdog-ended task.

http://boinc.bakerlab.org/rosetta/result.php?resultid=135624272

____________

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50991 - Posted 26 Jan 2008 13:48:16 UTC
Last modified: 26 Jan 2008 13:48:37 UTC

Strange. hedera received 88 of his 98 claimed for his watchdog ended task resultid=135513724. I wonder what the difference was?

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 1,301,330
RAC: 676
Message 50992 - Posted 26 Jan 2008 15:19:40 UTC - in response to Message ID 50991.

Strange. hedera received 88 of his 98 claimed for his watchdog ended task resultid=135513724. I wonder what the difference was?


And i received 92 of 94 claimed for resultid 135481414.
I hope Astro gets more than 20 credits for his job, but it probably won't be 400+.
____________

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 1,301,330
RAC: 676
Message 50993 - Posted 26 Jan 2008 15:20:27 UTC - in response to Message ID 50991.

Strange. hedera received 88 of his 98 claimed for his watchdog ended task resultid=135513724. I wonder what the difference was?


And i received 92 of 94 claimed for resultid 135481414.
I hope Astro gets more than 20 credits for his job, but it probably won't be 400+.
____________

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 1,301,330
RAC: 676
Message 50994 - Posted 26 Jan 2008 15:21:11 UTC - in response to Message ID 50991.

Strange. hedera received 88 of his 98 claimed for his watchdog ended task resultid=135513724. I wonder what the difference was?


And i received 92 of 94 claimed for resultid 135481414.
I hope Astro gets more than 20 credits for his job, but it probably won't be 400+.
____________

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 1,301,330
RAC: 676
Message 50995 - Posted 26 Jan 2008 15:24:42 UTC

sorry for the triple-post. I had some problems with my connection.
____________

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 50997 - Posted 26 Jan 2008 15:47:14 UTC - in response to Message ID 50995.
Last modified: 26 Jan 2008 16:32:51 UTC

sorry for the triple-post. I had some problems with my connection.

up to 461 credits now. LOL

Say, you do know that you can "edit" your posted messages as long as you do so within 60 min of the original post. You should see an "edit" box on each of your previous posts. You could (only if you wanna) delete everything and just put "deleted" or some other message into all but the intended one. At that point a nice moderator might come along and hide those extra posts. Anyway, just wanted you to know. Hope you enjoy the rest of the weekend

32:49:08 cpu time, 99.494% complete with 00:09:57 remaining.

[edit] made a progress chart. Given the curve, I doubt it'll ever finish.

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 4,820,543
RAC: 2,349
Message 50998 - Posted 26 Jan 2008 17:15:04 UTC - in response to Message ID 50992.

Strange. hedera received 88 of his 98 claimed for his watchdog ended task resultid=135513724. I wonder what the difference was?


And i received 92 of 94 claimed for resultid 135481414.
I hope Astro gets more than 20 credits for his job, but it probably won't be 400+.


In my case maybe not even one decoy was finished. I'm just guessing
____________

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 51000 - Posted 26 Jan 2008 18:49:20 UTC

resultid=135831728

CPU time 127601.71875 (35.44 HOURS)
Claimed credit 501.989998586804
Granted credit 20

Mod Sense. I'm pretty sure there's something wrong here. Anyone else spot the problem???? It's not like this issue wasn't posted about early enough on Friday for someone at the project to comment upon it.

JEklund

Joined: Sep 24 06
Posts: 7
ID: 114425
Credit: 105,447
RAC: 0
Message 51001 - Posted 26 Jan 2008 19:37:32 UTC - in response to Message ID 51000.

resultid=135831728

CPU time 127601.71875 (35.44 HOURS)
Claimed credit 501.989998586804
Granted credit 20

Mod Sense. I'm pretty sure there's something wrong here. Anyone else spot the problem???? It's not like this issue wasn't posted about early enough on Friday for someone at the project to comment upon it.


Based on the info in the log it seems that it was stuck and the watchdog killed it ( and appreciated your work as 20 credits .. which is not fair for 35 hours work IMHO )

No clue what is wrong with that work unit though ..

-- Lundi --

____________

mhhal

Joined: Mar 28 06
Posts: 7
ID: 68866
Credit: 1,463,604
RAC: 2,561
Message 51007 - Posted 26 Jan 2008 22:03:39 UTC - in response to Message ID 50335.

Please post problems and/or bugs with rosetta 5.93. Thanks for your
support!

My slower computer (ID #187636 -- older Linspire Linux box) is set to accept
jobs of approx 14 hours. I have a job on machine at this time which say it
is 99.67% completed with 50:16:19 of CPU time. For time being, I've suspended
the job. Name starts "2h4o_BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK
(Work unit 123162090).

Don't know if this is a Rosetta issue or a problem w/ this specific job.
I know that I have another of same name in my queue (135883853).

Just wondering if someone else has seen similar issue/problem.

Hope this helps!!

____________

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 1,301,330
RAC: 676
Message 51008 - Posted 26 Jan 2008 22:58:53 UTC - in response to Message ID 51000.

resultid=135831728

CPU time 127601.71875 (35.44 HOURS)
Claimed credit 501.989998586804
Granted credit 20

Mod Sense. I'm pretty sure there's something wrong here. Anyone else spot the problem???? It's not like this issue wasn't posted about early enough on Friday for someone at the project to comment upon it.


Oh no, you did get 20. You should have got at least an extra 100 for all the effort you put into it.
____________

dcdc Profile

Joined: Nov 3 05
Posts: 1488
ID: 8948
Credit: 19,185,928
RAC: 13,098
Message 51009 - Posted 26 Jan 2008 23:51:20 UTC

I've got one here:

http://boinc.bakerlab.org/rosetta/result.php?resultid=135314464

Rosetta score is stuck or going too long. Watchdog is ending the run!
CPU time: 58569.2 seconds. Greater than 4X preferred time: 14400 seconds

Claimed credit 211.010587329225
Granted credit 80
____________

[AF>France>TDM>Centre]Jeannot Le Tazon Profile

Joined: Dec 8 05
Posts: 6
ID: 32842
Credit: 55,596
RAC: 0
Message 51013 - Posted 27 Jan 2008 7:51:31 UTC

I've aborted this one http://boinc.bakerlab.org/rosetta/result.php?resultid=135287253 after 11h. (prefs set to 12h)
11 h crunching, then cpu benchmark, and then back to 10% complete. :(
it seemed to do nothing interesting after, maybe, 1h and 1 decoy
(Model 1, Step 27091, Accepted RMSD 9124, Accepted energy 6.65805)
Nothing displayed on "Searching", "Accepted", nothing moving after 1 decoy on "RMSD" & "Accepted Energy".

Paul

Joined: Oct 29 05
Posts: 154
ID: 7397
Credit: 11,613,750
RAC: 426
Message 51019 - Posted 27 Jan 2008 20:25:18 UTC

I started getting lots of computation errors today. I did make 1 change to the system but it should not have caused this problem. Most of the time the CPU cranks on the WU for 50+ min. before the error.

Is there a problem with some of the WUs in the 5.93 beta? I just installed the newest BOINC Client (5.10.30) and I guess it could be at fault as well.

Any insight is greatly appreciated.

Paul
____________
Thx!

Paul

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4818
ID: 85645
Credit: 1,852,264
RAC: 1,628
Message 51028 - Posted 27 Jan 2008 21:19:40 UTC - in response to Message ID 51019.

paul - do the group a favor and tell us which one of your many computers is having fits and which work units as you have alot of different computers and lots of workunits in queue. Its not the BOINC program that has the errors, rather the project work units themselves. You probably notice that you have errors on RAH vs the other projects you are working on. If it was a BOINC program error you would have errors on all your projects.

I started getting lots of computation errors today. I did make 1 change to the system but it should not have caused this problem. Most of the time the CPU cranks on the WU for 50+ min. before the error.

Is there a problem with some of the WUs in the 5.93 beta? I just installed the newest BOINC Client (5.10.30) and I guess it could be at fault as well.

Any insight is greatly appreciated.

Paul


____________

PieBandit
Avatar

Joined: Apr 17 07
Posts: 6
ID: 165650
Credit: 228,220
RAC: 0
Message 51029 - Posted 28 Jan 2008 0:08:43 UTC

several of my WU are also failing with compute errors:

Result ID 136334535
Result ID 136319412
Result ID 136308989
Result ID 136258153
Result ID 135343580
Result ID 135260720
Result ID 134993972

since January 21st, I've had about a 50% success rate
____________

Paul

Joined: Oct 29 05
Posts: 154
ID: 7397
Credit: 11,613,750
RAC: 426
Message 51030 - Posted 28 Jan 2008 0:31:32 UTC - in response to Message ID 51028.

paul - do the group a favor and tell us which one of your many computers is having fits and which work units as you have alot of different computers and lots of workunits in queue. Its not the BOINC program that has the errors, rather the project work units themselves. You probably notice that you have errors on RAH vs the other projects you are working on. If it was a BOINC program error you would have errors on all your projects.

I started getting lots of computation errors today. I did make 1 change to the system but it should not have caused this problem. Most of the time the CPU cranks on the WU for 50+ min. before the error.

Is there a problem with some of the WUs in the 5.93 beta? I just installed the newest BOINC Client (5.10.30) and I guess it could be at fault as well.

Any insight is greatly appreciated.

Paul



Greg:

Thanks for the note. I do have lots of WUs checked out and it takes a long time to find the issues.

The computer is 591177 and it has more compute errors than successes. I will keep fighting with the hardware but I think it is OK now. All of my temps are well in spec and I don't have any other issues.

I run 100% R@H so I can not compare these WUs to anything else. I did notice that none of my other systems have the same issues so a BIOS upgrade later, I think we may have some stability.

Thx

Paul

____________
Thx!

Paul

Conan Profile
Avatar

Joined: Oct 11 05
Posts: 134
ID: 4053
Credit: 1,599,032
RAC: 24
Message 51039 - Posted 28 Jan 2008 11:06:18 UTC

The problems I was getting over at Ralph appear to have carried over to Rosetta.

The Wu's starting with "2h4o" were causing problems on Ralph so I was supprised to see them over here on Rosetta.

They have a habit of running well past your preference time (up to 21 hours with preference time of 6 hours),
All seem to get to just over 97% completed with 9 minutes 59 seconds to go and just sit there for hours,
Says 100% completed but still shows "Waiting to Run" in Boinc Manager,
Often giving computation errors after the extra long run time (this was mainly on Ralph),
If it does complete after the extra long run time will only give a very poor amount of credit because usually only 1 decoy has been produced in all this time.

I have just aborted two of these WU's
WU 135437069 ran for over 3 1/2 hours got to 100% but still waiting to run in BM, after aborting results show Zero (0) time taken on job.
WU 135437323 was already over an hour past my preference time of 6 hours and still grinding away with 9 minuts 59 seconds to go at 97% completed, it had been this way for quite some time.
WU 135372094 completed after more than 21 hours, returning just 2.5 cr/h.

If I see any more of these WU type then I will be aborting them.
____________

Thomas Leibold

Joined: Jul 30 06
Posts: 55
ID: 102494
Credit: 19,256,322
RAC: 7,733
Message 51046 - Posted 28 Jan 2008 18:22:36 UTC - in response to Message ID 51039.



The Wu's starting with "2h4o" were causing problems on Ralph so I was supprised to see them over here on Rosetta.



I'm seeing the same problems as Conan on a number of my servers. The trouble workunits are 2h4o and 1zpy and all require manual abortion. Restarting Boinc will just reset the amount of time already spend on them and starting them again.

The 2h4o units in particular tend to stay at 100% Completed but state "Running" with no increase in amount of cpu time spend. Looking at the stdout.txt/stderr.txt files shows that there was an attempt by the watchdog to shut down the client (and as far as I know that has never worked properly for Rosetta on Linux).
____________
Team Helix

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 51052 - Posted 28 Jan 2008 21:06:46 UTC

I aborted them all as well, Still waiting on my 480 missing credits too...

I wonder when the staff gets in to work? These have really got to be affecting the total rate of return (i.e work done).

FalconFly Profile
Avatar

Joined: Jan 11 08
Posts: 23
ID: 234757
Credit: 2,162,896
RAC: 1
Message 51070 - Posted 29 Jan 2008 8:16:33 UTC - in response to Message ID 51052.

Same here, had to abort the last 2h4o Model.
One of my faster Hosts effectively stopped working, as the hourly rotation of the last 2h4o__BOINC_TWIST_RINGS WorkUnit apparently reset CPU time over and over, while making zero progress.

As a side-effect, the Rosetta Long Term Debt of the affected Clients rocketed upto -90000s (lots of work but almost no progress done)
____________

MerePeer

Joined: Nov 6 05
Posts: 3
ID: 9733
Credit: 1,787,446
RAC: 0
Message 51086 - Posted 29 Jan 2008 23:41:11 UTC - in response to Message ID 51070.

Same here. Same problem with 2h4o__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK* just hanging. Restarting boinc results in same problem 8 hours later. Linux box.

____________

Astro
Avatar

Joined: Oct 2 05
Posts: 987
ID: 2322
Credit: 500,253
RAC: 0
Message 51087 - Posted 30 Jan 2008 1:11:08 UTC
Last modified: 30 Jan 2008 1:36:10 UTC

I'm not sure what to think. Complaints about the 2h4o wus started atleast 5 days ago. I ran a test on one of mine starting 5 days ago, which leaves 3 full business days and two weekend days for management to make a statement. I've seen or heard nothing. How often do they monitor these boards? Are they of any importance? I'm feeling a bit like any "beta" tests or any other tests are really a waste of our man hours and CPU Seconds. Perhaps, I'll be considered impatient...hmmmm....How long must one wait before one isn't considered as such???

I don't know. I know I've stopped ALL rosetta work. It really isn't what I wanted, but I don't wanna "Pi**" away my CPU time for nothing when it might be spent more wisely. (I.E if my machines are just going to use electricity without scientific benefit, what's the point of leaving them on)

tony

I started at 200K and was shooting for 600K before stopping, but I guess 350K is OK. If that's what they want.(well, would stay 350K but I loaned out a machine before I knew the score, so I have to await it's return before I remove it.)

j2satx

Joined: Sep 17 05
Posts: 97
ID: 253
Credit: 3,371,456
RAC: 839
Message 51089 - Posted 30 Jan 2008 2:24:58 UTC - in response to Message ID 51039.

The problems I was getting over at Ralph appear to have carried over to Rosetta.

The Wu's starting with "2h4o" were causing problems on Ralph so I was supprised to see them over here on Rosetta.



Were you "really" surprised?

Conan Profile
Avatar

Joined: Oct 11 05
Posts: 134
ID: 4053
Credit: 1,599,032
RAC: 24
Message 51090 - Posted 30 Jan 2008 8:08:24 UTC - in response to Message ID 51089.

The problems I was getting over at Ralph appear to have carried over to Rosetta.

The Wu's starting with "2h4o" were causing problems on Ralph so I was surprised to see them over here on Rosetta.



Were you "really" surprised?


G'Day j2satx,
No I guess I was not, considering no response over on Ralph either. A lot of wasted time when these things run to over 21 hours and then often error out.
It is a shame, I do like the project and it's goals, it was one of the best monitored and responsive projects for a good while.
____________

j2satx

Joined: Sep 17 05
Posts: 97
ID: 253
Credit: 3,371,456
RAC: 839
Message 51094 - Posted 30 Jan 2008 15:00:39 UTC - in response to Message ID 51090.

The problems I was getting over at Ralph appear to have carried over to Rosetta.

The Wu's starting with "2h4o" were causing problems on Ralph so I was surprised to see them over here on Rosetta.



Were you "really" surprised?


G'Day j2satx,
No I guess I was not, considering no response over on Ralph either. A lot of wasted time when these things run to over 21 hours and then often error out.
It is a shame, I do like the project and it's goals, it was one of the best monitored and responsive projects for a good while.


I know....I started crunching Ralph again when it looked like they were making a change with the "minis", but seems that was short lived also.

hedera Profile
Avatar

Joined: Jul 15 06
Posts: 66
ID: 100139
Credit: 1,109,835
RAC: 1,164
Message 51096 - Posted 30 Jan 2008 18:30:29 UTC

The interesting thing with all this is that, after that one bad day a couple of weeks ago, I made a minor adjustment to the amount of memory (from 90% to 85% when computer is not in use) and CPU (from 100% to 90%) allowed, and since that time my WUs have been cranking happily away, finishing in the normal 2-4 hours of CPU time, and not overwhelming my Pentium IV. And no errors. Maybe I'm just lucky.
____________
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

Steve Dodd Profile

Joined: Dec 13 05
Posts: 6
ID: 36900
Credit: 1,371,424
RAC: 0
Message 51098 - Posted 30 Jan 2008 20:29:51 UTC
Last modified: 30 Jan 2008 20:31:29 UTC

I've had a problem recently with wus going way past the allotment time (8 hrs for my preferences). I've had 2 get stuck in the 90% complete range and no further. Looking at the graphics showed the step for the model being tested as not incrementing. WU numbers are: 123352364 and 123338380. Crunch time was ~19 hrs. each.
____________

Steve Dodd Profile

Joined: Dec 13 05
Posts: 6
ID: 36900
Credit: 1,371,424
RAC: 0
Message 51108 - Posted 1 Feb 2008 5:07:28 UTC

Add wu 121455059
____________

Ingemar

Joined: Feb 28 06
Posts: 20
ID: 61985
Credit: 1,680
RAC: 0
Message 51147 - Posted 3 Feb 2008 2:56:27 UTC
Last modified: 3 Feb 2008 2:57:11 UTC

The 2h4o**** jobs were of a very large protein with very complicated architecture so rosetta gets stuck a
lot during model generation. No more jobs of this variety will be sent out due to the problems you report.
____________

EdMulock Profile
Avatar

Joined: Mar 14 06
Posts: 30
ID: 65391
Credit: 2,347,485
RAC: 0
Message 51185 - Posted 5 Feb 2008 18:37:13 UTC


Any clue ? This happens on 8 diferent tasks, reboots, Boinc upgrade to 5.10.30, Reset project, abort task, Nothing helps.


2/5/2008 1:30:49 PM|rosetta@home|Task 1bm8__BOINC_CONTROLABRELAX_VF_IGNORE_THE_REST-S25-9-S3-3--1bm8_-vf__2547_10874_0 exited with a DLL initialization error.
2/5/2008 1:30:49 PM|rosetta@home|If this happens repeatedly you may need to reboot your computer.
2/5/2008 1:30:49 PM|rosetta@home|Restarting task 1bm8__BOINC_CONTROLABRELAX_VF_IGNORE_THE_REST-S25-9-S3-3--1bm8_-vf__2547_10874_0 using rosetta_beta version 593
2/5/2008 1:30:55 PM|rosetta@home|Task 1bm8__BOINC_CONTROLABRELAX_VF_IGNORE_THE_REST-S25-9-S3-3--1bm8_-vf__2547_10874_0 exited with a DLL initialization error.

____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4818
ID: 85645
Credit: 1,852,264
RAC: 1,628
Message 51221 - Posted 7 Feb 2008 14:35:04 UTC

1tit__BOINC_ABRELAX_VF_IGNORE_THE_REST-S25-11-S3-9--1tit_-vf__2731_81_0 died with client error and this message:

core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3223320

no credit granted

this happened on feb 2
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 51222 - Posted 7 Feb 2008 16:18:57 UTC - in response to Message ID 51185.


Any clue ? This happens on 8 diferent tasks, reboots, Boinc upgrade to 5.10.30, Reset project, abort task, Nothing helps.


Ed, is it just the one task? Or are you now have similar problem with other tasks as well?

If just the one task, obviously an abort of that one should clear up it's problems.

If it's happening on all of your tasks, I can only suggest doing a detach of the project, and then attach again. This will download a fresh copy of all of the dlls.
____________
Rosetta Moderator: Mod.Sense

EdMulock Profile
Avatar

Joined: Mar 14 06
Posts: 30
ID: 65391
Credit: 2,347,485
RAC: 0
Message 51241 - Posted 8 Feb 2008 15:36:20 UTC - in response to Message ID 51222.


Any clue ? This happens on 8 diferent tasks, reboots, Boinc upgrade to 5.10.30, Reset project, abort task, Nothing helps.


Ed, is it just the one task? Or are you now have similar problem with other tasks as well?

If just the one task, obviously an abort of that one should clear up it's problems.

If it's happening on all of your tasks, I can only suggest doing a detach of the project, and then attach again. This will download a fresh copy of all of the dlls.



Now about 120 different tasks. I've done that ( reset project about 5 times ). ( As stated in the first post )

All finish with "compute error" as reported status; and all restart ( over and over ) after about 4 seconds.
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 51242 - Posted 8 Feb 2008 16:06:01 UTC

Ed I was not, and am not clear on exactly what you've done. Did you "reset" the project?? Or did you "detach", then "attach" again? I am suggesting a complete detach.

Is it possible a virus scanner is consistently corrupting one of the files as they reload? You might try reinstalling BOINC to a new directory, and see if that triggers a message from an antivirus product that you may have overlooked originally.
____________
Rosetta Moderator: Mod.Sense

stoneysilence

Joined: May 4 07
Posts: 13
ID: 173036
Credit: 401,055
RAC: 0
Message 51324 - Posted 11 Feb 2008 7:29:31 UTC

Got my first Failed Task to my knowledge tonight. Been having problems with the MiniRosettas so at first I thought it was one of them. But after I researched it found it was a 5.93 task. Only ran for a bit over an hour before it apparently crashed. Most units run for 1.5/2.9 hours at least. Something obviously went haywire.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=127347713
http://boinc.bakerlab.org/rosetta/result.php?resultid=139825592

aeryise

Joined: Nov 5 07
Posts: 1
ID: 218078
Credit: 47,149
RAC: 0
Message 51412 - Posted 15 Feb 2008 8:15:27 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=140509732
http://boinc.bakerlab.org/rosetta/result.php?resultid=138228394

I've also had the strange problem of tasks restarting from zero although when I stopped BOINC and shut down my computer the day before, they were at 30+% or 70+% i.e. nonzero completion. Not sure if this is related to 5.93 in any way, but this restart has only been happening in the past 3 days.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 51416 - Posted 15 Feb 2008 17:07:34 UTC

aeryise, anytime you exit BOINC, you will lose some amount of work. The program would burn up your disk drive (and a lot of valuable computer time) if it was constantly storing everything it has done so far. So, periodically, it does a "checkpoint" where is preserves the work done so far. Some types of tasks a able to checkpoint more frequently then others.

The % completed is relative to your configured setting for your preferred runtime, so doesn't tell you definitively. In general, the project tries to checkpoint about every 15minutes, but there are some types of tasks that cannot do so, and may go for an hour or more without taking a checkpoint.

So if a checkpoint has not been reached when you exit BOINC, it will restart at 0% complete. It should then proceed normally. Don't worry, if you do several restarts like this without reaching a checkpoint, the Rosetta "watch dog" will figure out that this particular task is not a good fit for your machine, and purge it and get another task which may be able to checkpoint more frequently.
____________
Rosetta Moderator: Mod.Sense

Weasel

Joined: Nov 20 06
Posts: 1
ID: 130480
Credit: 334,404
RAC: 0
Message 51551 - Posted 22 Feb 2008 0:35:22 UTC

Well, I don’t have the time to read all the post here, (especially since even at 1024 X 768 I have to scroll sideways to read them) so I’ll just state my problems.
Even with the “Leave applications in memory while suspended” set to NO, R@H still hangs around after suspending the project, which with 350 MB for a WU, I have to do to get any work done.
So – memory hog WUs require suspension, which refuses to give up memory.

____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3097
ID: 106194
Credit: 0
RAC: 0
Message 51559 - Posted 22 Feb 2008 9:43:36 UTC
Last modified: 22 Feb 2008 9:44:21 UTC

Weasel, do you run Linux? Windows? Or Mac?

Edit, I see now that all of your hosts are Windows.
____________
Rosetta Moderator: Mod.Sense

KWSN THE Holy Hand Grenade! Profile

Joined: May 3 07
Posts: 5
ID: 172695
Credit: 532,162
RAC: 1,005
Message 51660 - Posted 26 Feb 2008 18:03:55 UTC - in response to Message ID 51221.

1tit__BOINC_ABRELAX_VF_IGNORE_THE_REST-S25-11-S3-9--1tit_-vf__2731_81_0 died with client error and this message:

core_client_version>5.10.30</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3223320

no credit granted

this happened on feb 2


Greg, see my Message 50716, in this thread (on Jan 15) - I'm glad that I'm not the only one with the problem! (Note that it must be R@H 5.93, as this happened on two different OS's and two different builds of BOINC)

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4818
ID: 85645
Credit: 1,852,264
RAC: 1,628
Message 51663 - Posted 26 Feb 2008 18:39:20 UTC

KWSN- when i click on the link it says no such task.
I went looking at your number 2 computer and noticed all the compute errors for all the FRA_t847__2 work. You should post those errors so the team knows that alot of that work crashes, unless your running on RALPH.

If that stuff crashes on your system, got to wonder if its going to die on mine.
____________

Thomas Leibold

Joined: Jul 30 06
Posts: 55
ID: 102494
Credit: 19,256,322
RAC: 7,733
Message 51700 - Posted 27 Feb 2008 23:11:57 UTC

Just checked on one of the servers whose performance was below par and found that it was still "running" on a 1zpy workunit. The workunit deadline expired over 1 month ago, confirming that short of manually aborting misbehaving workunits they will never stop on their own.

OS: SuSE Linux 10.1
Boinc: 5.10.21
Rosetta: 5.93
Workunit: ? no idea which number, long gone from the server!

stderr.txt:
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score -95.2845 for 900 seconds

This is (as usual!!!) followed by a SIGSEGV with the watchdog crashing and the client failing to terminate properly (and since the client process remains alive Boinc never finds out that there is anything wrong).
I'm well aware that this is not specific to the 5.93 client since that issue has been around for a long time, just reporting that it is still an issue.
____________
Team Helix

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 551
ID: 105843
Credit: 3,089,054
RAC: 2,001
Message 51728 - Posted 29 Feb 2008 21:15:02 UTC

I've got a validate error on this one, first time ever don't know

what happened it ran normal and finished.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=131395160

pete.

____________


Sid Celery

Joined: Feb 11 08
Posts: 550
ID: 241409
Credit: 5,181,515
RAC: 4,586
Message 51780 - Posted 3 Mar 2008 16:01:10 UTC

Hi,
Possibly not a Rosetta 5.93 error but with the Boinc Manager - apologies in advance if this question should go elsewhere.

I have added Boinc to a friend's HP Vista laptop and registered with Rosetta as LizzieBarry. I noticed that the Boinc manager isn't given permission to run at bootup on the machine by Windows Defender. I'm able to give it permission to run after the computer hits the desktop, but the computer owner is a complete novice and would not be able do so herself. I'm not a greatly technical person and only use XP at home, so I'm not familiar with Vista, but can someone advise how I can ensure Boinc manager starts on bootup without any user intervention? It's been suggested I go into Boinc Manager properties and to chose 'Run as adminstrator' but this hasn't been successful either.

Is it an issue with BM, Defender or the Vista OS? has anyone else seen this and found a solution I can use or can provide a link?

Also, if this question gets moved to a better topic, can someone mail me with it's new location. Any help much appreciated.

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 4,820,543
RAC: 2,349
Message 51782 - Posted 3 Mar 2008 16:35:42 UTC

I'm not running Vista myself, but maybe it will work if BOINC is not installed in the 'Program Files'-folder, but in the root of the C:\ parttion, for example.
____________

Paul

Joined: Oct 29 05
Posts: 154
ID: 7397
Credit: 11,613,750
RAC: 426
Message 51861 - Posted 9 Mar 2008 11:59:40 UTC - in response to Message ID 51782.

I'm not running Vista myself, but maybe it will work if BOINC is not installed in the 'Program Files'-folder, but in the root of the C:\ parttion, for example.


Client Download Error. This could just be a bad WU

Task ID 146467347
WU ID 133460753

I gotta learn how to use links on this message board.
____________
Thx!

Paul

Paul

Joined: Oct 29 05
Posts: 154
ID: 7397
Credit: 11,613,750
RAC: 426
Message 51862 - Posted 9 Mar 2008 12:05:06 UTC

More 5.93 errors

Here are a couple more errors with 5.93

Task ID WU ID
146382856 133443746 Client error Downloading 0.00 0.00 146381856 133443796 Client error Compute error 0.00 0.00
146381324 133443322 Client error Downloading 0.00 0.00
146380385 133402979 Client error Compute error 0.00 0.00


____________
Thx!

Paul

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,061,841
RAC: 1,331
Message 51922 - Posted 13 Mar 2008 13:43:10 UTC

I have 13 WUs in my Pending Credit list, which is normally empty. Sure enough, when I checked the Server Status page, I see that the rah_validator_beta program isn't running.

Paratima

Joined: Mar 9 08
Posts: 1
ID: 246388
Credit: 274,862
RAC: 0
Message 51927 - Posted 13 Mar 2008 19:24:12 UTC

I have six WU's pending the beta validator's running again. Will run more WU's after these are credited.

David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 761
ID: 14
Credit: 1,716,867
RAC: 323
Message 51928 - Posted 13 Mar 2008 19:38:32 UTC

one of the validators went down because of a random issue with one of the work units. I had to remove the bad work unit and result entries in the database. This has happened maybe 3 times during the whole lifetime of the project. It just had to happen again when I am at the hospital expecting our 2nd son! good thing my wife is being induced which may take hours to develop which means a lot of sitting around.

hedera Profile
Avatar

Joined: Jul 15 06
Posts: 66
ID: 100139
Credit: 1,109,835
RAC: 1,164
Message 51930 - Posted 13 Mar 2008 19:49:25 UTC

Squiddy, on the Vista issue: I bought a Vista laptop last fall and was going to install BOINC, but when I looked into it, the message boards indicated that the only way to avoid Vista's "nanny prompts" for BOINC at startup was to install BOINC to run as a service. This isn't my usual practice, and I decided that was too much of a pain. I'm waiting for the BOINC team to add Vista compatibility to BOINC, but you might think about it for your novice user.

Just remember, in the land of the blind, the one-eyed man is king...
____________
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

adrianxw Profile
Avatar

Joined: Sep 18 05
Posts: 520
ID: 402
Credit: 834,588
RAC: 19
Message 51932 - Posted 13 Mar 2008 20:03:05 UTC
Last modified: 13 Mar 2008 20:31:13 UTC

When the newest version of BOINC, 5.10.45, was announced on the boinc_dev list 2 days ago, it said, amongst other things...

This release also contains fixes for startup/shutdown issues when
running on Windows Vista.

... although I suspect this is more to do with the serious issues which required special shutdown hoops to jump through on Vista, rather then the Defender problem.

I spent some months last year converting one of my companies products to run on Vista. I won't be "upgrading" any of my machines any time soon.

With the validator issue, my "pending" list is dropping now.
____________
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.

alpha Profile

Joined: Nov 4 06
Posts: 27
ID: 127202
Credit: 715,254
RAC: 96
Message 51956 - Posted 15 Mar 2008 6:42:41 UTC

One of my work units errored out. This doesn't tend to happen very often, so I'm guessing it is probably the application at fault. Plenty of stderr info though.

I have since updated to BOINC 5.10.45 and am running Rosetta 5.95.
____________

Sid Celery

Joined: Feb 11 08
Posts: 550
ID: 241409
Credit: 5,181,515
RAC: 4,586
Message 51972 - Posted 16 Mar 2008 3:53:41 UTC - in response to Message ID 51930.

Squiddy, on the Vista issue: I bought a Vista laptop last fall and was going to install BOINC, but when I looked into it, the message boards indicated that the only way to avoid Vista's "nanny prompts" for BOINC at startup was to install BOINC to run as a service. This isn't my usual practice, and I decided that was too much of a pain. I'm waiting for the BOINC team to add Vista compatibility to BOINC, but you might think about it for your novice user.

Appreciate the reply, hedera. Unfortunately I'm not clever enough to follow it. How would I go about installing BOINC as a service? Is there a link to some advice? I'm keen to follow this up but don't know where to start.

KSMarksPsych Profile
Avatar

Joined: Oct 15 05
Posts: 199
ID: 4774
Credit: 21,970
RAC: 0
Message 51978 - Posted 16 Mar 2008 10:01:47 UTC - in response to Message ID 51972.

Appreciate the reply, hedera. Unfortunately I'm not clever enough to follow it. How would I go about installing BOINC as a service? Is there a link to some advice? I'm keen to follow this up but don't know where to start.


Download the latest installer from http://boinc.berkeley.edu/download.php. Stop BOINC (not necessary but safer). Double click the installer to start it. Choose service when given the choice. Click OK through the rest of the installer.

Note you'll have to have a password on the account for this to work.

____________
Kathryn :o)
The BOINC FAQ Service
The Unofficial BOINC Wiki
The Trac System
More BOINC information than you can shake a stick of RAM at.

M.L.

Joined: Nov 21 06
Posts: 182
ID: 130574
Credit: 180,462
RAC: 0
Message 51981 - Posted 16 Mar 2008 12:46:32 UTC

Task ID 147927265
Name t028_1_NMRREF_1_t028_1_id_model_14_idlIGNORE_THE_REST_core_2979_7573_0
Workunit 134873464
Created 13 Mar 2008 22:50:16 UTC
Sent 13 Mar 2008 22:50:33 UTC
Received 16 Mar 2008 12:43:37 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 735230
Report deadline 23 Mar 2008 22:50:33 UTC
CPU time 12113.984375
stderr out

<core_client_version>5.10.45</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 14400
# random seed: 3952428
No heartbeat from core client for 31 sec - exiting
# cpu_run_time_pref: 14400
======================================================
DONE :: 1 starting structures 12113.3 cpu seconds
This process generated 4 decoys from 4 attempts
0 starting pdbs were skipped
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...

</stderr_txt>
]]>

Validate state Valid
Claimed credit 50.0505518733559
Granted credit 43.4203813594983
application version 5.93

Message boards : Number crunching : Problems with Rosetta version 5.93


Home | Join | About | Participants | Community | Statistics

Copyright © 2014 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^