loss of credit post crash

Message boards : Number crunching : loss of credit post crash

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 45899 - Posted: 10 Sep 2007, 5:40:52 UTC

i had 34 work units report post crash.
out of this 27 came up client error and only 7 returned with a ok status.

whats up with this?
ID: 45899 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 45909 - Posted: 10 Sep 2007, 10:20:19 UTC - in response to Message 45899.  

don't cry too hard, lol, I "lost" ~40 wu's. But I did see that 2 or 3 of them had "compute errors".

trying to again overtake me in Rosie credits Belgian, lol ?!
ID: 45909 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 45918 - Posted: 10 Sep 2007, 12:22:47 UTC

Not picking on you here Greg... but a cursory review of your failed tasks shows the following message:

<error_message>user requested transfer abort</error_message>

...if you abort the upload of your results (as it appears occurred), your results cannot be useful to the project.
Rosetta Moderator: Mod.Sense
ID: 45918 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
agge

Send message
Joined: 14 Nov 06
Posts: 63
Credit: 432,341
RAC: 0
Message 45920 - Posted: 10 Sep 2007, 12:27:46 UTC - in response to Message 45899.  

I doubt this is related, but yesterday, on one computer, I got 'compute error' on all of the WU for all projects (seti, einstein & wcg) except rosetta. It seems to be fine now after I reset the projects and restarted the computer. Any idea what this was about?
ID: 45920 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5658
Credit: 5,670,291
RAC: 2,328
Message 45928 - Posted: 10 Sep 2007, 13:40:12 UTC - in response to Message 45918.  

doh! so perhaps i should have just let them continue and try and communicate with the project last night during the communications troubles. I thought I was just stopping them from trying to communicate and not do a total abort.
so how do i go about just making them pause if they are already cued in the transfer section? suspend network activity or what?

Not picking on you here Greg... but a cursory review of your failed tasks shows the following message:

<error_message>user requested transfer abort</error_message>

...if you abort the upload of your results (as it appears occurred), your results cannot be useful to the project.


ID: 45928 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Beezlebub
Avatar

Send message
Joined: 18 Oct 05
Posts: 40
Credit: 260,375
RAC: 0
Message 45929 - Posted: 10 Sep 2007, 13:41:00 UTC - in response to Message 45920.  

I doubt this is related, but yesterday, on one computer, I got 'compute error' on all of the WU for all projects (seti, einstein & wcg) except Rosetta. It seems to be fine now after I reset the projects and restarted the computer. Any idea what this was about?
A graphics glitch on one of my computers will crash any WU running at the time with a "client error" msg. Rosetta, Cpdn, anything with graphics.When I track down the problem I'll post back. (might be awhile tho)

e6600 quad @ 2.5ghz
2418 floating point
5227 integer

e6750 dual @ 3.71ghz
3598 floating point
7918 integer


ID: 45929 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FluffyChicken
Avatar

Send message
Joined: 1 Nov 05
Posts: 1260
Credit: 369,635
RAC: 0
Message 45941 - Posted: 10 Sep 2007, 15:52:34 UTC - in response to Message 45928.  

doh! so perhaps i should have just let them continue and try and communicate with the project last night during the communications troubles. I thought I was just stopping them from trying to communicate and not do a total abort.
so how do i go about just making them pause if they are already cued in the transfer section? suspend network activity or what?

Not picking on you here Greg... but a cursory review of your failed tasks shows the following message:

<error_message>user requested transfer abort</error_message>

...if you abort the upload of your results (as it appears occurred), your results cannot be useful to the project.



I did the same with one of mine... then I remembered it deletes it ;-) I was half asleep.
Yes suspend network activity.

May open up a 'trac' at boinc for 'suspend' to be added to individual uploads.. as suspend network activity suspends ALL netwrok activity.



Team mauisun.org
ID: 45941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile (_KoDAk_)

Send message
Joined: 18 Jul 06
Posts: 109
Credit: 1,859,263
RAC: 0
Message 45983 - Posted: 11 Sep 2007, 6:23:55 UTC
Last modified: 11 Sep 2007, 6:30:58 UTC

104537423 94863610 10 Sep 2007 4:39:11 UTC 11 Sep 2007 5:52:37 UTC Over Success Done 6,936.59 42.07 20.00
104510027 94838135 10 Sep 2007 2:41:23 UTC 10 Sep 2007 14:56:26 UTC Over Success Done 42,022.69 254.87 20.00
104510025 94838133 10 Sep 2007 2:41:23 UTC 10 Sep 2007 15:52:50 UTC Over Success Done 27,928.73 169.39 20.00
104510023 94838131 10 Sep 2007 2:41:23 UTC 10 Sep 2007 23:58:21 UTC Over Success Done 15,540.84 94.26 20.00
?????????????????
20.00 ??? what is IT ????????????
ID: 45983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 45994 - Posted: 11 Sep 2007, 12:15:08 UTC
Last modified: 11 Sep 2007, 12:17:21 UTC

KoDAk, your work units were ended prematurely by the watchdog. You are having the same issue described by several others in the "Problems with..." thread where the Rosetta score is stuck for 900 seconds.

So the 20 credits is basically a thank you for trying to crunch the task. These were probably issued by the nightly run to award credit for failed tasks. The project is working both on preserving any useful work done on the task (which is probably why it didn't show as a failure in the list), and on resolving the problem with some of the CAPRI tasks that causes many of them to end in this way.
Rosetta Moderator: Mod.Sense
ID: 45994 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
No longer involved

Send message
Joined: 19 Mar 06
Posts: 22
Credit: 327,220
RAC: 0
Message 46018 - Posted: 11 Sep 2007, 19:07:38 UTC

The credit I have been getting since the crash shows me going backwards by the hour. This is a really good system. The more work we do now the less we get credit for doing. Guess it is time to more on.
ID: 46018 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46020 - Posted: 11 Sep 2007, 19:15:03 UTC - in response to Message 46018.  

The credit I have been getting since the crash shows me going backwards by the hour. This is a really good system. The more work we do now the less we get credit for doing. Guess it is time to more on.


Phinehas, please define how you are seeing less credits issued for work completed then you were prior to the system outage. Because the credit system is the same as it has always been. Are you looking at RAC? Or credit for specific tasks?
Rosetta Moderator: Mod.Sense
ID: 46020 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Zxian

Send message
Joined: 17 May 07
Posts: 18
Credit: 1,173,075
RAC: 0
Message 46025 - Posted: 11 Sep 2007, 20:13:03 UTC

Since the system outage, I'm getting far, far more WUs with the 20-credit "thank you" than before. I actually think that I never saw this before the outage. I've tried to "fix" this by making my machines run for only 3 hours per WU, but this isn't an ideal solution.
ID: 46025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46026 - Posted: 11 Sep 2007, 20:31:46 UTC - in response to Message 46025.  

Since the system outage, I'm getting far, far more WUs with the 20-credit "thank you" than before. I actually think that I never saw this before the outage. I've tried to "fix" this by making my machines run for only 3 hours per WU, but this isn't an ideal solution.


This seems to be due to the new type of tasks that are presently being send out. You will note they have "CAPRI" in the name. I'm sure Rhiju is working hard on resolving these issues. They are working to predict structures for a CAPRI challenge.


Rosetta Moderator: Mod.Sense
ID: 46026 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 46035 - Posted: 11 Sep 2007, 22:38:39 UTC

Valid results returned past the deadline have been granted the claimed credit. The maximum value possible is 300 so if you claimed over 300 you get 300.
ID: 46035 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 46040 - Posted: 12 Sep 2007, 4:37:52 UTC

David, I don't believe anyone in this thread has tasks where they claimed that much credit. I think the issue is the CAPRI tasks that are ended by watchdog due to Rosetta score not moving for 900 seconds.
Rosetta Moderator: Mod.Sense
ID: 46040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 46075 - Posted: 12 Sep 2007, 17:30:25 UTC

Oh, I posted to the wrong thread. The watch dog errors suggest that there may be an issue with the application or the specific work units for capri. I'll alert rhiju and the others involved in capri. Since the capri experiment/competition is time sensitive, they may not be able to address the issue soon.
ID: 46075 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
No longer involved

Send message
Joined: 19 Mar 06
Posts: 22
Credit: 327,220
RAC: 0
Message 46565 - Posted: 19 Sep 2007, 4:08:45 UTC - in response to Message 46020.  

The credit I have been getting since the crash shows me going backwards by the hour. This is a really good system. The more work we do now the less we get credit for doing. Guess it is time to more on.


Phinehas, please define how you are seeing less credits issued for work completed then you were prior to the system outage. Because the credit system is the same as it has always been. Are you looking at RAC? Or credit for specific tasks?


I am looking at Average Work Done which is now down to 1286 and dropping fast. This seems to relate to the increasing delays from the server. It is now putting out 'communication deferred' times in the hours each day. The ranking of computers has dropped me from being around 6 or 7 to somewhere around 39 now. Why do we bother with these kinds of stats when the host site determines the out comes? I have watched hours and hours of work units sitting here not being able to be returned because the server was delaying communications.
ID: 46565 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
No longer involved

Send message
Joined: 19 Mar 06
Posts: 22
Credit: 327,220
RAC: 0
Message 46569 - Posted: 19 Sep 2007, 4:31:06 UTC - in response to Message 46020.  

The credit I have been getting since the crash shows me going backwards by the hour. This is a really good system. The more work we do now the less we get credit for doing. Guess it is time to more on.


Phinehas, please define how you are seeing less credits issued for work completed then you were prior to the system outage. Because the credit system is the same as it has always been. Are you looking at RAC? Or credit for specific tasks?


I have been trying to respond to your request but the server does not take the update. This message shows what the server is doing to jobs running. The message boards say the server is up and running yet I keep getting this type of message, sometimes into the multiple hours of delay. That delay turns into reduce results and standings in the Teams and Computer ratings.

Tue 18 Sep 22:26:26 2007|rosetta@home|Message from server: Project is temporarily shut down for maintenance
Tue 18 Sep 22:26:26 2007|rosetta@home|Deferring communication for 1 hr 0 min 0 sec
Tue 18 Sep 22:26:26 2007|rosetta@home|Reason: project is down


ID: 46569 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,841,260
RAC: 0
Message 46577 - Posted: 19 Sep 2007, 4:34:40 UTC - in response to Message 46568.  

Look instead at the total work completed credits. The average is based on something like the past 2 weeks, so you would expect it to drop and 'stay dropped' until the outage timeframe begins to fall outside that two week window. Daily credits for me still haven't quite recovered to the pre-crash levels -- todays hiccups didn't help with that of course, nor did the release into the wild of some 'bad boy' work units which CPU's would chew on but not yield credit. Take a look at the message board topic regarding the 5.80 application and look thru it -- work units with 'Capri' in the title have been mentioned as work units you want to abort.





I am looking at Average Work Done which is now down to 1286 and dropping fast. This seems to relate to the increasing delays from the server. It is now putting out 'communication deferred' times in the hours each day. The ranking of computers has dropped me from being around 6 or 7 to somewhere around 39 now. Why do we bother with these kinds of stats when the host site determines the out comes? I have watched hours and hours of work units sitting here not being able to be returned because the server was delaying communications.


ID: 46577 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 46636 - Posted: 19 Sep 2007, 18:38:41 UTC - in response to Message 46577.  
Last modified: 19 Sep 2007, 18:40:23 UTC

Hi everybody: Thanks for your posts and for your patience over the last week. Quite a few things have been crazy. We have been testing all our workunits on the RALPH test server and they went through fine -- so your feedback over here at Rosetta@home has been critical to identifying and (in some cases) fixing new problems.

The issue with the CAPRI workunits appears to be the large numebr of generated models and the size of output files; this was hammering our already frazzled fileservers. We are no longer sending out those jobs -- if we do, we'll fix this issue first. We're very sorry for this problem; it was totally unanticipated.

There was also a separate issue with some workunits sent out before the crash not being accepted as valid; we had a problem with the database, and I think DK has fixed this.

Then of course there was the massive outage; as BarryAZ has explained, this is causing some craziness with the credits that should hopefully be gone in a week or so.

If you can, bear with us here. The results we're getting back are exciting on a number of scientific fronts. The CAPRI data on predicting protein-protein interactions is very interesting and we're analyzing it now. The work with NMR-constrained protein structural inference has the potential to revolutionize how structures are solved. And there's more exciting stuff coming soon -- we'll try to be as careful as possible!


Look instead at the total work completed credits. The average is based on something like the past 2 weeks, so you would expect it to drop and 'stay dropped' until the outage timeframe begins to fall outside that two week window. Daily credits for me still haven't quite recovered to the pre-crash levels -- todays hiccups didn't help with that of course, nor did the release into the wild of some 'bad boy' work units which CPU's would chew on but not yield credit. Take a look at the message board topic regarding the 5.80 application and look thru it -- work units with 'Capri' in the title have been mentioned as work units you want to abort.





I am looking at Average Work Done which is now down to 1286 and dropping fast. This seems to relate to the increasing delays from the server. It is now putting out 'communication deferred' times in the hours each day. The ranking of computers has dropped me from being around 6 or 7 to somewhere around 39 now. Why do we bother with these kinds of stats when the host site determines the out comes? I have watched hours and hours of work units sitting here not being able to be returned because the server was delaying communications.



ID: 46636 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : loss of credit post crash



©2024 University of Washington
https://www.bakerlab.org