Lots of Validate Errors???

Message boards : Number crunching : Lots of Validate Errors???

To post messages, you must log in.

AuthorMessage
Profile Ace Casino

Send message
Joined: 16 Jul 07
Posts: 17
Credit: 11,592,948
RAC: 11,465
Message 69579 - Posted: 2 Feb 2011, 10:47:16 UTC

Why?
My partner on the WU's errored also.
ID: 69579 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 69583 - Posted: 2 Feb 2011, 16:21:32 UTC

Which machine? Which work units?
Rosetta Moderator: Mod.Sense
ID: 69583 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ace Casino

Send message
Joined: 16 Jul 07
Posts: 17
Credit: 11,592,948
RAC: 11,465
Message 69585 - Posted: 2 Feb 2011, 17:32:48 UTC

On my Computer called Rockyquad4 there are about 40 validate errors. There are a few validate errors on other machines.
ID: 69585 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 352
Credit: 382,349
RAC: 0
Message 69629 - Posted: 13 Feb 2011, 12:50:42 UTC

WU ID=365224239: both tasks ended in "validate error"

WU ID=363473507: just my task... wingman's was OK.


The explanation of validate errors is "The task was reported but could not be validated, typically because the output files were lost on the server". So is it just server's fault?
.
ID: 69629 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1982
Credit: 38,462,565
RAC: 15,158
Message 69634 - Posted: 14 Feb 2011, 7:18:37 UTC

ID: 69634 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,767,285
RAC: 12,464
Message 69642 - Posted: 15 Feb 2011, 10:47:17 UTC - in response to Message 69629.  
Last modified: 15 Feb 2011, 10:49:35 UTC

WU ID=365224239: both tasks ended in "validate error"

WU ID=363473507: just my task... wingman's was OK.


The explanation of validate errors is "The task was reported but could not be validated, typically because the output files were lost on the server". So is it just server's fault?


@Link yours COULD be that you are still using an older version, 6.10.18, of Boinc. I am NOT saying it is, but the current version is 6.10.58 and that is alot of steps in between.

Sid Celery IS using the 6.10.58 version of Boinc and is having similar errors, although both he AND his wingman got errors, while your wingman finished his just fine.
ID: 69642 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 69648 - Posted: 15 Feb 2011, 21:01:03 UTC - in response to Message 69642.  
Last modified: 15 Feb 2011, 21:02:41 UTC

WU ID=365224239: both tasks ended in "validate error"

WU ID=363473507: just my task... wingman's was OK.


The explanation of validate errors is "The task was reported but could not be validated, typically because the output files were lost on the server". So is it just server's fault?


@Link yours COULD be that you are still using an older version, 6.10.18, of Boinc. I am NOT saying it is, but the current version is 6.10.58 and that is alot of steps in between.

Sid Celery IS using the 6.10.58 version of Boinc and is having similar errors, although both he AND his wingman got errors, while your wingman finished his just fine.


It is interesting that the task that failed ran about 9 times longer than the successful task and produced four times as many decoys. There could have been a problem with both tasks but the shorter run time allowed one task to finish before encountering the anomaly.

I have also had validate errors for these types of tasks on 12th February:

T0602_boinc_5_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22906_816_1
T0635_boinc_1_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22905_90_0

The tasks failed on BOINC versions 6.4.5, 6.10.17 and 6.10.58. I would say it is safe to assume it is just a bad batch of work units and nothing to do with the client computers.
ID: 69648 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1982
Credit: 38,462,565
RAC: 15,158
Message 69650 - Posted: 16 Feb 2011, 3:15:03 UTC - in response to Message 69634.  
Last modified: 16 Feb 2011, 3:18:04 UTC

ID: 69650 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,703,329
RAC: 2,182
Message 69651 - Posted: 16 Feb 2011, 8:27:47 UTC
Last modified: 16 Feb 2011, 8:29:17 UTC

ID: 69651 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,812,147
RAC: 777
Message 69656 - Posted: 16 Feb 2011, 18:21:09 UTC

**WARNING: Scattered observations and wild speculation contained herein**

I am wondering if there isn't another "end computation now" rule (in addition to the preferred CPU time limit and the 100 model limit) that is tripping up the validator. This thought has occurred to me before when tasks have ended well within those parameters but without obvious error. Currently I have crunched two tasks that fit this description:

T0579_boinc_5_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22906_24_0
T0523_boinc_10_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22912_178_0

Both of these validated successfully but one ended after completing 5 decoys in 4248 seconds, the other after completing 9 decoys in 3347 seconds. My cpu run time preference is 43200 seconds.

Looking through the similarly named tasks reported in this thread I notice many recorded the exact same details in their stdrr out: completion of 5 decoys in 1201 seconds. The cpu time recorded elsewhere on the task details page varies considerably; I've seen 507, 693, 843, 1109 seconds.

Lots of similarly named tasks have completed and validated successfully with varying amounts of cpu time and number of models completed and on the same hosts that are receiving validate errors.

I speculate that these tasks are reaching (achieving?) some point after which it is futile (unnecessary?) to continue and so the app code says it's time to stop working on this one and send it back. If this happens before a single model has been completed the validator code would need a new set of instructions for this. As would the credit granting code. A script is run (once a day, I think) that grants credit to tasks which ended with a validate error. I wonder if the 5 decoy/1201 cpu seconds in the stdrr out is the clue left by the app that the work done by this host on this task is actually fine and should receive credit and for credit purposes the server should assume 5 models completed. In which case it's less a matter of the validator being tripped up than a workaround that's confusing us uninformed crunchers.

This doesn't answer why some tasks validate successfully after ending prior to runtime preference or the 100 model limit being met. There must be some other limiting factor but I haven't spotted any pattern among the successfully validated tasks. Or more likely it's the same limiting factor but as long it occurs in a second or later model rather than the first the validator doesn't need a special set of instructions for dealing with it.

**Ending speculation and entering a plea for an admin or Mod.Sense to let me know if any of this is even remotely close to reality.**

Best,
Snags
ID: 69656 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 352
Credit: 382,349
RAC: 0
Message 69658 - Posted: 16 Feb 2011, 18:43:22 UTC - in response to Message 69642.  

@Link yours COULD be that you are still using an older version, 6.10.18, of Boinc. I am NOT saying it is, but the current version is 6.10.58 and that is alot of steps in between.

Sid Celery IS using the 6.10.58 version of Boinc and is having similar errors, although both he AND his wingman got errors, while your wingman finished his just fine.

I was testing 6.10.58 on my laptop and it was generating a new host-CPID on almost every reboot, specially if the IP changed. I posted about this problem here. I was not the only one with this problem, as you can see here. So this version is messing up stats pages and that's why I'm back to 6.10.18.
.
ID: 69658 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 352
Credit: 382,349
RAC: 0
Message 69659 - Posted: 16 Feb 2011, 18:51:10 UTC - in response to Message 69656.  

**WARNING: Scattered observations and wild speculation contained herein**

I am wondering if there isn't another "end computation now" rule (...)


After my observations the rule for all T????_boinc_#_templates* tasks is:

max. nummber of decoys = #

but sometimes they can also end before that, like your "10" ended with 9.




_
.
ID: 69659 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Snags

Send message
Joined: 22 Feb 07
Posts: 198
Credit: 2,812,147
RAC: 777
Message 69661 - Posted: 16 Feb 2011, 19:42:27 UTC - in response to Message 69659.  

**WARNING: Scattered observations and wild speculation contained herein**

I am wondering if there isn't another "end computation now" rule (...)


After my observations the rule for all T????_boinc_#_templates* tasks is:

max. nummber of decoys = #

but sometimes they can also end before that, like your "10" ended with 9.




_


Ah, thanks Link, I really should have caught that.

For my examples then my guess is that; while working on the 10th decoy the app encountered this new limiting factor, ended the crunching and reported back the 9 completed models. In the other instance it completed the 5 models as assigned and without incident.

And of course now I wonder about the model limit. Does it have anything to do with what they are trying to find out by running these tasks or is it a coping mechanism for tasks that require a lot of memory and/or produce large output files?

Best,
Snags
ID: 69661 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5662
Credit: 5,703,329
RAC: 2,182
Message 69668 - Posted: 17 Feb 2011, 23:18:20 UTC - in response to Message 69661.  

**WARNING: Scattered observations and wild speculation contained herein**

I am wondering if there isn't another "end computation now" rule (...)


After my observations the rule for all T????_boinc_#_templates* tasks is:

max. nummber of decoys = #

but sometimes they can also end before that, like your "10" ended with 9.




_


Ah, thanks Link, I really should have caught that.

For my examples then my guess is that; while working on the 10th decoy the app encountered this new limiting factor, ended the crunching and reported back the 9 completed models. In the other instance it completed the 5 models as assigned and without incident.

And of course now I wonder about the model limit. Does it have anything to do with what they are trying to find out by running these tasks or is it a coping mechanism for tasks that require a lot of memory and/or produce large output files?

Best,
Snags



Wasn't that earlier when tasks were generating over 100 decoys or something along that line? Thought they put a limiter code in to shut the task down at 100 vs 1000 or whatever.
ID: 69668 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jesse Viviano

Send message
Joined: 14 Jan 10
Posts: 42
Credit: 2,700,472
RAC: 0
Message 69786 - Posted: 10 Mar 2011, 21:00:18 UTC

Work unit 370728909 is another work unit that failed due to validate errors. I don't have a problem with a watchdog shutting down a work unit due to having found too many decoys, but I do have a problem when the validator cannot validate such results and therefore my results get wasted. I don't care about credits, but I do care about wasted science.
ID: 69786 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Lots of Validate Errors???



©2024 University of Washington
https://www.bakerlab.org