Lots of Validate Errors???

Author	Message
Ace Casino Send message Joined: 16 Jul 07 Posts: 18 Credit: 16,197,811 RAC: 0	Message 69579 - Posted: 2 Feb 2011, 10:47:16 UTC Why? My partner on the WU's errored also. ID: 69579 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 69583 - Posted: 2 Feb 2011, 16:21:32 UTC Which machine? Which work units? Rosetta Moderator: Mod.Sense ID: 69583 · Rating: 0 · rate: / Reply Quote

Ace Casino Send message Joined: 16 Jul 07 Posts: 18 Credit: 16,197,811 RAC: 0	Message 69585 - Posted: 2 Feb 2011, 17:32:48 UTC On my Computer called Rockyquad4 there are about 40 validate errors. There are a few validate errors on other machines. ID: 69585 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0	Message 69629 - Posted: 13 Feb 2011, 12:50:42 UTC WU ID=365224239: both tasks ended in "validate error" WU ID=363473507: just my task... wingman's was OK. The explanation of validate errors is "The task was reported but could not be validated, typically because the output files were lost on the server". So is it just server's fault? . ID: 69629 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2487 Credit: 46,548,320 RAC: 3,365	Message 69634 - Posted: 14 Feb 2011, 7:18:37 UTC Ditto here overnight. Only a few 100 seconds running when set up for 8 hours: T0545_boinc_1_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22905_72_1 T0602_boinc_5_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22906_269_1 ID: 69634 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 5 Jan 06 Posts: 1898 Credit: 12,751,025 RAC: 1,454	Message 69642 - Posted: 15 Feb 2011, 10:47:17 UTC - in response to Message 69629. Last modified: 15 Feb 2011, 10:49:35 UTC WU ID=365224239: both tasks ended in "validate error" WU ID=363473507: just my task... wingman's was OK. The explanation of validate errors is "The task was reported but could not be validated, typically because the output files were lost on the server". So is it just server's fault? @Link yours COULD be that you are still using an older version, 6.10.18, of Boinc. I am NOT saying it is, but the current version is 6.10.58 and that is alot of steps in between. Sid Celery IS using the 6.10.58 version of Boinc and is having similar errors, although both he AND his wingman got errors, while your wingman finished his just fine. ID: 69642 · Rating: 0 · rate: / Reply Quote

Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0	Message 69648 - Posted: 15 Feb 2011, 21:01:03 UTC - in response to Message 69642. Last modified: 15 Feb 2011, 21:02:41 UTC WU ID=365224239: both tasks ended in "validate error" WU ID=363473507: just my task... wingman's was OK. The explanation of validate errors is "The task was reported but could not be validated, typically because the output files were lost on the server". So is it just server's fault? @Link yours COULD be that you are still using an older version, 6.10.18, of Boinc. I am NOT saying it is, but the current version is 6.10.58 and that is alot of steps in between. Sid Celery IS using the 6.10.58 version of Boinc and is having similar errors, although both he AND his wingman got errors, while your wingman finished his just fine. It is interesting that the task that failed ran about 9 times longer than the successful task and produced four times as many decoys. There could have been a problem with both tasks but the shorter run time allowed one task to finish before encountering the anomaly. I have also had validate errors for these types of tasks on 12th February: T0602_boinc_5_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22906_816_1 T0635_boinc_1_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22905_90_0 The tasks failed on BOINC versions 6.4.5, 6.10.17 and 6.10.58. I would say it is safe to assume it is just a bad batch of work units and nothing to do with the client computers. ID: 69648 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2487 Credit: 46,548,320 RAC: 3,365	Message 69650 - Posted: 16 Feb 2011, 3:15:03 UTC - in response to Message 69634. Last modified: 16 Feb 2011, 3:18:04 UTC Ditto here overnight. Only a few 100 seconds running when set up for 8 hours: T0545_boinc_1_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22905_72_1 T0602_boinc_5_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22906_269_1 Three more for me, but only on my Vista AMD desktop, not my W7 Intel laptop: T0634_boinc_5_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22911_986_0 T0619_boinc_5_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22911_959_1 T0560_boinc_10_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22912_406_1 Also note: mine was the second attempt at the first 2 and last 2. On the 3rd of the 5 my run was the first attempt. ID: 69650 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 69651 - Posted: 16 Feb 2011, 8:27:47 UTC Last modified: 16 Feb 2011, 8:29:17 UTC I just checked my tasks and found a bunch as well in the T0 series also double validate errors. T0611_boinc_1_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22905_1 T0567_boinc_5_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22906_33 T0632_boinc_1_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22905_34 T0522_boinc_1_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22905_49 T0580_boinc_5_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22911_50 T0530_boinc_5_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22911_530 T0580_boinc_5_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22911_595 *Note: Wingman has this in his queue still T0632_boinc_5_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22911_661 The above tasks were granted claimed credit. But still a pretty huge hiccup for the validator. ID: 69651 · Rating: 0 · rate: / Reply Quote

Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0	Message 69656 - Posted: 16 Feb 2011, 18:21:09 UTC WARNING: Scattered observations and wild speculation contained herein I am wondering if there isn't another "end computation now" rule (in addition to the preferred CPU time limit and the 100 model limit) that is tripping up the validator. This thought has occurred to me before when tasks have ended well within those parameters but without obvious error. Currently I have crunched two tasks that fit this description: T0579_boinc_5_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22906_24_0 T0523_boinc_10_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22912_178_0 Both of these validated successfully but one ended after completing 5 decoys in 4248 seconds, the other after completing 9 decoys in 3347 seconds. My cpu run time preference is 43200 seconds. Looking through the similarly named tasks reported in this thread I notice many recorded the exact same details in their stdrr out: completion of 5 decoys in 1201 seconds. The cpu time recorded elsewhere on the task details page varies considerably; I've seen 507, 693, 843, 1109 seconds. Lots of similarly named tasks have completed and validated successfully with varying amounts of cpu time and number of models completed and on the same hosts that are receiving validate errors. I speculate that these tasks are reaching (achieving?) some point after which it is futile (unnecessary?) to continue and so the app code says it's time to stop working on this one and send it back. If this happens before a single model has been completed the validator code would need a new set of instructions for this. As would the credit granting code. A script is run (once a day, I think) that grants credit to tasks which ended with a validate error. I wonder if the 5 decoy/1201 cpu seconds in the stdrr out is the clue left by the app that the work done by this host on this task is actually fine and should receive credit and for credit purposes the server should assume 5 models completed. In which case it's less a matter of the validator being tripped up than a workaround that's confusing us uninformed crunchers. This doesn't answer why some tasks validate successfully after ending prior to runtime preference or the 100 model limit being met. There must be some other limiting factor but I haven't spotted any pattern among the successfully validated tasks. Or more likely it's the same limiting factor but as long it occurs in a second or later model rather than the first the validator doesn't need a special set of instructions for dealing with it. Ending speculation and entering a plea for an admin or Mod.Sense to let me know if any of this is even remotely close to reality. Best, Snags ID: 69656 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0	Message 69658 - Posted: 16 Feb 2011, 18:43:22 UTC - in response to Message 69642. @Link yours COULD be that you are still using an older version, 6.10.18, of Boinc. I am NOT saying it is, but the current version is 6.10.58 and that is alot of steps in between. Sid Celery IS using the 6.10.58 version of Boinc and is having similar errors, although both he AND his wingman got errors, while your wingman finished his just fine. I was testing 6.10.58 on my laptop and it was generating a new host-CPID on almost every reboot, specially if the IP changed. I posted about this problem here. I was not the only one with this problem, as you can see here. So this version is messing up stats pages and that's why I'm back to 6.10.18. . ID: 69658 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0	Message 69659 - Posted: 16 Feb 2011, 18:51:10 UTC - in response to Message 69656. WARNING: Scattered observations and wild speculation contained herein I am wondering if there isn't another "end computation now" rule (...) After my observations the rule for all T????_boinc_#_templates* tasks is: max. nummber of decoys = # but sometimes they can also end before that, like your "10" ended with 9. _ . ID: 69659 · Rating: 0 · rate: / Reply Quote

Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0	Message 69661 - Posted: 16 Feb 2011, 19:42:27 UTC - in response to Message 69659. WARNING: Scattered observations and wild speculation contained herein I am wondering if there isn't another "end computation now" rule (...) After my observations the rule for all T????_boinc_#_templates* tasks is: max. nummber of decoys = # but sometimes they can also end before that, like your "10" ended with 9. _ Ah, thanks Link, I really should have caught that. For my examples then my guess is that; while working on the 10th decoy the app encountered this new limiting factor, ended the crunching and reported back the 9 completed models. In the other instance it completed the 5 models as assigned and without incident. And of course now I wonder about the model limit. Does it have anything to do with what they are trying to find out by running these tasks or is it a coping mechanism for tasks that require a lot of memory and/or produce large output files? Best, Snags ID: 69661 · Rating: 0 · rate: / Reply Quote

Greg_BE Send message Joined: 30 May 06 Posts: 5770 Credit: 6,139,760 RAC: 0	Message 69668 - Posted: 17 Feb 2011, 23:18:20 UTC - in response to Message 69661. WARNING: Scattered observations and wild speculation contained herein I am wondering if there isn't another "end computation now" rule (...) After my observations the rule for all T????_boinc_#_templates* tasks is: max. nummber of decoys = # but sometimes they can also end before that, like your "10" ended with 9. _ Ah, thanks Link, I really should have caught that. For my examples then my guess is that; while working on the 10th decoy the app encountered this new limiting factor, ended the crunching and reported back the 9 completed models. In the other instance it completed the 5 models as assigned and without incident. And of course now I wonder about the model limit. Does it have anything to do with what they are trying to find out by running these tasks or is it a coping mechanism for tasks that require a lot of memory and/or produce large output files? Best, Snags Wasn't that earlier when tasks were generating over 100 decoys or something along that line? Thought they put a limiter code in to shut the task down at 100 vs 1000 or whatever. ID: 69668 · Rating: 0 · rate: / Reply Quote

Jesse Viviano Send message Joined: 14 Jan 10 Posts: 42 Credit: 2,700,472 RAC: 0	Message 69786 - Posted: 10 Mar 2011, 21:00:18 UTC Work unit 370728909 is another work unit that failed due to validate errors. I don't have a problem with a watchdog shutting down a work unit due to having found too many decoys, but I do have a problem when the validator cannot validate such results and therefore my results get wasted. I don't care about credits, but I do care about wasted science. ID: 69786 · Rating: 0 · rate: / Reply Quote