Message boards : Number crunching : Lots of Validate Errors???
Author | Message |
---|---|
Ace Casino Send message Joined: 16 Jul 07 Posts: 18 Credit: 14,062,294 RAC: 11,221 |
Why? My partner on the WU's errored also. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Which machine? Which work units? Rosetta Moderator: Mod.Sense |
Ace Casino Send message Joined: 16 Jul 07 Posts: 18 Credit: 14,062,294 RAC: 11,221 |
On my Computer called Rockyquad4 there are about 40 validate errors. There are a few validate errors on other machines. |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
WU ID=365224239: both tasks ended in "validate error" WU ID=363473507: just my task... wingman's was OK. The explanation of validate errors is "The task was reported but could not be validated, typically because the output files were lost on the server". So is it just server's fault? . |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
Ditto here overnight. Only a few 100 seconds running when set up for 8 hours: T0545_boinc_1_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22905_72_1 T0602_boinc_5_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22906_269_1 |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,178,442 RAC: 3,202 |
WU ID=365224239: both tasks ended in "validate error" @Link yours COULD be that you are still using an older version, 6.10.18, of Boinc. I am NOT saying it is, but the current version is 6.10.58 and that is alot of steps in between. Sid Celery IS using the 6.10.58 version of Boinc and is having similar errors, although both he AND his wingman got errors, while your wingman finished his just fine. |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
WU ID=365224239: both tasks ended in "validate error" It is interesting that the task that failed ran about 9 times longer than the successful task and produced four times as many decoys. There could have been a problem with both tasks but the shorter run time allowed one task to finish before encountering the anomaly. I have also had validate errors for these types of tasks on 12th February: T0602_boinc_5_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22906_816_1 T0635_boinc_1_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22905_90_0 The tasks failed on BOINC versions 6.4.5, 6.10.17 and 6.10.58. I would say it is safe to assume it is just a bad batch of work units and nothing to do with the client computers. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 9,368 |
Ditto here overnight. Only a few 100 seconds running when set up for 8 hours: Three more for me, but only on my Vista AMD desktop, not my W7 Intel laptop: T0634_boinc_5_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22911_986_0 T0619_boinc_5_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22911_959_1 T0560_boinc_10_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22912_406_1 Also note: mine was the second attempt at the first 2 and last 2. On the 3rd of the 5 my run was the first attempt. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I just checked my tasks and found a bunch as well in the T0 series also double validate errors. T0611_boinc_1_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22905_1 T0567_boinc_5_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22906_33 T0632_boinc_1_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22905_34 T0522_boinc_1_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22905_49 T0580_boinc_5_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22911_50 T0530_boinc_5_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22911_530 T0580_boinc_5_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22911_595 *Note: Wingman has this in his queue still T0632_boinc_5_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22911_661 The above tasks were granted claimed credit. But still a pretty huge hiccup for the validator. |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
**WARNING: Scattered observations and wild speculation contained herein** I am wondering if there isn't another "end computation now" rule (in addition to the preferred CPU time limit and the 100 model limit) that is tripping up the validator. This thought has occurred to me before when tasks have ended well within those parameters but without obvious error. Currently I have crunched two tasks that fit this description: T0579_boinc_5_templates_loopbuild_threading_cst_relax_wt10_tex_IGNORE_THE_REST_22906_24_0 T0523_boinc_10_templates_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_22912_178_0 Both of these validated successfully but one ended after completing 5 decoys in 4248 seconds, the other after completing 9 decoys in 3347 seconds. My cpu run time preference is 43200 seconds. Looking through the similarly named tasks reported in this thread I notice many recorded the exact same details in their stdrr out: completion of 5 decoys in 1201 seconds. The cpu time recorded elsewhere on the task details page varies considerably; I've seen 507, 693, 843, 1109 seconds. Lots of similarly named tasks have completed and validated successfully with varying amounts of cpu time and number of models completed and on the same hosts that are receiving validate errors. I speculate that these tasks are reaching (achieving?) some point after which it is futile (unnecessary?) to continue and so the app code says it's time to stop working on this one and send it back. If this happens before a single model has been completed the validator code would need a new set of instructions for this. As would the credit granting code. A script is run (once a day, I think) that grants credit to tasks which ended with a validate error. I wonder if the 5 decoy/1201 cpu seconds in the stdrr out is the clue left by the app that the work done by this host on this task is actually fine and should receive credit and for credit purposes the server should assume 5 models completed. In which case it's less a matter of the validator being tripped up than a workaround that's confusing us uninformed crunchers. This doesn't answer why some tasks validate successfully after ending prior to runtime preference or the 100 model limit being met. There must be some other limiting factor but I haven't spotted any pattern among the successfully validated tasks. Or more likely it's the same limiting factor but as long it occurs in a second or later model rather than the first the validator doesn't need a special set of instructions for dealing with it. **Ending speculation and entering a plea for an admin or Mod.Sense to let me know if any of this is even remotely close to reality.** Best, Snags |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
@Link yours COULD be that you are still using an older version, 6.10.18, of Boinc. I am NOT saying it is, but the current version is 6.10.58 and that is alot of steps in between. I was testing 6.10.58 on my laptop and it was generating a new host-CPID on almost every reboot, specially if the IP changed. I posted about this problem here. I was not the only one with this problem, as you can see here. So this version is messing up stats pages and that's why I'm back to 6.10.18. . |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
**WARNING: Scattered observations and wild speculation contained herein** After my observations the rule for all T????_boinc_#_templates* tasks is: max. nummber of decoys = # but sometimes they can also end before that, like your "10" ended with 9. _ . |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
**WARNING: Scattered observations and wild speculation contained herein** Ah, thanks Link, I really should have caught that. For my examples then my guess is that; while working on the 10th decoy the app encountered this new limiting factor, ended the crunching and reported back the 9 completed models. In the other instance it completed the 5 models as assigned and without incident. And of course now I wonder about the model limit. Does it have anything to do with what they are trying to find out by running these tasks or is it a coping mechanism for tasks that require a lot of memory and/or produce large output files? Best, Snags |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
**WARNING: Scattered observations and wild speculation contained herein** Wasn't that earlier when tasks were generating over 100 decoys or something along that line? Thought they put a limiter code in to shut the task down at 100 vs 1000 or whatever. |
Jesse Viviano Send message Joined: 14 Jan 10 Posts: 42 Credit: 2,700,472 RAC: 0 |
Work unit 370728909 is another work unit that failed due to validate errors. I don't have a problem with a watchdog shutting down a work unit due to having found too many decoys, but I do have a problem when the validator cannot validate such results and therefore my results get wasted. I don't care about credits, but I do care about wasted science. |
Message boards :
Number crunching :
Lots of Validate Errors???
©2024 University of Washington
https://www.bakerlab.org