Message boards : Number crunching : lots of validation errors
Author | Message |
---|---|
trick@planet3dnow Send message Joined: 21 Feb 09 Posts: 8 Credit: 53,370 RAC: 0 |
hello everyone! my pc produces many validation errors and a few client errors: https://boinc.bakerlab.org/rosetta/results.php?userid=302635 it's not overclocked, i tested the ram (2 kits of kingston valueram kvr800d2e5k2/4g, 4 modules of 2 gb each --> 8gb ddr2-800 ecc ram) with memtest without errors (default test run 5 cycles/runs you know what i mean). cpu is phenom2 x4 940 on asus m3a78-t running 64 bit gentoo linux. system is up to date. i don't use boinc from portage but downloaded it m manually from berkeley's server (version 6.4.5). hw error, system issue or a rosetta bug? thanks for help in advance! |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2130 Credit: 41,424,155 RAC: 14,205 |
My pc produces many validation errors and a few client errors: All those WUs came up with the error "hbond tripped". This has come up before a few times for some people. No solution that I know of. |
trick@planet3dnow Send message Joined: 21 Feb 09 Posts: 8 Credit: 53,370 RAC: 0 |
thanks for your reply! 15% of all work units run longer than they should and burst up in flames so another pc has to "recrunch" them. not really a nice thing. :-/ |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
I reported similar errors myself here. A lot of your longer-running tasks seem to go wrong. Do you have a long default runtime setting or is it a clue that something has already gone wrong? Try reducing the run-time by an hour and see if you get any better success. At worst, a failing task will end an hour earlier and you won't waste so much time on a 'bad' task. Just an idea... |
trick@planet3dnow Send message Joined: 21 Feb 09 Posts: 8 Credit: 53,370 RAC: 0 |
runtime is set to 8 hours. with 3 hours it was the same problem |
trick@planet3dnow Send message Joined: 21 Feb 09 Posts: 8 Credit: 53,370 RAC: 0 |
i switched runtime back to default. surprisingly i get lower error rates than before! O_o rac looks significantly higher than in the beginning when i also used default runtime settings. so for now i'm quite satisfied. :) |
trick@planet3dnow Send message Joined: 21 Feb 09 Posts: 8 Credit: 53,370 RAC: 0 |
argh again 4 work units with validation errors. :( |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Looks like at least the two I looked at were "hbond tripped" errors (see last line of output). Please report them in the problems with 1.54 thread. It is helpful if you can post links to the specific tasks you are reporting problems with. Rosetta Moderator: Mod.Sense |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
Looks like at least the two I looked at were "hbond tripped" errors (see last line of output). Please report them in the problems with 1.54 thread. It is helpful if you can post links to the specific tasks you are reporting problems with. I was about to say that the issue with these errors isn't so much the "hbond tripped" issue (of course it is), but that the WU runs for 7 hours before crashing out (default + 4 hours, but watchdog not mentioned in stderr.out). I was about to say it, but just noticed that 2 more WUs errored out quickly for trick@planet3dnow, reporting Client Errors. At least they didn't waste processing time on another good WU. |
trick@planet3dnow Send message Joined: 21 Feb 09 Posts: 8 Credit: 53,370 RAC: 0 |
@ mod.sense: i just had a look on my last 164 work unit and i saw 49 validate errors and 11 client errors. does it make sense to link every failed work unit? atm i'm thinking about stopping rah. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
No, who has the time to do that? And then the time to study each one? ...but if you could post the link to the host in the "problems with..." thread that would be helpful. Rosetta Moderator: Mod.Sense |
trick@planet3dnow Send message Joined: 21 Feb 09 Posts: 8 Credit: 53,370 RAC: 0 |
i habe the theory that stopping boinc and starting it again helps. it did several hours ago and the last 20 work units did not have any error. before that boinc ran for several days without break. will keep an eye on that... |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
I suggested a change and it has been implemented so that ONLY the failed tasks are shown on the task page. See the SaH site. Now, if we can get RaH to implement that server side change it would help ... |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,208,737 RAC: 2,882 |
i habe the theory that stopping boinc and starting it again helps. it did several hours ago and the last 20 work units did not have any error. before that boinc ran for several days without break. I do that too and it works for me too! When I have a machine that goes wacky, I have 16 crunching right now, I shut it down for a day and then bring it back up and it works just fine again. I run mostly Windows machines so I wonder if it is a Windows thing. I know that Windows has 'memory leaks' but usually just a reboot will fix that. The problem I am having is not fixed by a quick reboot but instead a long, relatively speaking, time shut down. I guess that could mean hardware issues but the hardware is soooo varied across my ranch that it would be impossible to track down. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
With 16 machines, you could easily have heat problems too. Clogged vents or failing fans can cause some pretty odd symptoms. And off for a few hours would cool it back down. You might try limiting number of CPUs to cutting down on % of CPU. That will let the machine run cooler and see if that helps. Rosetta Moderator: Mod.Sense |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,208,737 RAC: 2,882 |
With 16 machines, you could easily have heat problems too. Clogged vents or failing fans can cause some pretty odd symptoms. And off for a few hours would cool it back down. You might try limiting number of CPUs to cutting down on % of CPU. That will let the machine run cooler and see if that helps. Oh no the machines are spread out all over my home. Most are in my basement on a set of metal shelves. Very few have monitors, or even keyboards or mice attached. I use a remote access program, and a kvm in one case, to access each one from a single location. Air flow and heat is not the problem. Although all those pc's do add to my electric bill each month!! I think alot of the problem is that alot of the machines are not new. They are hand me downs I got in payment for doing computer work for friends. When I do work they either give me money or their old stuff. Usually the stuff! I have a basement full of STUFF! I donated 10 pc's just last year to the Foster Kids in the County where I work! |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2130 Credit: 41,424,155 RAC: 14,205 |
Another validation error with this WU, but no 'hbond tripped' message - no errors reported at all. stef__BOINC_ABRELAX_ONE_CRYSTALLIN_SAVE_ALL_OUT_IGNORE_THE_REST-S25-9-S3-3--stef_-_8935_94269_2 I notice it shows a claimed and granted credit in the above link, but in my task list it shows neither claim nor granted figure. Just a blip in the validation process? Hopefully. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2130 Credit: 41,424,155 RAC: 14,205 |
|
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
And another: You should be reporting validation errors on version 1.54 tasks in the Minirosetta v1.54 bug report thread. The project team have set up central threads for error reporting to avoid wasting time hunting through several different threads. It is less likely that your error message will be read if you leave it here. ^_~ |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2130 Credit: 41,424,155 RAC: 14,205 |
And another: Ok. I wasn't sure if this was a problem with the WU (no errors reported at all) or just the validation jobprocess. |
Message boards :
Number crunching :
lots of validation errors
©2024 University of Washington
https://www.bakerlab.org