lots of validation errors

Message boards : Number crunching : lots of validation errors

To post messages, you must log in.

AuthorMessage
trick@planet3dnow

Send message
Joined: 21 Feb 09
Posts: 8
Credit: 53,370
RAC: 0
Message 60058 - Posted: 10 Mar 2009, 13:01:28 UTC
Last modified: 10 Mar 2009, 13:04:03 UTC

hello everyone!

my pc produces many validation errors and a few client errors:
https://boinc.bakerlab.org/rosetta/results.php?userid=302635
it's not overclocked, i tested the ram (2 kits of kingston valueram kvr800d2e5k2/4g, 4 modules of 2 gb each --> 8gb ddr2-800 ecc ram) with memtest without errors (default test run 5 cycles/runs you know what i mean). cpu is phenom2 x4 940 on asus m3a78-t running 64 bit gentoo linux. system is up to date. i don't use boinc from portage but downloaded it m manually from berkeley's server (version 6.4.5).

hw error, system issue or a rosetta bug?

thanks for help in advance!
ID: 60058 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2130
Credit: 41,424,155
RAC: 14,205
Message 60059 - Posted: 10 Mar 2009, 13:42:33 UTC - in response to Message 60058.  

My pc produces many validation errors and a few client errors:
https://boinc.bakerlab.org/rosetta/results.php?userid=302635
It's not overclocked, I tested the ram (2 kits of kingston valueram kvr800d2e5k2/4g, 4 modules of 2 gb each --> 8gb ddr2-800 ecc ram) with memtest without errors (default test run 5 cycles/runs you know what I mean). Cpu is Phenom2 x4 940 on Asus m3a78-t running 64 bit gentoo linux. System is up to date. I don't use boinc from portage but downloaded it manually from berkeley's server (version 6.4.5).

hw error, system issue or a rosetta bug?

thanks for help in advance!

All those WUs came up with the error "hbond tripped".

This has come up before a few times for some people. No solution that I know of.
ID: 60059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
trick@planet3dnow

Send message
Joined: 21 Feb 09
Posts: 8
Credit: 53,370
RAC: 0
Message 60060 - Posted: 10 Mar 2009, 14:13:15 UTC

thanks for your reply!
15% of all work units run longer than they should and burst up in flames so
another pc has to "recrunch" them. not really a nice thing. :-/
ID: 60060 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 60062 - Posted: 10 Mar 2009, 16:32:28 UTC
Last modified: 10 Mar 2009, 16:41:28 UTC

I reported similar errors myself here.

A lot of your longer-running tasks seem to go wrong. Do you have a long default runtime setting or is it a clue that something has already gone wrong?

Try reducing the run-time by an hour and see if you get any better success. At worst, a failing task will end an hour earlier and you won't waste so much time on a 'bad' task.

Just an idea...
ID: 60062 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
trick@planet3dnow

Send message
Joined: 21 Feb 09
Posts: 8
Credit: 53,370
RAC: 0
Message 60063 - Posted: 10 Mar 2009, 16:53:25 UTC

runtime is set to 8 hours. with 3 hours it was the same problem
ID: 60063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
trick@planet3dnow

Send message
Joined: 21 Feb 09
Posts: 8
Credit: 53,370
RAC: 0
Message 60136 - Posted: 13 Mar 2009, 15:47:38 UTC
Last modified: 13 Mar 2009, 15:48:15 UTC

i switched runtime back to default. surprisingly i get lower error rates than before! O_o rac looks significantly higher than in the beginning when i also used default runtime settings.
so for now i'm quite satisfied. :)
ID: 60136 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
trick@planet3dnow

Send message
Joined: 21 Feb 09
Posts: 8
Credit: 53,370
RAC: 0
Message 60158 - Posted: 14 Mar 2009, 12:44:29 UTC

argh
again 4 work units with validation errors. :(
ID: 60158 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60161 - Posted: 14 Mar 2009, 18:05:06 UTC

Looks like at least the two I looked at were "hbond tripped" errors (see last line of output). Please report them in the problems with 1.54 thread. It is helpful if you can post links to the specific tasks you are reporting problems with.
Rosetta Moderator: Mod.Sense
ID: 60161 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 60167 - Posted: 15 Mar 2009, 22:46:12 UTC - in response to Message 60161.  

Looks like at least the two I looked at were "hbond tripped" errors (see last line of output). Please report them in the problems with 1.54 thread. It is helpful if you can post links to the specific tasks you are reporting problems with.

I was about to say that the issue with these errors isn't so much the "hbond tripped" issue (of course it is), but that the WU runs for 7 hours before crashing out (default + 4 hours, but watchdog not mentioned in stderr.out).

I was about to say it, but just noticed that 2 more WUs errored out quickly for trick@planet3dnow, reporting Client Errors. At least they didn't waste processing time on another good WU.
ID: 60167 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
trick@planet3dnow

Send message
Joined: 21 Feb 09
Posts: 8
Credit: 53,370
RAC: 0
Message 60246 - Posted: 20 Mar 2009, 21:41:36 UTC

@ mod.sense:
i just had a look on my last 164 work unit and i saw 49 validate errors and 11 client errors. does it make sense to link every failed work unit?

atm i'm thinking about stopping rah.
ID: 60246 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60247 - Posted: 20 Mar 2009, 23:00:30 UTC

No, who has the time to do that? And then the time to study each one? ...but if you could post the link to the host in the "problems with..." thread that would be helpful.
Rosetta Moderator: Mod.Sense
ID: 60247 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
trick@planet3dnow

Send message
Joined: 21 Feb 09
Posts: 8
Credit: 53,370
RAC: 0
Message 60273 - Posted: 22 Mar 2009, 17:23:19 UTC

i habe the theory that stopping boinc and starting it again helps. it did several hours ago and the last 20 work units did not have any error. before that boinc ran for several days without break.
will keep an eye on that...
ID: 60273 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 60286 - Posted: 23 Mar 2009, 17:20:25 UTC

I suggested a change and it has been implemented so that ONLY the failed tasks are shown on the task page. See the SaH site. Now, if we can get RaH to implement that server side change it would help ...
ID: 60286 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,208,737
RAC: 2,882
Message 60296 - Posted: 24 Mar 2009, 9:52:49 UTC - in response to Message 60273.  
Last modified: 24 Mar 2009, 9:53:13 UTC

i habe the theory that stopping boinc and starting it again helps. it did several hours ago and the last 20 work units did not have any error. before that boinc ran for several days without break.
will keep an eye on that...


I do that too and it works for me too! When I have a machine that goes wacky, I have 16 crunching right now, I shut it down for a day and then bring it back up and it works just fine again. I run mostly Windows machines so I wonder if it is a Windows thing. I know that Windows has 'memory leaks' but usually just a reboot will fix that. The problem I am having is not fixed by a quick reboot but instead a long, relatively speaking, time shut down. I guess that could mean hardware issues but the hardware is soooo varied across my ranch that it would be impossible to track down.
ID: 60296 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 60299 - Posted: 24 Mar 2009, 13:07:12 UTC

With 16 machines, you could easily have heat problems too. Clogged vents or failing fans can cause some pretty odd symptoms. And off for a few hours would cool it back down. You might try limiting number of CPUs to cutting down on % of CPU. That will let the machine run cooler and see if that helps.
Rosetta Moderator: Mod.Sense
ID: 60299 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,208,737
RAC: 2,882
Message 60311 - Posted: 25 Mar 2009, 9:30:32 UTC - in response to Message 60299.  

With 16 machines, you could easily have heat problems too. Clogged vents or failing fans can cause some pretty odd symptoms. And off for a few hours would cool it back down. You might try limiting number of CPUs to cutting down on % of CPU. That will let the machine run cooler and see if that helps.


Oh no the machines are spread out all over my home. Most are in my basement on a set of metal shelves. Very few have monitors, or even keyboards or mice attached. I use a remote access program, and a kvm in one case, to access each one from a single location. Air flow and heat is not the problem. Although all those pc's do add to my electric bill each month!! I think alot of the problem is that alot of the machines are not new. They are hand me downs I got in payment for doing computer work for friends. When I do work they either give me money or their old stuff. Usually the stuff! I have a basement full of STUFF! I donated 10 pc's just last year to the Foster Kids in the County where I work!
ID: 60311 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2130
Credit: 41,424,155
RAC: 14,205
Message 60515 - Posted: 6 Apr 2009, 12:57:31 UTC

Another validation error with this WU, but no 'hbond tripped' message - no errors reported at all.

stef__BOINC_ABRELAX_ONE_CRYSTALLIN_SAVE_ALL_OUT_IGNORE_THE_REST-S25-9-S3-3--stef_-_8935_94269_2

I notice it shows a claimed and granted credit in the above link, but in my task list it shows neither claim nor granted figure.

Just a blip in the validation process? Hopefully.
ID: 60515 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2130
Credit: 41,424,155
RAC: 14,205
Message 60550 - Posted: 8 Apr 2009, 11:39:54 UTC

ID: 60550 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Murasaki
Avatar

Send message
Joined: 20 Apr 06
Posts: 303
Credit: 511,418
RAC: 0
Message 60575 - Posted: 9 Apr 2009, 14:45:04 UTC - in response to Message 60550.  

And another:

1V33A_BOINC_MPZN_vanilla_loop_modeling_9559_622_2


You should be reporting validation errors on version 1.54 tasks in the Minirosetta v1.54 bug report thread.

The project team have set up central threads for error reporting to avoid wasting time hunting through several different threads. It is less likely that your error message will be read if you leave it here. ^_~
ID: 60575 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2130
Credit: 41,424,155
RAC: 14,205
Message 60611 - Posted: 12 Apr 2009, 21:25:45 UTC - in response to Message 60575.  

And another:

1V33A_BOINC_MPZN_vanilla_loop_modeling_9559_622_2


You should be reporting validation errors on version 1.54 tasks in the Minirosetta v1.54 bug report thread.

The project team have set up central threads for error reporting to avoid wasting time hunting through several different threads. It is less likely that your error message will be read if you leave it here. ^_~

Ok. I wasn't sure if this was a problem with the WU (no errors reported at all) or just the validation jobprocess.
ID: 60611 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : lots of validation errors



©2024 University of Washington
https://www.bakerlab.org