Message boards : Number crunching : Lots of jobs in error
Author | Message |
---|---|
Cureseekers~Kristof Send message Joined: 5 Nov 05 Posts: 80 Credit: 689,603 RAC: 0 |
Hello, One of our members of DPC has got some jobs in error. (6 jobs in error, out of 15) See https://boinc.bakerlab.org/rosetta/results.php?hostid=373642 This is an example of a job in error: See https://boinc.bakerlab.org/rosetta/result.php?resultid=51874373 Can someone explain how this error can happen? Member of Dutch Power Cows |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
The error code shown in your example of job in error is a 0xC0000005, which is the code for "Access Violation", which essentially means that the process in question was trying to access memory that it wasn't supposed to access. From the call-stack, it seems like you're in an NVidia graphics driver that has gone into some sort of recursive call - but that could just be me misunderstanding the crash-dump... Or that the crash dump isn't very clever with certain types of stack-patterns. -- Mats |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
The -107 error code, and the fact that you have lots of them sounds to me like the screensaver problems. I see another WU that was ended by the watchdog. This is another sign to me of screensaver problems. The suggestion is to set Windows screensaver to none. There is a new version under test presently on Ralph which should be available here on Rosetta in just a few days, which seems to have resolved the screensaver problems. So, hang in there! Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Don Send message Joined: 28 Oct 06 Posts: 2 Credit: 294,270 RAC: 0 |
I run R@H on four machines. I get lots of errors on two of them that lock up the computer. Sometimes they are cleared by ctl-alt-del and halting the rosetta task, other times it requires a reboot. Only the two machines with hyperthreading get errors, the others (athlon 64 and a PIII have never had an error. All four machines also run other clients (Seti and Einstein) with no problems. I have tried keeping the job in memory and will see if that works, next I will try to disable the screensaver, but these do not explain why other CPUs are not affected. Maybe the size of the job. |
FluffyChicken Send message Joined: 1 Nov 05 Posts: 1260 Credit: 369,635 RAC: 0 |
I run R@H on four machines. I get lots of errors on two of them that lock up the computer. Sometimes they are cleared by ctl-alt-del and halting the rosetta task, other times it requires a reboot. Only the two machines with hyperthreading get errors, the others (athlon 64 and a PIII have never had an error. All four machines also run other clients (Seti and Einstein) with no problems. it is believed to be a syncronisation error and happens (or seems to happen) more often than not on a computer running more than one boinc project at the same time (Hyperthreading technology or multicore processors do this) Hence the PIII, A64 generally would not see this. Team mauisun.org |
Mats Petersson Send message Joined: 29 Sep 05 Posts: 225 Credit: 951,788 RAC: 0 |
I run R@H on four machines. I get lots of errors on two of them that lock up the computer. Sometimes they are cleared by ctl-alt-del and halting the rosetta task, other times it requires a reboot. Only the two machines with hyperthreading get errors, the others (athlon 64 and a PIII have never had an error. All four machines also run other clients (Seti and Einstein) with no problems. Synchronisation issues is by far more likely to be a problem on systems that have multiple execution units that run different threads at the same time (so SMP and HT/Multicore systems), as those would technically be able to get things into a "unsynched" state much more easily by accessing the data in parallel (and the data being in an inconsistent state due to one thread being half-way through some udpate, and the other one reading the "half-baked" data). It's of course possible to get this to happen on a single processor system as well, but the likelyhood of actually hitting the failure point is less likely. -- Mats |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
The problems are ALSO actually more likely on a computer that you are not actively using. Because if you were using it to do something, then one processor thread would often be working on what YOU are doing rather then what Rosetta is doing. And thus the two Rosetta threads are less likely to run at the same time or to be preempted at a key point in time. Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Message boards :
Number crunching :
Lots of jobs in error
©2025 University of Washington
https://www.bakerlab.org