Message boards : Number crunching : Why does this still happen.
Author | Message |
---|---|
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
I guess i'm not the only one this happens to, why can't the tasks be canceled by the project if they haven't been started. Other projects do this it saves wasting time. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=181582962 ===================================================== DONE :: 1 starting structures 21135.6 cpu seconds This process generated 42 decoys from 42 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> Validate state Workunit error - check skipped Claimed credit 148.221875622837 Granted credit 0 application version 1.34 pete |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Two words Peter: BOINC Bug http://boinc.berkeley.edu/trac/ticket/276 Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
My thanks to whoever fixed this one up. pete. |
FoldingSolutions Send message Joined: 2 Apr 06 Posts: 129 Credit: 3,506,690 RAC: 0 |
Task ID - 202592794 Work unit ID - 185058479 Sent - 27 Oct 2008 20:17:33 UTC Time reported or deadline - 29 Oct 2008 19:33:36 UTC Server state - Over Outcome - Client error Client state - Compute error CPU time (sec) - 70,590.59 Claimed credit - 329.12 Granted credit - --- Shouldn't there be some kind of credit compensation for 20 hours of wasted CPU time?? |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Task ID - 202592794 you should post this info in the 1.34 thread in case the team didn't see it. be sure to tell them you had a exit code 255 as that will help them narrow down the issue. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi. Looks like this is not fixed yet, wasted 6hrs on it. Why are tasks getting sent out when others are still not past their deadlines. Could have been doing something else. Workunit error - check skipped Over_Success_Done_21,377.59_154.48_0.00 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=214522002 pete. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The timestamps are a bit misleading. The deadline is always 10 days. If you look at it again, the 10 day deadline was indeed crossed and this caused the task to be reissued. Then, after that, a result came in. Rosetta Moderator: Mod.Sense |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Hi there Mod Sense. Well that dosen't make me feel all warm & fuzzy. If they can't be returning the work on time then i see that as a waste of time. I'm just going to have to abort all that are just sent out because there overdue then, i don't like wasting the time. More work for me but so be it. I'm guessing my result for that one won't be used at all, it might be a better answer!. pete. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,208,737 RAC: 2,882 |
Hi there Mod Sense. It just means you need to lower the cache for this Project. If you have special reasons why you can't do that then as you suggested this may not be the Project for you. A 10 day cache is pretty long and unless you have a very slow pc would result in a ton of workunits. My computer is taking about 2 to 2 1/2 hours per workunit, roughly. That is say 9 units per day times 10 days is 90 workunits! Just for this Project alone. I just looked at your 2 pc's and both seem to have a very short cache already. One pc has one workunit and the other has 2 workunits that haven't been returned yet. I wonder if Boinc is having problems? It should be able to tell that a unit is near its deadline and switch to high priority crunching for that unit, so that it gets returned on time. Do you crunch 24/7? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
mikey, it wasn't peter that was late with the results, so a smaller cache doesn't help what he's talking about. You could also increase your Rosetta preference for runtime and have a cleaner task list if you are crunching all the time anyway. The default runtime is 3 hours, but you can set it as high as 24hrs. If you make changes to target runtime, make them gradually. BOINC will still request enough work units for the time based on the old preference before it sees they begin running longer. So, best to make changes when you are requesting only a small cache, and to make changes of just a notch or two per day. peter, I hear ya. I would just point out that it is not every time a task is late that results in a credit problem. It only seems to be if one fails, a second is late and then a third is issued and then the second is reported back. So what I'm saying is, don't just go by the last digit on the WU name to judge. Also keep in mind that chances are that the late result will not come back in time to conflict with you. Although with a larger cache, you would have time to go look and see if it came in. Perhaps you could add to the trac item, and post about this issue on other project boards as well. It's gone unfixed for a long time. Rosetta Moderator: Mod.Sense |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,208,737 RAC: 2,882 |
mikey, it wasn't peter that was late with the results, so a smaller cache doesn't help what he's talking about. Whoops, sorry peter, I hear ya. I would just point out that it is not every time a task is late that results in a credit problem. It only seems to be if one fails, a second is late and then a third is issued and then the second is reported back. This is a long time Boinc thing, if I understand this time...if person A gets a unit but doesn't return it before the deadline the project reissues the unit, sending it to person B. But then if Person A returns the unit before person B, then person A does get credit and person B gets the "too many results" error message. Dr. A, and others, knew about this long, long ago and decided it was not a big deal since it only happened rarely. The way I see solving the problem is to not allow person A to return the unit once it has been reissued, giving them an error message if they do try to return it. If person B returns the unit before person A, then person A does get an error message. That is why at Seti they toyed with the idea of only sending units out as reissues to computers that could return the unit within 24 hours or less. This would also clear the database of 'hanging' units quicker. |
Message boards :
Number crunching :
Why does this still happen.
©2024 University of Washington
https://www.bakerlab.org