Message boards : Number crunching : Expired deadline
Author | Message |
---|---|
pixie Send message Joined: 30 Aug 08 Posts: 1 Credit: 3,666,539 RAC: 0 |
I just noticed that I missed the deadline by a whole day. Do I abort them so the rest of the tasks get submitted on time, or do I just let them crunch, so it doesn't ruin the project? TIA |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
They have been reassigned to another user from the looks of it. You could let them run but then the other user crunches but gets no credit. Check each one, if the task has been reassigned I personally would abort them so as not to hurt the new guy. Not sure if you will get hit in credit or not, would think not since they never ran. Someone else might have a different idea on this. I just noticed that I missed the deadline by a whole day. Do I abort them so the rest of the tasks get submitted on time, or do I just let them crunch, so it doesn't ruin the project? |
R.L. Casey Send message Joined: 7 Jun 06 Posts: 91 Credit: 2,728,885 RAC: 0 |
I just noticed that I missed the deadline by a whole day. Do I abort them so the rest of the tasks get submitted on time, or do I just let them crunch, so it doesn't ruin the project? pixie, You should abort tasks that have passed the deadline. The WU has been reassigned because your computer did not finish before the deadline which seems to be ten days. Actually, once the WU hsa been reassigned, there is a race: the first of the two computers to report will get credit, and the second will get a Validation Error and no credit. In order to alleviate this problem, you should greatly reduce, in your BOINC network preferences, the "Computer is connected to the Internet about every ___ days" value and the related one about keeping ___ additional days of work. Many of your computers have 400 to 500 WUs waiting to start -- too mmany, since you will see that many of them are approaching the deadline and have not eevn started crunching yet. I saw one computer with an average turaround time of 9.94 days, so results are barely getting in before the deadlie. No doubt some are not getting done in time. Try cutting the numbers 'way down; high numbers of WUs waiting to run are not an advantage. And... VERY nice computer farm!! Thanks for crunching Rosetta! |
Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
You should abort tasks that have passed the deadline. A rule of thumb: you can immediately abort the task, if if it was not yet started crunching. If it is already being crunched... then it depends. It is simpler to decide when the tasks take days and you need last few hours until finished. Then you can be sure that the reassigned task wll surely finish later. Rosetta's tasks are usually much shorter, a reassigned task can be finished in any moment (like a hidden thread :-) So yes - abort it. (Your nice farm will not notice it ;-) You could let them run but then the other user crunches but gets no credit. The WU has been reassigned because your computer did not finish before the deadline which seems to be ten days. Actually, once the WU hsa been reassigned, there is a race: the first of the two computers to report will get credit, and the second will get a Validation Error and no credit. Ths should definitely not happen until the second guy's deadline will pass!! (If it does, it is BOINC server-side error, which should get reported and repaired. The second guy is given a promise to be able to crunch the reassigned WU until his new deadline.) Peter |
Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
You could let them run but then the other user crunches but gets no credit. I'm taking my words back. It has nothing to do with BOINC server-side software, it is just Rosetta's tight and intolerant settings: minimum quorum 1 You are right: "poor second guy" ;-) Peter |
R.L. Casey Send message Joined: 7 Jun 06 Posts: 91 Credit: 2,728,885 RAC: 0 |
You could let them run but then the other user crunches but gets no credit. I mentioned the 'anomaly' on the Rosetta 5.98 'problems' thread at https://boinc.bakerlab.org/rosetta/forum_thread.php?id=4213&nowrap=true#55918 . It would be nice, I think, if the original, over-deadline results were discarded in favor of the second (as yet unfinished) second task, but it's probably a somewhat rare occurrence, not a top priority... |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
R.L. Casey Send message Joined: 7 Jun 06 Posts: 91 Credit: 2,728,885 RAC: 0 |
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified Thanks for the info, Feet1st! I've been on a crunching sabbatical for a year or so... plus I hadn't fllowed BOINC, anyway. They have an impressive list of items to assess. Thanks again; glad you're still around. |
Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified The changeset [trac]changeset:276[/trac] describes a different case, where third task is being errorneously reissued, although the total=2. Would the problem mentioned in this thread be solved using 2-2-2 limit settings? minimum quorum 1 (Surely it could take longer to discard the WU from server.) Peter |
ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0 |
This discussion seems to be going a bit too fast. I should like to see some documentation that Boinc/Rosetta gives credits merely to one computer in cases where two computers have returned results for the same task. I have used a slow but reliable computer for one and a half year. I see this situation quite often, receiving tasks expired or crashed on other computers (and occasionally fighting time limits myself), and I have tried to follow what happens to these tasks. I have never - repeat: never - observed that Rosetta has refused credits to one of two computers delivering valid results for the same task. Rosetta will use the first incoming result as a canonical result, but credits are delivered to everyone. And I think this is a proper behaviour. Boinc is designed to run unattended, and participants should not need to worry unnecessarily about deadlines or task duplications. There is of course one limitation. When Rosetta has received one valid result the project is satisfied. The task will stay as statistics for the successful cruncher for a limited period and then disappear from the server. And at that point no one will be able to return results and get credits. As for aborting tasks passing deadlines I am in two minds. If you see that another computer has delivered a valid result then by all means abort your replication. But an abortion of a task weakens its overall chances of success. I have observed quite a few perfectly sound tasks being cancelled on the server because the first cruncher aborted for time reasons (thereby registering a compute error by the server) and the next receiver crashing or giving in (because some models are too lengthy, what do I know?). Anyhow, I now wonder if it is better for the project to register two successes for the same task than none at all. |
ramostol Send message Joined: 6 Feb 07 Posts: 64 Credit: 584,052 RAC: 0 |
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified My very personal theory is that the text "max # of error/total/success tasks" is misleading and should read "max # of error/total/success results". This situation may be created because Boinc somehow does not consider tasks exceeding the deadline (Server state: Over ; Outcome: No reply ; Client state: New") as results. Then if a result is returned to the server after the deadline in a situation where three computers have received this task for computing, this increases the number of total results in disfavour of the third participant. By the way, I doubt that we may conclude from "max # of error/total/success tasks" = 1, 2, 1 that Rosetta should not send out more than 2 replications of the same task. By the same interpretation we should have to conclude that Rosetta terminates a task upon receiving one result with "Client error/Compute error". It doesn't, it terminates upon receiving more than 1, that is 2, error results. And Rosetta is perfectly capable of accepting more than 1 (again = 2) success results under ordinary circumstances, giving proper credits to everyone. |
Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
My very personal theory is that the text "max # of error/total/success tasks" is misleading and should read "max # of error/total/success results". That's just a different wording. The term "task" was introduced at a later time, to describe what is assigned to and running on a host. Previously there were just WUs, consisting of results (which were to be returned to server after being crunched). But it sounded weird if "a result was running on my host"... Once the files are returned to the server, they are just plain "results of computation". But the official wording might indeed be "tasks" now. By the way, I doubt that we may conclude from "max # of error/total/success tasks" = 1, 2, 1 that Rosetta should not send out more than 2 replications of the same task. But we have to. The scientists use these values to set up, how should the server behave during the WU's lifetime. By the same interpretation we should have to conclude that Rosetta terminates a task upon receiving one result with "Client error/Compute error". It doesn't, Sure, it does not. Usually in that moment, there are two results: one failed (1 error is fulfilled) and one just resent. (There should be no second additional resent task, because max is 2.) it terminates upon receiving more than 1, that is 2, error results. Exactly. If either 2 successful or 2 eror results are back, suddenly it does not fit in the (1,2,1) form and the WU is declared as failed. And Rosetta is perfectly capable of accepting more than 1 (again = 2) success results under ordinary circumstances, giving proper credits to everyone. That is up to the devs to comment on. Anyway, they are still able to grant (semi-manually?) credit to any successful result, regardless of the WU state. This way it is often done on beta projects. Peter |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified I guess this hasn't been resolved yet. In this workunit it is clear that at the moment it was send to one of my computers there were too many total results. And though my computer crunched it for more then 11 hours, and it finished without any errors, it could never validate. Bad luck for me, and a complete waste of CPU-time. AdeB |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified I think it was still valid to get sent to your computer, as the other 2 systems did not reply for whatever reason, which do not count as errors as no results were ever returned. It is to bad you got stuck on a validate error and wasted cpu time. that is a real problem here sometimes. |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified I think that you are right that 'no reply' doesn't count as an error. But it should not be send to a third computer, because then there will be a validate error as the number of tasks exceeds the maximum number of tasks: max # of error/total/success tasks [b]1, [color=red]2[/color], 1[/b] |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified yeah i see the human logic vs the computer logic do not match. the boinc ticket 276 explains things pretty good. surprised they haven't fixed this bug. must be super low priority. |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
This issue of reissuing a task to someone that may not receive credit for it has been on the BOINC "to do" list for over a year. Here is a link to the trac item: Server reissues task more then "total" specified Looks like someone stepped in and granted credit for the task. I hope it was also possible to save the results, because that's what its all about. |
Message boards :
Number crunching :
Expired deadline
©2024 University of Washington
https://www.bakerlab.org