Should I abort already finished workunits?

Message boards : Number crunching : Should I abort already finished workunits?

To post messages, you must log in.

AuthorMessage
TSD

Send message
Joined: 10 Oct 08
Posts: 7
Credit: 2,189,714
RAC: 0
Message 100713 - Posted: 12 Mar 2021, 1:00:18 UTC

Sometimes I get a workunit tagged with "Timed out - no response". Someone else got such a workunit before me - and has not finished it within 3 days.

And sometimes such a workunit is "Completed and validated" - after deadline - while my system is working on the same workunit.


Would it be a good idea to abort such a workunit and continue with another one? Or should I let it finish? I don't care about getting credits.

I hope this makes sense. And I apologize if this has been answered before. I have tried to make some searches.
ID: 100713 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,816,373
RAC: 22,802
Message 100716 - Posted: 12 Mar 2021, 8:32:54 UTC - in response to Message 100713.  

Would it be a good idea to abort such a workunit and continue with another one? Or should I let it finish? I don't care about getting credits.
I'd let it finish as it will still be returning a useful result.
Grant
Darwin NT
ID: 100716 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100719 - Posted: 12 Mar 2021, 22:35:17 UTC - in response to Message 100713.  

If you notice that the _0 task for the workunit has completed and validated before your _1 task has finished, you might as well abort it and start something new because your results will be identical to the set already submitted and (unless your machine is substantially faster than the other, or your run time is set higher) won’t add anything.

[So that’s two conflicting answers you’ve got, which leaves you in no better place than before you asked…]
ID: 100719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,816,373
RAC: 22,802
Message 100720 - Posted: 13 Mar 2021, 0:11:37 UTC - in response to Message 100719.  

If you notice that the _0 task for the workunit has completed and validated before your _1 task has finished, you might as well abort it and start something new because your results will be identical to the set already submitted
No, they won't.
That's the thing with Rosetta work- when a Task is processed it uses a Random seed value. So you could process the exact same Task 100 times on the same system, and get 100 different results, all of them Valid..
That's why it's worth completing a Task, even if it's original issue has been returned & Validated.
Grant
Darwin NT
ID: 100720 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100721 - Posted: 13 Mar 2021, 1:01:08 UTC - in response to Message 100720.  

It seems exceptionally unlikely that scientific experiments are being conducted by random chance, or that the tasks sent out are anything but entirely deterministic. Far more likely is that the seeds are determined by the server in advance, per workunit, to ensure a known distribution over the range of starting points. Observe that every task is sent with the -⁠constant_seed option set, and a specific -⁠jran value.
ID: 100721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,816,373
RAC: 22,802
Message 100722 - Posted: 13 Mar 2021, 3:04:12 UTC - in response to Message 100721.  

Observe that every task is sent with the -⁠constant_seed option set, and a specific -⁠jran value.
And those values (or at least the -jran one) is different for each Task. So for a given work unit, each replication that is sent out starts with a different value.
Hence there is no comparison of returned Tasks for validation as occurs on other projects.


It seems exceptionally unlikely that scientific experiments are being conducted by random chance, or that the tasks sent out are anything but entirely deterministic.
Yet that is partly what is happening- look at some of the std_err outputs for WUs where one Tasks errors out but another doesn't.
There are a huge number of possible combinations to try, and you need to try as many as possible to discard those that aren't of use, and find those that are. Using random variables (within a given range of values) as a seed value helps achieve that.
Grant
Darwin NT
ID: 100722 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TSD

Send message
Joined: 10 Oct 08
Posts: 7
Credit: 2,189,714
RAC: 0
Message 100723 - Posted: 13 Mar 2021, 3:30:46 UTC

I am a little confused. I don't know much about what the options -⁠constant_seed or -⁠jran means or does. It is a little too technical for me.

What I have described as a problem does not happen very often. And so far I think I will just do nothing - and let workunits finish.


@Grant (SSSF) & Brian Nixon

Thanks for your replies. Appreciated.
ID: 100723 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100726 - Posted: 13 Mar 2021, 11:59:28 UTC - in response to Message 100722.  
Last modified: 13 Mar 2021, 12:08:01 UTC

-⁠jran is set per workunit, not per task. (Evidence: _0 · _1) What would be the point of resending a failed workunit to a second machine if it wasn’t going to retry the exact thing that failed?

You’re right that there are some nondeterministic-looking outcomes where a workunit will fail on one machine but succeed on another. As you’ve pointed out to people many times, those cases are more likely down to hardware failures (which, over the installed base of users’ computers, genuinely are random) or latent platform bugs (such as where we see workunits fail on Windows but not on Linux, and the randomness is in the assignment of tasks to machines) than to unpredictable progression of tasks themselves.

Using random numbers will lead to a random outcome, not a useful one. That is gambling, not science.
ID: 100726 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 100727 - Posted: 13 Mar 2021, 12:02:13 UTC - in response to Message 100723.  
Last modified: 13 Mar 2021, 12:14:39 UTC

Your question has sparked a discussion between Grant and me, based on our different understandings of how the project works. The truth is that neither of us really knows, and unless we get a definitive answer from an administrator, we never will.

As you say: the situation you’re asking about happens so rarely that it will make no significant difference in the grand scheme of the project whether you abort the workunits or let them continue.
ID: 100727 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 353
Credit: 1,227,479
RAC: 1,506
Message 100731 - Posted: 13 Mar 2021, 14:44:53 UTC
Last modified: 13 Mar 2021, 14:55:57 UTC

I would probably run it especially if it already has a significant amount of CPU time.
An alternative to aborting a task is to change the CPU runtime target on the Rosetta@home preferences so that the WU finishes earlier (say the WU has 5 hours of CPU time already, then change the CPU runtime target to 4 hours if you want it to finish ASAP or 6 hours so it finishes soon). After updating the prefs, do a project update on the BOINC Manager. Then I usually either wait a bit or disable LAIM in the BOINC Settings, suspend the task and resume it immediately - it should end soon after.


Rarely, I miss the deadline by a couple of hours and a replacement task gets sent to another device and returned before I finish my own task because that replacement host is set with a very low CPU runtime target (1 hour, come on lol). Annoying, but I deliver my task with the 8 hours of CPU runtime since an 8 hour task is probably better than a mere 1 hour task.
ID: 100731 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,816,373
RAC: 22,802
Message 100733 - Posted: 13 Mar 2021, 21:44:51 UTC - in response to Message 100726.  

Using random numbers will lead to a random outcome, not a useful one. That is gambling, not science.
No so.
Many great discoveries have come about by chance. Statistics is a good example where randomness is essential in order to produce valid results.

Science is about repeatability- if you get a particular result using a particular set of variables, then you should always get the same result under the same identical conditions.


When it comes to proteins there are billions upon billions upon billions etc of possibilities. Most of which aren't viable.
But that still leaves a mind boggling number of possibilities that are viable. So many so that the models the researchers release are just a punt- an educated punt based on past experience, but a punt none the less. And using a random seed variable, that is within limits set by the researcher, will produce a range of valid result that will show if their model on on the right track, and if so in which direction they should head.
Grant
Darwin NT
ID: 100733 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,164,050
RAC: 4,004
Message 100734 - Posted: 13 Mar 2021, 23:44:16 UTC - in response to Message 100733.  

Using random numbers will lead to a random outcome, not a useful one. That is gambling, not science.
No so.
Many great discoveries have come about by chance. Statistics is a good example where randomness is essential in order to produce valid results.

Science is about repeatability- if you get a particular result using a particular set of variables, then you should always get the same result under the same identical conditions.


When it comes to proteins there are billions upon billions upon billions etc of possibilities. Most of which aren't viable.
But that still leaves a mind boggling number of possibilities that are viable. So many so that the models the researchers release are just a punt- an educated punt based on past experience, but a punt none the less. And using a random seed variable, that is within limits set by the researcher, will produce a range of valid result that will show if their model on on the right track, and if so in which direction they should head.


AGREED it's alot like the projects looking for Prime Numbers.....sure they go one by one as they look for them but the Project Scientists really do have a very good idea if there will be a prime number found in the current batch or the next batch or the batch after that as they are fairly predictable, not exactly but pretty close. YES eventually Rosetta, or some other project, could circle back and pick up all the ranges of things they are currently not doing but predictability is how most Boinc projects work and most seem to work pretty well.
ID: 100734 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
KWSN_Sir_Frank_of_the_Wood

Send message
Joined: 17 Mar 21
Posts: 1
Credit: 23,589
RAC: 0
Message 100826 - Posted: 24 Mar 2021, 17:47:46 UTC

New to Rosetta - less than 10 days...

Noticed this morning that 12 work units (out of 30 or so in last batch)
had been flushed/discarded by Server before my machine started on them...

Each had been re-sent to me shortly after 72 hour deadline had passed - then
Previous Cruncher had completed the unit a few hours later.

Seems to me that this is better than a lot of Wheel Spinning on processing
units that have already been completed successfully...and apparently the
Server thinks that a re-sent unit is exactly the same as the original unit.


frank
ID: 100826 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1679
Credit: 17,816,373
RAC: 22,802
Message 100832 - Posted: 25 Mar 2021, 8:15:59 UTC - in response to Message 100826.  

New to Rosetta - less than 10 days...

Noticed this morning that 12 work units (out of 30 or so in last batch)
had been flushed/discarded by Server before my machine started on them...
Having a smaller cache fixes that.
           Store at least 0.1 days of work
Store up to an additional 0.01 days of work

Grant
Darwin NT
ID: 100832 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Should I abort already finished workunits?



©2024 University of Washington
https://www.bakerlab.org