Message boards : Number crunching : Workunits getting stuck and aborting
Previous · 1 · 2
Author | Message |
---|---|
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
=================== From Thomas Posted 11 Feb 2007 7:07:24 UTC Original post was deleted due to it throwing out the thread's margins. =================== Please post any relevant observations on your side. Thank you for your help. Rosetta Moderator: Mod.Sense |
Thomas Leibold Send message Joined: 30 Jul 06 Posts: 55 Credit: 19,627,164 RAC: 0 |
I will share your findings and thoughts with other project developers tomorrow to see what this can bring to us. Has there been any news on this issue. I know that there are now DOC_* workunits that no longer cause problems, but what about the issue of the watchdog timer hang on Linux ? Do either the new Boinc 5.8.11 client or the new Rosetta 5.46 client address that issue or are watchdog hangs still possible on Linux ? Team Helix |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes Thomas, the Project Team has been working on these watchdog terminations. The watchdog is not a BOINC thing, so that is part of Rosetta. It was created to help improve the ease of use by catching things that don't look right to the watchdog (the "trained eye" if you will) and terminating when things don't seem to be progressing normally. This let's your computer get on with other work units which can be more fruitful. Especially when problems specific to a class of work units are found and the work is pulled off the server to correct it. The short story is that with the DOC work units, what passes for "normal" is not as simple to assess as it used to be. These tasks often spend considerable time in calculations without specific visible signs of progress. You've probably read posts from concerned users that have aborted tasks because they were "hung". It is difficult to assess without specific details of each case, but do keep in mind that the watchdog was created to make that determination FOR you, and to abort the task when the watchdog feels it is appropriate. So, if the watchdog is functioning correctly, aborting a "hung" task should not be necessary. The watchdog has been "in training" and is learning that he really does not need to bark at the mailman every day. (the mailman being a normal event which does not require special alarm). The next edition of Rosetta should include some changes for a smarter watchdog. So having the watchdog end a run will always be possible. But recently it has been ending runs that are not in fact hung. And the changes to correct this issue should be rolled out soon. I believe we're also seeing some reports of the watchdog NOT ending runs that it should have. That issue is under review as well. Rosetta Moderator: Mod.Sense |
j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0 |
Yes Thomas, the Projecgt Team has been working on these watchdog terminations. The watchdog is not a BOINC thing, so that is part of Rosetta. It was created to help improve the ease of use by catching things that don't look right to the watchdog (the "trained eye" if you will) and terminating when things don't seem to be progressing normally. This let's your computer get on with other work units which can be more fruitful. Especially when problems specific to a class of work units are found and the work is pulled off the server to correct it. Why is this experimenting being done in Rosetta, instead of Ralph? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Why is this experimenting being done in Rosetta, instead of Ralph? By "experimenting" I presume you are referring to the fact that the watchdog is not working perfectly for all situations. The purpose of Ralph is to test new tasks and new Rosetta releases prior to their release on Rosetta. This includes all the new science being worked on, all the changes made to the screensaver, the watchdog, etc. And as further changes are made to the watchdog, they will be tested first on Ralph, as was the last round of changes. It is not uncommon for a few software problems to go unnoticed during testing. The idea is to catch as many as you can. There are only so many ways you can test something. When changes are then released to 70,000 machines on Rosetta, there are certainly user environments that present unique situations. The other factor to consider is whether your changes improved things. If you have changes that your testing shows improve things, do you wait another couple of weeks to release it because you know it is still not perfect? The last round of watchdog changes was an improvement from what was running on Rosetta prior to the release. It is still not perfect. But testing on Ralph found it to be better then it's predecessor. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Workunits getting stuck and aborting
©2025 University of Washington
https://www.bakerlab.org