Message boards : Number crunching : how do you keep from losing work when windows wants to reboot your system after a update?
Author | Message |
---|---|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
since mod disagreed with me on this issue in the 1.47 thread on how to save yourself from losing work in the case windows needs to reboot, here's a thread to explain what in his opinion or others the way to keep from losing a task due to reboot. i suggested and i have tried and tested this way of doing things (perhaps the explanation is not correct) to preserve a task before rebooting. i goto activity in boinc manager and suspend work. after the drive gets done writing then i reboot and can come back to windows and reboot a few more times and then once things are settled down i just have boinc manager unsuspend the tasks and everything picks up from where it left off. so mod. what is not correct about this statement, even though it has worked numerous times on my machine? is there a different way or different explanation of how to keep from losing a task when windows reboots to many times and rosetta is restarted to many times due to windows? one reboot should not do anything to rosetta, but multiple reboots and restarts of tasks will lead to a loss similar to what rochester experienced. i know that first hand. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
For reference, Greg's original post As Rosetta starts up a task, it reviews and updates a counter. If this task has been started from this same point five times in a row, the task is basically aborted for you. Five tries seemed like a reasonable balance between normal operations of a machine, and a task that's not running well for your environment. So, what does "this same point" mean? The task always restarts from a checkpoint (or from scratch if the task has not previously reached a checkpoint). So, whenever a checkpoint is saved, the counter is reset to zero. Checkpoints are saved during a model when possible, and also at the end of each model. How can I force Rosetta to take a checkpoint? You can't. Keeping tasks in memory is recommended, but if you are about to reboot the machine, it's not going to help. Suspending a task does not preserve any data. So, if you know you could be about to reboot several times in a row, suspending either BOINC or the project or the specific tasks first would be a good way to assure they don't try to start up on each reboot. Sometimes BOINC gets confused in to thinking it needs more work to do. So here is what I do... To avoid counting all of the restarts, I would do the following: Suspend network activity (Advanced view, Activity pulldown) Suspend CPU (also in Advanced view, Activity pulldown) Exit BOINC (with file -> exit, not the red "X") ...Apply your fixes, doing all required reboots. Then start BOINC (or perhaps you set it to start at machine start) Allow CPU (Activity pulldown, "run based on preferences") Allow network access (Activity pulldown, "network activity based on preferences") And what we've achieved here, is any number of reboots, and Rosetta only being aware of one. So it will just add one to the number of restarts count. This avoids needlessly aborting the tasks due to 5 restarts. So the main detail in your description I feel is flawed, was that suspend is saving anything. Also you referred to keeping tasks in memory, I highly recommend that as well, but in this case, we're about to power off the machine, so it is not going to help. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Mod, very good explanation as always...mine was not that good. though i hit on some of the same points as you. true, that when you shut down memory is lost. didn't think about that. thanks for clarifying this point, perhaps you can add it to your q&a section. |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
ok thanks i had to make a copy of all that For reference, Greg's original post |
rochester new york Send message Joined: 2 Jul 06 Posts: 2842 Credit: 2,020,043 RAC: 0 |
yes good idea Mod, |
jay Send message Joined: 12 Jan 08 Posts: 20 Credit: 195,801 RAC: 0 |
Hello Feet1st, Re: Paging.. I kept on reading Microsoft FAQ, The Task Manager Help and others.. Found the best answer in a forum on another BOINC project: Clean Air Project. see http://www.WorldCommunityGrid.org/forums/wcg/viewthread?thread=22602 This is an entire thread on the page faults. I had missed the distinction between hard and soft page faults. The hard fault (when going to disk) was not what I was having. (There was over one GB free - and - no excessive disk activity light.) What I can understand of the 'soft' fault is when an application does a malloc, or some system call for memory, and the OS goes and gets memory from the free space. Perhaps C++ does the same thing when creating and freeing instances... On the Rosetta task, the xp task manager (process tab) showed mem usage and VM size increase for a couple of minutes and then drop back. This behavior continued throughout the processing of the WU. The WCG thread talked about programs that managed memory in the user heap - rather than using the expensive OS calls - as more efficient. It doesn't look like adding memory or user tweaking can help this. Perhaps that would apply to the rosetta task as well. Thanks for your supporting data!!!! Jay |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,688,048 RAC: 9,222 |
i don't shut down very often, but when I do on my single core machine i use sacp - see this thread: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3061&nowrap=true#39117 It runs a file when rosetta next checkpoints, and that file can be a batch file which either shuts down or reboots. If you don't need to be there for the reboot and so can have it delayed by a minute or thirty then this means you'll not loose work and won't have to worry about the number of reboots as it'll always have progressed from the last checkpoint and therefore boinc won't discard any work. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
mod, just a question on next to last statement. "So the main detail in your description I feel is flawed, was that suspend is saving anything." if it is not 'saving' anything then why is the HDD grinding away upon suspension of activity and then why is is grinding away again when you resume? there must be some sort of data being saved to the HDD otherwise rosetta would not have any information to refer to upon restart. "And what we've achieved here, is any number of reboots, and Rosetta only being aware of one. So it will just add one to the number of restarts count. This avoids needlessly aborting the tasks due to 5 restarts." where is the count data stored if not on the HDD in the project folder? there is some sort of significant amount of data being stored on suspend otherwise the drive would not take x number of seconds to write information and then again x number of seconds to read this data and pick up where you left off so to speak. never studied the graphics on suspend or restart to see if it picks up where it left off or started a new model. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Greg, yes, I thought you might ask about that, and that was my reasoning for a new thread. To have room to fully discuss it. I am uncertain what you are observing with your hard drive. Let me state what I know. I know that suspend doesn't save anything. I know that startup does save this counter (in the slot directory of the task I believe), so yes, saves it on your hard drive. But that is at start up, not shutdown/suspend time. I know that when many tasks are running on a machine, any of them could cause the hard drive to be used. Some possible explanations as to why you see your hard drive activity upon suspend: If you didn't keep tasks in memory, a suspend would cause the system to clean up all the memory pages of the task. And if some of those pages are in the swap file, it may physically be revising the swap file to indicate they no longer belong to the Rosetta task. Since you do keep tasks in memory, suspend puts the tasks in to a wait state. This means it has no need for the physical memory it presently occupies. And so perhaps Windows is pushing some pages out to the swap file in order to allow more physical memory for other tasks. Unfortunately, this action doesn't preserve anything for Rosetta to use once the machine is powered off. You can prove to yourself what is preserved and what is not, but it takes some patience. You can check file revision dates, or find a model end, or activate the checkpoint debug messages and determine when a checkpoint occurs for a task. Note the CPU time for the task at the time of the checkpoint. Then let it crunch for a period of time in to the next model. Note the CPU time used for the task so far. Then use your suspend-first approach and exit (not close) BOINC (or the whole machine). Then restart BOINC and watch as the task starts up. Once it gets initialized and running, the CPU time will be reduced to what it was at the time of the checkpoint. This is often what people mean when they refer to "lost work". That 30min or whatever from the checkpoint in to the next model is lost and must be done again to continue. So, yes, I was curious too as to why you see the disk activity. The above are some possibilities. Perhaps others have some other ideas on what causes it? Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I moved jay's post from the problems with 1.47 thread. I thought it better fit with the discussion here. He's referring to this post from Feet1st. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
mod, perhaps it is not rosetta that is writing to the drive. i am running einstein and maybe their project operates differently? no idea. but something does write to the drive on suspend and restart. swapfile space sounds logical,but since i have only 2 cores i only run 2 projects, rosetta and einsteinthere would be no other boinc tasks that would need to have more physical memory. there would be only windows processes and other web processes that were already running before the suspension of activity. would windows be doing something to increase their usage on the drive? in any case...hope someone else can explain the writing to the drive. i am curious about this. as for things saving to the hdd, besides the counter what else is saved? the model information would not be saved on suspend? but is saved at each checkpoint? so rosetta would back up to the last checkpoint that contained a complete model and start from there upon restart? i think this was discussed somewhere a long time ago. **note: i just did a upgrade of boinc and when i suspended all work there was no writing to the hdd.*** perhaps it was some windows thing that was writing to the hdd at the time. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
...besides the counter what else is saved? the model information would not be saved on suspend? but is saved at each checkpoint? so rosetta would back up to the last checkpoint that contained a complete model and start from there upon restart? A checkpoint is a save of everything needed to pick up from that point and go forward. At the end of a model, the only thing you really need to go forward is to know what your next model is. The completed models are saved in the .out file. But yes, the counter, the completed models and the checkpoints would be the things Rosetta writes to disk. A restart occurs from the last checkpoint. This may have been during in the middle of a model. Some types of work checkpoint more frequently then others. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
...besides the counter what else is saved? the model information would not be saved on suspend? but is saved at each checkpoint? so rosetta would back up to the last checkpoint that contained a complete model and start from there upon restart? great thanks for the info. so it must have been windows writing. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,284,221 RAC: 995 |
And what we've achieved here, is any number of reboots, and Rosetta only being aware of one. So it will just add one to the number of restarts count. This avoids needlessly aborting the tasks due to 5 restarts. I've found a similar circumstance when the leave in memory option actually helps: When the idea that a reboot is required turns out to be mistaken, as it is for some Vista updates. Another similar circumstance: A non-BOINC program is to be run that requires nearly full control of the hard drive, such as the antivirus program and the anti-spyware programs I use. In this case, it isn't even necessary to fully shut down the BOINC program; just fully suspending it as you describe, shut down its user interface, have leave in memory already enabled, have BOINC already allowed to use enough hard drive space that just the R@h share is enough to store the memory for all the R@h workunits already in progress, have BOINC already allowed to use a high enough fraction of swap space to hold this also, and set a sufficiently high upper limit on the amount of swap space used. When you later tell BOINC to resume, it is usually able to recover what was left in memory, as long as a reboot wasn't required. It's unlikely that all of these settings are actually required, but they don't cause enough problems on my machine to make it urgent that I test which of them are actually required. These settings seem to have a good side effect - they now allow both of my CPU cores to run minirosetta workunits at the same time, at some cost in page faults, something I seldom saw happen before. |
Message boards :
Number crunching :
how do you keep from losing work when windows wants to reboot your system after a update?
©2024 University of Washington
https://www.bakerlab.org