how do you keep from losing work when windows wants to reboot your system after a update?

Message boards : Number crunching : how do you keep from losing work when windows wants to reboot your system after a update?

To post messages, you must log in.

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 58038 - Posted: 19 Dec 2008, 16:32:34 UTC
Last modified: 19 Dec 2008, 16:33:13 UTC

since mod disagreed with me on this issue in the 1.47 thread on how to save yourself from losing work in the case windows needs to reboot, here's a thread to explain what in his opinion or others the way to keep from losing a task due to reboot.

i suggested and i have tried and tested this way of doing things (perhaps the explanation is not correct) to preserve a task before rebooting.

i goto activity in boinc manager and suspend work.
after the drive gets done writing then i reboot and can come back to windows and reboot a few more times and then once things are settled down i just have boinc manager unsuspend the tasks and everything picks up from where it left off.

so mod. what is not correct about this statement, even though it has worked numerous times on my machine? is there a different way or different explanation of how to keep from losing a task when windows reboots to many times and rosetta is restarted to many times due to windows?

one reboot should not do anything to rosetta, but multiple reboots and restarts of tasks will lead to a loss similar to what rochester experienced. i know that first hand.
ID: 58038 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58039 - Posted: 19 Dec 2008, 17:07:46 UTC

For reference, Greg's original post

As Rosetta starts up a task, it reviews and updates a counter. If this task has been started from this same point five times in a row, the task is basically aborted for you. Five tries seemed like a reasonable balance between normal operations of a machine, and a task that's not running well for your environment.

So, what does "this same point" mean?
The task always restarts from a checkpoint (or from scratch if the task has not previously reached a checkpoint). So, whenever a checkpoint is saved, the counter is reset to zero. Checkpoints are saved during a model when possible, and also at the end of each model.

How can I force Rosetta to take a checkpoint?
You can't. Keeping tasks in memory is recommended, but if you are about to reboot the machine, it's not going to help. Suspending a task does not preserve any data.

So, if you know you could be about to reboot several times in a row, suspending either BOINC or the project or the specific tasks first would be a good way to assure they don't try to start up on each reboot. Sometimes BOINC gets confused in to thinking it needs more work to do. So here is what I do...

To avoid counting all of the restarts, I would do the following:

Suspend network activity (Advanced view, Activity pulldown)
Suspend CPU (also in Advanced view, Activity pulldown)
Exit BOINC (with file -> exit, not the red "X")
...Apply your fixes, doing all required reboots.
Then start BOINC (or perhaps you set it to start at machine start)
Allow CPU (Activity pulldown, "run based on preferences")
Allow network access (Activity pulldown, "network activity based on preferences")

And what we've achieved here, is any number of reboots, and Rosetta only being aware of one. So it will just add one to the number of restarts count. This avoids needlessly aborting the tasks due to 5 restarts.

So the main detail in your description I feel is flawed, was that suspend is saving anything. Also you referred to keeping tasks in memory, I highly recommend that as well, but in this case, we're about to power off the machine, so it is not going to help.
Rosetta Moderator: Mod.Sense
ID: 58039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 58040 - Posted: 19 Dec 2008, 18:45:42 UTC

Mod,

very good explanation as always...mine was not that good. though i hit on some of the same points as you.

true, that when you shut down memory is lost. didn't think about that.

thanks for clarifying this point, perhaps you can add it to your q&a section.
ID: 58040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 58041 - Posted: 19 Dec 2008, 19:26:37 UTC - in response to Message 58039.  

ok thanks i had to make a copy of all that



For reference, Greg's original post

As Rosetta starts up a task, it reviews and updates a counter. If this task has been started from this same point five times in a row, the task is basically aborted for you. Five tries seemed like a reasonable balance between normal operations of a machine, and a task that's not running well for your environment.

So, what does "this same point" mean?
The task always restarts from a checkpoint (or from scratch if the task has not previously reached a checkpoint). So, whenever a checkpoint is saved, the counter is reset to zero. Checkpoints are saved during a model when possible, and also at the end of each model.

How can I force Rosetta to take a checkpoint?
You can't. Keeping tasks in memory is recommended, but if you are about to reboot the machine, it's not going to help. Suspending a task does not preserve any data.

So, if you know you could be about to reboot several times in a row, suspending either BOINC or the project or the specific tasks first would be a good way to assure they don't try to start up on each reboot. Sometimes BOINC gets confused in to thinking it needs more work to do. So here is what I do...

To avoid counting all of the restarts, I would do the following:

Suspend network activity (Advanced view, Activity pulldown)
Suspend CPU (also in Advanced view, Activity pulldown)
Exit BOINC (with file -> exit, not the red "X")
...Apply your fixes, doing all required reboots.
Then start BOINC (or perhaps you set it to start at machine start)
Allow CPU (Activity pulldown, "run based on preferences")
Allow network access (Activity pulldown, "network activity based on preferences")

And what we've achieved here, is any number of reboots, and Rosetta only being aware of one. So it will just add one to the number of restarts count. This avoids needlessly aborting the tasks due to 5 restarts.

So the main detail in your description I feel is flawed, was that suspend is saving anything. Also you referred to keeping tasks in memory, I highly recommend that as well, but in this case, we're about to power off the machine, so it is not going to help.

ID: 58041 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2842
Credit: 2,020,043
RAC: 0
Message 58042 - Posted: 19 Dec 2008, 19:42:24 UTC - in response to Message 58040.  


yes good idea


Mod,

very good explanation as always...mine was not that good. though i hit on some of the same points as you.

true, that when you shut down memory is lost. didn't think about that.

thanks for clarifying this point, perhaps you can add it to your q&a section.

ID: 58042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile jay

Send message
Joined: 12 Jan 08
Posts: 20
Credit: 195,801
RAC: 0
Message 58051 - Posted: 20 Dec 2008, 4:55:50 UTC - in response to Message 58013.  

Hello Feet1st, Re: Paging..

I kept on reading Microsoft FAQ, The Task Manager Help and others..

Found the best answer in a forum on another BOINC project: Clean Air Project.

see
http://www.WorldCommunityGrid.org/forums/wcg/viewthread?thread=22602

This is an entire thread on the page faults.

I had missed the distinction between hard and soft page faults.
The hard fault (when going to disk) was not what I was having.
(There was over one GB free - and - no excessive disk activity light.)

What I can understand of the 'soft' fault is when an application does a malloc, or some system call for memory, and the OS goes and gets memory from the free space. Perhaps C++ does the same thing when creating and freeing instances...

On the Rosetta task, the xp task manager (process tab) showed mem usage and VM size increase for a couple of minutes and then drop back. This behavior continued throughout the processing of the WU.

The WCG thread talked about programs that managed memory in the user heap - rather than using the expensive OS calls - as more efficient.

It doesn't look like adding memory or user tweaking can help this.

Perhaps that would apply to the rosetta task as well.

Thanks for your supporting data!!!!

Jay
ID: 58051 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,688,048
RAC: 9,222
Message 58053 - Posted: 20 Dec 2008, 8:56:26 UTC

i don't shut down very often, but when I do on my single core machine i use sacp - see this thread:
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3061&nowrap=true#39117

It runs a file when rosetta next checkpoints, and that file can be a batch file which either shuts down or reboots. If you don't need to be there for the reboot and so can have it delayed by a minute or thirty then this means you'll not loose work and won't have to worry about the number of reboots as it'll always have progressed from the last checkpoint and therefore boinc won't discard any work.
ID: 58053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 58056 - Posted: 20 Dec 2008, 12:42:13 UTC
Last modified: 20 Dec 2008, 12:45:00 UTC

mod,

just a question on next to last statement.
"So the main detail in your description I feel is flawed, was that suspend is saving anything."

if it is not 'saving' anything then why is the HDD grinding away upon suspension of activity and then why is is grinding away again when you resume?

there must be some sort of data being saved to the HDD otherwise rosetta would not have any information to refer to upon restart.

"And what we've achieved here, is any number of reboots, and Rosetta only being aware of one. So it will just add one to the number of restarts count. This avoids needlessly aborting the tasks due to 5 restarts."
where is the count data stored if not on the HDD in the project folder?

there is some sort of significant amount of data being stored on suspend otherwise the drive would not take x number of seconds to write information and then again x number of seconds to read this data and pick up where you left off so to speak. never studied the graphics on suspend or restart to see if it picks up where it left off or started a new model.
ID: 58056 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58059 - Posted: 20 Dec 2008, 15:10:58 UTC

Greg, yes, I thought you might ask about that, and that was my reasoning for a new thread. To have room to fully discuss it.

I am uncertain what you are observing with your hard drive. Let me state what I know.

I know that suspend doesn't save anything.

I know that startup does save this counter (in the slot directory of the task I believe), so yes, saves it on your hard drive. But that is at start up, not shutdown/suspend time.

I know that when many tasks are running on a machine, any of them could cause the hard drive to be used.

Some possible explanations as to why you see your hard drive activity upon suspend:

If you didn't keep tasks in memory, a suspend would cause the system to clean up all the memory pages of the task. And if some of those pages are in the swap file, it may physically be revising the swap file to indicate they no longer belong to the Rosetta task.

Since you do keep tasks in memory, suspend puts the tasks in to a wait state. This means it has no need for the physical memory it presently occupies. And so perhaps Windows is pushing some pages out to the swap file in order to allow more physical memory for other tasks. Unfortunately, this action doesn't preserve anything for Rosetta to use once the machine is powered off.

You can prove to yourself what is preserved and what is not, but it takes some patience. You can check file revision dates, or find a model end, or activate the checkpoint debug messages and determine when a checkpoint occurs for a task. Note the CPU time for the task at the time of the checkpoint. Then let it crunch for a period of time in to the next model. Note the CPU time used for the task so far. Then use your suspend-first approach and exit (not close) BOINC (or the whole machine). Then restart BOINC and watch as the task starts up. Once it gets initialized and running, the CPU time will be reduced to what it was at the time of the checkpoint. This is often what people mean when they refer to "lost work". That 30min or whatever from the checkpoint in to the next model is lost and must be done again to continue.

So, yes, I was curious too as to why you see the disk activity. The above are some possibilities. Perhaps others have some other ideas on what causes it?
Rosetta Moderator: Mod.Sense
ID: 58059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58060 - Posted: 20 Dec 2008, 15:25:40 UTC

I moved jay's post from the problems with 1.47 thread. I thought it better fit with the discussion here.

He's referring to this post from Feet1st.
Rosetta Moderator: Mod.Sense
ID: 58060 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 58063 - Posted: 20 Dec 2008, 18:17:42 UTC
Last modified: 20 Dec 2008, 19:14:07 UTC

mod,

perhaps it is not rosetta that is writing to the drive.
i am running einstein and maybe their project operates differently?
no idea.
but something does write to the drive on suspend and restart.
swapfile space sounds logical,but since i have only 2 cores i only run 2 projects, rosetta and einsteinthere would be no other boinc tasks that would need to have more physical memory. there would be only windows processes and other web processes that were already running before the suspension of activity. would windows be doing something to increase their usage on the drive?
in any case...hope someone else can explain the writing to the drive. i am curious about this.

as for things saving to the hdd, besides the counter what else is saved?
the model information would not be saved on suspend? but is saved at each checkpoint? so rosetta would back up to the last checkpoint that contained a complete model and start from there upon restart? i think this was discussed somewhere a long time ago.

**note: i just did a upgrade of boinc and when i suspended all work there was no writing to the hdd.*** perhaps it was some windows thing that was writing to the hdd at the time.
ID: 58063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 58069 - Posted: 20 Dec 2008, 21:13:40 UTC - in response to Message 58063.  

...besides the counter what else is saved? the model information would not be saved on suspend? but is saved at each checkpoint? so rosetta would back up to the last checkpoint that contained a complete model and start from there upon restart?


A checkpoint is a save of everything needed to pick up from that point and go forward. At the end of a model, the only thing you really need to go forward is to know what your next model is. The completed models are saved in the .out file. But yes, the counter, the completed models and the checkpoints would be the things Rosetta writes to disk.

A restart occurs from the last checkpoint. This may have been during in the middle of a model. Some types of work checkpoint more frequently then others.
Rosetta Moderator: Mod.Sense
ID: 58069 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 58070 - Posted: 20 Dec 2008, 21:16:04 UTC - in response to Message 58069.  

...besides the counter what else is saved? the model information would not be saved on suspend? but is saved at each checkpoint? so rosetta would back up to the last checkpoint that contained a complete model and start from there upon restart?


A checkpoint is a save of everything needed to pick up from that point and go forward. At the end of a model, the only thing you really need to go forward is to know what your next model is. The completed models are saved in the .out file. But yes, the counter, the completed models and the checkpoints would be the things Rosetta writes to disk.

A restart occurs from the last checkpoint. This may have been during in the middle of a model. Some types of work checkpoint more frequently then others.


great thanks for the info.
so it must have been windows writing.
ID: 58070 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,284,221
RAC: 995
Message 58153 - Posted: 24 Dec 2008, 3:27:41 UTC - in response to Message 58039.  
Last modified: 24 Dec 2008, 3:54:47 UTC

And what we've achieved here, is any number of reboots, and Rosetta only being aware of one. So it will just add one to the number of restarts count. This avoids needlessly aborting the tasks due to 5 restarts.

So the main detail in your description I feel is flawed, was that suspend is saving anything. Also you referred to keeping tasks in memory, I highly recommend that as well, but in this case, we're about to power off the machine, so it is not going to help.


I've found a similar circumstance when the leave in memory option actually helps: When the idea that a reboot is required turns out to be mistaken, as it is for some Vista updates.

Another similar circumstance: A non-BOINC program is to be run that requires nearly full control of the hard drive, such as the antivirus program and the anti-spyware programs I use. In this case, it isn't even necessary to fully shut down the BOINC program; just fully suspending it as you describe, shut down its user interface, have leave in memory already enabled, have BOINC already allowed to use enough hard drive space that just the R@h share is enough to store the memory for all the R@h workunits already in progress, have BOINC already allowed to use a high enough fraction of swap space to hold this also, and set a sufficiently high upper limit on the amount of swap space used. When you later tell BOINC to resume, it is usually able to recover what was left in memory, as long as a reboot wasn't required. It's unlikely that all of these settings are actually required, but they don't cause enough problems on my machine to make it urgent that I test which of them are actually required. These settings seem to have a good side effect - they now allow both of my CPU cores to run minirosetta workunits at the same time, at some cost in page faults, something I seldom saw happen before.
ID: 58153 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : how do you keep from losing work when windows wants to reboot your system after a update?



©2024 University of Washington
https://www.bakerlab.org