WU Checkpoint Issue

Message boards : Number crunching : WU Checkpoint Issue

To post messages, you must log in.

AuthorMessage
Profile Heidi1
Avatar

Send message
Joined: 11 Aug 07
Posts: 49
Credit: 1,786,248
RAC: 0
Message 52215 - Posted: 3 Apr 2008, 5:09:12 UTC

Greetings!

I thought I read somewhere in one of these forums that individual WU checkpoints were about every 1/2 hour, so that when you shut down your computer, the WU will start again from the last checkpoint. I'm not sure where I read the 1/2 hour part, because I can't find it now on the boards.

Maybe someone can help me: My WUs are going for longer than 30 minutes before checkpointing. Sometimes my computer, and therefore BOINC, will be on for only about 45 minutes, and when I turn the computer on again, one WU will complete restart from scratch; the other WU will restart from a checkpoint, but I haven't paid as much attention to that one. Now, the computer is on quite a bit longer just frequently enough that none of the WUs are returned back to the server with an error, but this really is driving me nuts. My BOINC prefs are set for the program to write to the disk every 10 seconds, so it can't be that. Nothing else in the prefs are screaming this issue at me, so I'm baffled.

I just changed my network prefs so that I'll have longer-running WUs, because my dial-up modem is just taking too long downloading WUs. So it would be nice to have this issue resolved, in one way or another, before I really get into the new lengths. This checkpointing issue has been going on for a while now, while more nuts keep getting added to my pile! :)

Any help will be appreciated. Thanks much!
ID: 52215 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 52219 - Posted: 3 Apr 2008, 15:12:15 UTC

Heidi, it sounds like you have a pretty good handle on the issues here. And unfortunately there isn't much you can do to improve the situation. But a couple of points to make to hopfully further improve your understanding...

When checkpoints are taken varies significantly by the type of task you happen to be working on. Some of the tasks currently going out are RNA work, and these are not able to checkpoint as frequently. So the goal of the Project Team is to build in the ability to checkpoint about every 15 min. or so. But, especially with new types of work being done, this goal is not always met (and obviously actual time would vary on different processor speeds).

When you specify the time for write to disk in the preferences, you are not specifying when the application WILL write. You are specifying how frequently it is ALLOWED to write. As you point out, Rosetta can crunch for, sometimes over an hour, without being ready to write to disk, and so nothing is written. So, changing the setting did not help Rosetta to checkpoint any more frequently. In fact, if you are running other projects as well, I'd suggest changing it back.

The whole system is designed to be able to recover from being interrupted by you turning of your computer, even when there is 45min of work that has not yet been preserved. And Rosetta is set up so that if this happens 5 times (5 restarts) with no progress being saved, it will abort the task for you and try another. Perhaps the other will be of a type that can checkpoint more frequently, and therefore will be more appropriate for the way you are using your computer.

Really the only means you have to preserve the work is to leave BOINC running for longer periods of time. And, when the issue comes up in the forums, you want to express your support for more frequent checkpointing. This is something they have been working on over time. Incorporating more checkpoints in to more types of work.

Perhaps others can suggest some other BOINC projects that already checkpoint more frequently.
Rosetta Moderator: Mod.Sense
ID: 52219 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Heidi1
Avatar

Send message
Joined: 11 Aug 07
Posts: 49
Credit: 1,786,248
RAC: 0
Message 52242 - Posted: 4 Apr 2008, 22:07:24 UTC

Yes, I knew that the time of CPU writing to disk was not the same as checkpointing, but I thought it may help with any answers. I haven't started the new WU lengths yet, so I don't know yet how that will affect checkpointing. Until that happens, while I still have the older ones, I guess that when I know the computer will be on a short amount of time, I just won't run BOINC. This scenario won't last too much longer, so it won't be a big problem.

Again, thanks!
ID: 52242 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 52245 - Posted: 4 Apr 2008, 22:29:14 UTC - in response to Message 52242.  

...when I know the computer will be on a short amount of time, I just won't run BOINC.


...or just understand that the shorter the time it is on, the more likely it is that the work cannot be preserved. Rosetta can send out tasks that complete a model every 10 minutes. And when a model completes, the results are always stored permenantly.

The preferred runtime setting will not alter the frequency of the checkpoints.

Rosetta Moderator: Mod.Sense
ID: 52245 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Heidi1
Avatar

Send message
Joined: 11 Aug 07
Posts: 49
Credit: 1,786,248
RAC: 0
Message 52289 - Posted: 6 Apr 2008, 0:59:53 UTC

This last time when I turned off the computer and turned it back on (& starting BOINC), I was paying attention to the timestamp of a few files located in the individual slots directories. Is <farlxcheck> the file that keeps the checkpointing? If that's it, then I can at least check that to get an idea of how much work I'll be losing when I shut down. Obviously, I can't change its data, and I don't want to, but I could monitor it at least.

Now I'm probably way off base . . . :)

Is there a way to request WUs with more frequent checkpoints, or is that automatically assigned?
ID: 52289 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 52291 - Posted: 6 Apr 2008, 1:40:50 UTC

You can TRY to tell the science application how seldom you want it to checkpoint. BUt, ultimately, the Science Application (read project) controls that ...
ID: 52291 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ingleside

Send message
Joined: 25 Sep 05
Posts: 107
Credit: 1,514,472
RAC: 0
Message 52296 - Posted: 6 Apr 2008, 12:29:56 UTC - in response to Message 52289.  

This last time when I turned off the computer and turned it back on (& starting BOINC), I was paying attention to the timestamp of a few files located in the individual slots directories. Is <farlxcheck> the file that keeps the checkpointing? If that's it, then I can at least check that to get an idea of how much work I'll be losing when I shut down. Obviously, I can't change its data, and I don't want to, but I could monitor it at least.

To see then a Task has last checkpointed, it's easier to use BOINC's built-in logging-options. Use Notepad or similar, and add a file called cc_config.xml (or edit current if already present) to your boinc-directory. In this file, include atleast:

<cc_config>
<log_flags>
<checkpoint_debug>1</checkpoint_debug>
</log_flags>
</cc_config>

After saving the file, in BOINC-Manager select "Advanced / Read config file", and in the Messages-tab you'll now get a message each time a Task has checkpointed. Example:

06.04.2008 14:21:54|rosetta@home|[checkpoint_debug] result bench80_rozilla_abrelax_natfrag_2ccvA_2986_44253_0 checkpointed


If you want to disable checkpoint-logging, my recommendation is to just edit cc_config.xml again so it's a zero instead of 1 in this line:

<checkpoint_debug>0</checkpoint_debug>

Afterwards, re-read config-file.

Is there a way to request WUs with more frequent checkpoints, or is that automatically assigned?

No.

"I make so many mistakes. But then just think of all the mistakes I don't make, although I might."
ID: 52296 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FoldingSolutions
Avatar

Send message
Joined: 2 Apr 06
Posts: 129
Credit: 3,506,690
RAC: 0
Message 52297 - Posted: 6 Apr 2008, 12:55:58 UTC - in response to Message 52296.  

Easy solution to the checkpoint problem, just put your computer into hibernate rather than shutdown, this way the work is preserved at the state it is in when you press the hibernate button, and there is not the issue of possible data loss or corruption or continued power usage as there is with standby. And if you need to shut your computer down, for whatever reason, then just try to make sure the workunit has check-pointed fairly recently by using Ingleside's method of opening the check-point file :)
HTH

ID: 52297 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Heidi1
Avatar

Send message
Joined: 11 Aug 07
Posts: 49
Credit: 1,786,248
RAC: 0
Message 52307 - Posted: 7 Apr 2008, 4:03:37 UTC

Thanks, everyone, for the responses. I did the checkpoint debug file, and it works! Even if I can't actually force it to checkpoint (or wash my car :] ) I can at least look at the debug info to see how frequently it is checkpointing to see if I need to wait a bit longer before shutting down.

Regarding hibernating, I wish that were an option. My computer that runs Rosetta doesn't seem to do hibernating. I'm going to attempt this manually, but it's not an option on my shut down menu. In looking at Windows Help, there are some computer manufacturers and components that don't support hibernation. Go figure . . .

Between the longer file size that I'm trying out and the new debug file, I think that's about as much as I can get, and I thank you all for your help. You're supposed to learn something new everyday, and I just learned mine.
ID: 52307 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,860,059
RAC: 1,696
Message 52309 - Posted: 7 Apr 2008, 10:49:43 UTC

there is a program called sacp by Christoph (shutdown at check-point) which I use to switch off some of my computers after a checkpoint rather than just hitting shutdown...

The original thread is here: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3061&nowrap=true#39117
although the link doesn't appear to be working so let me know if you want it emailing to you and i'll send it over.
ID: 52309 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Heidi1
Avatar

Send message
Joined: 11 Aug 07
Posts: 49
Credit: 1,786,248
RAC: 0
Message 52331 - Posted: 8 Apr 2008, 20:42:44 UTC

Thanks anyway, but the checkpoint debug seems to be my best option. I'm crunching 2 WUs, which tend to checkpoint at different times (so far), so I'd rather just look that info up and shut down accordingly. Maybe someone else wants that program.
ID: 52331 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
FoldingSolutions
Avatar

Send message
Joined: 2 Apr 06
Posts: 129
Credit: 3,506,690
RAC: 0
Message 52346 - Posted: 9 Apr 2008, 18:46:00 UTC - in response to Message 52331.  

To hibernate, click start, then control panel, then performance and maintenance, then power options, then on the top tabs click "hibernate", then enable hibernation. Now a file which exactly corresponds to the size of your RAM will be created on C: drive. Though it is hidden therefore not easily found. But more importantly, when you click the shutdown tab, hold shift and the stand by option will turn into hibernate :)
ID: 52346 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Heidi1
Avatar

Send message
Joined: 11 Aug 07
Posts: 49
Credit: 1,786,248
RAC: 0
Message 52671 - Posted: 23 Apr 2008, 21:34:38 UTC - in response to Message 52346.  

To hibernate, click start, then control panel, then performance and maintenance, then power options, then on the top tabs click "hibernate", then enable hibernation. Now a file which exactly corresponds to the size of your RAM will be created on C: drive. Though it is hidden therefore not easily found. But more importantly, when you click the shutdown tab, hold shift and the stand by option will turn into hibernate :)


Thanks for the tip! However, everything ran smoothly to go into hibernation, until I clicked on the button to hibernate, and my CPU won't do it. I get the logoff-type screen saying "Preparing for hibernation", and then it goes straight back to my Windows desktop. The file was created on my C: drive. I read online that some graphic drivers don't allow hibernation, and mine must be one of them (Nvidia), as I can't get it to go. I've tried this two different times, and it's done this both times.

I did try putting the WU on standby and then shutting down my computer. Nope. The WU still restarted at the last checkpoint. Oh well.

On a side note, I have noticed that some WUs do checkpointing as infrequent as every 2 hours! That is one model for that whole stretch of time! I think it was a FRA_ WU.
ID: 52671 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : WU Checkpoint Issue



©2025 University of Washington
https://www.bakerlab.org