Message boards : Number crunching : WU Checkpoint Issue
Author | Message |
---|---|
Heidi1 Send message Joined: 11 Aug 07 Posts: 49 Credit: 1,786,248 RAC: 0 |
Greetings! I thought I read somewhere in one of these forums that individual WU checkpoints were about every 1/2 hour, so that when you shut down your computer, the WU will start again from the last checkpoint. I'm not sure where I read the 1/2 hour part, because I can't find it now on the boards. Maybe someone can help me: My WUs are going for longer than 30 minutes before checkpointing. Sometimes my computer, and therefore BOINC, will be on for only about 45 minutes, and when I turn the computer on again, one WU will complete restart from scratch; the other WU will restart from a checkpoint, but I haven't paid as much attention to that one. Now, the computer is on quite a bit longer just frequently enough that none of the WUs are returned back to the server with an error, but this really is driving me nuts. My BOINC prefs are set for the program to write to the disk every 10 seconds, so it can't be that. Nothing else in the prefs are screaming this issue at me, so I'm baffled. I just changed my network prefs so that I'll have longer-running WUs, because my dial-up modem is just taking too long downloading WUs. So it would be nice to have this issue resolved, in one way or another, before I really get into the new lengths. This checkpointing issue has been going on for a while now, while more nuts keep getting added to my pile! :) Any help will be appreciated. Thanks much! |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Heidi, it sounds like you have a pretty good handle on the issues here. And unfortunately there isn't much you can do to improve the situation. But a couple of points to make to hopfully further improve your understanding... When checkpoints are taken varies significantly by the type of task you happen to be working on. Some of the tasks currently going out are RNA work, and these are not able to checkpoint as frequently. So the goal of the Project Team is to build in the ability to checkpoint about every 15 min. or so. But, especially with new types of work being done, this goal is not always met (and obviously actual time would vary on different processor speeds). When you specify the time for write to disk in the preferences, you are not specifying when the application WILL write. You are specifying how frequently it is ALLOWED to write. As you point out, Rosetta can crunch for, sometimes over an hour, without being ready to write to disk, and so nothing is written. So, changing the setting did not help Rosetta to checkpoint any more frequently. In fact, if you are running other projects as well, I'd suggest changing it back. The whole system is designed to be able to recover from being interrupted by you turning of your computer, even when there is 45min of work that has not yet been preserved. And Rosetta is set up so that if this happens 5 times (5 restarts) with no progress being saved, it will abort the task for you and try another. Perhaps the other will be of a type that can checkpoint more frequently, and therefore will be more appropriate for the way you are using your computer. Really the only means you have to preserve the work is to leave BOINC running for longer periods of time. And, when the issue comes up in the forums, you want to express your support for more frequent checkpointing. This is something they have been working on over time. Incorporating more checkpoints in to more types of work. Perhaps others can suggest some other BOINC projects that already checkpoint more frequently. Rosetta Moderator: Mod.Sense |
Heidi1 Send message Joined: 11 Aug 07 Posts: 49 Credit: 1,786,248 RAC: 0 |
Yes, I knew that the time of CPU writing to disk was not the same as checkpointing, but I thought it may help with any answers. I haven't started the new WU lengths yet, so I don't know yet how that will affect checkpointing. Until that happens, while I still have the older ones, I guess that when I know the computer will be on a short amount of time, I just won't run BOINC. This scenario won't last too much longer, so it won't be a big problem. Again, thanks! |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
...when I know the computer will be on a short amount of time, I just won't run BOINC. ...or just understand that the shorter the time it is on, the more likely it is that the work cannot be preserved. Rosetta can send out tasks that complete a model every 10 minutes. And when a model completes, the results are always stored permenantly. The preferred runtime setting will not alter the frequency of the checkpoints. Rosetta Moderator: Mod.Sense |
Heidi1 Send message Joined: 11 Aug 07 Posts: 49 Credit: 1,786,248 RAC: 0 |
This last time when I turned off the computer and turned it back on (& starting BOINC), I was paying attention to the timestamp of a few files located in the individual slots directories. Is <farlxcheck> the file that keeps the checkpointing? If that's it, then I can at least check that to get an idea of how much work I'll be losing when I shut down. Obviously, I can't change its data, and I don't want to, but I could monitor it at least. Now I'm probably way off base . . . :) Is there a way to request WUs with more frequent checkpoints, or is that automatically assigned? |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
You can TRY to tell the science application how seldom you want it to checkpoint. BUt, ultimately, the Science Application (read project) controls that ... |
Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0 |
This last time when I turned off the computer and turned it back on (& starting BOINC), I was paying attention to the timestamp of a few files located in the individual slots directories. Is <farlxcheck> the file that keeps the checkpointing? If that's it, then I can at least check that to get an idea of how much work I'll be losing when I shut down. Obviously, I can't change its data, and I don't want to, but I could monitor it at least. To see then a Task has last checkpointed, it's easier to use BOINC's built-in logging-options. Use Notepad or similar, and add a file called cc_config.xml (or edit current if already present) to your boinc-directory. In this file, include atleast: <cc_config> <log_flags> <checkpoint_debug>1</checkpoint_debug> </log_flags> </cc_config> After saving the file, in BOINC-Manager select "Advanced / Read config file", and in the Messages-tab you'll now get a message each time a Task has checkpointed. Example: 06.04.2008 14:21:54|rosetta@home|[checkpoint_debug] result bench80_rozilla_abrelax_natfrag_2ccvA_2986_44253_0 checkpointed If you want to disable checkpoint-logging, my recommendation is to just edit cc_config.xml again so it's a zero instead of 1 in this line: <checkpoint_debug>0</checkpoint_debug> Afterwards, re-read config-file. Is there a way to request WUs with more frequent checkpoints, or is that automatically assigned? No. "I make so many mistakes. But then just think of all the mistakes I don't make, although I might." |
FoldingSolutions Send message Joined: 2 Apr 06 Posts: 129 Credit: 3,506,690 RAC: 0 |
Easy solution to the checkpoint problem, just put your computer into hibernate rather than shutdown, this way the work is preserved at the state it is in when you press the hibernate button, and there is not the issue of possible data loss or corruption or continued power usage as there is with standby. And if you need to shut your computer down, for whatever reason, then just try to make sure the workunit has check-pointed fairly recently by using Ingleside's method of opening the check-point file :) HTH |
Heidi1 Send message Joined: 11 Aug 07 Posts: 49 Credit: 1,786,248 RAC: 0 |
Thanks, everyone, for the responses. I did the checkpoint debug file, and it works! Even if I can't actually force it to checkpoint (or wash my car :] ) I can at least look at the debug info to see how frequently it is checkpointing to see if I need to wait a bit longer before shutting down. Regarding hibernating, I wish that were an option. My computer that runs Rosetta doesn't seem to do hibernating. I'm going to attempt this manually, but it's not an option on my shut down menu. In looking at Windows Help, there are some computer manufacturers and components that don't support hibernation. Go figure . . . Between the longer file size that I'm trying out and the new debug file, I think that's about as much as I can get, and I thank you all for your help. You're supposed to learn something new everyday, and I just learned mine. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 1,696 |
there is a program called sacp by Christoph (shutdown at check-point) which I use to switch off some of my computers after a checkpoint rather than just hitting shutdown... The original thread is here: https://boinc.bakerlab.org/rosetta/forum_thread.php?id=3061&nowrap=true#39117 although the link doesn't appear to be working so let me know if you want it emailing to you and i'll send it over. |
Heidi1 Send message Joined: 11 Aug 07 Posts: 49 Credit: 1,786,248 RAC: 0 |
Thanks anyway, but the checkpoint debug seems to be my best option. I'm crunching 2 WUs, which tend to checkpoint at different times (so far), so I'd rather just look that info up and shut down accordingly. Maybe someone else wants that program. |
FoldingSolutions Send message Joined: 2 Apr 06 Posts: 129 Credit: 3,506,690 RAC: 0 |
To hibernate, click start, then control panel, then performance and maintenance, then power options, then on the top tabs click "hibernate", then enable hibernation. Now a file which exactly corresponds to the size of your RAM will be created on C: drive. Though it is hidden therefore not easily found. But more importantly, when you click the shutdown tab, hold shift and the stand by option will turn into hibernate :) |
Heidi1 Send message Joined: 11 Aug 07 Posts: 49 Credit: 1,786,248 RAC: 0 |
To hibernate, click start, then control panel, then performance and maintenance, then power options, then on the top tabs click "hibernate", then enable hibernation. Now a file which exactly corresponds to the size of your RAM will be created on C: drive. Though it is hidden therefore not easily found. But more importantly, when you click the shutdown tab, hold shift and the stand by option will turn into hibernate :) Thanks for the tip! However, everything ran smoothly to go into hibernation, until I clicked on the button to hibernate, and my CPU won't do it. I get the logoff-type screen saying "Preparing for hibernation", and then it goes straight back to my Windows desktop. The file was created on my C: drive. I read online that some graphic drivers don't allow hibernation, and mine must be one of them (Nvidia), as I can't get it to go. I've tried this two different times, and it's done this both times. I did try putting the WU on standby and then shutting down my computer. Nope. The WU still restarted at the last checkpoint. Oh well. On a side note, I have noticed that some WUs do checkpointing as infrequent as every 2 hours! That is one model for that whole stretch of time! I think it was a FRA_ WU. |
Message boards :
Number crunching :
WU Checkpoint Issue
©2025 University of Washington
https://www.bakerlab.org