Message boards : Number crunching : No checkpoint in more than 1 hour - Largescale_large_fullatom...
Author | Message |
---|---|
Aglarond Send message Joined: 29 Jan 06 Posts: 26 Credit: 446,212 RAC: 0 |
Hi, I just had to reboot my computer (Centrino 1.5, 1G RAM) and R@H was showing something around 1.5% progress and it wat runnig litte more than 1 hour. It was counting 1. model. After reboot it started from begining. That means there was no checkpoint in more than 1 hour. Standart settings in Boinc are: switching projects after 60 minutes and not leaving them in memory. I think people with several projects and standart settings will never finish 1 model. Is it so? I think there should be at least warning on the front page that this project has rare checkpoints. (Something similar to what CPDN Seasonal has.) Regards, Aglarond |
[DPC]Charley Send message Joined: 18 Mar 06 Posts: 9 Credit: 295,915 RAC: 0 |
yes you're right. Rosetta will only save after every completed model. With these large models, you have to complete them in one go or start over from the beginning when you unload the project (switch to another project for a little time, reboot your computer, you get the idea). |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
yes you're right. OK, I'm not sure I get it??? Is the answer to leave these in memory? change settings to switch projects less frequently? only do Rosey on dedicated (one project) machines? Abort these WUs? Any advise would be appreciated -Sid Proudly crunching with TeAm Anandtech |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
There was mention of the new large WUs taking over 4 hours to finish a model on some machines. The application switch time needs to be set for more than the time it takes to finish a model/decoy on your machine. Or the keep-in-memory flag needs to be turned on. |
nairb Send message Joined: 8 Dec 05 Posts: 17 Credit: 990,147 RAC: 0 |
I have one dedicated machine for rosy. I have seen it take nearly 9 hrs to complete one of the largescale wu. Mostly they are about 4 hrs but some are more or less. Bit of a pain if you run more than one project on a machine and dont change the switch project setting to be longer than the wu complete time. |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
Thanks for the quick replys! So do I need to have my switch time > time for the entire WU to complete? Isn't there any checkpointing? -Sid Proudly crunching with TeAm Anandtech |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
Rosetta doesn't checkpoint until after it's done with a decoy/model. And if it takes up to 9 hours to complete a single decoy/model, then you won't get checkpointed until that's done. Perhaps the keep in memory flag is the way to go, in case we get more of these really large WUs in the future. (Nairb - what are the math specs on that system that took 9 hours?) |
DigiK-oz Send message Joined: 8 Nov 05 Posts: 13 Credit: 333,730 RAC: 0 |
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! It does sound like a rather stringent requirement. Is this going to be the norm. from here on out? Proudly crunching with TeAm Anandtech |
nairb Send message Joined: 8 Dec 05 Posts: 17 Credit: 990,147 RAC: 0 |
(Nairb - what are the math specs on that system that took 9 Hours?) The wu was this one:- https://boinc.bakerlab.org/rosetta/result.php?resultid=16830240 The machine is a dual cpu 1ghz coppermines with 700+ ram. It spent 9.3 hrs at 1.6 % or so. It very nearly got the abort option.... except I forgot about it and it worked ok. Not the fastest bit of kit but its very stable (win nt4) |
DigiK-oz Send message Joined: 8 Nov 05 Posts: 13 Credit: 333,730 RAC: 0 |
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! Well, NOT having checkpoints/memory images whatever will adversely affect the entire project, as the small home-crunchers are getting fed up with not getting any results in because the same WU restarts from scratch over and over again. Maybe there is a way to hand out these large WUs only to computers having a high RAC? Or only to people who have their run-time preferences set to 8 hours or something similar? |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! I'm in a fortunate position that I can flex to requirements that impact how my PCs are operated, but I am concerned that the majority of your participants just simply cannot assure a computer will not be re-booted or available to crunch nothing but Rosetta 24/7/365. I understand the prospective of the project developers, but I think the project simply MUST acknowledge that the vast majority of it's participants consider DC to be a side endeavor. If your project impacts the use the PC was really bought to serve...your membership will suffer. A thread asking for suggestions to increase project participation was posted some time ago. I believe the single, most important answer is... your project must be hands-off with utter transparency to the work the PC was procured to do in the first place. I think I speak for many when I say I like Rosetta and the work it hopes to accomplish, but it is incorrect to expect PC users to arrange their PC to crunch your project.... it must be the other way around. Rosetta must arrange itself to fit within the resources that are available. Respectfully, -Sid Proudly crunching with TeAm Anandtech |
Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0 |
I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent! In the future we will be extremely cautious to not let this happen again - we are coding and testing some solutions to make the program checkpoint more often. It is of our highest priority at this moment to make sure that ever minute of your precious computer-on time spent on the Rosetta@home project can contribute to the scientific goals we are trying to achieve together. I should add that for users who crunch Rosetta 24/7 or have "leave in memory" on, you can choose to let the largescale jobs currently in your computers keep running. These results are still of great interest to us! I'm in a fortunate position that I can flex to requirements that impact how my PCs are operated, but I am concerned that the majority of your participants just simply cannot assure a computer will not be re-booted or available to crunch nothing but Rosetta 24/7/365. |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent! Thanks Bin! Knowing the project IS sensitive to this kind of concern and IS making an effort to shape this project in a productive way, is plenty to keep me supporting such an interesting and valuable research endeavor! -Sid Proudly crunching with TeAm Anandtech |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent! I didn't think you were dismissing the concerns. Actually, I saw your post as an attempt to help shed light. It was appreciated as was Bin's. Please believe me when I tell you, I am much more the fan than the critic. -Sid Proudly crunching with TeAm Anandtech |
Buffalo Bill Send message Joined: 25 Mar 06 Posts: 71 Credit: 1,630,458 RAC: 0 |
This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! You already have a separate test project (Ralph), so if there's no good solution to interrupting a big model, maybe you could start a new "RosettaExtreme" project just for those of us who would be happy to take care of those big proteins for you. Big WU's only project. Bill |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
... maybe you could start a new "RosettaExtreme" project just for those of us who would be happy to take care of those big proteins for you. Big WU's only. RosettaExtreme?! Hmmmm.... sounds interesting. :-D [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
Shoikan Send message Joined: 4 Apr 06 Posts: 14 Credit: 180,211 RAC: 0 |
Thanks for the quick replys! This issue has to be adressed ASAP. Many cycles go directly to the trash can because of this. An improved checkpointing system should be #1 priority on the TO DO list of the development team of Rosie. Regards. |
Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0 |
Thanks for the quick replys! Bin Qian addressed this already above (we all agree on this!) Proudly crunching with TeAm Anandtech |
River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0 |
You already have a separate test project (Ralph), so if there's no good solution to interrupting a big model, maybe you could start a new "RosettaExtreme" project just for those of us who would be happy to take care of those big proteins for you. If you go for the separate project, then in the times when there are no extreme WU to run, you could always deliver ordinary Rosetta work to the Rex users to keep the project share in use. Ralph was a shortening of Rosetta alpha, a nickname suggested by another Bill, in fact. Maybe Rosetta extreme could be Rex? Another way to do this would be to implement some form of user preference flag, off by default, but whenset manually by the user made them liable to receive work that was Extreme in some way. The course team could ask for volunteers to set the FullOn flag if they wanted to opt in. One method needs more coding, the other needs a separate url and some work at sysadmin level on one or more servers. The Bakerlab folk will know which of these is easier to deliver. I'd suggest either would be a good solution from the user's point of view. Your users already have experience of choosing the development project (Ralph) and of customising the run length of their work, and my impression is that both forms of user control have been well received. I am running down my CPDN participation due to their refusal to implement any form of user-specified control over the size of work that is issued. That is despite the fact that I personally feel their science to be very important. It is good that you acted to protect some users, but sad that you lose out on some interesting work by doing so. I feel sure that increased user control is the way forward. It won't attract any more people, but will help you to keep the ones you've already got. River~~ |
Message boards :
Number crunching :
No checkpoint in more than 1 hour - Largescale_large_fullatom...
©2025 University of Washington
https://www.bakerlab.org