No checkpoint in more than 1 hour - Largescale_large

Author	Message
Aglarond Send message Joined: 29 Jan 06 Posts: 26 Credit: 446,212 RAC: 0	Message 13834 - Posted: 15 Apr 2006, 13:57:49 UTC Hi, I just had to reboot my computer (Centrino 1.5, 1G RAM) and R@H was showing something around 1.5% progress and it wat runnig litte more than 1 hour. It was counting 1. model. After reboot it started from begining. That means there was no checkpoint in more than 1 hour. Standart settings in Boinc are: switching projects after 60 minutes and not leaving them in memory. I think people with several projects and standart settings will never finish 1 model. Is it so? I think there should be at least warning on the front page that this project has rare checkpoints. (Something similar to what CPDN Seasonal has.) Regards, Aglarond ID: 13834 · Rating: 0 · rate: / Reply Quote

[DPC]Charley Send message Joined: 18 Mar 06 Posts: 9 Credit: 295,915 RAC: 0	Message 13836 - Posted: 15 Apr 2006, 15:37:39 UTC yes you're right. Rosetta will only save after every completed model. With these large models, you have to complete them in one go or start over from the beginning when you unload the project (switch to another project for a little time, reboot your computer, you get the idea). ID: 13836 · Rating: 0 · rate: / Reply Quote

Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0	Message 13867 - Posted: 16 Apr 2006, 1:18:40 UTC - in response to Message 13836. yes you're right. Rosetta will only save after every completed model. With these large models, you have to complete them in one go or start over from the beginning when you unload the project (switch to another project for a little time, reboot your computer, you get the idea). OK, I'm not sure I get it??? Is the answer to leave these in memory? change settings to switch projects less frequently? only do Rosey on dedicated (one project) machines? Abort these WUs? Any advise would be appreciated -Sid Proudly crunching with TeAm Anandtech ID: 13867 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 13869 - Posted: 16 Apr 2006, 1:27:56 UTC There was mention of the new large WUs taking over 4 hours to finish a model on some machines. The application switch time needs to be set for more than the time it takes to finish a model/decoy on your machine. Or the keep-in-memory flag needs to be turned on. ID: 13869 · Rating: 0 · rate: / Reply Quote

nairb Send message Joined: 8 Dec 05 Posts: 17 Credit: 990,147 RAC: 0	Message 13871 - Posted: 16 Apr 2006, 1:35:30 UTC I have one dedicated machine for rosy. I have seen it take nearly 9 hrs to complete one of the largescale wu. Mostly they are about 4 hrs but some are more or less. Bit of a pain if you run more than one project on a machine and dont change the switch project setting to be longer than the wu complete time. ID: 13871 · Rating: 0 · rate: / Reply Quote

Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0	Message 13873 - Posted: 16 Apr 2006, 1:58:37 UTC Thanks for the quick replys! So do I need to have my switch time > time for the entire WU to complete? Isn't there any checkpointing? -Sid Proudly crunching with TeAm Anandtech ID: 13873 · Rating: 0 · rate: / Reply Quote

BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0	Message 13878 - Posted: 16 Apr 2006, 2:48:19 UTC Rosetta doesn't checkpoint until after it's done with a decoy/model. And if it takes up to 9 hours to complete a single decoy/model, then you won't get checkpointed until that's done. Perhaps the keep in memory flag is the way to go, in case we get more of these really large WUs in the future. (Nairb - what are the math specs on that system that took 9 hours?) ID: 13878 · Rating: 0 · rate: / Reply Quote

DigiK-oz Send message Joined: 8 Nov 05 Posts: 13 Credit: 333,730 RAC: 0	Message 13882 - Posted: 16 Apr 2006, 8:17:59 UTC This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! ID: 13882 · Rating: 0 · rate: / Reply Quote

Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0	Message 13887 - Posted: 16 Apr 2006, 12:05:29 UTC - in response to Message 13882. This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! It does sound like a rather stringent requirement. Is this going to be the norm. from here on out? Proudly crunching with TeAm Anandtech ID: 13887 · Rating: 0 · rate: / Reply Quote

nairb Send message Joined: 8 Dec 05 Posts: 17 Credit: 990,147 RAC: 0	Message 13891 - Posted: 16 Apr 2006, 12:50:37 UTC - in response to Message 13878. Last modified: 16 Apr 2006, 12:51:14 UTC (Nairb - what are the math specs on that system that took 9 Hours?) The wu was this one:- https://boinc.bakerlab.org/rosetta/result.php?resultid=16830240 The machine is a dual cpu 1ghz coppermines with 700+ ram. It spent 9.3 hrs at 1.6 % or so. It very nearly got the abort option.... except I forgot about it and it worked ok. Not the fastest bit of kit but its very stable (win nt4) ID: 13891 · Rating: 0 · rate: / Reply Quote

DigiK-oz Send message Joined: 8 Nov 05 Posts: 13 Credit: 333,730 RAC: 0	Message 13895 - Posted: 16 Apr 2006, 15:15:30 UTC - in response to Message 13894. This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! It does sound like a rather stringent requirement. Is this going to be the norm. from here on out? The plan is to try to have it checkpoint more often, or at least try to dump everything to a memory image at program swaps. But there is a lot of data involved, and interrupting the model effects the model outcome adversely. Well, NOT having checkpoints/memory images whatever will adversely affect the entire project, as the small home-crunchers are getting fed up with not getting any results in because the same WU restarts from scratch over and over again. Maybe there is a way to hand out these large WUs only to computers having a high RAC? Or only to people who have their run-time preferences set to 8 hours or something similar? ID: 13895 · Rating: 0 · rate: / Reply Quote

Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0	Message 13896 - Posted: 16 Apr 2006, 15:17:12 UTC - in response to Message 13894. This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! It does sound like a rather stringent requirement. Is this going to be the norm. from here on out? The plan is to try to have it checkpoint more often, or at least try to dump everything to a memory image at program swaps. But there is a lot of data involved, and interrupting the model effects the model outcome adversely. I'm in a fortunate position that I can flex to requirements that impact how my PCs are operated, but I am concerned that the majority of your participants just simply cannot assure a computer will not be re-booted or available to crunch nothing but Rosetta 24/7/365. I understand the prospective of the project developers, but I think the project simply MUST acknowledge that the vast majority of it's participants consider DC to be a side endeavor. If your project impacts the use the PC was really bought to serve...your membership will suffer. A thread asking for suggestions to increase project participation was posted some time ago. I believe the single, most important answer is... your project must be hands-off with utter transparency to the work the PC was procured to do in the first place. I think I speak for many when I say I like Rosetta and the work it hopes to accomplish, but it is incorrect to expect PC users to arrange their PC to crunch your project.... it must be the other way around. Rosetta must arrange itself to fit within the resources that are available. Respectfully, -Sid Proudly crunching with TeAm Anandtech ID: 13896 · Rating: 0 · rate: / Reply Quote

Bin Qian Send message Joined: 13 Jul 05 Posts: 33 Credit: 36,897 RAC: 0	Message 13897 - Posted: 16 Apr 2006, 16:25:30 UTC - in response to Message 13896. I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent! In the future we will be extremely cautious to not let this happen again - we are coding and testing some solutions to make the program checkpoint more often. It is of our highest priority at this moment to make sure that ever minute of your precious computer-on time spent on the Rosetta@home project can contribute to the scientific goals we are trying to achieve together. I should add that for users who crunch Rosetta 24/7 or have "leave in memory" on, you can choose to let the largescale jobs currently in your computers keep running. These results are still of great interest to us! I'm in a fortunate position that I can flex to requirements that impact how my PCs are operated, but I am concerned that the majority of your participants just simply cannot assure a computer will not be re-booted or available to crunch nothing but Rosetta 24/7/365. I understand the prospective of the project developers, but I think the project simply MUST acknowledge that the vast majority of it's participants consider DC to be a side endeavor. If your project impacts the use the PC was really bought to serve...your membership will suffer. A thread asking for suggestions to increase project participation was posted some time ago. I believe the single, most important answer is... your project must be hands-off with utter transparency to the work the PC was procured to do in the first place. I think I speak for many when I say I like Rosetta and the work it hopes to accomplish, but it is incorrect to expect PC users to arrange their PC to crunch your project.... it must be the other way around. Rosetta must arrange itself to fit within the resources that are available. Respectfully, -Sid ID: 13897 · Rating: 1 · rate: / Reply Quote

Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0	Message 13902 - Posted: 16 Apr 2006, 17:42:08 UTC - in response to Message 13897. I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent! In the future we will be extremely cautious to not let this happen again - we are coding and testing some solutions to make the program checkpoint more often. It is of our highest priority at this moment to make sure that ever minute of your precious computer-on time spent on the Rosetta@home project can contribute to the scientific goals we are trying to achieve together. I should add that for users who crunch Rosetta 24/7 or have "leave in memory" on, you can choose to let the largescale jobs currently in your computers keep running. These results are still of great interest to us! Thanks Bin! Knowing the project IS sensitive to this kind of concern and IS making an effort to shape this project in a productive way, is plenty to keep me supporting such an interesting and valuable research endeavor! -Sid Proudly crunching with TeAm Anandtech ID: 13902 · Rating: 0 · rate: / Reply Quote

Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0	Message 13906 - Posted: 16 Apr 2006, 20:35:39 UTC - in response to Message 13904. I totally agree with you Sid. It is not acceptable to have a job running for hours without a checkpoint. We realized that this is a mistake and have had those largescale jobs canceled from the boinc server. If your computer still has the largescale jobs running or queued please abort them - the new short jobs are waiting to be sent! In the future we will be extremely cautious to not let this happen again - we are coding and testing some solutions to make the program checkpoint more often. It is of our highest priority at this moment to make sure that ever minute of your precious computer-on time spent on the Rosetta@home project can contribute to the scientific goals we are trying to achieve together. I should add that for users who crunch Rosetta 24/7 or have "leave in memory" on, you can choose to let the largescale jobs currently in your computers keep running. These results are still of great interest to us! Thanks Bin! Knowing the project IS sensitive to this kind of concern and IS making an effort to shape this project in a productive way, is plenty to keep me supporting such an interesting and valuable research endeavor! -Sid I get the impression from your response that you have misinterpreted the information I tried to provide, to imply that I had in some way said the project does not think this is important. Nothing could be further from the truth. The original question was why did the Work Unit run so long without a checkpoint (see thread title). I never said this was an issue the project was ignoring, or not trying to fix. I just tried to explain WHY it was working the way it was. I didn't think you were dismissing the concerns. Actually, I saw your post as an attempt to help shed light. It was appreciated as was Bin's. Please believe me when I tell you, I am much more the fan than the critic. -Sid Proudly crunching with TeAm Anandtech ID: 13906 · Rating: 0 · rate: / Reply Quote

Buffalo Bill Send message Joined: 25 Mar 06 Posts: 71 Credit: 1,630,458 RAC: 0	Message 13912 - Posted: 16 Apr 2006, 21:31:40 UTC - in response to Message 13894. Last modified: 16 Apr 2006, 21:38:02 UTC This is ridiculous. A simple home-cruncher, leaving his/her PC on for only a few hours per day, could get stuck on one of those WUs indefinitely! It does sound like a rather stringent requirement. Is this going to be the norm. from here on out? The plan is to try to have it checkpoint more often, or at least try to dump everything to a memory image at program swaps. But there is a lot of data involved, and interrupting the model effects the model outcome adversely. You already have a separate test project (Ralph), so if there's no good solution to interrupting a big model, maybe you could start a new "RosettaExtreme" project just for those of us who would be happy to take care of those big proteins for you. Big WU's only project. Bill ID: 13912 · Rating: 0 · rate: / Reply Quote

Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0	Message 13914 - Posted: 16 Apr 2006, 21:38:02 UTC - in response to Message 13912. ... maybe you could start a new "RosettaExtreme" project just for those of us who would be happy to take care of those big proteins for you. Big WU's only. Bill RosettaExtreme?! Hmmmm.... sounds interesting. :-D [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] ID: 13914 · Rating: 0 · rate: / Reply Quote

Shoikan Send message Joined: 4 Apr 06 Posts: 14 Credit: 180,211 RAC: 0	Message 13944 - Posted: 17 Apr 2006, 9:47:18 UTC - in response to Message 13877. Thanks for the quick replys! So do I need to have my switch time > time for the entire WU to complete? Isn't there any checkpointing? -Sid ALL Work Units will checkpoint at the completion of a model. For some Work Units this means every 5 minuets, for larger ones this could mean 5 or six hours. Also ALL Work units will complete AT LEAST one model no matter how you set your user selectable time setting The BEST answer if you can do it, is to set your preferences to keep the application in memory during a swap. You could try to set the swap time to 4+ hours, but there is no guarantee that that will make it to a checkpoint. It depends on the size of the protein. Also keep in mind that "keep in memory" only works if you do not turn your machine off, or stop BOINC for some reason, as these actions would also remove the application from memory. This issue has to be adressed ASAP. Many cycles go directly to the trash can because of this. An improved checkpointing system should be #1 priority on the TO DO list of the development team of Rosie. Regards. ID: 13944 · Rating: -1 · rate: / Reply Quote

Insidious Send message Joined: 10 Nov 05 Posts: 49 Credit: 604,937 RAC: 0	Message 13950 - Posted: 17 Apr 2006, 14:20:19 UTC - in response to Message 13944. Thanks for the quick replys! So do I need to have my switch time > time for the entire WU to complete? Isn't there any checkpointing? -Sid ALL Work Units will checkpoint at the completion of a model. For some Work Units this means every 5 minuets, for larger ones this could mean 5 or six hours. Also ALL Work units will complete AT LEAST one model no matter how you set your user selectable time setting The BEST answer if you can do it, is to set your preferences to keep the application in memory during a swap. You could try to set the swap time to 4+ hours, but there is no guarantee that that will make it to a checkpoint. It depends on the size of the protein. Also keep in mind that "keep in memory" only works if you do not turn your machine off, or stop BOINC for some reason, as these actions would also remove the application from memory. This issue has to be adressed ASAP. Many cycles go directly to the trash can because of this. An improved checkpointing system should be #1 priority on the TO DO list of the development team of Rosie. Regards. Bin Qian addressed this already above (we all agree on this!) Proudly crunching with TeAm Anandtech ID: 13950 · Rating: 0 · rate: / Reply Quote

River~~ Send message Joined: 15 Dec 05 Posts: 761 Credit: 285,578 RAC: 0	Message 14022 - Posted: 18 Apr 2006, 9:10:31 UTC - in response to Message 13912. Last modified: 18 Apr 2006, 9:27:24 UTC You already have a separate test project (Ralph), so if there's no good solution to interrupting a big model, maybe you could start a new "RosettaExtreme" project just for those of us who would be happy to take care of those big proteins for you. If you go for the separate project, then in the times when there are no extreme WU to run, you could always deliver ordinary Rosetta work to the Rex users to keep the project share in use. Ralph was a shortening of Rosetta alpha, a nickname suggested by another Bill, in fact. Maybe Rosetta extreme could be Rex? Another way to do this would be to implement some form of user preference flag, off by default, but whenset manually by the user made them liable to receive work that was Extreme in some way. The course team could ask for volunteers to set the FullOn flag if they wanted to opt in. One method needs more coding, the other needs a separate url and some work at sysadmin level on one or more servers. The Bakerlab folk will know which of these is easier to deliver. I'd suggest either would be a good solution from the user's point of view. Your users already have experience of choosing the development project (Ralph) and of customising the run length of their work, and my impression is that both forms of user control have been well received. I am running down my CPDN participation due to their refusal to implement any form of user-specified control over the size of work that is issued. That is despite the fact that I personally feel their science to be very important. It is good that you acted to protect some users, but sad that you lose out on some interesting work by doing so. I feel sure that increased user control is the way forward. It won't attract any more people, but will help you to keep the ones you've already got. River~~ ID: 14022 · Rating: 0 · rate: / Reply Quote

No checkpoint in more than 1 hour - Largescale_large_fullatom...