Discussion on increasing the default run time

Author	Message
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 94753 - Posted: 18 Apr 2020, 10:39:00 UTC - in response to Message 94738. Tasks finally completed. Target CPU Run time 8hrs rb_03_31_20049_19874__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_904837_1472_0 Run time 15 hours 36 min 43 sec CPU time 15 hours 29 min 42 sec Validate state Valid Credit 605.87 <core_client_version>7.6.33</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_x86_64.exe -run:protocol jd2_scripting @flags_rb_03_31_20049_19874__t000__3_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_03_31_20049_19874__t000__3_C1_robetta.zip -nstruct 10000 -cpu_run_time 57600 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3920361 Starting watchdog... Watchdog active. ====================================================== DONE :: 13 starting structures 55782.3 cpu seconds This process generated 13 decoys from 13 attempts ====================================================== BOINC :: WS_max 1.75866e+09 BOINC :: Watchdog shutting down... 19:06:24 (8204): called boinc_finish(0) </stderr_txt> ]]> rb_03_31_20031_19865__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_904757_832_0 Run time 15 hours 45 min 21 sec CPU time 15 hours 40 min 23 sec Validate state Valid Credit 1,215.60 <core_client_version>7.6.22</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_intelx86.exe -run:protocol jd2_scripting @flags_rb_03_31_20031_19865__t000__0_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_03_31_20031_19865__t000__0_C1_robetta.zip -nstruct 10000 -cpu_run_time 57600 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1163139 Starting watchdog... Watchdog active. ====================================================== DONE :: 31 starting structures 56423.5 cpu seconds This process generated 31 decoys from 31 attempts ====================================================== BOINC :: WS_max 9.19749e+08 BOINC :: Watchdog shutting down... 19:36:44 (4984): called boinc_finish(0) </stderr_txt> ]]> Grant Darwin NT ID: 94753 · Rating: 0 · rate: / Reply Quote

rzlatic Send message Joined: 20 Nov 07 Posts: 3 Credit: 327,897 RAC: 0	Message 94765 - Posted: 18 Apr 2020, 13:53:04 UTC - in response to Message 94697. Last modified: 18 Apr 2020, 13:56:55 UTC rzlatic, it appears you are seeing tasks running more than 4 hours passed the runtime preference, then ended by the watchdog on a Linux system running the i686 application. Please see the discussion here. indeed, tasks generating the problem were starting with "12v1n" and were run by i686-pc-linux (as seen here: https://imgur.com/WaDeO14). created "cc_config.xml" config file in /var/lib/boinc/ with suggested settings, restarted boinc client and there seems to be none 686-pc-linux (32-bit) tasks now. we'll see how it will be going. thanks, great community and support. ID: 94765 · Rating: 0 · rate: / Reply Quote

MeeeK Send message Joined: 7 Feb 16 Posts: 31 Credit: 19,737,304 RAC: 0	Message 94766 - Posted: 18 Apr 2020, 14:20:17 UTC Last modified: 18 Apr 2020, 14:22:36 UTC Hi, Did something change in the last few days or weeks? I lost 7000 points in average per day. Dropped from 34.000 a day to 27.000 and below. Thats a huge gap. I changed nothing at my systems. Thwy are running 24/7. ID: 94766 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 442 Credit: 15,697,820 RAC: 59	Message 94771 - Posted: 18 Apr 2020, 14:53:48 UTC - in response to Message 94766. Hi, Did something change in the last few days or weeks? I lost 7000 points in average per day. Dropped from 34.000 a day to 27.000 and below. Thats a huge gap. I changed nothing at my systems. Thwy are running 24/7. I can’t see that far back in your results log but you’re loosing quite a few credits through missed deadlines - you might find it better to reduce your buffers to maybe 0.5 days + 0.5 days or even less. ID: 94771 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2590 Credit: 47,220,881 RAC: 6	Message 94785 - Posted: 18 Apr 2020, 17:39:14 UTC - in response to Message 94771. Last modified: 18 Apr 2020, 17:40:53 UTC Did something change in the last few days or weeks? I lost 7000 points in average per day. Dropped from 34.000 a day to 27.000 and below. Thats a huge gap. I changed nothing at my systems. Thwy are running 24/7. I can’t see that far back in your results log but you’re loosing quite a few credits through missed deadlines - you might find it better to reduce your buffers to maybe 0.5 days + 0.5 days or even less. To be fair, cancelled tasks that have passed deadline are only deleted if they're unstarted - running tasks are allowed to continue, so no credit loss. But it's certainly true that the buffer is too large to meet deadline so should be significantly cut back. Because even if the next host who receives them runs them within deadline they'll be so long after the batch was released they'll be no good to the project. They're effectively being made useless at the point of download. And the runtime is unnecessarily low so the project isn't getting full value for them either tbh But I'd rather point to the credits awarded here. Since the new version, credits have been a mess. Anyone seeking credits here should know they're going to be disappointed. My point being, I don't think lower credits are indicative of a problem with the tasks themselves ID: 94785 · Rating: 0 · rate: / Reply Quote

Bryn Mawr Send message Joined: 26 Dec 18 Posts: 442 Credit: 15,697,820 RAC: 59	Message 94797 - Posted: 18 Apr 2020, 20:22:46 UTC - in response to Message 94785. Did something change in the last few days or weeks? I lost 7000 points in average per day. Dropped from 34.000 a day to 27.000 and below. Thats a huge gap. I changed nothing at my systems. Thwy are running 24/7. I can’t see that far back in your results log but you’re loosing quite a few credits through missed deadlines - you might find it better to reduce your buffers to maybe 0.5 days + 0.5 days or even less. To be fair, cancelled tasks that have passed deadline are only deleted if they're unstarted - running tasks are allowed to continue, so no credit loss. But it's certainly true that the buffer is too large to meet deadline so should be significantly cut back. Because even if the next host who receives them runs them within deadline they'll be so long after the batch was released they'll be no good to the project. They're effectively being made useless at the point of download. And the runtime is unnecessarily low so the project isn't getting full value for them either tbh But I'd rather point to the credits awarded here. Since the new version, credits have been a mess. Anyone seeking credits here should know they're going to be disappointed. My point being, I don't think lower credits are indicative of a problem with the tasks themselves As well as the hundreds of error tasks there were quite a few invalid tasks that had run but been discarded and given no credits because they were received too late. ID: 94797 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 94799 - Posted: 18 Apr 2020, 23:37:59 UTC From the "Thank you" thread To accommodate this, the "watchdog" timeout has been extended from the normal 4 hours to 10 hours. A big change. And given the gaps between checkpoints even on my fairly powerful system -40min and more- and reports from others of even longer periods of no checkpointing (on Windows systems which don't have the Linux i686 application issue) i would hope the programmers are going to look very hard in to increasing the number of points where a Task can checkpoint. Otherwise even powerful systems that run more than just Rosetta will struggle to complete a Task due to resource share settings switching between projects, and less powerful Rosetta only systems will struggle to reach a checkpoint before there is a need for BOINC to suspend computation. And for those that aren't on for long periods of time, or have heavy non-BOINC use while crunching will have no chance of completing a Task if it has to start from the last checkpoint after an interruption. With an 8 hour (now up to 18 hours) Runtime losing 5min here or there isn't a big issue (annoying, but not a big issue). But to lose 40min, in many cases to loose 2hrs and more if a Rosetta Tasks gets interrupted (and unless you have massive amount of RAM to number of cores/threads ratio "Leave non-GPU tasks in memory while suspended" isn't an option (what is the default "Page/swap file: use at most %"?)). Yes, most Tasks won't run for the Target time + 10hrs. But those that do will drive people away if they spend hours doing work, only to loose it all & have to start again. Over & over again. Grant Darwin NT ID: 94799 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 94800 - Posted: 18 Apr 2020, 23:43:47 UTC - in response to Message 94738. Last modified: 18 Apr 2020, 23:44:46 UTC @Grant, you appear to be one of the lucky few that received some of those WUs from two weeks ago when the default runtime was bumped to 16 hours. So the WU thought it was running normally. This explains why the watchdog did not step in. As posted today by admin, the timeout used by the watchdog has been extended for the normal 4 hours to 10 hours. So, on new WUs, where the watchdog is set to 10 hours, that means a given WU may run 10 CPU hours beyond the runtime preference before the watchdog will step in and end the WU. This is because of the extremely challenging protein models under study now. Rosetta Moderator: Mod.Sense ID: 94800 · Rating: 0 · rate: / Reply Quote

Admin Project administrator Send message Joined: 1 Jul 05 Posts: 5146 Credit: 0 RAC: 0	Message 94803 - Posted: 18 Apr 2020, 23:54:18 UTC - in response to Message 94799. Extending the watchdog has no impact on the checkpointing issue. These are separate issues. Extending the watchdog will prevent errors for jobs that have a longer than usual run time per model, such as a 2000 residue protein, for protocols that do not have checkpoints of appropriate intervals. Although somewhat rare, these type of jobs do exist from Robetta. We should definitely try to address the checkpointing issue by adding more checkpoints to the various protocols but this will take development time. ID: 94803 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 94806 - Posted: 19 Apr 2020, 0:20:27 UTC - in response to Message 94803. Extending the watchdog has no impact on the checkpointing issue. These are separate issues. Extending the watchdog will prevent errors for jobs that have a longer than usual run time per model, such as a 2000 residue protein, for protocols that do not have checkpoints of appropriate intervals. Although somewhat rare, these type of jobs do exist from Robetta. We should definitely try to address the checkpointing issue by adding more checkpoints to the various protocols but this will take development time. Thank you for the response. Given the increase in the Watchdog time, i think that addressing the checkpoint issue should get bumped up the list of things that need to be done. As i said- losing a few minutes work is annoying, but it's not a major issue. But to lose 2, 4, 8, 12, 16 hours of processing? That will drive people away. If Rosetta is using a grace period on work not returned by the deadline, it may be worth considering increasing it to allow for the increased Watchdog time (and please if possible see that Tasks aren't resent till the Grace period extension has passed. ie don't send them, till they're needed. Reduce people's error count, reduce the network bandwidth used, reduce the Scheduler load- not a big reduction, but every little bit counts as the number of systems crunching grows). Grant Darwin NT ID: 94806 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2590 Credit: 47,220,881 RAC: 6	Message 94814 - Posted: 19 Apr 2020, 1:54:42 UTC - in response to Message 94799. From the "Thank you" thread To accommodate this, the "watchdog" timeout has been extended from the normal 4 hours to 10 hours. A big change. And given the gaps between checkpoints even on my fairly powerful system -40min and more- and reports from others of even longer periods of no checkpointing (on Windows systems which don't have the Linux i686 application issue) I would hope the programmers are going to look very hard in to increasing the number of points where a Task can checkpoint. Otherwise even powerful systems that run more than just Rosetta will struggle to complete a Task due to resource share settings switching between projects, and less powerful Rosetta only systems will struggle to reach a checkpoint before there is a need for BOINC to suspend computation. And for those that aren't on for long periods of time, or have heavy non-BOINC use while crunching will have no chance of completing a Task if it has to start from the last checkpoint after an interruption. With an 8 hour (now up to 18 hours) Runtime losing 5min here or there isn't a big issue (annoying, but not a big issue). But to lose 40min, in many cases to loose 2hrs and more if a Rosetta Tasks gets interrupted (and unless you have massive amount of RAM to number of cores/threads ratio "Leave non-GPU tasks in memory while suspended" isn't an option (what is the default "Page/swap file: use at most %"?)). Yes, most Tasks won't run for the Target time + 10hrs. But those that do will drive people away if they spend hours doing work, only to loose it all & have to start again. Over & over again. I think this is a strong point so +1. People who aren't running 247 or are running multiple projects with project switching every default amount of time who don't hold tasks in memory are going to have significant issues. I do run 24/7 and I'd already changed my "switch between tasks every xx minutes" from a default 60 to 999 but even this might not be enough. Might be worth adding another 9 ID: 94814 · Rating: 0 · rate: / Reply Quote

MeeeK Send message Joined: 7 Feb 16 Posts: 31 Credit: 19,737,304 RAC: 0	Message 94819 - Posted: 19 Apr 2020, 2:50:32 UTC - in response to Message 94797. I am running the same settings since i bought these CPUs. Saved a day and additional 2 days. Never had that much problems. I saved 3 days of work for having some reserve in case of problems with my ISP. So Computers can do their work offline until internetconmection is back. There are 150 deadline task in my stats right now. Never have seen that much before. Think there was something wrong with too short deadlines, bit the all hav not been started. So no lost work. My two ryzen 3600 crashed from 17.000 points each to approximately 12.000 and 13.000. Thats not normal. Will check the tasks later. Maybe i can find a problem. ID: 94819 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 94822 - Posted: 19 Apr 2020, 3:03:20 UTC - in response to Message 94814. You will correct me if I'm mistaken, but doesn't the BOINC Manager wait until a WU reaches a checkpoint before suspending it to work on another project? This change was made about a decade ago, because a task that hasn't reached a checkpoint, especially if tasks are not kept "in memory", will lose what it has been working on. True for all BOINC projects. Picture a machine with 4 CPUs, running 3 BOINC projects, it could easily have 8 different WUs that have been worked on during the day. And even if WUs are kept "in memory" when suspended, you are going to lose progress on all 8 of those WUs when you turn off the machine. So, I don't believe that "switch between tasks every xx minutes" setting really has much effect anymore. In fact I thought it used to include the phrase "at most" every xx minutes. Note: I tend to place "in memory" in quotes, because memory used by inactive threads is always swapped out if there is an active thread that needs memory. So long as you have a swap file of sufficient size, I highly recommend checking the box for "Leave non-GPU tasks in memory while suspended". Rosetta Moderator: Mod.Sense ID: 94822 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 94823 - Posted: 19 Apr 2020, 3:07:06 UTC - in response to Message 94819. I am running the same settings since i bought these CPUs. Saved a day and additional 2 days. Never had that much problems. You have as many Tasks that are an Error because they miss the deadline as are in your Valid list. That is a problem. Other Store at least 1 days of work Store up to an additional 0.02 days of work Will result in more than a days work until deadlines start to settle down. Then you could bump up "Store at least xx days of work" to 1.5 if you feel you need more, without having most of the work you get being Errors due to missed deadlines which is what is happening at present. Grant Darwin NT ID: 94823 · Rating: 0 · rate: / Reply Quote

MeeeK Send message Joined: 7 Feb 16 Posts: 31 Credit: 19,737,304 RAC: 0	Message 94825 - Posted: 19 Apr 2020, 4:57:21 UTC - in response to Message 94823. I am running the same settings since i bought these CPUs. Saved a day and additional 2 days. Never had that much problems. You have as many Tasks that are an Error because they miss the deadline as are in your Valid list. That is a problem. Other Store at least 1 days of work Store up to an additional 0.02 days of work Will result in more than a days work until deadlines start to settle down. Then you could bump up "Store at least xx days of work" to 1.5 if you feel you need more, without having most of the work you get being Errors due to missed deadlines which is what is happening at present. that isn´t the problem at all. all the canceled WUs because of deadline, have not been started. So my CPUs didnt waste a secound of workload to them. There have been 6 or 7 WUs that had an error while working on it and have been aborded. But these 6 or 7 tasks dont make me lose so many points. but i just have an idea right in that moment. Guess its because of the way average points are calculated. i have 150 Jobs finished with points that would be my 34.000 points BUT i also have 150 Deadline tasks with 0 Points. So its 300 WUs with XXX tousand points in average. Dont know the exact numbers at atm. Did somebody else had problems with deadines? I didnt read all the posts. ID: 94825 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 94826 - Posted: 19 Apr 2020, 5:30:54 UTC - in response to Message 94825. Last modified: 19 Apr 2020, 5:31:32 UTC Did somebody else had problems with deadines? I didnt read all the posts. There have been several issues lately that have caused machines to load more work than can be processed within the 3 day deadlines that now rule the day. Default runtime was briefly changed from 8 hours to 16 hours. Some WUs failed quickly, causing BOINC Manager to believe new WUs might "complete" quickly as well. Deadlines used to be a mixture of 3 days and 8 days, but now all WUs seem to be getting 3 day deadlines. Some WUs are using more memory than was previously typical, some machine environments handle this better than others. Some WUs were causing Linux i686 to run for the runtime preference plus 4 hours before ending with no models produced. This may be the primary reason behind your RAC drop, but it would seem the WUs that might have caused that have already been purged. Rosetta Moderator: Mod.Sense ID: 94826 · Rating: 0 · rate: / Reply Quote

Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1939 Credit: 18,534,891 RAC: 0	Message 94828 - Posted: 19 Apr 2020, 6:07:19 UTC - in response to Message 94825. that isn´t the problem at all. all the canceled WUs because of deadline, have not been started. So my CPUs didnt waste a secound of workload to them. But it is a problem. You did waste Rosetta's bandwidth & server resources in downloading them, having them sit there doing nothing for days, then having them timeout & having to send them out again to another system that will process them. The reason for the shorter deadlines is because the Project wants (needs) the results back sooner. If you are not going to process them, then why download them? If you set your cache to a more realistic value then everyone benefits. Did somebody else had problems with deadines? I didnt read all the posts. Many other people have problems with deadlines, but none as bad as yourself. Grant Darwin NT ID: 94828 · Rating: 0 · rate: / Reply Quote

MeeeK Send message Joined: 7 Feb 16 Posts: 31 Credit: 19,737,304 RAC: 0	Message 94829 - Posted: 19 Apr 2020, 6:49:56 UTC - in response to Message 94826. Did somebody else had problems with deadines? I didnt read all the posts. There have been several issues lately that have caused machines to load more work than can be processed within the 3 day deadlines that now rule the day. ... Some WUs failed quickly, causing BOINC Manager to believe new WUs might "complete" quickly as well. Deadlines used to be a mixture of 3 days and 8 days, but now all WUs seem to be getting 3 day deadlines. ... Some WUs were causing Linux i686 to run for the runtime preference plus 4 hours before ending with no models produced. This may be the primary reason behind your RAC drop, but it would seem the WUs that might have caused that have already been purged. i think that might have caused my "problems". Now changed the settings to 2 days and WU-Runtime to 6 hours. Will check the next days if there is a change. Do you think i should upgrade RAM to 32 because of higher usage? have two Ryzen 5 3600 12 Core with 16GB each. have always been enough so far. ID: 94829 · Rating: 0 · rate: / Reply Quote

Admin Project administrator Send message Joined: 1 Jul 05 Posts: 5146 Credit: 0 RAC: 0	Message 94830 - Posted: 19 Apr 2020, 6:55:07 UTC Another recent issue has been that a batch of jobs has been finishing earlier than expected. i.e. ~3 hours. We've hopefully addressed this issue for future batches of similar runs (cyclic peptide jobs). ID: 94830 · Rating: 0 · rate: / Reply Quote

MeeeK Send message Joined: 7 Feb 16 Posts: 31 Credit: 19,737,304 RAC: 0	Message 94831 - Posted: 19 Apr 2020, 7:00:39 UTC - in response to Message 94830. i think because of that, my clients downloaded too many WUs. the clients are downstairs in the basement. I dont watch them every day, so i didnt noticed that. ID: 94831 · Rating: 0 · rate: / Reply Quote