Discussion on increasing the default run time

Message boards : Number crunching : Discussion on increasing the default run time

To post messages, you must log in.

Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next

AuthorMessage
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,760,053
RAC: 22,881
Message 94753 - Posted: 18 Apr 2020, 10:39:00 UTC - in response to Message 94738.  

Tasks finally completed.
Target CPU Run time 8hrs
rb_03_31_20049_19874__t000__3_C1_SAVE_ALL_OUT_IGNORE_THE_REST_904837_1472_0
      Run time 15 hours 36 min 43 sec
      CPU time 15 hours 29 min 42 sec
Validate state Valid
        Credit 605.87


<core_client_version>7.6.33</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_x86_64.exe -run:protocol jd2_scripting @flags_rb_03_31_20049_19874__t000__3_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_03_31_20049_19874__t000__3_C1_robetta.zip -nstruct 10000 -cpu_run_time 57600 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3920361
Starting watchdog...
Watchdog active.
======================================================
DONE ::    13 starting structures  55782.3 cpu seconds
This process generated     13 decoys from      13 attempts
======================================================
BOINC :: WS_max 1.75866e+09

BOINC :: Watchdog shutting down...
19:06:24 (8204): called boinc_finish(0)

</stderr_txt>
]]>





rb_03_31_20031_19865__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_904757_832_0
      Run time 15 hours 45 min 21 sec
      CPU time 15 hours 40 min 23 sec
Validate state Valid
        Credit 1,215.60


<core_client_version>7.6.22</core_client_version>
<![CDATA[
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.15_windows_intelx86.exe -run:protocol jd2_scripting @flags_rb_03_31_20031_19865__t000__0_C1_robetta -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip input_rb_03_31_20031_19865__t000__0_C1_robetta.zip -nstruct 10000 -cpu_run_time 57600 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 1163139
Starting watchdog...
Watchdog active.
======================================================
DONE ::    31 starting structures  56423.5 cpu seconds
This process generated     31 decoys from      31 attempts
======================================================
BOINC :: WS_max 9.19749e+08

BOINC :: Watchdog shutting down...
19:36:44 (4984): called boinc_finish(0)

</stderr_txt>
]]>

Grant
Darwin NT
ID: 94753 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rzlatic
Avatar

Send message
Joined: 20 Nov 07
Posts: 3
Credit: 327,897
RAC: 0
Message 94765 - Posted: 18 Apr 2020, 13:53:04 UTC - in response to Message 94697.  
Last modified: 18 Apr 2020, 13:56:55 UTC

rzlatic, it appears you are seeing tasks running more than 4 hours passed the runtime preference, then ended by the watchdog on a Linux system running the i686 application. Please see the discussion here.


indeed, tasks generating the problem were starting with "12v1n" and were run by i686-pc-linux (as seen here: https://imgur.com/WaDeO14).

created "cc_config.xml" config file in /var/lib/boinc/ with suggested settings, restarted boinc client and there seems to be none 686-pc-linux (32-bit) tasks now.
we'll see how it will be going.

thanks, great community and support.
ID: 94765 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile MeeeK

Send message
Joined: 7 Feb 16
Posts: 31
Credit: 19,737,304
RAC: 0
Message 94766 - Posted: 18 Apr 2020, 14:20:17 UTC
Last modified: 18 Apr 2020, 14:22:36 UTC

Hi,

Did something change in the last few days or weeks?

I lost 7000 points in average per day. Dropped from 34.000 a day to 27.000 and below.

Thats a huge gap.
I changed nothing at my systems. Thwy are running 24/7.
ID: 94766 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 390
Credit: 12,073,013
RAC: 4,827
Message 94771 - Posted: 18 Apr 2020, 14:53:48 UTC - in response to Message 94766.  

Hi,

Did something change in the last few days or weeks?

I lost 7000 points in average per day. Dropped from 34.000 a day to 27.000 and below.

Thats a huge gap.
I changed nothing at my systems. Thwy are running 24/7.


I can’t see that far back in your results log but you’re loosing quite a few credits through missed deadlines - you might find it better to reduce your buffers to maybe 0.5 days + 0.5 days or even less.
ID: 94771 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2121
Credit: 41,179,074
RAC: 11,480
Message 94785 - Posted: 18 Apr 2020, 17:39:14 UTC - in response to Message 94771.  
Last modified: 18 Apr 2020, 17:40:53 UTC

Did something change in the last few days or weeks?

I lost 7000 points in average per day. Dropped from 34.000 a day to 27.000 and below.

Thats a huge gap.
I changed nothing at my systems. Thwy are running 24/7.

I can’t see that far back in your results log but you’re loosing quite a few credits through missed deadlines - you might find it better to reduce your buffers to maybe 0.5 days + 0.5 days or even less.

To be fair, cancelled tasks that have passed deadline are only deleted if they're unstarted - running tasks are allowed to continue, so no credit loss.

But it's certainly true that the buffer is too large to meet deadline so should be significantly cut back.
Because even if the next host who receives them runs them within deadline they'll be so long after the batch was released they'll be no good to the project. They're effectively being made useless at the point of download.
And the runtime is unnecessarily low so the project isn't getting full value for them either tbh

But I'd rather point to the credits awarded here. Since the new version, credits have been a mess. Anyone seeking credits here should know they're going to be disappointed.
My point being, I don't think lower credits are indicative of a problem with the tasks themselves
ID: 94785 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 390
Credit: 12,073,013
RAC: 4,827
Message 94797 - Posted: 18 Apr 2020, 20:22:46 UTC - in response to Message 94785.  

Did something change in the last few days or weeks?

I lost 7000 points in average per day. Dropped from 34.000 a day to 27.000 and below.

Thats a huge gap.
I changed nothing at my systems. Thwy are running 24/7.

I can’t see that far back in your results log but you’re loosing quite a few credits through missed deadlines - you might find it better to reduce your buffers to maybe 0.5 days + 0.5 days or even less.

To be fair, cancelled tasks that have passed deadline are only deleted if they're unstarted - running tasks are allowed to continue, so no credit loss.

But it's certainly true that the buffer is too large to meet deadline so should be significantly cut back.
Because even if the next host who receives them runs them within deadline they'll be so long after the batch was released they'll be no good to the project. They're effectively being made useless at the point of download.
And the runtime is unnecessarily low so the project isn't getting full value for them either tbh

But I'd rather point to the credits awarded here. Since the new version, credits have been a mess. Anyone seeking credits here should know they're going to be disappointed.
My point being, I don't think lower credits are indicative of a problem with the tasks themselves


As well as the hundreds of error tasks there were quite a few invalid tasks that had run but been discarded and given no credits because they were received too late.
ID: 94797 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,760,053
RAC: 22,881
Message 94799 - Posted: 18 Apr 2020, 23:37:59 UTC

From the "Thank you" thread
To accommodate this, the "watchdog" timeout has been extended from the normal 4 hours to 10 hours.
A big change.
And given the gaps between checkpoints even on my fairly powerful system -40min and more- and reports from others of even longer periods of no checkpointing (on Windows systems which don't have the Linux i686 application issue) i would hope the programmers are going to look very hard in to increasing the number of points where a Task can checkpoint.
Otherwise even powerful systems that run more than just Rosetta will struggle to complete a Task due to resource share settings switching between projects, and less powerful Rosetta only systems will struggle to reach a checkpoint before there is a need for BOINC to suspend computation. And for those that aren't on for long periods of time, or have heavy non-BOINC use while crunching will have no chance of completing a Task if it has to start from the last checkpoint after an interruption.

With an 8 hour (now up to 18 hours) Runtime losing 5min here or there isn't a big issue (annoying, but not a big issue). But to lose 40min, in many cases to loose 2hrs and more if a Rosetta Tasks gets interrupted (and unless you have massive amount of RAM to number of cores/threads ratio "Leave non-GPU tasks in memory while suspended" isn't an option (what is the default "Page/swap file: use at most %"?)).

Yes, most Tasks won't run for the Target time + 10hrs. But those that do will drive people away if they spend hours doing work, only to loose it all & have to start again. Over & over again.
Grant
Darwin NT
ID: 94799 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94800 - Posted: 18 Apr 2020, 23:43:47 UTC - in response to Message 94738.  
Last modified: 18 Apr 2020, 23:44:46 UTC

@Grant, you appear to be one of the lucky few that received some of those WUs from two weeks ago when the default runtime was bumped to 16 hours. So the WU thought it was running normally. This explains why the watchdog did not step in.

As posted today by admin, the timeout used by the watchdog has been extended for the normal 4 hours to 10 hours. So, on new WUs, where the watchdog is set to 10 hours, that means a given WU may run 10 CPU hours beyond the runtime preference before the watchdog will step in and end the WU. This is because of the extremely challenging protein models under study now.
Rosetta Moderator: Mod.Sense
ID: 94800 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 4805
Credit: 0
RAC: 0
Message 94803 - Posted: 18 Apr 2020, 23:54:18 UTC - in response to Message 94799.  

Extending the watchdog has no impact on the checkpointing issue. These are separate issues. Extending the watchdog will prevent errors for jobs that have a longer than usual run time per model, such as a 2000 residue protein, for protocols that do not have checkpoints of appropriate intervals. Although somewhat rare, these type of jobs do exist from Robetta. We should definitely try to address the checkpointing issue by adding more checkpoints to the various protocols but this will take development time.
ID: 94803 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,760,053
RAC: 22,881
Message 94806 - Posted: 19 Apr 2020, 0:20:27 UTC - in response to Message 94803.  

Extending the watchdog has no impact on the checkpointing issue. These are separate issues. Extending the watchdog will prevent errors for jobs that have a longer than usual run time per model, such as a 2000 residue protein, for protocols that do not have checkpoints of appropriate intervals. Although somewhat rare, these type of jobs do exist from Robetta. We should definitely try to address the checkpointing issue by adding more checkpoints to the various protocols but this will take development time.
Thank you for the response.
Given the increase in the Watchdog time, i think that addressing the checkpoint issue should get bumped up the list of things that need to be done. As i said- losing a few minutes work is annoying, but it's not a major issue. But to lose 2, 4, 8, 12, 16 hours of processing? That will drive people away.
If Rosetta is using a grace period on work not returned by the deadline, it may be worth considering increasing it to allow for the increased Watchdog time (and please if possible see that Tasks aren't resent till the Grace period extension has passed. ie don't send them, till they're needed. Reduce people's error count, reduce the network bandwidth used, reduce the Scheduler load- not a big reduction, but every little bit counts as the number of systems crunching grows).
Grant
Darwin NT
ID: 94806 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2121
Credit: 41,179,074
RAC: 11,480
Message 94814 - Posted: 19 Apr 2020, 1:54:42 UTC - in response to Message 94799.  

From the "Thank you" thread
To accommodate this, the "watchdog" timeout has been extended from the normal 4 hours to 10 hours.
A big change.
And given the gaps between checkpoints even on my fairly powerful system -40min and more- and reports from others of even longer periods of no checkpointing (on Windows systems which don't have the Linux i686 application issue) I would hope the programmers are going to look very hard in to increasing the number of points where a Task can checkpoint.
Otherwise even powerful systems that run more than just Rosetta will struggle to complete a Task due to resource share settings switching between projects, and less powerful Rosetta only systems will struggle to reach a checkpoint before there is a need for BOINC to suspend computation. And for those that aren't on for long periods of time, or have heavy non-BOINC use while crunching will have no chance of completing a Task if it has to start from the last checkpoint after an interruption.

With an 8 hour (now up to 18 hours) Runtime losing 5min here or there isn't a big issue (annoying, but not a big issue). But to lose 40min, in many cases to loose 2hrs and more if a Rosetta Tasks gets interrupted (and unless you have massive amount of RAM to number of cores/threads ratio "Leave non-GPU tasks in memory while suspended" isn't an option (what is the default "Page/swap file: use at most %"?)).

Yes, most Tasks won't run for the Target time + 10hrs. But those that do will drive people away if they spend hours doing work, only to loose it all & have to start again. Over & over again.

I think this is a strong point so +1. People who aren't running 247 or are running multiple projects with project switching every default amount of time who don't hold tasks in memory are going to have significant issues.
I do run 24/7 and I'd already changed my "switch between tasks every xx minutes" from a default 60 to 999 but even this might not be enough. Might be worth adding another 9
ID: 94814 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile MeeeK

Send message
Joined: 7 Feb 16
Posts: 31
Credit: 19,737,304
RAC: 0
Message 94819 - Posted: 19 Apr 2020, 2:50:32 UTC - in response to Message 94797.  

I am running the same settings since i bought these CPUs.

Saved a day and additional 2 days.
Never had that much problems.

I saved 3 days of work for having some reserve in case of problems with my ISP.
So Computers can do their work offline until internetconmection is back.

There are 150 deadline task in my stats right now. Never have seen that much before.
Think there was something wrong with too short deadlines, bit the all hav not been started. So no lost work.

My two ryzen 3600 crashed from 17.000 points each to approximately 12.000 and 13.000.

Thats not normal.
Will check the tasks later. Maybe i can find a problem.
ID: 94819 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94822 - Posted: 19 Apr 2020, 3:03:20 UTC - in response to Message 94814.  

You will correct me if I'm mistaken, but doesn't the BOINC Manager wait until a WU reaches a checkpoint before suspending it to work on another project? This change was made about a decade ago, because a task that hasn't reached a checkpoint, especially if tasks are not kept "in memory", will lose what it has been working on. True for all BOINC projects.

Picture a machine with 4 CPUs, running 3 BOINC projects, it could easily have 8 different WUs that have been worked on during the day. And even if WUs are kept "in memory" when suspended, you are going to lose progress on all 8 of those WUs when you turn off the machine.

So, I don't believe that "switch between tasks every xx minutes" setting really has much effect anymore. In fact I thought it used to include the phrase "at most" every xx minutes.

Note: I tend to place "in memory" in quotes, because memory used by inactive threads is always swapped out if there is an active thread that needs memory. So long as you have a swap file of sufficient size, I highly recommend checking the box for "Leave non-GPU tasks in memory while suspended".
Rosetta Moderator: Mod.Sense
ID: 94822 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,760,053
RAC: 22,881
Message 94823 - Posted: 19 Apr 2020, 3:07:06 UTC - in response to Message 94819.  

I am running the same settings since i bought these CPUs.

Saved a day and additional 2 days.
Never had that much problems.
You have as many Tasks that are an Error because they miss the deadline as are in your Valid list. That is a problem.
   Other	
                                Store at least 1 days of work
                     Store up to an additional 0.02 days of work
Will result in more than a days work until deadlines start to settle down. Then you could bump up "Store at least xx days of work" to 1.5 if you feel you need more, without having most of the work you get being Errors due to missed deadlines which is what is happening at present.
Grant
Darwin NT
ID: 94823 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile MeeeK

Send message
Joined: 7 Feb 16
Posts: 31
Credit: 19,737,304
RAC: 0
Message 94825 - Posted: 19 Apr 2020, 4:57:21 UTC - in response to Message 94823.  

I am running the same settings since i bought these CPUs.

Saved a day and additional 2 days.
Never had that much problems.
You have as many Tasks that are an Error because they miss the deadline as are in your Valid list. That is a problem.
   Other	
                                Store at least 1 days of work
                     Store up to an additional 0.02 days of work
Will result in more than a days work until deadlines start to settle down. Then you could bump up "Store at least xx days of work" to 1.5 if you feel you need more, without having most of the work you get being Errors due to missed deadlines which is what is happening at present.



that isn´t the problem at all.
all the canceled WUs because of deadline, have not been started. So my CPUs didnt waste a secound of workload to them. There have been 6 or 7 WUs that had an error while working on it and have been aborded.
But these 6 or 7 tasks dont make me lose so many points.

but i just have an idea right in that moment.

Guess its because of the way average points are calculated.
i have 150 Jobs finished with points that would be my 34.000 points BUT i also have 150 Deadline tasks with 0 Points. So its 300 WUs with XXX tousand points in average. Dont know the exact numbers at atm.

Did somebody else had problems with deadines? I didnt read all the posts.
ID: 94825 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94826 - Posted: 19 Apr 2020, 5:30:54 UTC - in response to Message 94825.  
Last modified: 19 Apr 2020, 5:31:32 UTC

Did somebody else had problems with deadines? I didnt read all the posts.


There have been several issues lately that have caused machines to load more work than can be processed within the 3 day deadlines that now rule the day.


  • Default runtime was briefly changed from 8 hours to 16 hours.
  • Some WUs failed quickly, causing BOINC Manager to believe new WUs might "complete" quickly as well.
  • Deadlines used to be a mixture of 3 days and 8 days, but now all WUs seem to be getting 3 day deadlines.
  • Some WUs are using more memory than was previously typical, some machine environments handle this better than others.
  • Some WUs were causing Linux i686 to run for the runtime preference plus 4 hours before ending with no models produced. This may be the primary reason behind your RAC drop, but it would seem the WUs that might have caused that have already been purged.


Rosetta Moderator: Mod.Sense
ID: 94826 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,760,053
RAC: 22,881
Message 94828 - Posted: 19 Apr 2020, 6:07:19 UTC - in response to Message 94825.  

that isn´t the problem at all.
all the canceled WUs because of deadline, have not been started. So my CPUs didnt waste a secound of workload to them.
But it is a problem.
You did waste Rosetta's bandwidth & server resources in downloading them, having them sit there doing nothing for days, then having them timeout & having to send them out again to another system that will process them.
The reason for the shorter deadlines is because the Project wants (needs) the results back sooner. If you are not going to process them, then why download them? If you set your cache to a more realistic value then everyone benefits.



Did somebody else had problems with deadines? I didnt read all the posts.
Many other people have problems with deadlines, but none as bad as yourself.
Grant
Darwin NT
ID: 94828 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile MeeeK

Send message
Joined: 7 Feb 16
Posts: 31
Credit: 19,737,304
RAC: 0
Message 94829 - Posted: 19 Apr 2020, 6:49:56 UTC - in response to Message 94826.  

Did somebody else had problems with deadines? I didnt read all the posts.


There have been several issues lately that have caused machines to load more work than can be processed within the 3 day deadlines that now rule the day.


    ...
  • Some WUs failed quickly, causing BOINC Manager to believe new WUs might "complete" quickly as well.
  • Deadlines used to be a mixture of 3 days and 8 days, but now all WUs seem to be getting 3 day deadlines.
    ...
  • Some WUs were causing Linux i686 to run for the runtime preference plus 4 hours before ending with no models produced. This may be the primary reason behind your RAC drop, but it would seem the WUs that might have caused that have already been purged.



i think that might have caused my "problems".
Now changed the settings to 2 days and WU-Runtime to 6 hours.
Will check the next days if there is a change.

Do you think i should upgrade RAM to 32 because of higher usage? have two Ryzen 5 3600 12 Core with 16GB each. have always been enough so far.
ID: 94829 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 4805
Credit: 0
RAC: 0
Message 94830 - Posted: 19 Apr 2020, 6:55:07 UTC

Another recent issue has been that a batch of jobs has been finishing earlier than expected. i.e. ~3 hours. We've hopefully addressed this issue for future batches of similar runs (cyclic peptide jobs).
ID: 94830 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile MeeeK

Send message
Joined: 7 Feb 16
Posts: 31
Credit: 19,737,304
RAC: 0
Message 94831 - Posted: 19 Apr 2020, 7:00:39 UTC - in response to Message 94830.  

i think because of that, my clients downloaded too many WUs.

the clients are downstairs in the basement. I dont watch them every day, so i didnt noticed that.
ID: 94831 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next

Message boards : Number crunching : Discussion on increasing the default run time



©2024 University of Washington
https://www.bakerlab.org