Message boards : Number crunching : Discussion on increasing the default run time
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 10 · Next
Author | Message |
---|---|
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,270,985 RAC: 1,405 |
I'd happily change my run-time prefs so that computers that are on lots have a high run-time and the others have a low run-time but I find this really difficult as they're tied to the BOINC work/home/school settings (which I think are poor, but not the project's fault ;) ). I've noticed that the World Community Grid project lets you make some machine-specific settings through their web site, but then several other settings do not propagate through to that machine if changed in other ways. I don't use BAM, so I don't know if this is compatible with BAM. However, it looks like I may soon need to switch managers so that I can control BOINC on my two desktops from my laptop, which appears to be short of much power for running longer workunits well, so could you tell me if BAM seems suitable for that purpose? |
Warped Send message Joined: 15 Jan 06 Posts: 48 Credit: 1,788,185 RAC: 0 |
I live in a bandwidth-impoverished part of the world, with high prices and low speed. Consequently, I have selected 16 hours run time. The workunits ending before the selected run-time get to a stop at 100 decoys, whereas the one recent workunit which made it to the selected 16 hours stopped at 88 decoys. Is there anything I can do to adjust this or is it a lucky-dip? |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
I've noticed this too. As far as I can tell, it's a "lucky-dip" as you so accurately describe it. Also known as a crap-shoot in other parts of the world. ;) In another thread, I suggested increasing the maximum number of decoys from 100 to something higher, but that idea was rejected. I still find the reason for staying with the 100 decoy max totally counter-intuitive, and in fact I'm not at all sure the reasoning given is correct. That said, I'll make the suggestion again to increase the max decoys to 200 (or even higher), and see where the suggestion goes. For those of us with fast machines, willing to do long run times it will reduce the load on the servers. I admit it will change the "shape" of the uploaded data, but it will not change the amount - this last point is the one where I think people haven't thought the problem through correctly. |
cenit Send message Joined: 1 Apr 07 Posts: 13 Credit: 1,630,287 RAC: 0 |
"Maximum number of decoys" at 99 was introduced some months ago when Rosetta@home was in "debug mode" (I think around v1.50, no new features only bugs solved). It was used as the easy way to solve some bugs that arose with large uploads (if I remember correctly, they didn't even investigate if the problem was in BOINC or somewhere in their code, because this trick solved easily the bug). I don't think that, atm, it's so important to solve drastically this problem; anyway, it should be interesting to know if they have any problem with server load now... |
dgnuff Send message Joined: 1 Nov 05 Posts: 350 Credit: 24,773,605 RAC: 0 |
Interesting. If anyone is looking for fairly reliable repro steps on getting uploads to fail, try the following. Set up a machine, and adjust the maximum upload rate to 2 kbytes/sec on the advanced preferences page. Grab yourself a task like this one: 276073593 let it complete, and then try to upload it. The key section of the name appears to be the "ddg_predictions" string. I've seen a few of these guys going by, they seem to produce very large result files. I've had two that are in excess of 7 Mb and one that was over 11 Mb. It's worth noting that if I temporarily adjust the upload speed to something over my connection's max (384 kbits/sec, i.e. ~ 40 kbytes/sec), the transfer will then go through without problems. However it's a bit of a pain doing this, I'm about to the point that if I find another of these WU that's stuck uploading, I'm going to force that upload through, and then abort any of these jobs that I see in the queue. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
So yes warped, you are seeing what I would expect. When your preference of 16hrs is near, the tasks end. And if 99 models is reached prior to that, the task will end at that time (at least in the "mini" application). The amount of data reported back on the uploads varies by the type of protein and type of study being done. But the primary factor or multiplier on the size is the number of models. At one point there were batches of WUs that were running 20 models an hour. The upload size and potential for hitting the maximum outfile size were very large for long runtime preferences. 99 models was just a way to strike a compromise between giving the desired runtime, and having a predictable and reasonable upload size. dgnuff From what you are describing, it sounds like the only issue with any given type of work unit is the resulting size of the result file. Any time you have a large file that must move and a very limited bandwidth, there is a conflict to be resolved. The BOINC client can do partial file transfers and continue where it has left off. But I believe it also times out on connections that are actually moving data as well. I've seen connections ended after 5 minutes, and then restarting, at least on downloads. I presume uploads are similar. I am not sure why Berkeley made the client work that way. Seems to me that an active connection that is still successfully moving data should be left alone. So when you say you can get an upload to "fail", do you mean a retry occurs? Or do you mean that so many retries occur that... well the WU you linked looks like it arrived in-tact. So, eventually the upload was completed. I am unclear what you mean about the upload being "stuck". I think what you are seeing sounds normal for a connection with very limited bandwidth. And the client will continue working on, and completing getting it sent all by itself. This is part of why they decided to limit to 99 models too. The uploads on tasks that produced many many models were approaching 100MB, which is large enough to cause difficulty in many environments. Rosetta Moderator: Mod.Sense |
thatoneguy Send message Joined: 8 Jun 06 Posts: 3 Credit: 2,636,731 RAC: 0 |
Back to the main issue... what would be the best way to transition to an increased run time. If it is possible to do so, temporarily decrease the amount of work that can be downloaded. I think it is possible to fudge the report deadline so that computers don't ask for more work, but still receive credit for past due WUs. Following the change, simply increasing the deadline would ease almost all problems stemming from long run-times. The problem remains of course that WUs may take a long time to return to the server. As long as credit is given for the late work, I think most people won't care about the change (except for the few people who have their computer on so seldom that they won't be able to complete any work on time). |
S_Koss Send message Joined: 7 Jan 10 Posts: 4 Credit: 37,252 RAC: 0 |
I have a serious problem with changing the default times. I shut 2 of my 3 crunching computers off at night because they are in my bedroom. Last night I had a 3 hour WU that was 99% done. But I was tired and did not want to wait 5 - 10 or 15 minutes for it to finish so I exited Boinc and went to bed. This morning the said WU restarted but from 0% I lost 3 hours of work on just this WU not taking into consideration the other WU that also restarted. I find that unacceptable and turned the default time down to 1 hour. If you are going to change the default time to a minimum of 3 hours then I will be changing projects because I will not continue to loose uncountable hours of work. |
S_Koss Send message Joined: 7 Jan 10 Posts: 4 Credit: 37,252 RAC: 0 |
On second thought, you can do whatever you want. I am outa here............. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Steve, what you are describing is a very new issue that has turned up with a new type of work unit that seems to be having some checkpointing issues. Transient, I don't think an overclock would be needed to cause the symptoms he's reporting. I've asked Sarel to look in to it. Steve, I'm curious how the runtime of a task is effecting your user experience (other then loss of work, which I clearly already understand). You appear to have racked up 25,000 credits in just 4 days, so clearly you have machines running 24x7 so how does running one task for 3 hours have a disadvantage over running 3 tasks for an hour each? Rosetta Moderator: Mod.Sense |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I have a serious problem with changing the default times. I shut 2 of my 3 crunching computers off at night because they are in my bedroom. Last night I had a 3 hour WU that was 99% done. But I was tired and did not want to wait 5 - 10 or 15 minutes for it to finish so I exited Boinc and went to bed. This morning the said WU restarted but from 0% I lost 3 hours of work on just this WU not taking into consideration the other WU that also restarted. I find that unacceptable and turned the default time down to 1 hour. If you are going to change the default time to a minimum of 3 hours then I will be changing projects because I will not continue to loose uncountable hours of work. can you give us more information. what was the job id? can you link us to your job information? DK |
S_Koss Send message Joined: 7 Jan 10 Posts: 4 Credit: 37,252 RAC: 0 |
Hi, so let me try to explain this better. If you have 4 or 8 or 12 WU in varying degrees of completion and you shut down for the night (because I shut 2 computers of 3 down at night) the average loss will be higher than 1 hour WU. When you restart the next morning and you loose everything that you did the night before it gets frustrating and it has been so for the past 4 days. That is why I am not really interested in your project. I have since detached from your project so I cannot give you WU numbers. Thank you. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Steve, most Rosetta work units will save a checkpoint every 15 minutes or so. Giving a balance between losing CPU effort, and keeping checkpoint overhead and disk writes low (even in your case where you power off each day, 90% of the checkpoints are never actually needed). So, on average you should see that less then 7.5min. of CPU time is lost when powering off. Sarel is making the needed changes (posted here) so this will be true for his new type of work units as well. I just don't want to see you leave a very worthwhile project for the wrong reasons. Mad Max's Post on Saturday the 9th was one of the first posts that was specific enough to identify the problem, and yours then confirmed the issue. And here we are Monday the 11th, and the problem is being addressed. Rosetta Moderator: Mod.Sense |
DJStarfox Send message Joined: 19 Jul 07 Posts: 145 Credit: 1,250,162 RAC: 0 |
Even if you ignore Steve's experience with this project, I hope you recognize that one point has been made clear repeatedly. Checkpoints are a critical feature of BOINC applications. If you need to make checkpoints work within a single decoy's generation, then make it happen. Given that, there's nothing wrong with doubling the default/minimum run times. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,887,280 RAC: 10,511 |
Steve, most Rosetta work units will save a checkpoint every 15 minutes or so. Giving a balance between losing CPU effort, and keeping checkpoint overhead and disk writes low (even in your case where you power off each day, 90% of the checkpoints are never actually needed). So, on average you should see that less then 7.5min. of CPU time is lost when powering off. On my observations (after I have faced a similar problem, I some time watched disk writing of Rosetta application) majority of WUs wrote checkpoints even much more often - about 1 time each 1-2 minutes (I think according to setting in BOINС which by default set to 60 seconds). Except for two types WUs - one did not write checkpoints at all (as you have marked this problem is already localised and FIX for it should be included in new version Rosetta mini 2.05) and another wrote checkpoints as usually, but after restarting for any reason could not use them (or did not try at all). If the job of 2nd type once again gets to me I will try to catch it. I think an indirect tag of such tasks there should be a bad ratio between "claimed credit" and "granted credit" (on the scale of the concrete computer). As in this case: https://boinc.bakerlab.org/rosetta/result.php?resultid=309578283 I think having the complete server statistics probably to sort tasks by this ratio and to look in what types of tasks there is a bad ratios more often. By this criterion tasks having one (or both) from following disadvantages should "emerge": 1. Problems with checkpoints mechanism 2. Bad optimisation (executed more slowly in comparison with the others) But, while I do not have any ideas how to separate one from others... P.S. I am impressed by speed of response (only few days between "bug report" and fix for it), on matching with many other projects it are very fast feedback. |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
@S_Koss: why don't you just hibernate the systems instead of shuting them down? Works perfect for me and I don't loose even one second of work. BTT: no problem for me if the default run times be encreased, I run WUs for 12-24 hours. . |
Rabinovitch Send message Joined: 28 Apr 07 Posts: 28 Credit: 5,439,728 RAC: 0 |
We are planning to increase the default run time from 3 hours to 6 hours and the minimum from 1 to 3 hours to reduce the load on our servers. Nice idea. And what about increasing maximum crunching time? I am ready crunch even several days if it necessary, or crunch untill all models will be processed. What about checkbox like "Work till the end"? :-) |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,887,280 RAC: 10,511 |
Long it was not necessary to wait, it is seems I got one of such tasks just right now. I will post the "report" in an appropriate topic a bit later: minirosetta 2.03 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
And what about increasing maximum crunching time? I am ready crunch even several days if it necessary, or crunch untill all models will be processed. What about checkbox like "Work till the end"? :-) There is no end. There are literally trillions of trillions of possible models. The current 24hr maximum attempts to strike a balance between getting results back to the project with a fast turnaround time, and minimizing burden on servers and bandwidth for downloads. Originally the maximum was 4 days, but just think if a problem arose and you ran for 4 days before the watchdog realized it and kicked in to end the task. Rosetta Moderator: Mod.Sense |
Nuadormrac Send message Joined: 27 Sep 05 Posts: 37 Credit: 202,469 RAC: 0 |
This also brings up another issue with such a possible increase; though it's a credit related one, so might not take the same precedence as... And yet depending how the units are treated, it might effect the science also. If the processing time is increased, and the unit deadlocks, hangs, or in some way crashes after the initial model(s) had been successfully been processed, it will after whatever time is spent hanging, error out. And yet not everything in the WU was bad. Now because the units don't have the time involved of a CPDN unit, it's unlikely that trickles would be introduced. However, an effect of lengthening the runtime can also be that a unit that does error latter on will have a higher chance to error out; and if this occurs then any science which was accumulated prior to the model within the WU that did error could be lost, and the credits for those models which were completed without error most assuredly would be, unless something along the line of trickles or partial validation/crediting could be implemented to allow the successfully processed models within the unit to be validated and counted as such. I understand completely the motivation behind increasing the default run time and if I only received Rosetta Beta 5.98 WUs I'm sure I'd hold to that default successfully. |
Message boards :
Number crunching :
Discussion on increasing the default run time
©2024 University of Washington
https://www.bakerlab.org