Message boards : Number crunching : File transfers.
Author | Message |
---|---|
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 51 |
I noticed yesterday a Rosetta on my list in the "downloading" state. Some time later, it was still in the downloading state, so I went to transfers poked and prodded it, the download starts, but stops at 46.22%. retry does the same. It is still like that today. Server status looks normal. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
LarryMajor Send message Joined: 1 Apr 16 Posts: 22 Credit: 31,533,212 RAC: 0 |
I'm having the same problem with two machines. It happens occasionally, but it's been bad the past 24 hours. |
bfromcolo Send message Joined: 25 Apr 13 Posts: 2 Credit: 1,294,095 RAC: 0 |
I have had 3 tasks on 2 machines hung like this for hours, and these are very small downloads. To make matters worse it stops other work from being downloaded, at least sometimes, its not consistent here. Retrying the transfer didn't help with any of them. Aborting the transfer did help, it caused the associated work unit to fail, next update everything is back in order. Sat 08 Feb 2020 08:26:01 AM MST | Rosetta@home | Not requesting tasks: some download is stalled |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 51 |
Still like that today. I aborted the transfer. Other jobs downloaded and started quickly. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
Just a note to help others not have to 'abort transfer' (and thus inadvertently abort tasks that may then never get completed and thus impact research) I've found that closing the BOINC client including checking the checkbox that says 'Stop running tasks when exiting the BOINC manager' and re-starting it, force-retries the downloads and they usually succeed. Still this is definitely a networking issue on the UW side. Hopefully someone reads this forum post. **38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,960,596 RAC: 13,955 |
I also have few stuck files in last few days. And BOINC also stop getting new work from R@H completely until i have noticed it today and aborted stuck file transfers. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,206,907 RAC: 10,305 |
I've had this over the last few weeks - not entirely sure it's fixed even now. The biggest issue is unattended machines for a period of time longer than my overall buffer size - in my case 24-34hrs New tasks are prevented from coming down while a download is stalled (always a very small zip file) until all Rosetta tasks in my buffer are complete, so tasks are drawn from my backup project to completely fill the buffer instead. Once the stalled filetask is manually abortedcleared, my priorities between Rosetta and backup project mean backup tasks are all ignored unless they're manually forced to run, so there's a further day or two of clearing them out before the machine becomes unattended again with the prospect of another failed Rosetta download and everything repeats itself. This has been a constant job almost every single day of the last two weeks over 4 machines in 3 different locations, so if anyone can find a way of preventing this recurring I'd really appreciate it. It's not ben funny. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
Once the stalled filetask is manually abortedcleared, my priorities between Rosetta and backup project mean backup tasks are all ignored unless they're manually forced to run, so there's a further day or two of clearing them out before the machine becomes unattended again with the prospect of another failed Rosetta download and everything repeats itself. That is annoying, I know. But if you have set the backup as a zero resource share, it will eventually clear itself out in order to meet its expiration date. It will just sit around for a while. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 25,960,596 RAC: 13,955 |
Yes, it will clear itself but in a not a good way - BOINC will just ignore such tasks from project with "zero" resource share until it almost hit theirs deadlines, it trigger "panic mode" and BOINC reallocate all resources to it to be able finish it before deadline. But sometimes it still miss some deadlines as tasks duration estimates are far from perfect and some WU can take a way longer than BOINC thinks. And do some other stupid thing while in "panic mode" like ignoring CPU cores reservation setting (like i set to use 90% CPUs at max = 7 of 8 cores, but BOINC in "panic mode" will use all 8) or start pausing GPU work to free more cpu cores for CPU WU risking cross deadline and other thing which was never allowed to do. |
Om Send message Joined: 18 Feb 20 Posts: 16 Credit: 777,076 RAC: 0 |
. |
Om Send message Joined: 18 Feb 20 Posts: 16 Credit: 777,076 RAC: 0 |
March 14th and the issue continues. I have one stuck at 82.22%. Aborting seems to be the only option... |
Dr Who Fan Send message Joined: 28 May 06 Posts: 70 Credit: 266,414 RAC: 398 |
This thread/ topic is duplicate to Message boards : Number crunching : Stalled downloads Let's not make multiple topics on SAME issue! |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,206,907 RAC: 10,305 |
Once the stalled filetask is manually abortedcleared, my priorities between Rosetta and backup project mean backup tasks are all ignored unless they're manually forced to run, so there's a further day or two of clearing them out before the machine becomes unattended again with the prospect of another failed Rosetta download and everything repeats itself. I set it to 96.67% Rosetta to 3.33% WCG, but that's not the issue I'm seeing. Once all Rosetta tasks are complete, barring the stalled download Rosetta task, my entire buffer fills with the backup project, so I get 2.0 or 2.4 days of WCG tasks. When I resolve the Rosetta issue, I can manually force the WCG tasks to run (4 or 8 tasks at a time, depending on the cores for that machine) but as soon as they finish, Rosetta starts again and I have to manually start more WCG tasks. It's very boring as well as annoying. And when I'm at that location, I'm in one of two places for half a day at a time, so it can take 2 or 3 days to clear them or, as has just been the case, I don't get to clear them all in 3 days and have to leave for my other location for 3-4 days. I could just abort all the WCG tasks, I suppose, but I don't like to do that. If they run, then I'm sure of a long unattended run on Rosetta to catch up the debt. Which is great unless another Rosetta download fails and then I'm back to square one, resolving a task that's failed while unattended there. This has been going on for nearly a month. To say I'm thoroughly sick and tired of it all would be an understatement. |
Message boards :
Number crunching :
File transfers.
©2024 University of Washington
https://www.bakerlab.org