Message boards : Number crunching : Stalled WU
Author | Message |
---|---|
Ed Machak Send message Joined: 10 Nov 16 Posts: 7 Credit: 17,339,411 RAC: 0 |
Hello, I have run at least a half dozen WU down to > 99% completed then they stall. Time remaining goes to a few minutes and stays there. I've had to abort all 6 WU as they've run past the expiration due date. Is this a common thing? It's been happening over the last month. I hate to waste all that CPU time that might go to better use on another project. I've been with R@H since 2011 and would like to continue to do useful work if possible. Thank you, Ed Machak |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
Some python work units quickly turn into zombies , after less than 60 seconds of CPU time If you see a work unit over running , click on it and "properties" if its "elapsed" time is a lot more , its gone zombie Abort them There are far to many of them . |
Ed Machak Send message Joined: 10 Nov 16 Posts: 7 Credit: 17,339,411 RAC: 0 |
Thanks for the tip. Ed |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
If the task does not complete within 12 hours elapsed time then it stuck. No need to run it for 1 or 2 days. You ran 2.5 days before killing it. After 1 day elapsed, kill it. Before you kill it, goto the slot it is stored in and look at the stderr text. Look at checkpoint times and elapsed times. Like this: Status Report: Elapsed Time: '6000.564621' Status Report: CPU Time: '6877.687500' Look at each status report from the bottom up and see how much it advanced or if there are any error messages in the text. This will give you a better idea whats going on. If everything looks normal up to the 95% or whatever mark and then it stalls, then its something in the data itself and all you can do is kill the task. You can also download and use Emfer Boinc Tasks program and set up the columns so you see CPU% as one of them and then you can tell if its stalled or not. If it uses a decimal percent of the CPU then its stuck and you can kill it. BT is a very useful program for monitoring. |
UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,291,266 RAC: 0 |
Hi all I don't think I have "stalled" tasks - as the %age work done is still increase - but they are taking AGES to complete... task #1 Application - rosetta python projects 1.03 (vbox64) Name - aagb-NMPHE_pp-NMVAL-GGLY-mACPenC12C_pp_7_2674773_4 State - Running Received - 08/04/2022 00:41:54 Report deadline - 11/04/2022 00:41:56 Estimated computation size - 80,000 GFLOPs CPU time - 00:34:13 CPU time since checkpoint - 00:00:06 Elapsed time - 1d 20:46:22 Estimated time remaining - 01:06:58 Fraction done - 97.568% Virtual memory size - 101.57 MB Working set size - 2.79 GB Directory - slots/3 Process ID - 5000 Progress rate - 2.160% per hour Executable - vboxwrapper_26203_windows_x86_64.exe ========= tasks #2 Application - rosetta python projects 1.03 (vbox64) Name - aagb-mAZE-mPHE-GPN-mB3PHG_pp_9_2612326_4 State - Running Received - 08/04/2022 00:41:11 Report deadline - 11/04/2022 00:41:13 Estimated computation size - 80,000 GFLOPs CPU time - 00:37:56 CPU time since checkpoint - 00:00:06 Elapsed time - 2d 02:10:06 Estimated time remaining - 00:47:31 Fraction done - 98.446% Virtual memory size - 101.04 MB Working set size - 2.79 GB Directory - slots/1 Process ID - 7280 Progress rate - 1.800% per hour Executable - vboxwrapper_26203_windows_x86_64.exe And from Task Manager Is ee that CPU usage fluctuates between 0% and maybe 1% This is very much a waste of computing time, if the tasks are not actually doing much...but I don't want to abort them, if the task is going to complete and the "result" file is of benefit... Maybe some admin can provide more succinct answers as to why this is happening, as others seems to ahev reported similar issues with what appear to be "zombie" tasks.,. regards, Tim Founder, UK BOINC Team Join UK BOINC Team: http://www.ukboincteam.org.uk/newforum |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 390 Credit: 12,073,013 RAC: 4,827 |
Look at the difference between CPU time and elapsed time, either there is something serious running alongside Boinc or, far more likely, those tasks are dead. |
UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,291,266 RAC: 0 |
Look at the difference between CPU time and elapsed time, either there is something serious running alongside Boinc or, far more likely, those tasks are dead. Hi Thanks for the feedback. :-) I've seen this sort of behaviour before with other non-VBox projects and usually the rule of thumb is to "leave them be" and they will (eventually) complete... But I've not had this happen with Rosetta's VBox tasks before - and indeed I have one other host, with the same OS (Win 7 Pro), the same VBox version and the same version of BOINC Manager, and that has been fairly rattling through the tasks...and both hosts have plenty of installed, working RAM - and no other significant non-BOINC tasks are taking place simultaneously. eg: One VBoxHeadless.exe is taking up 71Mb, the other is at 39Mb and VirtualBox.exe is taking up 18.5Mb - which are minute amounts of RAM in the grand scheme of things... So, it might be my old CPU on this one host could be "past it" - maybe the right CPU "core-functions" are not up to the mark ...but it works fine with LHC and QuChem VBox tasks... Which leads one to assume there might be something peculiar with the Rosetta VBox tasks themselves... ? regards, Tim Founder, UK BOINC Team Join UK BOINC Team: http://www.ukboincteam.org.uk/newforum |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 390 Credit: 12,073,013 RAC: 4,827 |
Look at the difference between CPU time and elapsed time, either there is something serious running alongside Boinc or, far more likely, those tasks are dead. Yes, there is a problem with some of the Rosetta VBox tasks that causes this behaviour. |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
So, it might be my old CPU on this one host could be "past it" - maybe the right CPU "core-functions" are not up to the mark ...but it works fine with LHC and QuChem VBox tasks... Whatever the cpu is it is good enugh to run them so it is ok in that way . Which leads one to assume there might be something peculiar with the Rosetta VBox tasks themselves... ? Now there is an understatement . . . |
UBT - Timbo Send message Joined: 25 Sep 05 Posts: 20 Credit: 2,291,266 RAC: 0 |
Which leads one to assume there might be something peculiar with the Rosetta VBox tasks themselves... ? Hi Yup - it certainly seems like that :-( You'da thought that a "tech admin" would be overseeing the results returned, would have recognised that a certain percentage were taking far too long to be reported and would be actively figuring out there was a problem and would fix it. Instead, the situation seems to be that volunteers computers are wasting time, money and electricity, by spinning their wheels, due to Rosetta's poor and inefficient management of the tasks they make available. :-( regards, Tim Founder, UK BOINC Team Join UK BOINC Team: http://www.ukboincteam.org.uk/newforum |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 390 Credit: 12,073,013 RAC: 4,827 |
Simple solution, I just refuse to run the Python tasks - too buggy and too resource hungry. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2121 Credit: 41,179,074 RAC: 11,480 |
I don't think I have "stalled" tasks - as the %age work done is still increasing[quote] There's no way of telling from the task manager which task is running or not. The difference in CPU time and Elapsed time is telling you <exactly> what's happening with the task. It's the very definition of "stalled" or a "zombie" task. Things are no more complicated than that. Knowing why is the researcher's problem. We only need to know that they've stopped and, of the hundreds I've seen, they <never ever> restart and nothing you can do will change that. Abort on sight. Don't worry about why, just do it and get on with your day. Quoting my earlier message referring specifically to VBox tasks: ~~~ Repeating my earlier message for those who haven't seen it: If you have a task you think is stalled or taking a long time, click on it and select properties on the left.~~~ |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,526,853 RAC: 5,737 |
BOINCTasks shows whether a task is using CPU time or not so you can see what to abort. https://efmer.com/boinctasks/download-boinctasks/ |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,001,518 RAC: 6,291 |
BOINCTasks shows whether a task is using CPU time or not so you can see what to abort. I use Windows BOINCTasks and it is very obvious when a Rosetta WU hangs. The CPU usage goes to zero and stays. I have never seen one finish after the CPU goes to 0%. On Linux I use "top -i -c -d3" to get a similar display. I press "SHIFT P" to sort processes by CPU time. "-i" only show running processes "-c" show the command line so you can see what is burning CPU "-d 3" sample every 3 seconds so I can see the display I have two computers with near identical configurations and I saw the number of stalls/hangs increase SIGNIFICANTLY when I simply updated VirtualBox to a newer version than comes with BOINC. When I uninstalled BOINC and VirtualBox and reinstalled again, the problems cleared up. It appears the Rosetta developers/integrator introduced some dependency on a VirtualBox. Using VirtualBox was supposed to reduce the Rosetta developer problems with different environments. It looks more like they just put a 3gb vbox wrapper around it and introduced a new set of problems. BOINC startup times when running Rosetta WU is now minutes instead of seconds. Checkpoints that write gb of data to the BOINC drive is going to kill volunteer HW. Excess memory demands exhausts memory and adds to the unnecessary excess power needed to run Rosetta WU. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2121 Credit: 41,179,074 RAC: 11,480 |
I use Windows BOINCTasks and it is very obvious when a Rosetta WU hangs. The CPU usage goes to zero and stays. I have never seen one finish after the CPU goes to 0%. A new one I've seen is a task with zero elapsed time with a status of "waiting to run" They never ever start either. Annoying, but quicker to abort |
Message boards :
Number crunching :
Stalled WU
©2024 University of Washington
https://www.bakerlab.org