Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 193 · 194 · 195 · 196 · 197 · 198 · 199 . . . 300 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2117
Credit: 41,147,428
RAC: 16,343
Message 105617 - Posted: 21 Mar 2022, 13:15:46 UTC - in response to Message 105583.  

This is a task that is still running but barely for some reason. (aagb-mPPS-mPHE-mACHC13T-ACHC12C)
It was running fine until I shut it down and restarted.
...
I paused BOINC for a bit because it was overloading my system for some reason:
2022-03-19 19:04:46 (8908): VM state change detected. (old = 'running', new = 'paused')
2022-03-19 20:11:49 (8908): VM state change detected. (old = 'paused', new = 'running')
2022-03-19 20:11:54 (8908): Guest Log: 11:22:16.381991 timesync vgsvcTimeSyncWorker: Radical host time change: 4 033 163 000 000ns (HostNow=1 647 717 113 448 000 000 ns HostLast=1 647 713 080 285 000 000 ns)
2022-03-19 20:12:04 (8908): Guest Log: 11:22:26.382377 timesync vgsvcTimeSyncWorker: Radical guest time change: 4 509 674 906 000ns (GuestNow=1 647 717 123 448 400 000 ns GuestLast=1 647 712 613 773 494 000 ns fSetTimeLastLoop=true )
...
Shutdown #2 for the night and restart
...
2022-03-20 09:19:50 (14368): Setting CPU throttle for VM. (98%)

Greg, these are extracts from one of your reports. There's something you've said here I can confirm.

When you (and I) pause or reboot or there's some reason to stop processing then restart, this is when I notice zombie tasks appearing straight away after.
Going back to what you were saying earlier, it may well be that aagb tasks have a problem with re-starting most of all, but check them all anyway.

And again, there's a line in there talking about "Setting CPU throttle for VM. (98%)"

So what I'm going to suggest is to be very wary after pausing or rebooting.
Re-check tasks to see if they've restarted ok.
If they haven't restarted using CPU time, abort them. They'll have done all the work they're going to do.
Maybe 2 out of 10 will have a problem in my experience.

Once you do this, see if any other problems develop. In my very limited experience, they won't as you'll have addressed the issues as soon as they appear.
Give it a few days and report how things are going.
ID: 105617 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
zxcvbob

Send message
Joined: 4 Jan 06
Posts: 8
Credit: 830,878
RAC: 0
Message 105618 - Posted: 21 Mar 2022, 13:55:37 UTC - in response to Message 105597.  

Yes it was an aagb* task. I aborted it and uninstalled vbox. I'm running 4.2 tasks just fine, and am going to attach a couple more computers to R@H today (without vbox.) I also have an old Xeon-based server with Linux installed (I don't run it much because the cooling fan is so loud); I wonder if it will run those tasks natively?
ID: 105618 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,717,270
RAC: 11,974
Message 105619 - Posted: 21 Mar 2022, 15:18:48 UTC - in response to Message 105618.  

Yes it was an aagb* task. I aborted it and uninstalled vbox. I'm running 4.2 tasks just fine, and am going to attach a couple more computers to R@H today (without vbox.) I also have an old Xeon-based server with Linux installed (I don't run it much because the cooling fan is so loud); I wonder if it will run those tasks natively?
Welcome to the club, I have two dual Xeon X5650 computers. Change the fan, you can get lovely quiet things on Ebay. I've done the reverse, I have mine cooled with some very loud 120V 6 inch fans I bought 30 years ago as a teenager from a bankruptcy recycling company. Not sure what they were from, but they have solid steel very sharp blades! Hurts if your finger gets in there, or on the 120V terminals!
ID: 105619 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 105624 - Posted: 21 Mar 2022, 18:01:06 UTC - in response to Message 105617.  

Sid, again aagb

Started 1056 it is now 1855
Right off the bat the times are wrong:

2022-03-21 12:38:25 (7640): Status Report: Elapsed Time: '6000.608391'
2022-03-21 12:38:25 (7640): Status Report: CPU Time: '16.687500'

6000 for 16 seconds? WTF? And just under 2 hours in.

And then this:
2022-03-21 14:20:03 (7640): Status Report: Elapsed Time: '12001.508751'
2022-03-21 14:20:03 (7640): Status Report: CPU Time: '28.937500'
No pause still.

2022-03-21 17:57:22 (7640): Status Report: Elapsed Time: '24002.140787'
2022-03-21 17:57:22 (7640): Status Report: CPU Time: '49.781250'

Never paused once.

So its deeper than my machine and BOINC
Killed it at 62.69% after over 7.5 hrs of working.

What a joke! .19% of a core. Waste of time!
ID: 105624 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,717,270
RAC: 11,974
Message 105625 - Posted: 21 Mar 2022, 18:11:45 UTC - in response to Message 105624.  
Last modified: 21 Mar 2022, 18:12:33 UTC

Some kind of a response from them would be nice, perhaps one of:

1) We know there's a problem but our programmers are too inept to fix it.

2) We didn't know there was a problem because our heads are buried in the sand.

3) We know there's a problem and we're working on fixing it by [insert date]
ID: 105625 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,269,631
RAC: 2,588
Message 105628 - Posted: 21 Mar 2022, 19:07:39 UTC

An idea to check for the python tasks failing all at once:

Something may be writing into a location in the *.vdi files that is supposed to be read-only, and is shared among all the python tasks running at once.

To check for this:

1. While no python tasks are running, go to the shared directory for read-only Rosetta@Home files. On my computer, it is:

C:/ProgramData/BOINC/projects/boinc.bakerlab.org_rosetta

If your computer does not run under Windows 10, expect a different directory name.

For each file with a name ending with .vdi , make a copy elsewhere.

2. Allow many python tasks to start.

Watch for all of them to fail at once. If this happen, make a second copy of each of the files with names ending with vdi .

Start a program that will check two binary files, and tell you if they are identical or not. Tell it to compare the old and new copies of each of the vdi files.

Let us know if the copies were identical or not. Unless you have special information on the file structures, don't bother with what the differences are.


Another idea: vbox64 (which is used by python tasks) may have problems restarting from checkpoints, but not always. I have not thought of a way to check for that.
ID: 105628 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 258
Credit: 483,503
RAC: 133
Message 105629 - Posted: 21 Mar 2022, 19:30:56 UTC - in response to Message 105628.  
Last modified: 21 Mar 2022, 19:31:08 UTC

Virtualbox tasks do not write to boinc.bakerlab.org_rosetta vdi
They fully copy vdi files to slot directories.
All 7 gigabytes.
ID: 105629 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 105630 - Posted: 21 Mar 2022, 23:00:07 UTC - in response to Message 105625.  

Some kind of a response from them would be nice, perhaps one of:

1) We know there's a problem but our programmers are too inept to fix it.

2) We didn't know there was a problem because our heads are buried in the sand.

3) We know there's a problem and we're working on fixing it by [insert date]



HAHAHAHA...yeah right. Take #1 and add we don't care
ID: 105630 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 105631 - Posted: 21 Mar 2022, 23:01:48 UTC

Idiot server kicked me off after 2 aborts and 1 error all from aagb
If they would make things correct the first time I wouldn't have this problem.
ID: 105631 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2117
Credit: 41,147,428
RAC: 16,343
Message 105632 - Posted: 21 Mar 2022, 23:07:27 UTC - in response to Message 105624.  

Never paused once

That's fine. If they pause, abort them. They never unpause in my experience.
But I have plenty of running and successfully completed aagb tasks too.
They may certainly be most susceptible, but it's not all of them.
Just check them 10mins after they've started and you'll know which way it's heading, then take the appropriate action.
ID: 105632 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bruce Morse

Send message
Joined: 8 Oct 05
Posts: 5
Credit: 816,727
RAC: 0
Message 105640 - Posted: 22 Mar 2022, 13:17:58 UTC

I have a two applications of
Rosetta python projects 1.03 (vbox64) running.
aagb-SAR_pp-…..
And
aaam-PRO_pp-….
They are currently showing elapsed time; Time remaining:
5d 14:45:56 00:00:04
5d 14:39:50. 00:00:04

The elapsed timer is running.
The time remaining has been getting progressively longer and longer between changes - currently measured in hours.

Any ideas?
ID: 105640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,717,270
RAC: 11,974
Message 105641 - Posted: 22 Mar 2022, 13:20:51 UTC - in response to Message 105640.  

I have a two applications of
Rosetta python projects 1.03 (vbox64) running.
aagb-SAR_pp-…..
And
aaam-PRO_pp-….
They are currently showing elapsed time; Time remaining:
5d 14:45:56 00:00:04
5d 14:39:50. 00:00:04

The elapsed timer is running.
The time remaining has been getting progressively longer and longer between changes - currently measured in hours.

Any ideas?
You need to see how much CPU time they're actually using. These tasks tend to sit doing nothing. If you have Boinctasks, this shows real CPU usage. Or you can use Windows task manager. If they aren't doing anything, abort them.
ID: 105641 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bruce Morse

Send message
Joined: 8 Oct 05
Posts: 5
Credit: 816,727
RAC: 0
Message 105642 - Posted: 22 Mar 2022, 13:51:28 UTC - in response to Message 105640.  

Additional notes:
Menu options in BOINC Manager are no longer functioning, including the snooze, about and exit options from the taskbar;
Outlook will no longer start,

Just noticed that my elapsed time NOW reads three (3) seconds remaining.

The version of Vbox is the one distributed with BOINC and has not been updated.

Is/Are there some settings in Vbox that *I* should have modified?

Vbox shows both tasks running and a pop up indicates a new version available: 6.1.32
Current version: 6.0.14r133895 (Qt5.6.2)

There is sporadic activity.
ID: 105642 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bruce Morse

Send message
Joined: 8 Oct 05
Posts: 5
Credit: 816,727
RAC: 0
Message 105644 - Posted: 22 Mar 2022, 14:01:13 UTC - in response to Message 105641.  

Checking windows 10 task manager:
baseline - There is very little cpu usage but there is some bursts of usage;
memory - minimal changes; disk - some;
Vbox Ethernet- zero; and
LAN network - some.
ID: 105644 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
kotenok2000
Avatar

Send message
Joined: 22 Feb 11
Posts: 258
Credit: 483,503
RAC: 133
Message 105645 - Posted: 22 Mar 2022, 14:06:44 UTC - in response to Message 105644.  

can you open virtualbox gui, press show and look at what virtualbox vm screens are showing?
Also open task C:programdataBOINCslots[slotnumber]shared and look at file modification times?
ID: 105645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bruce Morse

Send message
Joined: 8 Oct 05
Posts: 5
Credit: 816,727
RAC: 0
Message 105646 - Posted: 22 Mar 2022, 14:25:11 UTC - in response to Message 105645.  
Last modified: 22 Mar 2022, 14:29:17 UTC

can you open virtualbox gui, press show and look at what virtualbox vm screens are showing?


Looks like it never started?
Last line:
Intel MKL FATAL ERROR: Error on loading function mkl_lapack_ps_mc3_dsytrf_l_small.

Also open task
C:programdataBOINCslots[slotnumber]shared and look at file modification times?


Most recent are 03/16/2022 05:46 PM
(Um.. today: 03/22/3022)

Kinda saddens me - it appears I have wasted many days.
ETA: left it running for now in case anyone wants additional information.
ID: 105646 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,717,270
RAC: 11,974
Message 105647 - Posted: 22 Mar 2022, 14:58:59 UTC - in response to Message 105646.  

can you open virtualbox gui, press show and look at what virtualbox vm screens are showing?
Looks like it never started?
Last line:
Intel MKL FATAL ERROR: Error on loading function mkl_lapack_ps_mc3_dsytrf_l_small.
I get that every single time on 5 of my 7 computers. Nobody knows why.

Which of your computers are having problems? From my end it looks like older computers don't work. Mine are:

Ryzen 9 3900XT - works all the time on Rosetta Python VB.
i5 8600K - works all the time on Rosetta Python VB.
Core 2 Quad Q8400 - gets the same error as you every time.
Pentium N3700 - gets the same error as you every time.
Dual Xeon X5650 - gets the same error as you every time.
Dual Xeon X5650 - gets the same error as you every time.
i3 M350 - gets the same error as you every time.
ID: 105647 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bruce Morse

Send message
Joined: 8 Oct 05
Posts: 5
Credit: 816,727
RAC: 0
Message 105648 - Posted: 22 Mar 2022, 15:15:09 UTC
Last modified: 22 Mar 2022, 15:23:49 UTC

I currently have only two computers actively running Vbox:

Toshiba laptop: Intel Pentium CPU 2020 (two core hyper thread)@ 2.4GHz; 16.0 GB RAM; Win10/home. & doesn’t want to play nice

6-core 3.2 GHz Intel Core i7-8700; 16.0 GB RAM; Win10/home. IS playing nice.
ID: 105648 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,717,270
RAC: 11,974
Message 105649 - Posted: 22 Mar 2022, 15:36:07 UTC - in response to Message 105648.  

I currently have only two computers actively running Vbox:

Toshiba laptop: Intel Pentium CPU 2020 (two core hyper thread)@ 2.4GHz; 16.0 GB RAM; Win10/home. & doesn’t want to play nice

6-core 3.2 GHz Intel Core i7-8700; 16.0 GB RAM; Win10/home. IS playing nice.
You seem to be getting the same as me. Newer machines work, older machines don't. I'm going to guess the Python app is using newer instruction sets only available on newer processors, and the incompetant fools at Rosetta are handing them out to everybody instead of only those that can handle it. They must be relying on you failing a lot of them so it automatically switches off your computer from Python, but the trouble is they don't just quickly fail, they sit doing nothing for days. And you have to fail 100 of them (not just abort them) before it bans you from Python.
ID: 105649 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,524,375
RAC: 7,553
Message 105650 - Posted: 22 Mar 2022, 19:44:39 UTC - in response to Message 105649.  

Newer machines work, older machines don't. I'm going to guess the Python app is using newer instruction sets only available on newer processors, and the incompetant fools at Rosetta are handing them out to everybody instead of only those that can handle it.


If i'm not wrong, VirtualBox exposes instructions sets automaticaly to guest machines so you're idea is not so fool.
Python app is running TrRosetta simulations that are, probably, compiled against Tensorflow. Someone has this problem with old cpu
ID: 105650 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 193 · 194 · 195 · 196 · 197 · 198 · 199 . . . 300 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org