Message boards : Number crunching : Rosetta 4.1+ and 4.2+
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 34 · Next
Author | Message |
---|---|
Meerb Send message Joined: 10 Dec 10 Posts: 3 Credit: 2,925,682 RAC: 0 |
I have 3 computers running Darwin 19.4.0 (macOS 10.15.4) and BOINC 7.14.4 with Rosetta 4.12. Two of three computers are running WU's just fine and contain 7th gen and 9th generation Intel CPUs, but the computer with the 4th gen CPU seems to be consistently encountering computation errors less than 60 seconds after the start of run, like many other have reported. I am seeing the same issue with two other computers containing a 2nd gen and two 4th gen Intel CPUs. The crash report seems to indicate that there is an illegal instruction, with exception type: EXC_BAD_INSTRUCTION (SIGILL). Crash Log Excerpt: Process: rosetta_4.12_x86_64-apple-darwin [78172] Path: /Library/Application Support/BOINC Data/*/rosetta_4.12_x86_64-apple-darwin Identifier: rosetta_4.12_x86_64-apple-darwin Version: 0 Code Type: X86-64 (Native) Parent Process: ??? [70780] Responsible: BOINCManager [70776] User ID: 505 Date/Time: 2020-04-02 18:27:51.869 -0500 OS Version: Mac OS X 10.15.4 (19E266) Report Version: 12 Anonymous UUID: Time Awake Since Boot: 4000 seconds System Integrity Protection: enabled Crashed Thread: 0 Dispatch queue: com.apple.main-thread Exception Type: EXC_BAD_INSTRUCTION (SIGILL) Exception Codes: 0x0000000000000001, 0x0000000000000000 Exception Note: EXC_CORPSE_NOTIFY Termination Signal: Illegal instruction: 4 Termination Reason: Namespace SIGNAL, Code 0x4 Terminating Process: rosetta_4.12_x86_64-apple-darwin [78172] Application Specific Information: dyld2 mode abort() called rosetta_4.12_x86_64-apple-darwin(78172,0x119c54dc0) malloc: *** error for object 0x116998000: pointer being freed was not allocated |
RT Send message Joined: 14 Mar 20 Posts: 6 Credit: 1,155,031 RAC: 0 |
Looking at the issue, it appears the client thinks the zip files for the work units are corrupt? Nearly all tasks that have failed so far have the same error: zip file probably corrupt (illegal instruction) <core_client_version>7.14.2</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> command: rosetta_4.12_x86_64-apple-darwin -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 3ho6nx2y_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 3ho6nx2y_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3644681 Starting watchdog... Watchdog active. error: zipfile probably corrupt (illegal instruction) </stderr_txt> ]]> |
andrzej Send message Joined: 13 Mar 20 Posts: 4 Credit: 21,560 RAC: 0 |
same here on older iMac with c2d cpu. Rosetta 4.12 all my tasks fail due to computation error second after it starts Application Rosetta 4.12 Name hugh2020_HHH_rd4_0628_E18W_fragments_abinitio_SAVE_ALL_OUT_905068_626 State Computation error Received Friday, 03 April 2020 at 18:55:40 Report deadline Monday, 06 April 2020 at 18:55:40 Estimated computation size 80,000 GFLOPs CPU time --- Elapsed time 00:00:02 Executable rosetta_4.12_x86_64-apple-darwin Fri 3 Apr 18:59:24 2020 | Rosetta@home | Started download of hugh2020_HHH_rd4_0628_K6F_fragments_fold_data.zip Fri 3 Apr 18:59:27 2020 | Rosetta@home | Finished download of hugh2020_HHH_rd4_0628_K6F_fragments_fold_data.zip Fri 3 Apr 18:59:30 2020 | Rosetta@home | Starting task hugh2020_HHH_rd4_0628_K6F_fragments_abinitio_SAVE_ALL_OUT_905059_813_1 Fri 3 Apr 18:59:31 2020 | Rosetta@home | Computation for task hugh2020_HHH_rd4_0628_K6F_fragments_abinitio_SAVE_ALL_OUT_905059_813_1 finished Fri 3 Apr 18:59:31 2020 | Rosetta@home | Output file hugh2020_HHH_rd4_0628_K6F_fragments_abinitio_SAVE_ALL_OUT_905059_813_1_r987270243_0 for task hugh2020_HHH_rd4_0628_K6F_fragments_abinitio_SAVE_ALL_OUT_905059_813_1 absent Fri 3 Apr 19:00:39 2020 | Rosetta@home | Sending scheduler request: To fetch work. Fri 3 Apr 19:00:39 2020 | Rosetta@home | Reporting 1 completed tasks Fri 3 Apr 19:00:39 2020 | Rosetta@home | Requesting new tasks for CPU Fri 3 Apr 19:00:40 2020 | Rosetta@home | Scheduler request completed: got 0 new tasks |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
So after someone mentioned that they had 4.12 tasks running fine on a more recent MacOS machine, I did a fresh install (BOINC 7.14.4) on the most recent iMac I have access to, a 2017 iMac 5K running a i7 7700K Kaby Lake chip @ 4.2GHz. 32GB ram, latest Catalina OS X build. It is running Rosetta 4.12 tasks just fine (at least fine for the last 45 minutes or so). So there's something about older CPU's running OS X that the new Rosetta 4.12 doesn't seem to like. |
csbyseti Send message Joined: 24 Dec 05 Posts: 11 Credit: 5,053,421 RAC: 15,376 |
Version 4.12 seem under Ubuntu 18.04 faulty, first i wonder about 12 hours runtime instead of 8 hours, then i see only 20 Credits for the work. looking in the Stderr output shows Stderr Ausgabe <core_client_version>7.9.3</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.12_i686-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol predictor_v11_boinc--fuse--covid_spike_design_boinc_v1.xml @flags_jhr_cv -in:file:silent 9au6ic2s_Junior_HalfRoid_vs_COVID-19_design1_dev.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 9au6ic2s_Junior_HalfRoid_vs_COVID-19_design1_dev.zip @9au6ic2s_Junior_HalfRoid_vs_COVID-19_design1_dev.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 Starting watchdog... Watchdog active. BOINC:: CPU time: 43777.2s, 14400s + 28800s[2020- 4- 3 11:11:53:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 43777.2 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 11:11:53 (4726): called boinc_finish(0) </stderr_txt> ]]> Only 1 structure calculated in 12 hours, the Windows CPU Pendant shows over 70 structures in 8 hours. It seem that the Linux App doesnt really calculate. All 12 Results show this behaviour. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2123 Credit: 41,204,457 RAC: 10,266 |
BOINC:: CPU time: 43777.2s, 14400s + 28800s[2020- 4- 3 11:11:53:] :: BOINC Streaming information inconsistent, but this time on an 8hr task +4hrs for the watchdog to cut in and the 1st decoy still isn't complete :( |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,277,304 RAC: 1,589 |
I received 9 tasks on rosetta 4.12 a few hours ago. They seem to use a lot of RAM compared to the older tasks (over 1.5 GB after 15 minutes of working on them). I'm not sure you can shut down only the 4.12 tasks. You may find it more useful to tell BOINC to run fewer tasks at once, so it will have more memory available for each of them. |
HoomanSacrifice Send message Joined: 25 Dec 19 Posts: 1 Credit: 635,449 RAC: 0 |
Hey I'm having issues with R@h. For some reason, I am not receiving any tasks from R@h. I've updated it multiple times, and still no tasks. My friend next to me, who is also Running R@H, is saying he's getting tasks. So I don't know what I should do to start receiving tasks. Is there a way for me to fix this? |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
Hey I'm having issues with R@h. For some reason, I am not receiving any tasks from R@h. I've updated it multiple times, and still no tasks. My friend next to me, who is also Running R@H, is saying he's getting tasks. So I don't know what I should do to start receiving tasks. Is there a way for me to fix this? Scroll to the bottom of the server status page and look at the "Tasks by application". ( https://boinc.bakerlab.org/rosetta/server_status.php ) Then note the number of tasks available to send. If it says 0, then there's no work to send out. The status page isn't the fastest to update, but should give you a rough idea of what's available. Occasionally tasks time out and get sent back, so people occasionally are getting some random tasks here or there, but currently (as of me writing this) there aren't any in the queue to send out so until they dump a big load of work into the pipeline we are waiting. Leave it running, it will grab something eventually. |
Meerb Send message Joined: 10 Dec 10 Posts: 3 Credit: 2,925,682 RAC: 0 |
I was poking around a bit more and chatting with others on our team, and discovered something after reading Aurum's post to this morning's COVID-19 update (https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13702&postid=93202#93202). He mentioned something about the misuse of the L3 cache that was causing issues he was noticing on Xeon E5's (didn't specify which architecture, and the computers are hidden). I looked at the 2nd, 4th, 7th and 9th gen Intel processors we have on our project, and found that the 2nd and 4th gen both have 3MB of L3 cache, and the 7th and 9th gen Intel processors have 4MB of L3 cache per each two physical cores (4 MB for the 7th gen core i5, and a 12MB SmartCache for the 9th gen core i7). Maybe it's coincidence; but I find it curious. Hope this helps someone track down what's going on. Cheers! |
entity Send message Joined: 8 May 18 Posts: 19 Credit: 5,946,387 RAC: 8,724 |
I was poking around a bit more and chatting with others on our team, and discovered something after reading Aurum's post to this morning's COVID-19 update (https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13702&postid=93202#93202). He mentioned something about the misuse of the L3 cache that was causing issues he was noticing on Xeon E5's (didn't specify which architecture, and the computers are hidden). I looked at the 2nd, 4th, 7th and 9th gen Intel processors we have on our project, and found that the 2nd and 4th gen both have 3MB of L3 cache, and the 7th and 9th gen Intel processors have 4MB of L3 cache per each two physical cores (4 MB for the 7th gen core i5, and a 12MB SmartCache for the 9th gen core i7). Maybe it's coincidence; but I find it curious. Hope this helps someone track down what's going on. Cheers! I've been seeing this in several threads recently. This is a response we got at WCG from the MIP project developers a couple of years ago: "The short version is that Rosetta, the program being used by the MIP to fold the proteins on all of your computers*, is pretty hungry when it comes to cache. A single instance of the program fits well in to a small cache. However, when you begin to run multiple instances there is more contention for that cache. This results in L3 cache misses and the CPU sits idle while we have to make a long trip to main memory to get the data we need. This behavior is common for programs that have larger memory requirements. It's also not something that we as developers often notice; we typically run on large clusters and use hundreds to thousands of cores in parallel on machines. Nothing seemed slower for us because we are always running in that regime. We are looking to see if if we can improve the cache behavior. Rosetta is ~2 million lines of C++ and improving the cache performance might involve changing some pretty fundamental parts. We have some ideas of where to start digging, but I can't make any promises. Long term, identifying these issues may end up improving Rosetta for everyone that uses it so pat yourselves on the back for that!" [/code] |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
I was poking around a bit more and chatting with others on our team, and discovered something after reading Aurum's post to this morning's COVID-19 update (https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13702&postid=93202#93202). He mentioned something about the misuse of the L3 cache that was causing issues he was noticing on Xeon E5's (didn't specify which architecture, and the computers are hidden). I looked at the 2nd, 4th, 7th and 9th gen Intel processors we have on our project, and found that the 2nd and 4th gen both have 3MB of L3 cache, and the 7th and 9th gen Intel processors have 4MB of L3 cache per each two physical cores (4 MB for the 7th gen core i5, and a 12MB SmartCache for the 9th gen core i7). Maybe it's coincidence; but I find it curious. Hope this helps someone track down what's going on. Cheers! Just to tie into this, I managed to test on my 2015 iMac (8 core 4GHz Core i7, 6700K Skylake based 8MB L3) and it's now running 4.12 tasks. Any machine running OSX that's pre-Skylake seems to crash on 4.12 tasks, including several Xeon MacPro's. Oddly my lone Window's box is running 4.12 fine with the same XEON processor family my MacPro has, so it's a OS code bug. |
Tomcat雄猫 Send message Joined: 20 Dec 14 Posts: 180 Credit: 5,386,173 RAC: 0 |
I was poking around a bit more and chatting with others on our team, and discovered something after reading Aurum's post to this morning's COVID-19 update (https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13702&postid=93202#93202). He mentioned something about the misuse of the L3 cache that was causing issues he was noticing on Xeon E5's (didn't specify which architecture, and the computers are hidden). I looked at the 2nd, 4th, 7th and 9th gen Intel processors we have on our project, and found that the 2nd and 4th gen both have 3MB of L3 cache, and the 7th and 9th gen Intel processors have 4MB of L3 cache per each two physical cores (4 MB for the 7th gen core i5, and a 12MB SmartCache for the 9th gen core i7). Maybe it's coincidence; but I find it curious. Hope this helps someone track down what's going on. Cheers! That is interesting to hear, I've tested two 4.12 tasks on my early 2015 Macbook Pro equipped with a 14 nm "Broadwell" CPU welding 2 cores and 4 threads clocked at 2.9GHz (got the 512GB/8GB variant, I believe the CPU is an i5-5257U, which has 3MB of cache) and they ran fine, netting slightly over 1000 credits/day per core. I am running MacOS Catalina, however. Which version of MacOS/OSX are you using? If it is an OS bug it might have been fixed in the latest MacOS releases. That being said, I only run one BOINC task at a time to prevent overheating. |
bkil Send message Joined: 11 Jan 20 Posts: 97 Credit: 4,433,288 RAC: 0 |
To prevent overheating, it is much more productive on such a cooling constrained system to run on all cores, but sleep often. For the same fan RPM, you could produce at least twice the amount of RAC this way. This is caused by: turbo boost increasing voltage if running a single core, causing square thermal output and also package power saving not kicking in if a core is constantly active. Somehow BOIC's runtime % preference isn't as power efficient, so I simply set everything to 100% and run something like this: (but with temperature control, deprivileged user, logging and whatnot) sudo su while true; do killall -em -CONT apple-darwin sleep 0.5 killall -em -STOP apple-darwin sleep 0.5 # adjust this done Feel free to adjust the sleep to fit your fan RPM target (don't increase the one after CONT to help dissipate heat). Although for the sake of responsiveness, you may still opt to only use 50% of the cores (threads) or to suspend while the computer is active. I found the "suspend when other processes use the CPU" to be not operable on OS X. |
csbyseti Send message Joined: 24 Dec 05 Posts: 11 Credit: 5,053,421 RAC: 15,376 |
No 4.12 WU works on my Linux Ubuntu System all got the same problem: Stderr Ausgabe <core_client_version>7.9.3</core_client_version> <![CDATA[ <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.12_i686-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol predictor_v11_boinc--fuse--covid_spike_design_boinc_v1.xml @flags_jhr_cv -in:file:silent 6cp3nh2c_Junior_HalfRoid_vs_COVID-19_design1_dev.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 6cp3nh2c_Junior_HalfRoid_vs_COVID-19_design1_dev.zip @6cp3nh2c_Junior_HalfRoid_vs_COVID-19_design1_dev.flags -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 Starting watchdog... Watchdog active. BOINC:: CPU time: 43776.6s, 14400s + 28800s[2020- 4- 4 11:33:34:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 43776.6 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 11:33:34 (18264): called boinc_finish(0) </stderr_txt> ]]> I stopped all 4.12 WU's on the Linux system, waiting for a bugfix. Switches this machine to TN-Grid, They have also Workunits special for the Corona problem. |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
That is interesting to hear, I've tested two 4.12 tasks on my early 2015 Macbook Pro equipped with a 14 nm "Broadwell" CPU welding 2 cores and 4 threads clocked at 2.9GHz (got the 512GB/8GB variant, I believe the CPU is an i5-5257U, which has 3MB of cache) and they ran fine, netting slightly over 1000 credits/day per core. I am running MacOS Catalina, however. Which version of MacOS/OSX are you using? If it is an OS bug it might have been fixed in the latest MacOS releases. I should have phrased that better. The OS doesn't crash, the Rosetta 4.12 tasks all fail with a "Error while computing" seconds after being downloaded. If you take a peek at my account you will see I run a range of MacOS machines, mostly on Catalina but a few Mojave and High Sierras. The newest machines (2015 and later, both Catalina) run 4.12 tasks fine. Anything older then the 2015 Macs won't run 4.12 tasks regardless of OS. (2012 Mac mini running latest Catalina won't). So it's not a MacOS problem, it's just a coding issue in Rosetta for older CPU's running OS X. Hopefully a simple fix. All my machines pre-4.12 ran full load, 24/7 without issue. That said, as I mentioned before my MacPro's run the same CPU family (Intel Xeon X5670, X5690, all circa 2012) as my lone Windows 10 machine (Xeon X5675), and the windows machine is happily crunching away on 4.12 tasks. Tasks running on Rosetta pre-4.12 ran (and continue to run) fine on every single machine. Mini-Rosetta also continues to run to completion without error on all machines. |
Admin Project administrator Send message Joined: 1 Jul 05 Posts: 4805 Credit: 0 RAC: 0 |
Can people please join Ralph@home http://ralph.bakerlab.org/. I'm trying to test an updated build which may fix the OSX issue but there are not enough active participants. This application update includes some code related to COVID-19 that we'd like to push out to R@h. Thanks |
Blackbird Send message Joined: 16 Jan 07 Posts: 5 Credit: 733,433 RAC: 587 |
Can people please join Ralph@home http://ralph.bakerlab.org/. I'm trying to test an updated build which may fix the OSX issue but there are not enough active participants. This application update includes some code related to COVID-19 that we'd like to push out to R@h. I have done so. |
yoerik Send message Joined: 24 Mar 20 Posts: 128 Credit: 169,525 RAC: 0 |
Can people please join Ralph@home http://ralph.bakerlab.org/. I'm trying to test an updated build which may fix the OSX issue but there are not enough active participants. This application update includes some code related to COVID-19 that we'd like to push out to R@h. I'm registered for both, on all my devices - got 4 WUs crunching on my tablet right now. |
CIA Send message Joined: 3 May 07 Posts: 100 Credit: 21,059,812 RAC: 0 |
Can people please join Ralph@home http://ralph.bakerlab.org/. I'm trying to test an updated build which may fix the OSX issue but there are not enough active participants. This application update includes some code related to COVID-19 that we'd like to push out to R@h. I've tried to sign up (New user, inside the boinc software using the URL you list above) and it won't let me join.... /edit OSX Catalina, i7, BOINC 7.14.4. When I try and sign up from both the BOINC software, and the website, it says my email is not formatted properly, must be <name>@domain.<xxx>. Which is weird as I'm typing my email in correctly and it's the same one I used for the normal Rosetta site/Project. /edit2. Apparently it didn't like my VPN. I shut that off and it let me sign up fine. |
Message boards :
Number crunching :
Rosetta 4.1+ and 4.2+
©2024 University of Washington
https://www.bakerlab.org