Message boards : Number crunching : CPU Optimization, GPU utilization: so sad!
Author | Message |
---|---|
garyn_87048 Send message Joined: 21 Feb 09 Posts: 7 Credit: 906,320 RAC: 0 |
I just finished reading threads 3879, “Number crunching: CPU Optimization” and 4042, “Number crunching: does GPU upgrade help for R@H”. The threads are both about a year old and both threads are a bit depressing. If I correctly captured the gist of these threads: no, R@H is not going to try and optimize, and no, R@H is not going to try and use GPU’s. It just seemed sad - R@H is such an appealing project. Yet, R@H development appears to be focused on the current software fires and oblivious to a long-term processing strategy. I think “The Bad Penguin” has some good suggestions. CPU optimization would appear to be the simplest, quickest, and easiest first step. R@H might be stuck in a paradigm that the software has to be re-written in order to achieve optimization. That may be ultimate solution, but I think there is a less costly initial step. At each release point, there is at least some group of code which is deemed to be at least momentarily stable. I believe the project would benefit by making native 32 bit, 64 bit editions and within each group having specific AMD and Intel math libraries. Being optimistic, I think it would also be beneficial to possibly extend the sub-groupings to SSE2 and prior, SSE3, and SSE4 (I’m not sure how the AMD processors divide out). Savvy R@H participants can find their best match. Those who either don’t care or don’t feel comfortable attempting would keep the stock version. CPU optimization part 2: From “Paul D. Buck” comments in the threads, it appears that there are some sections of R@H code that are more long term stable. It would seem like this would be an area to apply some amount of modest software re-write efforts. As group, BOINC projects seem to attract an at least a somewhat technically savvy audience (just look at the detail in some of the user questions). My naïve suggestion – would R@H consider engaging volunteers at release time to help generate the various CPU optimized flavors? Or, engage volunteers to help re-write the stable portions of the application? Maybe these efforts are already underway (awesome!), but if not, it feels like an untapped talent pool. GPU (the CUDA’s) utilization: it just seems short sighted not devote some effort into figuring out how to utilize these powerful processors within R@H. If I’ve missed the current CPU and GPU efforts, I apologize for my misdirected post. I did read the “Number crunching: Rosetta Application Version Release Log” for 2008/2009 and did not see any specific mention of CPU/GPU efforts. |
Jaykay Send message Joined: 13 Nov 08 Posts: 29 Credit: 1,743,205 RAC: 0 |
sorry? short sighted? i guess the guys at the bakerlab are clever enough to "devote some effort" in this... there are plenty of other threads about r@h with cuda, please make a advanced search for cuda and change the time limit.... i can't say anything about cpu-optimisations, i don't know enough about that. but i guess the simple answer is that it's not worth the effort, or there are too few people working on rosetta@home. jaykay |
garyn_87048 Send message Joined: 21 Feb 09 Posts: 7 Credit: 906,320 RAC: 0 |
Please provide a link to the CUDA enabled versions. I looked and didn't find them. And, this message board's responses to CPU/GPU inquiries has been (or appears to me to have been) "nope, and no such plans". It is possible that I may have misread the 2008-2009 posts. Yes, I think this is a great project! But, I don't seen CUDA enabled and CPU optimized applications available or in the pipeline. Generally, when these questions were posted they were referred to a list of reasons neither path is doable (see yesterday's post, message 60322). These guys/gals are no doubt very clever and I apologize if my wording obscures my primary point: I'm hoping to see recent CPU and GPU advances actively considered/embraced. If needed, consider recruiting volunteer to help. I think there are a lot of people interested in seeing this project succeed! |
nick n Send message Joined: 26 Aug 07 Posts: 49 Credit: 219,102 RAC: 0 |
yeah they NEED to do both the gpu and cpu optimizations soon! We are loosing so much processing power by not doing these things. As an example on the project milkyway@home some gpu enabled computers have almost 200,000 Rac! We really need to take advantage of all of these ready to use GPU's. Here is one card that on its own has 1 teraflop of power!! As of this writing the whole project had 90 teraflops so just getting a couple hundred people using a card such as this could double our power. |
Viktor Astrom Send message Joined: 7 Apr 06 Posts: 3 Credit: 1,113,859 RAC: 0 |
I as many others has been waiting a long time to see GPU support for Rosetta@home. Now when CUDA is supported directly in BOINC I'm hoping Rosetta@home soon will get the support we need. Just look at the stats from Folding@home: http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,208,737 RAC: 2,882 |
I as many others has been waiting a long time to see GPU support for Rosetta@home. Now when CUDA is supported directly in BOINC I'm hoping Rosetta@home soon will get the support we need. You guys do realize that CUDA mean Nvidia video cards ONLY, right?. All of those with ATI cards, older Nvidia cards, any other brand card are all out of luck! I believe most projects are waiting for some dust to settle before devoting the resoures into making a video card application work for only a few video cards, that while dropping still cost in the $100.00US range. Yes you can get some Nvidia cards for less, on sale, but the higher end, ie better, cards are still as much as SEVERAL hundred dollars. There IS a new standard starting to emerge, it will incorporate both ATI and Nvidia and all other brand name cards into it. Hopefully that will encourage all projects to take a more proactive approach to gpu processing. Until then...don't hold your breath! Also your stats links shows 138916 TOTAL gpu's being used to crunch while at the same time showing 3144777 cpu's being used! QUITE a difference but showing a growing trend to use the gpu as a cruncher too! Currently Boinc is a really bad multi-tasker between the cpu and gpu. In some cases it can take 90% of a cpu just to feed a gpu!! Paul D. Buck has written extensively, in several projects, about this. These are the very early days, still, of using this technology and there are many changes still to come. One problem is that, in almost all cases, gpu's crunch much faster than cpu's. That means that projects have to be able to handle the new workload, some just cannot currently do that! |
Paul Send message Joined: 29 Oct 05 Posts: 193 Credit: 66,501,314 RAC: 8,264 |
While GPUs clearly offer the greatest increase is processing power, it is clear that CPU optimization is the logical first step. As I understand it, minirosetta is written in a moden language and the compiler most likely supports CPU based optimizations. Some of the code must be evaluated for SSE, SSE2, SSE3 optimizations. Some older posts indicate that the R@H code can not easily take advantage of these instructions. It would be ideal to start with the most stable procedures and look for ways to adapt the complex instructions to maximize the utilization of the processors. I would much rather buy a $200 video card and tripple my efforts than to purchasae a $2,000 Core i7 rig and only double the performance. We have lots of proteins to explore. The sooner we get GPUs involved, the sooner we can start understanding these complexities and save lives. Thx! Paul |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
To quickly recap some points made elsewhere. The GPUs are in essence Vector Processors (see wikipedia for more on the essence of VP) and not all tasks are amenable to being run on a VP. I do not know if Rosetta is or is not one of these class of programs. But, if it is, then the GPU will not give you a significant increase in speed. Even if the problem IS GPU amenable, coding in the APIs right now is a PITA... and the support is still maturing. And, there are three nescient APIs to target. The one for Nvidia cards, the one for ATI cards, and the one that should support both (OpenCL) which is very new. So, the issue is, is this the right time? Were I on the project I would say no ... Now, Understand that I am one of those that HAS GPUs actively doing work ... and I will say that for all the neato stuff here, it is not really ready for prime time. For one thing, BOINC still does not support GPUs that well ... none of the 6.6.x versions is really free of major bugs though I do have some hope for 6.6.18 (though there are some additional bugs and fixes posted on the Alpha list that tells me that there are still issues with .18) ... To the next point, BOINC does not support the ATI GPUs at all yet. Yes MW has an application done by a volunteer that works, sorta, and when all the planets align it is neat. But, keeping a steady stream of work, and the GPU working is nearly impossible even with micromanagement. So, were I RaH, I would be sitting on the sidelines for another inning or three ... Oh, and the point about needing expensive cards is also very germain. If you don't have a card that is 9800GT or better, you may be disappointed in the totality of what you get ... |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,208,737 RAC: 2,882 |
Oh, and the point about needing expensive cards is also very germain. If you don't have a card that is 9800GT or better, you may be disappointed in the totality of what you get ... Exactly the more "cores" the card has the better it crunches, some of the under $100.00 cards have as few as 16! Here is a link the an Nvidia page that if you click on each card model and then the specs tab on its page will tell you how many processing cores each card has: http://www.nvidia.com/object/cuda_learn_products.html A card is NOT a card in this instance! Oh one more thing, if your card doesn't have at least 256 meg of memory on it, it won't even work for gpu crunching!!! There IS some talk of Nvidia enabling some on the motherboard chips for crunching, but this is not even close to working. No one even knows how this would work and if the cpu would be able to power it and itself for its own crunching etc, etc, etc! In short gpu crunching is very limited and will probably remain that way for a while yet. Yes for those of us that do have gpu's capable of crunching you CAN crunch and be on the very cutting edge. Seti and Milky Way for Boinc, there is a stand-alone program for Folding, that I use, and I think there is one under WCG. That is all I can remember right now, Paul probably has a more complete list. He is much more of a cutting edge, find out everything about everything and every project kinda guy than I am. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
There are a lot of reasons why it may not be desirable to spend the effort needed to make R@h run on GPU's; Paul Buck has mentioned some of them above. I'd like to mention two more issues, testing and maintenance. I really doubt that R@h will engage the help of volunteers as suggested might be possible in the initial post, simply because of the difficulty of doing these necessary tasks on software that was written elsewhere. Here's an example, from the experimental side of this field, of what can happen when errors slip through the testing process. http://boscoh.com/protein/a-sign-a-flipped-structure-and-a-scientific-flameout-of-epic-proportions Having said that, it seems that general purpose programming on GPU's is a rapidly advancing field, and in a few years OpenCL or its successor may be mature enough to make this viable. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Um, well, there are GPU devices and there are stream processors ... Most GPUs have one device and many stream processors. For example my GTX 280 has one device and, well: # Device 0: "GeForce GTX 280" # Clock rate: 1350000 kilohertz # Total amount of global memory:.......1073414144 bytes # Number of multiprocessors:..............30 # Number of cores:.............................240 # Time per step: 32.335 ms # Approximate elapsed time for entire WU: 20209.391 s While my system with 2 GTX 295 cards looks like: # Using CUDA device 0 # Device 0: "GeForce GTX 295" # Clock rate: 1242000 kilohertz # Total amount of global memory:.......939261952 bytes # Number of multiprocessors::..............30 # Number of cores:.............................240 # Device 1: "GeForce GTX 295" # Clock rate: 1242000 kilohertz # Total amount of global memory:.......939196416 bytes # Number of multiprocessors:..............30 # Number of cores:.............................240 # Device 2: "GeForce GTX 295" # Clock rate: 1242000 kilohertz # Total amount of global memory:.......939261952 bytes # Number of multiprocessors:..............30 # Number of cores:.............................240 # Device 3: "GeForce GTX 295" # Clock rate: 1242000 kilohertz # Total amount of global memory:.......939261952 bytes # Number of multiprocessors:..............30 # Number of cores:.............................240 # Time per step: 35.729 ms # Approximate elapsed time for entire WU: 22330.781 s But, fundamentally, Mikey is right, the more cores and multi-processors, the faster the card is ... also the clocks play a part. It is hard to know for sure what the complexity of the tasks are on GPU Grid (where I got these numbers) so I do not know if these two tasks are comparable (they have at least three different "sizes")... Anyway ... WCG does not have a GPU application yet, neither does YoYo though rumors persist. Einstein and AI have said they are working on them now to be released soon (though I would not hold my breath on the AI application). The Lattice Project just did some testing on their GPU application and it works as poorly as does the base application so I am not impressed there either (the tasks "hang" and show no progress though the project seems to think that having the tasks fail at the same rate as teh CPU application is acceptable, that failure rate of 200 plus hour tasks is why I stopped doing their work, their application fails wasting 200 hours or so of my compute time? Sorry, I got better things to do ...). Anyway ... if you have a 38x0 or 48x0 ATI card you can try the Milky Way application... it is alpha and getting enough work is problematical at this time. Travis announced he will work on a CUDA or OpenCL application that will work differently than the CPU application and will do different work so as to solve server side work issuing problems ... The AI application will be on the ATI cards if and when it comes out ... Einstein is CUDA first (probably) and then OpenCL later (at least that is the last word) before the site crashed. SETI@Home and SaH Beta have CUDA though there are some tasks that don't run well and some that crash the GPU causing all tasks to fail until reboot (even other project's tasks). GPU Grid seems to be the "best behaved" at this point, though they have short deadlines (just increased to 5 days) and they have just admitted to under paying for the work ... they are working on changing that ... oh, and on occasion they try new tasks that may crash though they usually die fairly early and the project is pretty responsive in killing the bad task series when notified... Oh, and almost no version of BOINC really handles GPUs correctly yet. The later versions of 6.6.x (17/18, or later) seem to hold promise, though watching the alpha list I still see some pretty serious issues being raised. Some seem to think that they are having decent luck with those versions, but, YMMV ... |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
There are a lot of reasons why it may not be desirable to spend the effort needed to make R@h run on GPU's; Paul Buck has mentioned some of them above. I don't think I say this often enough though ... the more that OTHER projects move to GPU processing the more CPU resources that are potentially available for Rosetta. So, even though RaH does not make a GPU application, the mere fact of GPU processing means that the potential resources applied here may increase ... For example, if Einstein releases a GPU application I will likely migrate my resources applied to the GPU to get improved throughput. And add to my GPUs ... That would mean that the amount of time that HAD been taken by the CPUs to do EaH work is now back in the pool for other projects. |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
OR... other Projects will optimize their code for the new(er) cpu architectures, and crunchers will donate their cpu time to those that wring every bit of science possible out of each cpu cycle, while those Projects that haven't increased the efficiency of their code in years, lose out. |
garyn_87048 Send message Joined: 21 Feb 09 Posts: 7 Credit: 906,320 RAC: 0 |
Thank you for considering these options! Please consider using volunteers! I think there are several ‘safe’ ways (see below) to engage outside participation and make these changes happen quickly and effectively. Personally, I am very interested in seeing you succeed! I’m guessing this represents a much larger sentiment or the current CPU power directed your way would have gone elsewhere. Svincent, I read your post and I read your link. I agree with, embrace, and understand your concern. I think you can have both a faster project and generate dependable results. My Proposal: Admittedly, I have zero protein experience. However, I have extensive software experience and both a CS and CE background. I think your project may benefit from a three pronged approach, where each prong is nearly independent and each prong can be supported by various levels of volunteer hours and expertise. Prong 1: Each of your software releases represents a milestone which, at least temporarily, R@H believes to be stable and to generate reliable and useful results. I believe the easiest and quickest performance improvement is to compile this software in 32bit and 64bit native formats and then sub-divide these two branches into targeted CPU math libraries. This ‘multiple recompile’ could be an (almost) entirely a volunteer driven effort. The default R@H BOINC client gets the 32 bit version (the POR version, equivalent to no volunteer effort). Initially, only interested users download the specialized versions. Later, it may be possible to auto-detect the client host’s configuration. There is some chance that the various compiler optimizations may generate code that produces (hopefully only ‘slightly’ at worst case) different results. The combined R@H and volunteer effort (mostly volunteer) would be responsible for running these versions head to head (possibly even relying partially on BOINC client muscle) and begin building a library of ‘known’ tough proteins to test against all software releases. Prong 2: Get portions of the existing, stable, seldom changed, calculation-important software rewritten into an optimized format. For this task, I’m guessing that you can tap some very experienced volunteers - this area maybe particularly enticing to knowledgeable experienced volunteers with only limited extra bandwidth to donate. Modifications would be subject to the same (or even more rigorous) testing as prong 1. Again, the original POR code can be retained. Prong 3: Review the current design and processing flow and attempt to identify the best scheme for utilizing GPU power. The GPU approach is likely much more parallel than the current design. This analysis will help define R@H’s longer term roadmap and may possibly even lead to short-term unanticipated processing gains. Using mostly higher-level process flows, the volunteer community may provide some surprisingly innovative suggestions. Testing and implementation as in prong 1 and 2. Hope there is a useful nugget somewhere in this! |
Murasaki Send message Joined: 20 Apr 06 Posts: 303 Credit: 511,418 RAC: 0 |
Please consider using volunteers! I think there are several ‘safe’ ways (see below) to engage outside participation and make these changes happen quickly and effectively. There are several dangers inherent in allowing volunteers access to change the project code.
|
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,208,737 RAC: 2,882 |
Um, well, there are GPU devices and there are stream processors... Thanks Paul for using the PROPER words to describe what I was trying to say. Sometimes I know what I mean it just doesn't come out that way!! I REALLY do appreciate it!! To all it was Stream Processors that I was trying to say in my post about different gpu cards being faster than others. Generally the more Stream Processors a card has, the faster it will crunch. Also the more Stream Processors a card has, the more it costs!! Be VERY careful with the changing specs of cards though, I think I have seen a card go from one number of Stream Processors to a much lower number with a new release and version of the same card! No sorry I do not have the details, but it was an Nvidia card. I do not buy anything anymore without going to the website site first!!! And then looking on the side of the box, if it doesn't say, I don't buy!! If they don't think it's important enough to put the number on the box, then it must not be enough for what I want to do with it. Marketing must be a pain sometimes! |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Um, well, there are GPU devices and there are stream processors... Well, cool ... though my intent was to clean up MY misuse of terminology ... :) @The_Bad_Penguin Well, yes, some may do that... just as some chase projects that award the highest credit so that they collect more marbles than the next guy ... But, I think that more people are about the science in the ultimate sense. And fundamentally, if the project is collecting results as fast as they can use the data, more users only means they fall behind faster. For my part, I have enough trouble keeping all of my GPUs fed and running (in particular for Milky Way) and for molecular and folding GPU Grid is doing that so I am content to "suffer" with that for now. When Einstein releases its GPU application it will be time to consider looking to upgrade my suite of GPUs so that I can add projects and keep GPU Grid humming along at the current rate. Yes, it would be nice if RaH had a GPU version in the works. But, personally I am quite happy that they have been concentrating on getting mini-rosetta working better. I mean, 1.54 has very few crashes compared to when I quit doing RaH, and 1.58 seems SLIGHTLY better though there are still a few nagging issues (and occasional new ones) ... I don't think most of you realize just how big of a programming effort it would take to make the migration to a GPU version. |
Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0 |
I can understand why Rosetta would want to wait more to consider supporting GPUs. Their code is much bigger than Folding@home (f.ex.), and it changes more often. Waiting for OpenCL and other techs to mature is probably a good idea. What I have more trouble understanding is why there's no support for at least SSE/SSE2. Rosetta@home might not want to support 2 code bases, but I say they should just support one: The SSE/SSE2 code, and just drop support for CPUs without SSE/SSE2 (in 2009, this means they are old enough not to contribute too much, and the boost from SSE/SSE2 would more than compensate for their loss). I'm not a coder so I might be missing something, but I don't think supporting a SSE/SSE2 code base - once the transition has been made - would be harder than to support the current non-optimized code base. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
The SSE an later extensions are really nothing more than low end verctorizing instructions. In essence, doing on the CPU what we hope to do on the GPU. Put more operations in parallel ... Again, not knowing the compiler, or the code, it is hard to know if this would give a speed boost or not. It probably would, but might not ... More importantly, I think some are missing part of the point of RaH, which is not necessarily to get the most answers the fastest, it is to figure out how to get the answers ... What I mean by that is that they are working on the algorithm of how to do what they are doing. yes, a GPU or SSE version would be faster. But, using those compile time optimizations can mean that figuring out where a bug is becomes that much harder. I pretty much stopped doing RaH for almost a year because the versions of Mini-Rosetta worked so poorly on most of my systems that I just stopped. Now it is stable and only returns an error of about 1 in 100 or so it is reliable enough that I feel that it is worth my time again. I stopped doing Garli on The Lattice Project and Sztaki grid for similar reasons. If the application has lots of errors, well, not worth my time at all ... But, seriously, the fastest way to speed up any application is to get the right algorithm. And THAT is what this project is working on right now. Once you have the right algorithm you can then consider using compile time optimizations to speed up the process. As a former developer I can tell you one of the hardest things in the world to figure out is why the "optimized" version fails when the "standard" one does not, well, it is also a quick road to madness. If you want to see this in operation, read the NC forum in SaH for a couple months and see the range of issues when new versions are released and how much care and feeding is required ... |
Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0 |
Your point is a good one, Paul. But Rosetta@home is doing two things: 1) Improving the software 2) Running scientific calculations on that software I'm not saying that something like SSE/SSE2 optimizations would necessarily help with #1, but if it can help with #2 WITHOUT hurting #1, then I think it should be done (it just needs to be done once and after that the benefits add up over time). Since the science done in #2 is pretty important too (HIV research, designing proteins from scratch, etc), any speed up (which means more models can be tried, which means that lower-energy models can be found) would be important. And in a way, a speed up is actually an improvement of the software (#1) if you look at objective results (software running on machine X will give better results with SSEx than without). I'm no coder, but it seems to me that the kind of 3d modeling that Rosetta@home does would benefit significantly from SSEx. What if there's a 2x speed up there just waiting to happen? People in the Baker Lab are better positioned than I am to know what is the best use of their resources, but I just want to make sure that they're looking at all options when they make their choices. My goal is for scientific and medical breakthroughs, so however best we can reach that I'm fine with... |
Message boards :
Number crunching :
CPU Optimization, GPU utilization: so sad!
©2024 University of Washington
https://www.bakerlab.org