Message boards : Number crunching : Add support for DB funcs over tasks
Author | Message |
---|---|
Hubington Send message Joined: 3 Feb 06 Posts: 24 Credit: 127,236 RAC: 0 |
I was just posting information on long running WU (work units) for a different model within the rosetta project when I had thoght, it gets a little techy though so if you don't know databases prepare to get a little lost. I was thinking about if I could code anything up to scan my log and ID the WU that run long, and then it struck me, why isn't the project doing that? You have all the information stored in a MSSQL database. You have who ran what, how long it took, and how long there preferaces state it should take. With access to that database I could code a stored procedure that either took the preferances for users, put in a 20% variance to exclude minor anomolies and compare it to there results, or even look at the average run time of WU for a user in the last month and see which results fall to far outside of it. And there there is the whole inbuilt reporting module that comes with MSSQL now (not wanting to sound like a MS marketing pictch but it is what your running after all) Given most users will set it up and forget about it unless the whole thing bursts into flames, setting this up and running it once a day to gather the mistakes of the last 24hours would make a lot of sence to me. Certainly a lot more than rellying on users to catch it when most of the time the stuff is going to be reported before they even have a chance to look at it, so I can't see why you arn't doing it? Or if you are then why not announce it in the thread so users don't waste time digging out info you already have. p.s. I'm not asking for access to the DB to do it. My thinking is, if I can do it you must have someone in house who if handed this could pick it up and run with it. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Hubington, the users can change runtime preference while a task is running. And the standard BOINC software only stores the time the entire task took. So the missing piece there is the time for each of the models produced. And changes are underway now to gather that additional detail so that these specific long-running models can be reviewed and changes made to bring them in to line. We are kinda off-topic here though, so if you'd like to continue this conversation, let's find another thread to do so. Rosetta Moderator: Mod.Sense |
Hubington Send message Joined: 3 Feb 06 Posts: 24 Credit: 127,236 RAC: 0 |
We are kinda off-topic here though, so if you'd like to continue this conversation, let's find another thread to do so. Granted the tie in was somewhat tenuious, my appologies. I'd like to expand on this and after a quick poke about don't see any existing thread that relates. I don't supose you could be so kind as to break off my previous responce and your reply to a seperate topic? many thanks |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I've created this new thread to discuss this topic. Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
The point you haven't touched upon yet is what you would like to do with the information. I believe you are heading towards suggesting that a function be added to allow you to generate server abort requests for specific tasks. Either by your request, or by the project team. Perhaps you could describe in your own words what you would do with such functions. Rosetta Moderator: Mod.Sense |
Hubington Send message Joined: 3 Feb 06 Posts: 24 Credit: 127,236 RAC: 0 |
To deal with one of your earlier points first the time recorded isn't the time each WU started and the time it ended but the CPU time used in seconds. 1 CPU second is infact 1 second of complete usage of a single core. So taking a system with a single CPU as an example. If you had a WU running but only using 50% of the CPU for 2 seconds then that would equate to 1 CPU second. It's much the same principle as kilo watt hours on an electric meter except every CPU has it's own work load assosiated with 1 CPU second depending on clock speed. To be honest I'm not 100% sure what you would do with it, it's the people running the project that seem to have requested we report this information to them so you might want to ask them what their interest is. If I were to hazzard a guess as to why they are interested in it I'd assume long running running WU are symptomatic of a piece of botched or ineffeciant code/logical process. If thats the case then investigation into examples of where this has happened is the only way they will be able to resolve these issues and to investigate you first need to identify. Now if you want to identify this you are reliant on 1 of 2 processes. Have people (be that end users or project staff) look over it and miss out on 97% occurances or impliment the idea I've sighted above and depending on what level of variance you allow for impliment the above and probably catch about 75% of long running WU and 100% of the extreme cases, which are going to be the ones they are really interested in. Not wanting to assue you need this all spoon fed, but if you were to generate an overnight report for each model then that report could be passed to the lead for that model and they can do what ever it is they do that keeps my CPU on the redline day and night. as for users changing there WU run time preferances. It's my experiance that people set that sort of thing how they want it and then leave it there. In the unlikley event that a lot of false positives get thrown up owning to a user who alters there preferances on a daily basis then the SQL can be ammended to exclude these users. In much the same way as the code we run for Rosseta has evolved, every other system in the world is evolved to account for situations that had not been seen at there time of conception. What I can assure you of though is that with no system you will be missing out on a great deal when compared to even a half baked system. |
Hubington Send message Joined: 3 Feb 06 Posts: 24 Credit: 127,236 RAC: 0 |
The point you haven't touched upon yet is what you would like to do with the information. just to back up my previous statments on this one with some facts, in one of your posts (assuming this account is used by one person and not a general administrative account shared by a group) you open saying you start a thread for people to report long running WU. See https://boinc.bakerlab.org/rosetta/forum_thread.php?id=4375&nowrap=true#49379 The system I'm suggesting would not be so much for me as although I'd find it interesting to be able to look up that data in one easy place in the event that the credit level produced by me was low, the reality is I don't really care that much. By virtue of your starting a thread for people to report the things in though I'd have to assume your interest level was higher. Alternatly if my assumption is in error, no-one in the porject has any interest in being notified of these long running units and your just trying to group together a number of irrelevent posts so they don't clog the forum then I'd suggest it would be helpfull to point this out (perhap even amend forum rules if it's happening a lot) as I've been going out of my way to provide this information to you not becuase I have a major problem with it, but becasue I thought it was something you wanted. If this isn't the case though then I've been wasting my time, and it woudln't supprise me to find I was the only one. That said based on the reply from Mike Tyka to feedback supplied by users of long running units to his large homology model it would seem that this information was useful as it highlighted an issue that had not shown up with inhouse testing. This system if implimented against the inhouse testing system and RALPH may even prove to help identify issues which are at present may be being missed. After all humans are falable and an extra check could help. The only down side I can see for this is that the analysis of such a large amount of data would have a noticable impact in the database. Without knowing the more about the servers current hardware & load aswell as the number of records being handled it's hard to make a guess as to how long this would take but I'd guess it would have a run time of approximatly 5 mins or 15 at the most, which is why I sugesseted it only be ran once a day. If the restults are then stored and distributed internally though, once a day shoud be ideal. If the performance hit of doing it all in one go is to much though then you could set it up to only process one model at a time and stagger them across the day, or not run it for models where they have no interest in the results. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes, when I opened that thread I also passed along the suggestion that some basic code changes could be made to gather the information on EVERYONE, not just those that post. And pointed out how this level of analysis seems not to have been done in the passed, as there were significantly more long-running models occuring then I believe the Project Team was aware of. I believe changes were incoroprated into v1.40 to return CPU time for each model produced along with the results file. This way they can see readily in Ralph testing if long-running models are occuring. So, I believe the idea behind your suggestion has already been implemented. If not, it is definitely in the works. Rosetta Moderator: Mod.Sense |
Hubington Send message Joined: 3 Feb 06 Posts: 24 Credit: 127,236 RAC: 0 |
This was much my thinking this way they can get info on everyone who is showing long running units. They don't need to ammend the code for the client application though as based on what I can see when I examine the task information for my own tasks, there is enough information being captured in the database to create whats called a stored procedure that would generate a list of long running units that could be colated as required. If it took someone more than an hour to do this I'd be shocked, although a good DBA with existing knowledge of the tables should be able to put it together in under 5 mins. If you could find out if this has been done it would be most usefull as I'm getting a lot of long runners that there is little point in me reporting them and cloging the forum if they are already known of. If it isn't though then I'd strongly recommend that this be looked into quickly. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I believe they've been doing that level of analysis for years. The problem though is when you run the first 5 models in 2.5 hours and begin the 6th assuming it will complete in a 3 hrs (for example) runtime preference, and that 6th model takes 5 hours or something. The data used previously just told them that 6 models took 7.5 hours. Not that there was one much longer then the others, and not that it exceeded the client's target runtime by 4.5hrs. The result is the task gets poor credit, because everyone else can complete 6 (different) models in 3 hours. But it's not so bad that it couldn't be an anomoly on the client machine. It slipped in between the watchdog, which would kill it if it runs several times your target length; and failing immediately. It reported back without error, and *ON AVERAGE* it looks like it took a little over an hour per model (not unusual). With the new change, they will have the client store CPU time taken for each model within the task. And modify their databases to store that information, and then yes, they can and will query it; locate long-running models; and see if they can track down what makes that particular random starting point take so long to resolve. Either these long-runners are the golden eggs, or they are trash. So, they will write more code to treat them as what they prove to be, and in the future the sequence of folds that leads up to such a long-running model will be resolved much quicker. Rosetta Moderator: Mod.Sense |
Message boards :
Number crunching :
Add support for DB funcs over tasks
©2024 University of Washington
https://www.bakerlab.org