Why so much variation?

Author	Message
Bikermatt Send message Joined: 12 Feb 10 Posts: 20 Credit: 10,559,305 RAC: 0	Message 66403 - Posted: 1 Jun 2010, 15:53:51 UTC I have watched about twenty of the ProteinInterfaceDesign work units run over the last few days and it seems like they just stall. I don't seem to remember anyone saying that suspend/resume would create a checkpoint. It does seem to restart these work units though. The problem is that with my six hour run time I was finding about 25% of these work units stalled with a CPU time of about 4 hours, an elapsed time from any where between 7 and 10 hours with a check point at ten minutes! So the work units are not check pointing and the watchdog is not killing them either. Suspend/resume will restart the run in these cases but if you "show graphics" before and after the suspend/resume, the hundreds of models that had been produced are reset to zero so all of your work is lost. The only other option in this case seems to be to abort, however again you loose all of your models. In some cases I noticed these would just stall and the work units were check pointing. In those cases suspend/resume would finish the work unit several seconds after resuming. Matt ID: 66403 · Rating: 0 · rate: / Reply Quote

Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0	Message 66406 - Posted: 1 Jun 2010, 19:06:01 UTC Last modified: 1 Jun 2010, 19:40:44 UTC Matt - What you are seeing appears to be different than what I have been seeing. In my case: 1. It has always, without fail, been been a ProteinInterfaceDesign task. 2. It will quit taking checkpoints. Once it quits taking checkpoints it will not start taking checkpoints again for the life of the task. 3, It will run way long, most often provoking the ire of the watchdog. 4. CPU time always continues to increment. 5. If it makes it to the desired + 4 hours, watchdog kills it with out further issues. 6. Fractional credit - often laughingly minuscule, is awarded for the task. I am not a credit hound, nor am I going to "deprecate my participation" as another user put it, over this issue. My concern is that when I see a task run over to the point where the watchdog nails it, and then see that it is awarded only 10% of the expected credit, centers around the question: have I accomplished anything useful? During the few weeks I have participated in this endeavor one point has been made over and over in these forums - credit is awarded for work accomplished. To me the small credit awarded for these tasks is an indication that they were not real productive. ID: 66406 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 66407 - Posted: 1 Jun 2010, 19:45:02 UTC In general comment to last several posts, such wide variation is not uncommon for new protocols. And detailed reports such as all of you are providing here are exactly how I have seen the protocols improved, checkpoints enhanced, and bugs eliminated in the passed. So please hang in there, and recognize that even failure is a learning experience and a necessary part of developing something new. Sid, just to clarify, it goes to step 500 and then what happens? It takes a long time and then runs from step 501 through 1000? Or do you actually see it run back to step 1 on the same model? Chris, I believe ID: 66407 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 66408 - Posted: 1 Jun 2010, 19:45:04 UTC In general comment to last several posts, such wide variation is not uncommon for new protocols. And detailed reports such as all of you are providing here are exactly how I have seen the protocols improved, checkpoints enhanced, and bugs eliminated in the passed. So please hang in there, and recognize that even failure is a learning experience and a necessary part of developing something new. Sid, just to clarify, it goes to step 500 and then what happens? It takes a long time and then runs from step 501 through 1000? Or do you actually see it run back to step 1 on the same model? Chris, I believe #2, quit taking checkpoints is due to it not hitting the end of the next model. And taking 10 min. for first model and then having the second get halted by the watchdog results in #6 because you are essentially just getting credit for 2 models that are taking others 20 min. on average. #4 confirms you aren't having the "stalled" issue. Also, I would say your tasks and reports are very productive, and the fact that your credit doesn't reflect that is basically a result of a bug in the new protocol. So, when attempting to address questions, especially from a newer user, you tend to get the theoretical answers rather then the reality. And I would agree that the reality of your specific observations points to some areas that need fixing and/or enhancing. Over time, new protocols produce much more consistent per model runtimes. And over time, you will see tasks from a variety of protocols processed and so less of a concentration on the ones you are seeing problems with. Rosetta Moderator: Mod.Sense ID: 66408 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2602 Credit: 47,220,881 RAC: 0	Message 66412 - Posted: 2 Jun 2010, 2:44:51 UTC - in response to Message 66407. So please hang in there, and recognize that even failure is a learning experience and a necessary part of developing something new. Understood. I'm observing and letting them run their course. On occasion I'm finding one finish normally at runtime +2 without hitting the watchdog. At one point today I had to reboot to uninstallreinstall printer drivers and one job went back to it's checkpoint at 3mins when it was at 7hours+ before the reboot. It's currently at 9h (rt+1) on decoy 408. An isolated case though. Sid, just to clarify, it goes to step 500 and then what happens? It takes a long time and then runs from step 501 through 1000? Or do you actually see it run back to step 1 on the same model? The above WU runs 500 steps per model, then goes to the next model until decoy 408. Now it goes up 500 at a time, waits 100 seconds or so (CPU time still clocking up), then 501 to 1000, repeat. It's currently at step 76,500 after 9h 6m CPU time (10h 5m wall-clock) with the last checkpoint at 5h 40m. ID: 66412 · Rating: 0 · rate: / Reply Quote

Bikermatt Send message Joined: 12 Feb 10 Posts: 20 Credit: 10,559,305 RAC: 0	Message 66413 - Posted: 2 Jun 2010, 2:48:25 UTC Chris, Have you checked the graphics on some of the runs to see how they are doing at making models? As far as I can tell when these tasks are running properly they make models really fast and it only takes 500 steps. The tasks that seem to have crashed will be stuck on a model with steps ticking into the thousands. It seems like that would account for a large variation in granted credit because when this occurs you will end up making fewer models. Also, I did run eight or so of these yesterday with my default run time set at two hours and they all finished, however six of them had their checkpoints stuck between 10 and 20 minutes the whole time. I wonder if people with longer default run times are having more problems with these tasks. ID: 66413 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 66415 - Posted: 2 Jun 2010, 5:55:34 UTC I wonder if people with longer default run times are having more problems with these tasks. It shouldn't make any difference if you look at things on a per model basis. If your 2 hour default runtime means you run 3 models each task, and you ran 8 tasks uneventfully then you ran 24 models uneventfully. If someone else runs a single task for 16 hours and runs 24 models as well, but within a single task, their odds of encountering a long-running model, or a watchdog termination (really the same thing if the long model takes more then 4 hours), they should have a similar success rate to yours. Since you are running through more tasks, your failure rate per task should be lower, but on a per model or per CPU hour basis, the failure rates should be the same between short and long target runtimes. Rosetta Moderator: Mod.Sense ID: 66415 · Rating: 0 · rate: / Reply Quote

Bikermatt Send message Joined: 12 Feb 10 Posts: 20 Credit: 10,559,305 RAC: 0	Message 66429 - Posted: 2 Jun 2010, 23:36:20 UTC Last modified: 3 Jun 2010, 0:27:11 UTC int_simpleTwo_1f0s_2djh_ProteinInterfaceDesign_21May2010_21289_68_0 So this task is running on my system right now and check pointing is updating maybe twice per min. It had about two hundred models after about an hour and I suspended the task. When I resumed the models started over at zero. I let them run up to nine again and suspend, again when I resumed the models went to zero! This would really account for the variation on credit on these tasks if anytime you snooze or reboot you are restarting from zero. Am I the only one seeing this? Edit -Now the same task has made it back up to model 79 but it seems like it is now stuck. It is at step 95000 and counting. CPU time is now at 1.5 hours with the last checkpoint at one hour. Yet another edit! - I let the steps get up to 150000 on model 79 then hit suspend/resume. Models went back to zero, CPU time went back to the one hour check point stated above. It is now making models and check pointing again. ID: 66429 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2602 Credit: 47,220,881 RAC: 0	Message 66436 - Posted: 3 Jun 2010, 10:46:50 UTC - in response to Message 66429. Last modified: 3 Jun 2010, 10:47:43 UTC int_simpleTwo_1f0s_2djh_ProteinInterfaceDesign_21May2010_21289_68_0 So this task is running on my system right now and check pointing is updating maybe twice per min. It had about two hundred models after about an hour and I suspended the task. When I resumed the models started over at zero. I let them run up to nine again and suspend, again when I resumed the models went to zero! This would really account for the variation on credit on these tasks if anytime you snooze or reboot you are restarting from zero. Am I the only one seeing this? Edit -Now the same task has made it back up to model 79 but it seems like it is now stuck. It is at step 95000 and counting. CPU time is now at 1.5 hours with the last checkpoint at one hour. Yet another edit! - I let the steps get up to 150000 on model 79 then hit suspend/resume. Models went back to zero, CPU time went back to the one hour check point stated above. It is now making models and check pointing again. I hadn't seen this at the time of reading, but later I had another strange-looking proteininterfacedesign task and tried to view graphics to see how far it had got. By mistake I hit the suspend button. When I got it running again I was pleased to see it had restarted from nearly 6 hours, but it had reverted to model 0. While watching, it reached model 2 and the window closed as the task ended - that is, 1 model after 6 hours on an 8 hour runtime meant it wouldn't complete the next task. No problem, I thought, but I probably lost all the processing effort. But on checking the task details I found it was awarded healthy credit and the task reported 1104 decoys run. int_simpleTwo_1f0s_2IWR_ProteinInterfaceDesign_21May2010_21289_63_0 All's well that ends well, but there's definitely something going wrong with these types of tasks on several levels. Hopefully the science part is successful and it's just the programming that's gone wrong. NB: I note bikermatt aborted his task - maybe that wasn't the best idea, but with the symptoms it showed it's hard to blame him. Hopefully the wingman will fare better. ID: 66436 · Rating: 0 · rate: / Reply Quote

Bikermatt Send message Joined: 12 Feb 10 Posts: 20 Credit: 10,559,305 RAC: 0	Message 66439 - Posted: 3 Jun 2010, 12:53:43 UTC - in response to Message 66436. Last modified: 3 Jun 2010, 12:58:56 UTC NB: I note bikermatt aborted his task - maybe that wasn't the best idea, but with the symptoms it showed it's hard to blame him. Hopefully the wingman will fare better. Well I checked it after another two hours and noticed it had locked up again. Models were in the three hundreds at the time but steps were counting up from 300k. The check point had made it up to an hour and a half and was not updating anymore and the CPU time was over four hours. Again suspend/resume set the models back to zero and got things running again. I let it make a few more models and then aborted it to see what it would report and I was getting tired of watching it. It seems like it just submitted an error report however I think it also shows all of the times it was suspended and resumed. Another one I aborted the other day with an elapsed time of over ten hours with a CPU time of two hours produced a similar report. int2_centerfirst2b_1fAc_3eme_ProteinInterfaceDesign_23May2010_21231_30_0 I do not like to abort tasks and I could really care less about the credit. However, after watching so many of these tasks lock up for the last week I will not let them run unattended. ID: 66439 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 66440 - Posted: 3 Jun 2010, 13:50:07 UTC I think Sid's report would seem to confirm the thought that either the graphic is not showing the proper model numbers, or takes a considerable time to initialize and display the proper value. It sounds like Bikermatt is encountering the "stalled" issue where tasks stop using CPU, even though BOINC Manager shows them in a running state. Rosetta Moderator: Mod.Sense ID: 66440 · Rating: 0 · rate: / Reply Quote

Matthias Lehmkuhl Send message Joined: 20 Nov 05 Posts: 13 Credit: 2,795,681 RAC: 0	Message 66447 - Posted: 3 Jun 2010, 21:08:25 UTC Last modified: 3 Jun 2010, 21:13:55 UTC Got also one with very low granted credit rhoA15May2010_1lb1_1w9w_ProteinInterfaceDesign_15May2010_20686_58_1 DONE :: 143 starting structures 32927 cpu seconds This process generated 143 decoys from 143 attempts Claimed credit 94.7424055372759 Granted credit 7.88833689203152 edit: no reboot or suspend appears due runtime based on stderr out Matthias ID: 66447 · Rating: 0 · rate: / Reply Quote

bigtuna Send message Joined: 29 Nov 09 Posts: 1 Credit: 3,033,283 RAC: 0	Message 66450 - Posted: 4 Jun 2010, 0:13:55 UTC - in response to Message 66447. Got also one with very low granted credit rhoA15May2010_1lb1_1w9w_ProteinInterfaceDesign_15May2010_20686_58_1 DONE :: 143 starting structures 32927 cpu seconds This process generated 143 decoys from 143 attempts Claimed credit 94.7424055372759 Granted credit 7.88833689203152 edit: no reboot or suspend appears due runtime based on stderr out int_simpleTwo_1f0s_2e6m_ProteinInterfaceDesign_21May2010_21289_79 https://boinc.bakerlab.org/rosetta/result.php?resultid=343201186 cpu seconds: 25,791.81 Claimed credit: 171.73 Granted credit: 3.60 ...Ouch!! ID: 66450 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2602 Credit: 47,220,881 RAC: 0	Message 66452 - Posted: 4 Jun 2010, 1:54:38 UTC - in response to Message 66439. NB: I note bikermatt aborted his task - maybe that wasn't the best idea, but with the symptoms it showed it's hard to blame him. Hopefully the wingman will fare better. Well I checked it after another two hours and noticed it had locked up again. Models were in the three hundreds at the time but steps were counting up from 300k. The check point had made it up to an hour and a half and was not updating anymore and the CPU time was over four hours. I'm not sure it's clear that the model has stalled or locked up, as long as CPU time is still moving. It may well be right, but I'll leave it to the coders to confirm. The fact that # of models goes back to zero, but the completed job, in my case, reported 1104 successful decoys says to me that appearances are deceptive. Aborting the task will certainly report 0 decoys, so I don't think we should second guess what's happening, even if that means running until the watchdog kicks in. I fully accept credits aren't the real issue - but abandoning successful models means someone else has to go through it as well instead of doing new work. I note your wingman reported the job successfully btw. I might also mention (a propos of nothing much) there's a new Boinc version that hasn't been withdrawn (yet!) which seems to have solved the 'lumpy' downloading of new tasks. It may help somewhere down the line. ID: 66452 · Rating: 0 · rate: / Reply Quote

Bikermatt Send message Joined: 12 Feb 10 Posts: 20 Credit: 10,559,305 RAC: 0	Message 66460 - Posted: 4 Jun 2010, 13:40:46 UTC - in response to Message 66452. I fully accept credits aren't the real issue - but abandoning successful models means someone else has to go through it as well instead of doing new work. I note your wingman reported the job successfully btw. Wow! I checked his reports and you can tell that it locked up on him also. His default run time is 3 hours but this one ran for 18,128 sec. Plus he only reported 277 models and these tasks typically fly when running properly. Anyway, I thought I had read in a thread somewhere that the tasks were supposed to report completed models even if aborted but I sure could be wrong. Does anyone know for sure? ID: 66460 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 66461 - Posted: 4 Jun 2010, 14:28:23 UTC - in response to Message 66460. I thought I had read in a thread somewhere that the tasks were supposed to report completed models even if aborted but I sure could be wrong. Does anyone know for sure? Tasks will report completed models... even if ended by the watchdog. But abort drops any output produced by the application. Rosetta Moderator: Mod.Sense ID: 66461 · Rating: 0 · rate: / Reply Quote

Bikermatt Send message Joined: 12 Feb 10 Posts: 20 Credit: 10,559,305 RAC: 0	Message 66462 - Posted: 4 Jun 2010, 15:21:20 UTC - in response to Message 66461. I thought I had read in a thread somewhere that the tasks were supposed to report completed models even if aborted but I sure could be wrong. Does anyone know for sure? Tasks will report completed models... even if ended by the watchdog. But abort drops any output produced by the application. Thanks, good to know I will try and let the watchdog kill them from now on. ID: 66462 · Rating: 0 · rate: / Reply Quote

Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0	Message 66464 - Posted: 4 Jun 2010, 17:00:49 UTC If you wish to force a task to clean up and ship out, you can suspend and resume it several times in a row. I forget if it takes 4 or 5 restarts with no checkpoint and then the application has been set up to presume this indicates either a problem, or at least that the task is not well-suited for the machine's environment and terminate it, but report the completed models. Rosetta Moderator: Mod.Sense ID: 66464 · Rating: 0 · rate: / Reply Quote

Sid Celery Send message Joined: 11 Feb 08 Posts: 2602 Credit: 47,220,881 RAC: 0	Message 66467 - Posted: 5 Jun 2010, 1:27:13 UTC - in response to Message 66460. I fully accept credits aren't the real issue - but abandoning successful models means someone else has to go through it as well instead of doing new work. I note your wingman reported the job successfully btw. Wow! I checked his reports and you can tell that it locked up on him also. His default run time is 3 hours but this one ran for 18,128 sec. Plus he only reported 277 models and these tasks typically fly when running properly I noticed that too, but if it ran 2 hours over time then it ended 'normally' and without the watchdog intervening. I don't mean to criticise at all. I don't know what it means tbh, but if CPU keeps increasing I still think it means it hasn't stalled or locked up - though it may've got itself into some kind of loop. I'm guessing of course. I keep tinkering too. I suspect I've rarely tinkered to improve anything though. ID: 66467 · Rating: 0 · rate: / Reply Quote