Posts by Bikermatt

1) Message boards : Number crunching : Report long-running models here (Message 68225) Posted 27 Oct 2010 by Bikermatt Post: Chris, I have noticed that the PCS_ tasks run very slow in Linux. On my 2.2 GHz Linux box they were taking 10 hours to make two models. On my 2.1 GHz Win 7 box they always seem to make at least 4 models in 6 hours. A few days ago I was getting a ton of them so I put my Linux machines on WCG for awhile but you can look at the results for my Win 7 box and pick out the PCS tasks just by looking at the granted credit. Edit: In fact, I have looked at a lot of other Win 7 boxes out there and all of the PCS task on Win 7 seem to be getting much higher granted credit than what was claimed. So maybe it is some kind of dysfunction?
2) Message boards : Number crunching : minirosetta 2.16 (Message 68179) Posted 23 Oct 2010 by Bikermatt Post: Anyone else notice PCS_ tasks running poorly in Linux? They are running longer than default and producing fewer models then on my similarly equipped Win 7 box.
3) Message boards : Number crunching : no work units (Message 67478) Posted 31 Aug 2010 by Bikermatt Post: Um, I think there is a really good point being made here. If any other project I crunch on goes down for even five minuets it seems like someone who works on the project posts what happened and what they are doing about it. Rosetta has gone down several times since I started crunching here and I have never seen a post from anyone that works on the project about what is going on. Am I missing a thread somewhere?
4) Message boards : Number crunching : simIF2...ProteinInterfaceDesign...-tasks (Message 66974) Posted 22 Jul 2010 by Bikermatt Post: I have noticed that when these protein interface tasks stop check pointing,that when you show graphics the model in progress will have a very high number for the step that it is on. I have seen tasks that have not check pointed for several hours with a model at over 100k steps. It seems like when I look at the graphics of a protein interface task that is updating its' checkpoint it is making a model every 500 steps.
5) Message boards : Number crunching : New Memory Requirements? (Message 66624) Posted 21 Jun 2010 by Bikermatt Post: I just found the 24 core box running 5 casp 9 tasks out of 24. They were all using around 600MB per task so I must have seen a few of them running the other day. The Box is using 13GB right now, it runs around 10 GB with no casp tasks. I will definitely go with one GB per core from now on. Linux is really running great on the box also. I have actually just started using Linux in the last few months. I tinkered a little bit with it eight years ago when I was a computer tech but never really got into it. Today I have one xp and two win7 licenses,as far as I am concerned I will never never need to buy a windows license again.
6) Message boards : Number crunching : New Memory Requirements? (Message 66613) Posted 19 Jun 2010 by Bikermatt Post: Chris I am playing with a new box I just built and it has been running win7 64 for the last 24 hours or so. Task manager shows all of the rosetta work units vary between the high 200s to the low 400s for memory usage. The box is not running any of the casp tasks right now though and as Murasaki said, I remember reading some where that those tasks would be using more memory. Anyway, the same box was running Linux for a few hours before I loaded win and I seem to remember quite a few of the tasks running up into the 600s, but I was not looking at task names so those could have been casp tasks. The machine was running a lot better with Linux so I will be reloading this morning after it finishes up its last tasks. I will let you know for sure after it is reloaded. Also, on boxes that I use for crunching only, I set the memory usage up to 90% for "in use" and "idle". I noticed sometimes that if a box was right at the limit it would start waiting for memory even when I was just checking the system out. I also noticed on my six core box that when I added 3 GT240s for running GPU grid tasks it started waiting for memory when trying to run nine tasks at once (duh). I had to bump it up to eight GB it it now will run all nine tasks without issue. Four GB for six cores had been working well for me on my six core box so went I built this most recent machine I planned on 4 GB for every six cores. I'm planning on building another 24 core box at the end of this summer and with more tasks out there that use higher memory I think I will just go for a GB per core from now on. It seems like quite often a machine will download quite a few of the same types of tasks in a row. I think this has to do with how the tasks are generated. I have noticed quite often when one of my machines has to download a lot of tasks at the same time, they are quite often the same type of task. Anyway, if you get a whole batch of the higher memory requiring tasks less than one GB per core might not cut it anymore.
7) Message boards : Number crunching : Over aggressive work fetch? (Message 66567) Posted 13 Jun 2010 by Bikermatt Post: This might not do anything, but here is what I would try if it does not start behaving properly in a few days. - Select "No new tasks" on BOINC and let everything you have right now complete. You may have to manually update for the tasks to report. - After everything is gone try resetting and detaching from the project. I've read in other forums that that can knock some sense into BOINC. - If all else fails you could try a different (possibly older?) version of BOINC. Good luck! Matt
8) Message boards : Number crunching : Credit always low (Message 66529) Posted 9 Jun 2010 by Bikermatt Post: I have been following this thread and a part of me really wants to be angry at the people that whine about credit. Anyone who has had any biology can tell you that proteins are what allows life to exist. The other thing they can tell you is that in biology structure equals function. Having knowledge of protein structure and how they function will benefit every living organism on this planet! I think Chris: "I just look at credits as being a benchmark as to how well I am doing within a project and as a measure of what I am accomplishing for the project. And with the credit structure they have setup here it can also serve as a flag when something is not working right (see the long running work unit discussion)" and JackOnTheRox: "I keep track of my accumulation and RAC, and compare it to the others around me to keep it interesting, but I crunch here because I hope the research makes a difference." gave the best reasons for credit. But if raising the credit could bring in a few more credit whores that would equal more models produced right? "Another 'theory' is that if you keep the credits low the 'credit whores' won't come and make the server work extra hard. Just the people that 'believe' in the idea of your project will come and since they are always going to be here the server load stays fairly steady. More credits means more people meaning more work for the project and that can mean more monetary expenditures. Credits ARE low here, but they may plan it that way." But then mikey’s comment brings up a very good point: So is there any way some kind cost benefit analysis can be performed? How much throughput could be gained by raising the awarded credit and what would it cost in server time? And can the project accept monetary contributions? I know a lot of other projects ask. You never know, you might have some crunchers around here that wouldn’t mind giving a few bucks to help with buying a little more sever power. Anyway, just my thoughts, I imagine the last question is the only one that can be easily answered.
9) Message boards : Number crunching : Why so much variation? (Message 66462) Posted 4 Jun 2010 by Bikermatt Post: I thought I had read in a thread somewhere that the tasks were supposed to report completed models even if aborted but I sure could be wrong. Does anyone know for sure? Tasks will report completed models... even if ended by the watchdog. But abort drops any output produced by the application. Thanks, good to know I will try and let the watchdog kill them from now on.
10) Message boards : Number crunching : Why so much variation? (Message 66460) Posted 4 Jun 2010 by Bikermatt Post: I fully accept credits aren't the real issue - but abandoning successful models means someone else has to go through it as well instead of doing new work. I note your wingman reported the job successfully btw. Wow! I checked his reports and you can tell that it locked up on him also. His default run time is 3 hours but this one ran for 18,128 sec. Plus he only reported 277 models and these tasks typically fly when running properly. Anyway, I thought I had read in a thread somewhere that the tasks were supposed to report completed models even if aborted but I sure could be wrong. Does anyone know for sure?
11) Message boards : Number crunching : Why so much variation? (Message 66439) Posted 3 Jun 2010 by Bikermatt Post: NB: I note bikermatt aborted his task - maybe that wasn't the best idea, but with the symptoms it showed it's hard to blame him. Hopefully the wingman will fare better. Well I checked it after another two hours and noticed it had locked up again. Models were in the three hundreds at the time but steps were counting up from 300k. The check point had made it up to an hour and a half and was not updating anymore and the CPU time was over four hours. Again suspend/resume set the models back to zero and got things running again. I let it make a few more models and then aborted it to see what it would report and I was getting tired of watching it. It seems like it just submitted an error report however I think it also shows all of the times it was suspended and resumed. Another one I aborted the other day with an elapsed time of over ten hours with a CPU time of two hours produced a similar report. int2_centerfirst2b_1fAc_3eme_ProteinInterfaceDesign_23May2010_21231_30_0 I do not like to abort tasks and I could really care less about the credit. However, after watching so many of these tasks lock up for the last week I will not let them run unattended.
12) Message boards : Number crunching : Why so much variation? (Message 66429) Posted 2 Jun 2010 by Bikermatt Post: int_simpleTwo_1f0s_2djh_ProteinInterfaceDesign_21May2010_21289_68_0 So this task is running on my system right now and check pointing is updating maybe twice per min. It had about two hundred models after about an hour and I suspended the task. When I resumed the models started over at zero. I let them run up to nine again and suspend, again when I resumed the models went to zero! This would really account for the variation on credit on these tasks if anytime you snooze or reboot you are restarting from zero. Am I the only one seeing this? Edit -Now the same task has made it back up to model 79 but it seems like it is now stuck. It is at step 95000 and counting. CPU time is now at 1.5 hours with the last checkpoint at one hour. Yet another edit! - I let the steps get up to 150000 on model 79 then hit suspend/resume. Models went back to zero, CPU time went back to the one hour check point stated above. It is now making models and check pointing again.
13) Message boards : Number crunching : Why so much variation? (Message 66413) Posted 2 Jun 2010 by Bikermatt Post: Chris, Have you checked the graphics on some of the runs to see how they are doing at making models? As far as I can tell when these tasks are running properly they make models really fast and it only takes 500 steps. The tasks that seem to have crashed will be stuck on a model with steps ticking into the thousands. It seems like that would account for a large variation in granted credit because when this occurs you will end up making fewer models. Also, I did run eight or so of these yesterday with my default run time set at two hours and they all finished, however six of them had their checkpoints stuck between 10 and 20 minutes the whole time. I wonder if people with longer default run times are having more problems with these tasks.
14) Message boards : Number crunching : Why so much variation? (Message 66403) Posted 1 Jun 2010 by Bikermatt Post: I have watched about twenty of the ProteinInterfaceDesign work units run over the last few days and it seems like they just stall. I don't seem to remember anyone saying that suspend/resume would create a checkpoint. It does seem to restart these work units though. The problem is that with my six hour run time I was finding about 25% of these work units stalled with a CPU time of about 4 hours, an elapsed time from any where between 7 and 10 hours with a check point at ten minutes! So the work units are not check pointing and the watchdog is not killing them either. Suspend/resume will restart the run in these cases but if you "show graphics" before and after the suspend/resume, the hundreds of models that had been produced are reset to zero so all of your work is lost. The only other option in this case seems to be to abort, however again you loose all of your models. In some cases I noticed these would just stall and the work units were check pointing. In those cases suspend/resume would finish the work unit several seconds after resuming. Matt
15) Message boards : Number crunching : Why so much variation? (Message 66381) Posted 31 May 2010 by Bikermatt Post: I noticed my laptop which has a three hour run time seems to have less trouble with completing these work units. My higher throughput systems are set to six hours. Also, it seems most of the stuck CPU times are in the two to four hour range. I am going to move one system back to a 2 hour run time and see what happens. Could this be a watchdog issue in combination with some work units that don't crash until later in the run? Definitely not a genius! I just had a rod put in my leg and have a ruptured disc in my back! I am spending way too much time looking at these computers. Matt
16) Message boards : Number crunching : Report long-running models here (Message 66373) Posted 30 May 2010 by Bikermatt Post: gunn_fragments_SAVE_ALL_OUT_-1rkiA__20675_701_0 This one had an elapsed time of 10.5 hours with CPU time stuck at 2 hours. Suspend/resume allowed it to finish normally in 21101.7 CPU seconds, but elapsed time was almost 15 hours. Default run time on this system is 6 hours.
17) Message boards : Number crunching : Why so much variation? (Message 66368) Posted 30 May 2010 by Bikermatt Post: Chris, You are not alone, I have seen a huge increase in the number of long running work units on all of my systems for the last week or two and especially the last few days. I have AMD and Intel CPUs and I am running both Linux and win7 and this occurs on all of the systems. I first noticed it in the ProteinInterfaceDesign work units and I reported it the the long running thread, however today I noticed a gunn_fragments that had run for 10 hours with 2 hours of CPU time so this may be a 2.14 issue. Usually if I suspend then resume the task it will start running again or complete itself within a few seconds. Matt
18) Message boards : Number crunching : Report long-running models here (Message 66353) Posted 29 May 2010 by Bikermatt Post: int2_centerfirst2b_1fAc_2a9o_ProteinInterfaceDesign_23May2010_21231_280 Found this one at 6 1/2 hours (default is 3 on this system) with CPU time at 28 min. I suspended it and then resumed and let it run for another 15 min but my CPU time only went up another 2 min so I aborted the WU.
19) Message boards : Number crunching : Report long-running models here (Message 66324) Posted 25 May 2010 by Bikermatt Post: I have had problems with several of these on all of my systems. The CPU time will be 7+ hours with the last check point being as long as 5 hours earlier. Matt int2_centerfirst2b_1fAc_1i76_ProteinInterfaceDesign_23May2010_21231_62_0 int2_centerfirst2b_1fAc_1k7j_ProteinInterfaceDesign_23May2010_21231_129_0
20) Message boards : Number crunching : Report long-running models here (Message 65639) Posted 25 Mar 2010 by Bikermatt Post: Does anyone look at long running models anymore? I have been seeing two to three per week. -Matt Win 7 64 bit v2FcInnerW_1dAl_3fk8_ProteinInterfaceDesign_15Mar2010_18672_235_0 http://boinc.bakerlab.org/rosetta/result.php?resultid=326997251 <core_client_version>6.10.18</core_client_version> ====================================================== DONE :: 2 starting structures 20222.2 cpu seconds This process generated 2 decoys from 2 attempts ====================================================== Validate state Valid Claimed credit 115.549211591099 Granted credit 0.467169393634457 application version 2.05