Posts by Plasmon_attack

1) Message boards : Number crunching : Server refuses to send more work (Message 74434)
Posted 15 Nov 2012 by Plasmon_attack
Post:
I have a 64 bit win7 machine that I updated with the correct software this morning. I only got 8 tasks (I have 16 cores) and all completed just fine and have been reported. However, the server refuses to send any more work (reached daily quota of 8 results). I've detached and reattached to the project, but it still won't send anything and is deferring communication for 9 hours. Why is this computer limited to 8 tasks/day when my other ones aren't? This machine COULD do >100 tasks/day but work needs to be sent. I know sometimes if a computer sends in too many bad work units (I tried adding this computer yesterday and 8 work units ended with computation errors) it gets stalled but since it's returned good ones now why the withholding? Thoughts?
2) Message boards : Number crunching : High priority jobs getting out of control (Message 71045)
Posted 13 Aug 2011 by Plasmon_attack
Post:
Thanks all...Rosetta is allowed access to all the memory, it's left in memory while suspended, etc. I've seen it get hung up before with memory limits so I've opened the bore wide and the system has plenty of overhead. I am noticing a mix of 3 hour and 7 hour tasks going by so I think it's getting confused about how long things are going to take. The work units its jumping to are the longer-running units, and it's skipping over the shorter units to do it. Also running on the most recent version of Boinc. I think it's just the switchign around. Looks like it'll miss the deadline on a few so we'll be able to see what happens if they're returned slightly late.
3) Message boards : Number crunching : High priority jobs getting out of control (Message 71009)
Posted 10 Aug 2011 by Plasmon_attack
Post:
Thanks all, interesting discussion. For the questions about settings I don't know what you mean by 'default run time' and how to raise or lower. This computer is on 24/7, has unrestricted access to the cores, and is rarely paused. It's the only project I run so there's no switching between apps. The idea about the estimated run times being off seems plausible. I did see a lot of the short workunits go by, and then I had a batch that was taking more like 6-7 hours to complete for a while, so I could see the estimator getting confused. It's just weird that it high prioritizes work units due LATER than the earlier ones. I would be worried about not finishing work units due on the 11th and prioritize them more than ones due on the 15th.

I'll just leave the network off until it stabilizes. It won't be much longer now. The queue might've been 5 days but I've found that it usually finishes them 2-3 days earlier than expected. I have a computer at home set to 10 days because when there is a workunit shortage it's usually out in 3-4.
4) Message boards : Number crunching : High priority jobs getting out of control (Message 70991)
Posted 9 Aug 2011 by Plasmon_attack
Post:
Thanks all...I don't think it's a memory issue. The buffer is set to about 5 days (which is pretty reasonable given the length of recent outages)....it's got 16 cores and 16 GB of RAM and isn't running ANY other projects in addition to Rosetta. The 'time to completion' for the high priority tasks is about the same as the ones that have paused. Rosetta has full access to the memory and it's only showing about 8-10 GB used at any time, and with my apps open we don't hit the limit. All 16 cores crunch pretty much all the time, just once in a while these get out of whack. If I stop network communications for a little bit it seems to even out. I've seen this happen on other machines (like a Mac mini) where things are a bit less extreme. There are no error messages around this, just, "Pausing task, xx task is high priority" so I figured this might be coming from the server somehow.
5) Message boards : Number crunching : High priority jobs getting out of control (Message 70981)
Posted 9 Aug 2011 by Plasmon_attack
Post:
I've seen this behavior several times and wondering what causes it. Everything will run smoothly for a while, but then out of nowhere a large number of WUs will start running at high priority. Weirdly, these jobs are usually not due for some time (in the current case at least a week) and are crunched before jobs with earlier dates (in this case jobs due in 3 days). Who has high priority can also change while the job is running. At the current moment one of my machines has 110 partially completed work units (anywhere from 1% to 90% complete) that it has paused to run these high priority jobs. Here is an example for those who dig into the database:

Job:
2p9h_lac_sum_rest_LigDes_SAVE-ALL_OUT_29943_319_0 is hung up at 70% complete and due 8/11/2011 and 9:28 PM

Right above it is:
T0393_3d1l.pdb_boinc_symmetric_lr_symm_wangyr_IGNORE_THE_REST_29956_32531_0 which is now 85% complete and not due until 8/15/2011, 10:40 am.

This starts to kill RAC because jobs aren't completing and being uploaded. Often I can fix this by turning off network communication and letting it run down ALL the work before the first WUs expire and roll it up in one massive upload, or click-and-pause all the unstarted WUs and make it finish the partial ones before it can move on. Both are tedious.

Any idea what's going on here?
6) Message boards : Number crunching : Problems and Technical Issues with Rosetta@home (Message 70408)
Posted 27 May 2011 by Plasmon_attack
Post:
Yeah, this seems to happen every once in a while, the queue runs down. I'm tempted to ramp up my queues to be over 5 days or more since it can take that long to get work filled back in. Given it's Friday I suspect our computers are all getting a break over the weekend and our RAC's will just have to eat it.
7) Message boards : Number crunching : zero work left (Message 69961)
Posted 4 Apr 2011 by Plasmon_attack
Post:
Yeah, they must've added more in the late evening. Yesterday the link you mention said 'zero' work to send.
8) Message boards : Number crunching : zero work left (Message 69930)
Posted 3 Apr 2011 by Plasmon_attack
Post:
My crunchers started running dry and I noticed there's zero work left. Will this be refilled anytime soon?
9) Message boards : Number crunching : Can't report to server on one machine (Message 69667)
Posted 17 Feb 2011 by Plasmon_attack
Post:
Ok, got it, I worked backwards through all versions of the BOINC client and found that 6.6.38 works. I'm back up, though I think the server is mad at me for losing the ~400 units that were completed so it only gave me one to start with (for 16 cores!). I guess eventually it'll recover.
10) Message boards : Number crunching : Can't report to server on one machine (Message 69665)
Posted 17 Feb 2011 by Plasmon_attack
Post:
Umm, yes, I did check that the computer has internet access and that BOINC is allowed to do network access before asking for help.

@Wolf...your situation and suggestions are a lot closer to what I think the issue is, namely, uninstall doesn't fully uninstall and leaves 'ghosts' behind that keep a fresh install from functioning. I've tried what you said and am a little further, but not quite there. I was able to locate program data in a the 'program data' folder. I also found a leftover directory in program files (x86). ProgramData has the project specific data, work units, a lot of .xml files, etc., and I wiped it out. A fresh install now prompts to setup a new project.

However, it still can't communicate with the servers. It says they're temporarily unavailable (but I tried several projects) and then it gives the same network error.

For Win7 are there any other hidden places the program could be leaving traces that remind if of this broken state? I've looked in the users directories (and I have hidden fiels showing) but I'm not finding anything that looks boinc related.

Sorry about your 3 workunits. I've been trying to fix this for a while as all 16 cores were crunching and lost ~400 workunits :(
11) Message boards : Number crunching : Can't report to server on one machine (Message 69660)
Posted 16 Feb 2011 by Plasmon_attack
Post:
Yes, I tried rebooting first, and several times, and between installations of different versions of the BOINC client.

It appears that our network admins may have disabled ping as I can't ping anything (even Google) from any computers here (note one is a laptop and ping works fine at home). Note, the network is working as there are seven other computers on the same network that are able to reach the server without an issue.

There was a power outage that crashed the machine (yes it's on a backup but the outage was too long) and this computer hasn't been able to connect.

I guess, is there a better way to uninstall Boinc? Clearly the uninstall doesn't do it totally because, once reinstalled, it knows it's connected to Rosetta AND still knows what work units are completed and not completed. Is there a way to wipe it off completely and start over?

Thanks
12) Message boards : Number crunching : Can't report to server on one machine (Message 69649)
Posted 16 Feb 2011 by Plasmon_attack
Post:
Hi Everyone, recently one of my computers (Win7 64-bit, dual quad core xeons, 6 GB ram, hyperthreaded, boinc 6.10.58) suddenly started having problems communicating with the server. It's still crunching work but says, "Scheduler request failed: Couldn't connect to server." Other computers on the same network are working fine.

I've uninstalled and reinstalled, made sure to delete the Boinc directory, and yet when I reinstall it's like nothing happened because the same work units are ready to report, the project is still attached, etc.

The error is, "Boinc couldn't do internet communication, and no default connection is selected. Please connect to the internet, or select a default connection using advanced/options/connection."

I don't need a proxy or anything for my network. Any idea what to put in? Is there a better way to reinstall? I have like 300 completed workunits piled up and would like to report them.

Thanks!
Tony
13) Message boards : Number crunching : Uploading error (Message 69013)
Posted 4 Jan 2011 by Plasmon_attack
Post:
I see the scheduler is back up and now I can upload finished work units, but it's still not sending out work. I have ~50 nodes waiting for work, several of them hyperthreaded. I joined yesterday. Is this much downtime typical? I usually have them fetch enough work for 2 days. Should I up this to 3, or 5, or 10? Since most of them were new to the project they hadn't yet received a full two days of work unites.






©2024 University of Washington
https://www.bakerlab.org