Posts by Hans Schulze

1) Message boards : Number crunching : Help us solve the 1% bug! (Message 12311)
Posted 20 Mar 2006 by Hans Schulze
Post:
I just found another workunit that restarted several times, wasting the whole day on an AMD 3500+ machine.
2006-03-19 7:40:03 PM|rosetta@home|Restarting result FA_RLXct_hom018_1ctf__360_252_1 using rosetta version 482
This is the 7th so-called 1% I get in a week.
Sorry, but I will remove this application from my farm.
2) Message boards : Number crunching : Help us solve the 1% bug! (Message 12289)
Posted 19 Mar 2006 by Hans Schulze
Post:
I don't mind an occasional error, but I do have a few issues

1) Why restart a unit that already overran time? I happened to notice this message in the log. I don't see how I get credit for the restart, as the CPU time is zeroed, and no additional communications happens with R@H.
2) Cancelled unsuccessful units seem to be recycled for some other oaf to run, so the number of these units floating around is increasing. After 3 failures by different people, they should be cancelled and permanently removed from the database queue.
3) This was reported half year ago, and doesn't seem to be serious enough to already be under active research
4) Suspending a WU seems to restart the CPU time, and hence credits. Pausing the WU's to swap also seems to zero the before-stop cpu time, and hence credits
5) There is no local persistant log of either error messages, or of completed wu's so it is hard to tell what went wrong before Microsoft's last update or company policy mandated machine update/patch restart. I would recommend appending to the existing log on Boinc restart. We need the logs to figure out the pattern here.
6) One of my machine bluescreens (bad pool caller) since I have installed Boinc - had run Seti for almost 2 years on that machine before that with no issues. Will run diags and reinstall drivers, but with Boinc causing some R@H to calculate WUs differently, who knows what's wrong.

I ran Seti in the days
3) Message boards : Number crunching : Help us solve the 1% bug! (Message 12189)
Posted 18 Mar 2006 by Hans Schulze
Post:
FA_RLXnp_hom022_1npsA_361_221_0 using rosetta version 482

After rebooting the system, this calculation stops in the same place.

Archived (RAR) the slot directories, deleted them all, and restarted the wu, it still hangs in the same place.

Exited Boinc, restarted it, got the graphics on the screen, then quickly killed both Boinc and Boincmgr. The graphics continued flawlessly.

Not sure what happened next, but that wu disappeared without completing.

I will try this again next time I see a wu puddle.
4) Message boards : Number crunching : Help us solve the 1% bug! (Message 12187)
Posted 18 Mar 2006 by Hans Schulze
Post:
2006-03-17 6:55:43 AM|rosetta@home|Starting result FA_RLXnp_hom022_1npsA_361_221_0 using rosetta version 482
2006-03-17 4:18:04 PM|rosetta@home|Result FA_RLXnp_hom022_1npsA_361_221_0 exited with zero status but no 'finished' file
2006-03-17 4:18:04 PM|rosetta@home|If this happens repeatedly you may need to reset the project.
2006-03-17 4:18:04 PM||request_reschedule_cpus: process exited
2006-03-17 4:18:04 PM|rosetta@home|Restarting result FA_RLXnp_hom022_1npsA_361_221_0 using rosetta version 482

Now stuck at 1%, 10:49 PM, no graphics activity, step 21585. Suspend, resume, no effect. Exited Boinc, restarted, the thread ran to exactly the same spot and stopped in about 15 seconds. I permanently suspended that wu, and now my machine is working on the next one while I look at why it stops. Strangely enough, the suspended wu is still hogging 100MB of ram.

Running the same job with the same seed passed the 1% point no problem.

I must say that I don't like that my machine crunched for 9+6.5 hours with no result or credits. Boinc should definitely not restart the calculation without notifying HQ.

Is it possible that some files were copied incorrectly as the job was started? I will save this post, reboot, resume, and post back here if it ran correctly under the GUI.
5) Message boards : Number crunching : Credit fallen from 50-odd to approx 19? (Message 12127)
Posted 17 Mar 2006 by Hans Schulze
Post:

[snip] ... What happens when Boinc runs when you play a game? (hey, maybe 3-4 hours a week) Does it randomly do benchmarks when CPU is busy? If I am in the middle of an Excel sheet recalc, the benchmark will take forever and get a lousy score.


In short yes. If BOINC is running it could benchmark at any time.... [snip]

In that case, how is the credit calculation affected by a loaded machine getting lower benchmarks? Do the credits drop for processing the same 2h wu?
Does that mean the souped up Boinc versions getting higher benchmarks get higher credits for the same work?
And then, why would Boinc recheck throughput on a regular basis? Any random hits of Explorer stuck looping for a slow web page could chew hours? of crunching into zero credits?
6) Message boards : Number crunching : Credit fallen from 50-odd to approx 19? (Message 12110)
Posted 16 Mar 2006 by Hans Schulze
Post:
Most of my machines have dropped to about 60% of their original credits/day over the last few weeks. The first issue was a large number of units that took well over 24h, and a week's worth of those ended up being past the deadline, so I aborted a whole slew of them from most of my 5 machines. Now I am back to running <2h units since about a week, and the stats look like they are levelling off at the lower level. Each machine is a different CPU and motherboard, none are laptops. I occasionally have to suspend Boinc for an hour or two when some dumb installshield apps run, as they [not!] run at the same priority, but this is very rare, like once a week. I also noticed that my queues were almost empty on two machines, as if the benchmarks were miscalculated (24H estimate each for a pair of 2h units), and as soon as a few of the 2h units were calculated, the queue filled up again, although the estimate still shows 5h.
What happens when Boinc runs when you play a game? (hey, maybe 3-4 hours a week) Does it randomly do benchmarks when CPU is busy? If I am in the middle of an Excel sheet recalc, the benchmark will take forever and get a lousy score.
7) Message boards : Number crunching : AMD vs Intel (Message 9169)
Posted 17 Jan 2006 by Hans Schulze
Post:
Best SETI machine I ever used was a Dual-XEON HT 2800 MHz machine, crunching at about 4WU per 2 hours. SETI made good use of the large on-chip caches. Would be nice to see what some Intel EE or more recent Athlon FX chips do.
That machine shows recent credits of close to 500, where dual Opteron 242 1.6 GHz 2600+ are pulling in about 175. A new Ath64 3700+ is running around 135.
Anyone know what the "recent average credit" timeperiod is? A week?






©2024 University of Washington
https://www.bakerlab.org