Credits Granted

Message boards : Number crunching : Credits Granted

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,488,060
RAC: 3
Message 8946 - Posted: 13 Jan 2006, 15:32:04 UTC - in response to Message 8942.  

But to me, and I assume to others as well, it (the "Technical News" statement) looked like that was all they were going to do (rather incomplete) and that made me rather grumpy.


Reading it again, it could sound that way; _I_ took it more as "here is what we have done", and didn't give up on more to come. I'm stubborn, or optimistic, or something - until I hear a flat "no", I assume something is either "yes" or "maybe". :-)

ID: 8946 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Deamiter

Send message
Joined: 9 Nov 05
Posts: 26
Credit: 3,793,650
RAC: 0
Message 8966 - Posted: 13 Jan 2006, 20:38:22 UTC

Am I the only one who assumes to have a certain "loss rate" in my crunching? There are dozens if not hundreds of factors, but the biggest are probably in computer resets (particularly for the highly-mobile laptop and borged boxes). Of course there are also network outages, power outages, cycle loss due to the OS etc...

Somewhere in there there are losses due to project problems. But there's a big reason I'm working on alpha projects -- quite simply, I strongly feel I'm getting more scientific value for my CPU time by running the projects that are less popular! Of course that also means they're much less stable.

I guess I just don't complain if my RAC is down 50 points for the day because a WU got lost in the shuffle. Maybe one of the PCs in my lab were left off for the night, or maybe my home router needs a reset, or MAYBE one of my WUs was bad.

I guess I'm very content that the systematic problems are being worked on. Yeah, I'll probably lose some credit here and there for participating in the pre-release probjects. The time I DO donate, however, is worth so much more to a project like Rosetta than to the overloaded SETI, I feel such intermittant troubles more than make up for a slightly attenuated credit rate.
ID: 8966 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Los Alcoholicos~La Muis

Send message
Joined: 4 Nov 05
Posts: 34
Credit: 1,041,724
RAC: 0
Message 8978 - Posted: 14 Jan 2006, 0:38:54 UTC - in response to Message 8966.  

Am I the only one who assumes to have a certain "loss rate" in my crunching? There are dozens if not hundreds of factors, but the biggest are probably in computer resets (particularly for the highly-mobile laptop and borged boxes). Of course there are also network outages, power outages, cycle loss due to the OS etc...

Somewhere in there there are losses due to project problems. But there's a big reason I'm working on alpha projects -- quite simply, I strongly feel I'm getting more scientific value for my CPU time by running the projects that are less popular! Of course that also means they're much less stable.

I guess I just don't complain if my RAC is down 50 points for the day because a WU got lost in the shuffle. Maybe one of the PCs in my lab were left off for the night, or maybe my home router needs a reset, or MAYBE one of my WUs was bad.

I guess I'm very content that the systematic problems are being worked on. Yeah, I'll probably lose some credit here and there for participating in the pre-release probjects. The time I DO donate, however, is worth so much more to a project like Rosetta than to the overloaded SETI, I feel such intermittant troubles more than make up for a slightly attenuated credit rate.

I quiet agree with you, the possibility to help achieving some of the goals of Rosetta is why I joined this project. And yes, a starting project deserves lots of understanding and courtesy.. but (there have to be a but somewhere) I value my cpu-time a lot.

And when cpu-time is wasted due to project problems, it is fine with me. But it still hurts. Because I go thru quiet some effort to gather as much computing power and time as I can get for this project. That is why I expect some understanding from the project staff in return. And the way of showing that is keeping us informed, involve us with the problems and granting us credits (as a mather of fact I don't give a sh*t about credits, but I keep telling myself (and my wife) that there will be a time that the electricity company will accept them as payment for their bills).

I think others go thru the same ammount of trouble to keep their farms chruching and that is why they like to have their lost cpu-time rewarded.

By the way, imho the Rosetta project staff is doing a great job so far... thanks.
ID: 8978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 8980 - Posted: 14 Jan 2006, 0:48:01 UTC - in response to Message 8917.  


I know/understand they can be busy, but it would've been nicer if they kept us a bit more informed about these things.
Not only what they do or have done, but also what they intend to do.


A lot of what we intend to do is based on feedback from users like you. For instance, we are now discussing whether we should grant credit to all "Time exceeded" errors. My vote is rather then worrying about past lost credit, spend time on tracking down and fixing the cause. We have an important question to answer about these errors,

Are they due to stuck jobs (i.e. 1% errors), or are they being terminated prematurely due to the rsc_fpops_bound being set too low on our end?

I have not seen evidence yet suggesting the later in general. The bound is set conservatively.

We are definitely going to try to fix this issue in the next app update.
ID: 8980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 8998 - Posted: 14 Jan 2006, 11:13:28 UTC - in response to Message 8980.  

[A lot of what we intend to do is based on feedback from users like you. For instance, we are now discussing whether we should grant credit to all "Time exceeded" errors. My vote is rather then worrying about past lost credit, spend time on tracking down and fixing the cause. We have an important question to answer about these errors,

You have losses on all projects for one reason or another.

Perhaps rather than hunting around, just give a "flat rate" bonus to the people on the project. Much simpler, easiser, less time, ...

But, I also would much rather you spent the time moving forward on the fixes and improvements ...
ID: 8998 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 9012 - Posted: 14 Jan 2006, 15:18:16 UTC - in response to Message 8998.  

But, I also would much rather you spent the time moving forward on the fixes and improvements ...


I agree! :)

Regards,
Bob P.
ID: 9012 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,488,060
RAC: 3
Message 9015 - Posted: 14 Jan 2006, 15:54:37 UTC - in response to Message 8980.  

Are they due to stuck jobs (i.e. 1% errors), or are they being terminated prematurely due to the rsc_fpops_bound being set too low on our end?

I have not seen evidence yet suggesting the later in general. The bound is set conservatively.


The ones that I looked into were definitely not "stuck" at any point, and they actually ran a fairly "normal" amount of time for that WU type - at the long end, but not unreasonable. I believe that the problem is that when calculating "is the boundary exceeded", BOINC uses the DCF as well as the benchmarks. Example...

Let's say (for simplicity of math) the bound is 100,000, and the benchmark is 100. You would expect this result to hit the boundary after 1000 seconds. If the host normally finishes an "average" result in 500 seconds (estimated fpops is 50,000), the setting is quite conservative; you're allowing this result to run up to twice the normal expected time. But, it seems _all_ Rosetta results have the same _ESTIMATED_ time, when in reality, the actual times vary quite a bit. DCF is lowered by short (quicker than estimated) results, and raised by long (longer than estimated) ones; if this host happens to get a handful of very short WUs (say 250 seconds) immediately before getting _this_ result, then the DCF could be, for example, 0.5 when this result starts. Multiply the bound by that DCF, you're suddenly down to 50,000, or 500 seconds - and if this result runs even ONE SECOND longer than the "average result", it's exceeded the boundary.

Now, in general, the DCF is a very good thing; it keeps the cache filled with the correct amount of work, it lets the "to completion" times be reasonably accurate, etc. But the accuracy of the DCF itself depends _entirely_ on the accuracy of the project's estimates of "how long" a result will take. THAT is, I believe, the source of this problem; Rosetta simply isn't very accurate on those estimates, making DCF a matter of "luck" - the order in which a host did what type of results. I have seen DCF vary IN ONE DAY, on one of my machines, by a factor of 2x, and that was _without_ any "short error" WUs. (Error WUs shouldn't lower the DCF, but I haven't been able to prove that they don't...)

I don't know what your estimated fpops is or your boundary fpops, so I don't know what "conservative" means - 2x? 3x? I'm guessing it's a bit more than 2x, or we would see many more "max cpu exceeded" errors. I've had a result with an original estimated time of 10 hours take 21 on a slow Mac and not blow up, but I'd bet it was getting pretty close.

The long term solution is simple - use the alpha project or internal machines to run a hundred of each new WU type before releasing them to the project, and set the estimated fpops for that WU type based on the times you see. The short term solution may well require temporarily raising the boundary - I _don't_ think it is helping with the 1% problem, it sounds like some of those have still been "stuck" well after I think they should have gotten the "max cpu exceeded" error.

ID: 9015 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile nasher

Send message
Joined: 5 Nov 05
Posts: 98
Credit: 618,288
RAC: 0
Message 9022 - Posted: 14 Jan 2006, 16:59:06 UTC - in response to Message 8878.  


[/quote]

The time exceeding WU have not been granted credit yet. The "Bad Random Number Seeds" WU usually error out in less than 20 CPU seconds and are a different kind of WU problem.

Regards
Phil

Yet or won't be granted at all ?

It's not about the credits, but the way this is handled which is annoying me.
If it's your own fault/mistake you can't complain, but if it's a devs problem they should show there appreceation for all the idle time of all the participants which is lost, but not because of these participants.
During the running of that time-exceeding WU I could've been crunching for another project without those WU problems.
If I knew I had to babysit I would've chosen another medical project.
It gets tyiring.

And about those 20second WU's. I've had a lot of those, but not a single one has been granted for and again it's not about the credits but the whole package.
I have my doubts about these criteria but one can always go somewhere else if one doesn't agree with the politics.


Well about those 20 second WU's just cuirous how much credit do you think should be granted... most my jobs that run 5000 seconds get about 14 credits. this may be low or high for the avarage .. but anyway asuming 5000 seconds =14 credits then 1 credit is about 357 seconds (~6 min) so um.. i hate saying it but it probaly isnt worth the effort to worry about loosing a 20 second job (unless you pay for your UL/DL bandwith...

yes i love to see credits for work done and credits if an error occours beyond your control... but personaly every time i reboot one of my computers it looses back to the last benchmark (probaly miniuts or more of work + 3-6 min to reboot) so i expect to loose credits now and then

sorry for the soapbox lecture

ID: 9022 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 9025 - Posted: 14 Jan 2006, 17:57:51 UTC - in response to Message 9015.  
Last modified: 14 Jan 2006, 17:58:36 UTC

I believe that the problem is that when calculating "is the boundary exceeded", BOINC uses the DCF as well as the benchmarks.


Can anyone confirm this by showing me where to find it in the boinc client code? I do not see this and, in fact, I see what I thought was the way it is calculated in client/app.C:

max_cpu_time = rp->wup->rsc_fpops_bound/gstate.host_info.p_fpops;

so it depends on your benchmark (p_fpops) and the rsc_fpops_bound that we set for the work unit, as far as I can tell. If the benchmarks are off (p_fpops too big), then there could be a chance that a result can be terminated prematurely. Also, due to the random nature of the calculations, a particular work unit may need more fpops (floating-point operations) to finish, but it would have to be quite a bit more since our bound is rather conservative.

Currently we use:
rsc_fpops_est = 2e13
rsc_fpops_bound = 9e13 (so 4.5x)

On one of my computers the benchmark currently gives 1333470000 fpops/s and successful results have completed in less than 3 hours (10800sec) so that is a total of 1.44e13 which is not too far off the estimate. My understanding is that the fpops_est primarily effects the run time estimates shown on the client and how many work units to download given the communication interval with the server.

ID: 9025 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 9026 - Posted: 14 Jan 2006, 18:18:25 UTC - in response to Message 9022.  
Last modified: 14 Jan 2006, 18:27:51 UTC

DELETED DOUBLE POST

We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9026 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 9027 - Posted: 14 Jan 2006, 18:22:29 UTC - in response to Message 9022.  

The time exceeding WU have not been granted credit yet. The "Bad Random Number Seeds" WU usually error out in less than 20 CPU seconds and are a different kind of WU problem.

Regards
Phil


Well about those 20 second WU's just curious how much credit do you think should be granted... most my jobs that run 5000 seconds get about 14 credits. this may be low or high for the average .. but anyway assuming 5000 seconds =14 credits then 1 credit is about 357 seconds (~6 min) so um.. i hate saying it but it probably isn't worth the effort to worry about loosing a 20 second job (unless you pay for your UL/DL bandwidth...

yes i love to see credits for work done and credits if an error occurs beyond your control... but personally every time i reboot one of my computers it looses back to the last benchmark (probably minuets or more of work + 3-6 min to reboot) so i expect to loose credits now and then

sorry for the soapbox lecture


I agree 100% that the WUs that error at 20 seconds do not amount to anything in terms of credit, and that it is a waist of valuable project time and resources to credit those. Unfortunately, almost everyone got a few of these and people have been screaming their heads off about it on the boards. The "squeaky wheel" theory has now come into play and the project has (to their credit) responded to the demand that credits be awarded.

But the most significant loss of credit is occurring on the WUs that error for "Max CPU time exceeded." These frequently error after 80% simply because they run longer than the system expects them to run. I, and others, have had a number of these amounting to a few thousand credits over the last month or so. While I would like to see the credit for those awarded, I would prefer to see a fix for the problem. Some of us have implemented a "patch" by increasing the DCF in BOINC to allow longer run times. At best this is temporary and requires a lot of monitoring of the system to keep things running.

It is for that reason that I pointed out that the "Random number" problem and the "Max time" problem are not the same thing. If we are really concerned about loss of project resources then the effort should be focused on the Max time issue. If I have one WU that fails at 80% complete for Max time, that represents a loss of 5-8 hours of time for the project every time it happens. It would take more than a few hundred "20 second" failures to equal that single failed WU. I think the project team has a handle on the Randon number problem. It may take a while to implement the fix, but it is at hand. They should not waste time awarding credit for these.

It is simple math. The project should concentrate its limited resources on the problems that slow the production of science the most. These would not necessarily be the issues that make the most noise in the user community.

In this case it is not WUs that error in 20 seconds. I lose more CPU cycles in rounding errors than I ever lost on the 20 second failures.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9027 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 9028 - Posted: 14 Jan 2006, 18:26:22 UTC

David,

For what it is worth, I confirm your analysis. The place that that boundary is used to abort the task and emit the message uses the number un-modified.
ID: 9028 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 9029 - Posted: 14 Jan 2006, 18:43:21 UTC - in response to Message 9025.  

I believe that the problem is that when calculating "is the boundary exceeded", BOINC uses the DCF as well as the benchmarks.


Can anyone confirm this by showing me where to find it in the boinc client code? I do not see this and, in fact, I see what I thought was the way it is calculated in client/app.C:

max_cpu_time = rp->wup->rsc_fpops_bound/gstate.host_info.p_fpops;

so it depends on your benchmark (p_fpops) and the rsc_fpops_bound that we set for the work unit, as far as I can tell. If the benchmarks are off (p_fpops too big), then there could be a chance that a result can be terminated prematurely. Also, due to the random nature of the calculations, a particular work unit may need more fpops (floating-point operations) to finish, but it would have to be quite a bit more since our bound is rather conservative.

Currently we use:
rsc_fpops_est = 2e13
rsc_fpops_bound = 9e13 (so 4.5x)

On one of my computers the benchmark currently gives 1333470000 fpops/s and successful results have completed in less than 3 hours (10800sec) so that is a total of 1.44e13 which is not too far off the estimate. My understanding is that the fpops_est primarily effects the run time estimates shown on the client and how many work units to download given the communication interval with the server.


Clearly you are in a much better position to assess the cause of this problem than I . That said There seems to be more at play here. These Max time failures (at least for me) started about a week after I upgraded to BOINC 5.2.13. The systems ran ok for that first week, then I started seeing a number of the Max time errors. I have just completed a WU on one system than ran over 20 hours. This is 4 time the normal run time. If I had not manually adjusted the DCF it would have errored at around 7.

It seems to me that part of the problem is the wide variation in WU size. Typically BOINC expects to see WU of similar size perhaps with a variation of 10% one way or the other. But R@H WUs can vary by 200% or more. Since the boundaries are for all practical purposes fixed, this causes a problem. I for one do not need the system to decide if a WU should be aborted because it is taking a long time. If it is progressing, I would prefer to let it complete. I have seen R@H WUs run as long as 35 hours and complete successfully. The most recent releases of BOINC will not allow that unless it is manually adjusted to for long run times.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9029 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,488,060
RAC: 3
Message 9030 - Posted: 14 Jan 2006, 18:46:20 UTC - in response to Message 9027.  

Some of us have implemented a "patch" by increasing the DCF in BOINC to allow longer run times.


While the code may say it's not using the DCF, _something_ is causing this condition; why would one host get a "max CPU time exceeded" error on a WU that ran _less_ time than ones shortly before and after it that were successful? The case I investigated, the only difference I could see was a string of "short" WUs immediately before the "max CPU" one, which would strongly indicate DCF...

ID: 9030 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Snake Doctor
Avatar

Send message
Joined: 17 Sep 05
Posts: 182
Credit: 6,401,938
RAC: 0
Message 9032 - Posted: 14 Jan 2006, 18:53:48 UTC - in response to Message 9030.  

Some of us have implemented a "patch" by increasing the DCF in BOINC to allow longer run times.


While the code may say it's not using the DCF, _something_ is causing this condition; why would one host get a "max CPU time exceeded" error on a WU that ran _less_ time than ones shortly before and after it that were successful? The case I investigated, the only difference I could see was a string of "short" WUs immediately before the "max CPU" one, which would strongly indicate DCF...

I agree Bill, if the DCF was not involved then raising it would have no effect on the outcome. Clearly it does. Now it may also effect the number of WU a particular machine can download at once, but that is a different issue.

If the DCF is raised sufficiently, all WUs seem to complete successfully irrespective of the CPU time they take. This implys that the DCF IS used in the calculations for Max time. However, I have only seen the system make very small changes in the DCF over time, and they have always been to make it less.

Regards
Phil


We Must look for intelligent life on other planets as,
it is becoming increasingly apparent we will not find any on our own.
ID: 9032 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 9033 - Posted: 14 Jan 2006, 18:53:55 UTC - in response to Message 9022.  
Last modified: 14 Jan 2006, 19:22:52 UTC



Well about those 20 second WU's just cuirous how much credit do you think should be granted... most my jobs that run 5000 seconds get about 14 credits. this may be low or high for the avarage .. but anyway asuming 5000 seconds =14 credits then 1 credit is about 357 seconds (~6 min) so um.. i hate saying it but it probaly isnt worth the effort to worry about loosing a 20 second job (unless you pay for your UL/DL bandwith...

yes i love to see credits for work done and credits if an error occours beyond your control... but personaly every time i reboot one of my computers it looses back to the last benchmark (probaly miniuts or more of work + 3-6 min to reboot) so i expect to loose credits now and then

sorry for the soapbox lecture


I do not care about these 20 seconds WU's I didnt get credits granted for.
It's just that were a lot of these WU's that got credits and I found it a bit strang I didn't get anything so I thought what was wrong with the ones I had uploaded. Just curiousity.
Too much time gets wasted on these 0.00xxxx credits but if you've had a lot of these it might count.
ID: 9033 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 575
Credit: 4,488,060
RAC: 3
Message 9036 - Posted: 14 Jan 2006, 19:50:45 UTC - in response to Message 9015.  

But, it seems _all_ Rosetta results have the same _ESTIMATED_ time, when in reality, the actual times vary quite a bit.


Just had a good example of this; a WU that was estimated at 20+ hours just finished in 4:11:22. Remaining WUs in the queue dropped to 18:47 estimates. If you look at this host here you'll see a _6x_ range of CPU times... the DCF was set very high by the 26-hour one, is just now back _down_ to 2.33... There's only eight completed results for that host, makes it very easy to see what's going on.

I think the "Increase_cycles" WUs should have been issued with at least double the estimate they got, and the "no_sim_anneal" ones possibly half the estimate. However it's done, the estimates should definitely not be the same on them.

ID: 9036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Los Alcoholicos~La Muis

Send message
Joined: 4 Nov 05
Posts: 34
Credit: 1,041,724
RAC: 0
Message 9040 - Posted: 14 Jan 2006, 21:09:41 UTC - in response to Message 9036.  
Last modified: 14 Jan 2006, 21:18:27 UTC

But, it seems _all_ Rosetta results have the same _ESTIMATED_ time, when in reality, the actual times vary quite a bit.


Just had a good example of this; a WU that was estimated at 20+ hours just finished in 4:11:22. Remaining WUs in the queue dropped to 18:47 estimates. If you look at this host here you'll see a _6x_ range of CPU times... the DCF was set very high by the 26-hour one, is just now back _down_ to 2.33... There's only eight completed results for that host, makes it very easy to see what's going on.

I think the "Increase_cycles" WUs should have been issued with at least double the estimate they got, and the "no_sim_anneal" ones possibly half the estimate. However it's done, the estimates should definitely not be the same on them.

After another "maximum cpu time exceeded" error I suspend the networkactivities on a dual G5 2GHz 2,5GB ram (boinc 5.2.13 no other projects)

I have the following (not yet uploaded) queue of results:

cpu-time - status

12:38:43 - maximum cpu time exceeded
02:32:24 - finished
03:47:37 - finished
03:12:34 - finished
12:38:43 - maximum cpu time exceeded
08:49:47 - finished
06:19:48 - finished
03:46:29 - finished
07:42:59 - finished
03:17:35 - finished
06:08:48 - finished
05:53:45 - finished
03:05:26 - finished
02:28:05 - finished
02:22:14 - finished
12:38:45 - maximum cpu time exceeded
06:47:13 - finished
04:38:56 - finished
08:04:36 - finished
04:44:13 - finished
03:14:39 - finished
01:41:28 - finished
08:02:37 - finished
06:02:18 - finished
04:58:23 - finished
05:28:48 - 70%
01:53:23 - 50%

So far I didn't keep track of the variations in the estimated_time (at the moment: 07:44:12)

Although there is a sequence of 3 short wu's before an error I don't think that that's the real cause. As you can see some wu's take just too much time to finish (one was at 80%, the other at 90% when they errored out). And I can't recall seeing an estimated_time on this machine greater then 12:00:00.
Unless the max_cpu_time is increased these wu's will never finish.


ID: 9040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 9048 - Posted: 14 Jan 2006, 23:24:21 UTC

I notice that some of the people who see a DCF dependance have computers with *very* high benchmark scores. I assume that's because they're using an "optimized" version of boinc. Some of those boinc versions may have been modified to include DCF in the max_cpu_time calculation.

In fact, they would pretty much have to do something of the sort because otherwise the extremely high benchmark scores would result in a very short max_cpu_time, and all work units would time out.
ID: 9048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Los Alcoholicos~La Muis

Send message
Joined: 4 Nov 05
Posts: 34
Credit: 1,041,724
RAC: 0
Message 9053 - Posted: 15 Jan 2006, 0:40:28 UTC - in response to Message 9048.  
Last modified: 15 Jan 2006, 0:55:59 UTC

I notice that some of the people who see a DCF dependance have computers with *very* high benchmark scores. I assume that's because they're using an "optimized" version of boinc. Some of those boinc versions may have been modified to include DCF in the max_cpu_time calculation.

In fact, they would pretty much have to do something of the sort because otherwise the extremely high benchmark scores would result in a very short max_cpu_time, and all work units would time out.

I don't think optimized clients cause the problem. The standard version of Boinc has "maximum cpu time exceeded" errors as well. My G4 with the standard (recommended) version of boinc (5.2.13) did have 3 out of 13 wu's with the "maximum cpu time exceeded" errors in 10 days. Just like my G4 Powerbook with the 4.44 superbench client where 4 out of 15 wu's errored out.

Beside that, Rosetta uses the Boinc platform where the utilization of an optimized client is quiet common (i.e. Seti). So Rosetta should meet those multi_project requirements. It should be very odd to ask people to change their Boinc clients for every other project, isn't it?


ID: 9053 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Credits Granted



©2024 University of Washington
https://www.bakerlab.org