How about some QC on Rosetta WU's?

Message boards : Number crunching : How about some QC on Rosetta WU's?

To post messages, you must log in.

AuthorMessage
BadThad

Send message
Joined: 8 Nov 05
Posts: 30
Credit: 71,834,523
RAC: 0
Message 10540 - Posted: 7 Feb 2006, 16:21:54 UTC

The "max time exceeded" or "stuck at 1%" or "wu hosed for whatever reason" is getting out of control with this project. I have too many machines to waste time babysitting them every day. I've been running DC projects for many years and this is the only project I've seen send out masses of bad wu's.

Get some QC on those wu's....PLEASE. One sure way to kill a project for people that run lot's of computers is to force us to babysit the dang client. I've had countless hours of CPU time completely wasted away with Rosetta, IT MUST STOP!
ID: 10540 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Hoelder1in
Avatar

Send message
Joined: 30 Sep 05
Posts: 169
Credit: 3,915,947
RAC: 0
Message 10554 - Posted: 8 Feb 2006, 1:13:40 UTC
Last modified: 8 Feb 2006, 1:20:49 UTC

With all the ongoing complaints about WU errors, perhaps it would help if the project stated the total error rate they see in the database to put things into perspective. Judging from the ~2000 WUs I have crunched, I would guess that it must be considerably below 1% (one failed WU among 184 on my current results page). It might also be of interest to see how the error rate varies across different hosts/OS type/BOINC versions...
ID: 10554 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 10561 - Posted: 8 Feb 2006, 5:21:37 UTC

Paul's current opinion on the current state of the projects. Your mileage may vary ... but, basically none of the projects are without problems:

SETI@Home in the last couple months has sent out several batches of 20,000 or so work units that were all bad.

SDG is having problems with some participant's work and reported time, venue, and daily quotas. Also they have regular issues with work available but committed to other platforms.

Predictor@Home has had, off and on, problems with a pop-up dialog (the science application has not changed since this was first reported in August of 2005), work that over-runs time. Also they have regular issues with work available but committed to other platforms.

CPDN seems to be having a higher than normal failure rate with work issued against Sulfur 4.22; I have had at least one computer that has done a fiar number of Slab and Sulfur work units but has not been able to start a new one up. Just as a note, most of them also failed with a second issue to another participant.

Einstein@Home had an issue with work running too fast that many participants were running out, since fixed with an increase in daily quota.

SIMAP is having some problems with participants on dial up because of the size of the work download/upload sizes (I have yet to look into actual numbers on my systems so I do not know how bad this problem is).

PrimeGrid is having connectivity problems (non-comercial connection to the Internet)

WCG has an acute case of using United Devices technology making them unsuitable for some participants along with an only partially complient BOINC infrastructure (though they are adding features, the last was team stat exports - but still missing are Work Unit and Result pages) also reported by some that there seems to be a bug in the way preferences are handled with WCG at times over-ruling other settings. There may be other problems but I do not monitor the WCG boards as they are non-BOINC and I get lost ...

The bottom line, no project is without issues, most work fairly well, but at this time, if you cannot monitor your computers then, in my opinion this renders Predictor@Home and Rosetta@Home unsuitable for you. Rosetta@Home is working hard, and based on my experience the 1% problem incidence is lower now than when it first surfaced. I only had one time over-run so, not sure about that one.

Disclaimer: For various reasons I am not running the following projects: SETI@Home, Rosetta@Home, Predictor@Home, SZTAKI Desktop Grid, and PrimeGrid. This is primarily because my main interest is CPDN, Einstein@Home, and LHC@Home; WCG Pirates@Home, and SIMAP@Home are "live" because I am trying to get their lifetime total credit values to certain positions.
ID: 10561 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 10564 - Posted: 8 Feb 2006, 6:21:19 UTC - in response to Message 10554.  

With all the ongoing complaints about WU errors, perhaps it would help if the project stated the total error rate they see in the database to put things into perspective. Judging from the ~2000 WUs I have crunched, I would guess that it must be considerably below 1% (one failed WU among 184 on my current results page). It might also be of interest to see how the error rate varies across different hosts/OS type/BOINC versions...


In fact, the overall error rate is pretty low. with the cpu time limit problem fixed, it appears that a relatively small fraction of users are having the majority of the wu probems--we wish we understood what was causing these!

ID: 10564 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scribe
Avatar

Send message
Joined: 2 Nov 05
Posts: 284
Credit: 157,359
RAC: 0
Message 10565 - Posted: 8 Feb 2006, 6:44:49 UTC - in response to Message 10561.  

Paul's current opinion on the current state of the projects. Your mileage may vary ... WCG has an acute case of using United Devices technology making them unsuitable for some participants along with an only partially complient BOINC infrastructure (though they are adding features, the last was team stat exports - but still missing are Work Unit and Result pages)........



The results page has been there for at least 3 days now.......
ID: 10565 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ib Rasmussen

Send message
Joined: 27 Sep 05
Posts: 16
Credit: 211,416
RAC: 0
Message 10567 - Posted: 8 Feb 2006, 8:24:23 UTC

Admittedly all the projects have their problems, but I agree with the orignal poster, that Rosetta has needed more operator intervention - or babysitting, if you like - than any of the other of the big boinc projects, I participate in.
That why I only run r@h on the few machine I have daily access to.

/Ib
ID: 10567 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 10571 - Posted: 8 Feb 2006, 10:47:32 UTC - in response to Message 10564.  

In fact, the overall error rate is pretty low. with the cpu time limit problem fixed, it appears that a relatively small fraction of users are having the majority of the wu probems--we wish we understood what was causing these!

Have you considered re-issuing some of those work units to other participants?

All of the stuck at 1% that I had restarted and ran to completion which, as I understand it, means that the model did not run with one RND seed but did with another. Are the seeds saved?

If not that should be on the list of things to add to the std out recorded by the WU and returned when reported.

I am guessing you looked for commonality between the systems with the highest error rates for common factors.

Have you considered that it could be a cross-project issue? It only happens if the participant also runs CPDN for example ...

Just thinking out loud ...
ID: 10571 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@H] Ray
Avatar

Send message
Joined: 20 Sep 05
Posts: 118
Credit: 100,251
RAC: 0
Message 10611 - Posted: 10 Feb 2006, 2:48:58 UTC - in response to Message 10554.  

With all the ongoing complaints about WU errors, perhaps it would help if the project stated the total error rate they see in the database to put things into perspective. Judging from the ~2000 WUs I have crunched, I would guess that it must be considerably below 1% (one failed WU among 184 on my current results page). It might also be of interest to see how the error rate varies across different hosts/OS type/BOINC versions...


I have to agree with you, I have only had 1 bad unit since I started in Sept. 05, aborted that at about 10 hours at 1%. Would have timed out on it's own if I let it run. And at one time when they had a batch of bad units I had a bunch crash at about 1 secound just like everyone else but the total errror rate would be less than .5% which is not bad.

I have to admit that for a short while I was aborting units that were returned by other systems 2 to 4 times as bad, but than I ran some of those and had no problems with them. Guess that some systems run these better than others, but give more problems on other programs.

Will be fireing up an old K6 in a while, wonder if that will run Rosetta? Think not, can onlt get 192 Megs ram in it without buying more.

Cheers
Ray

System 1
P4 2.4 gig, 533 fsb, 1024 meg, 512K L2, Win XP, BOINC 4.68

System 2
Celeron 2.93 gig, 533 fsb, 1024 megs, 256K L2, Win XP, BOINC 4.68


Pizza@Home Rays Place Rays place Forums
ID: 10611 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
yoner
Avatar

Send message
Joined: 17 Sep 05
Posts: 10
Credit: 2,581,874
RAC: 0
Message 10619 - Posted: 10 Feb 2006, 8:06:36 UTC

As for resources needed with this project, I actually have an old Dell dual PII-233 running with 128 MB ram (running two threads of rosetta), and am getting fairly decent results from it. The computer sits in the corner and crunches units and streams my MP3 collection to my other computer.

Your K6 may do better than you think!


ID: 10619 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BadThad

Send message
Joined: 8 Nov 05
Posts: 30
Credit: 71,834,523
RAC: 0
Message 10631 - Posted: 10 Feb 2006, 15:48:34 UTC - in response to Message 10564.  

With all the ongoing complaints about WU errors, perhaps it would help if the project stated the total error rate they see in the database to put things into perspective. Judging from the ~2000 WUs I have crunched, I would guess that it must be considerably below 1% (one failed WU among 184 on my current results page). It might also be of interest to see how the error rate varies across different hosts/OS type/BOINC versions...


In fact, the overall error rate is pretty low. with the cpu time limit problem fixed, it appears that a relatively small fraction of users are having the majority of the wu probems--we wish we understood what was causing these!


I think there's a small fraction of users with wu problems because I received all the bad ones on the 30 systems I run Rosetta on. LMAO
ID: 10631 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
keputnam

Send message
Joined: 18 Sep 05
Posts: 24
Credit: 2,084,465
RAC: 0
Message 10637 - Posted: 10 Feb 2006, 17:45:14 UTC - in response to Message 10611.  

[quote
Will be fireing up an old K6 in a while, wonder if that will run Rosetta? Think not, can onlt get 192 Megs ram in it without buying more.

[/quote]

Should be fine. I've got a PII/400/192MB/Win98se that runs Rosetta with no problems (well, very few problems ;-) )

ID: 10637 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 10794 - Posted: 16 Feb 2006, 1:13:49 UTC - in response to Message 10564.  


ID: 10794 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 10797 - Posted: 16 Feb 2006, 1:57:59 UTC - in response to Message 10794.  

ID: 10797 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 10812 - Posted: 16 Feb 2006, 15:54:21 UTC - in response to Message 10797.  
Last modified: 16 Feb 2006, 15:56:59 UTC


Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 10812 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 10851 - Posted: 17 Feb 2006, 17:51:12 UTC - in response to Message 10812.  


ID: 10851 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 10868 - Posted: 18 Feb 2006, 5:19:18 UTC - in response to Message 10851.  
Last modified: 18 Feb 2006, 5:23:25 UTC


Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 10868 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : How about some QC on Rosetta WU's?



©2024 University of Washington
https://www.bakerlab.org