Message boards : Number crunching : Project concerns after running WU FAST_ABINITIO_CENTROID_PACKING_1ctf__305_295
Author | Message |
---|---|
JChojnacki Send message Joined: 17 Sep 05 Posts: 71 Credit: 10,633,777 RAC: 4,604 |
|
Kevin Send message Joined: 15 Jan 06 Posts: 21 Credit: 109,496 RAC: 0 |
|
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
|
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,081,660 RAC: 345 |
Looks like both original WU's that were mentioned in this thread were returned after the deadline. Once the deadline passed, the WU's were handed out again. The question is, why were the original WU's accepted as valid when they were returned well after the deadline. Charlie -Charlie |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
This is not what it should be doing. I will look into this bug. |
JChojnacki Send message Joined: 17 Sep 05 Posts: 71 Credit: 10,633,777 RAC: 4,604 |
This is not what it should be doing. I will look into this bug. Thank you. Joel |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I granted to appropriate credit and added the diff to the user, host, and team totals for both of you. |
JChojnacki Send message Joined: 17 Sep 05 Posts: 71 Credit: 10,633,777 RAC: 4,604 |
I granted to appropriate credit and added the diff to the user, host, and team totals for both of you. On the personal, point-earning level, I definitely appreciate that! I would also ask though, that you please keep me/us apprized of the status of the situation in general. Like, if you are able to find out why it happened. Also, if or what steps are being taken to prevent it from happening again, to anyone else. I would be curious to know this, coming at it from a project level. Thanks again, Joel |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I shot an email to David Anderson who is the head of the boinc project describing the problem. I will also take a look at the validator code to see where it might be causing this. |
Kevin Send message Joined: 15 Jan 06 Posts: 21 Credit: 109,496 RAC: 0 |
I was looking at the team I'm on and it seems like this computer is also having the same issue. I wish the project team luck in getting this sorted out. Thank you for looking this problem and fixing the credit. |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
I set my max cpu time to roughly 24 hours; and checked to see if I was being affected by WUs that were having problems with running over 2 hours. They've all run fine so far.. (Yes.. small sample.) But I noticed that I'd been given a small fraction of the claimed credit for this WU that seems identical to the issues others were having: Past deadline turn in. Have we turned in enough of these to identify the problem; or have all the past-deadline turnins that occur prior to the second person's finishing the WU been causing this problem? (In which case you can look through the database and find all the affected parties and teams.) How is this supposed to be handled? Is the original person's turnin supposed to be flagged as Past Deadline - so it doesn't affect the second person's turnin? And what happens when a WU is given out.. goes past deadline, gets handed out again, the original person turns it in late, and the second person deletes that job or goes past deadline? Does it get handed out once again? And if putting in 10 hours, 24 hours, etc into a WU that's already been turned in isn't providing any additional scientific data - (is it?) then please have our systems call in every hour or so when we're working on something that's been passed out a second time so that we can stop work after finishing the current model. Credit is nice.. but I'd prefer my system to be producing something useful for the project. |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
... And if putting in 10 hours, 24 hours, etc into a WU that's already been turned in isn't providing any additional scientific data - (is it?) then please have our systems call in every hour or so when we're working on something that's been passed out a second time so that we can stop work after finishing the current model. Credit is nice.. but I'd prefer my system to be producing something useful for the project. As long as the WU completes successfully, and the result is valid, the science will be useful. The credit issue is another problem. Credit awards are separate from and do not affect the quality of the science contained in the WU result. The rest of the problem and questions you have reported are a known problem that is being looked into by the Project team at this time. They are talking with the BOINC developers, because a few things are not working right in BOINC and that is the source of the problem. Moderator9 ROSETTA@home FAQ Moderator Contact |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
And if putting in 10 hours, 24 hours, etc into a WU that's already been turned in isn't providing any additional scientific data - (is it?) then please have our systems call in every hour or so when we're working on something that's been passed out a second time so that we can stop work after finishing the current model. Credit is nice.. but I'd prefer my system to be producing something useful for the project. Just as a sidenote in addition to what has been said already, Rosetta@home only sends out a specific workunit ONCE (apart from situations like you described, which is a BOINC issue). Most BOINC projects I've checked send out the VERY SAME workunit MULTIPLE times (2 to 7 times or even more). The fact that R@h doesn't waste donated CPU resources was a very important factor for me personally. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0 |
Just as a sidenote in addition to what has been said already, Rosetta@home only sends out a specific workunit ONCE (apart from situations like you described, which is a BOINC issue). Most BOINC projects I've checked send out the VERY SAME workunit MULTIPLE times (2 to 7 times or even more). ... ok, this is mostly off-topic for Rosetta@home, but still I'll post an answer to this... Well, in example SETI@Home that basically looks on random noise, how do you make sure someone reporting a signal really has found a signal if you don't check this? Of course, if the assigned wu is from 1420.2 to 1420.3 and someone reports a signal at 1420.1 you know the answer is not correct, but a signal at 1420.25 on the other hand must be verified. Also, till someone does find "ET is here", it's just as important to make sure no reported signal is really no reported signal. So, just like in millions of other scientific experiments, the same tests is repeated multiple times to verify the validity of the results. While in many scientific experiments someone stands in a lab and often measure and re-measure the same thing multiple times, in BOINC-projects this is done by sending same wu to multiple computers and comparing the results from the different computers at the end. Just like a person can make a mistake when measuring something, a result from a computer can also be incorrect, reasons for errors can be all from too much overclock, random hardware-errors, software-bugs, and in DC can also be due to users intentionally trying to cheat. No idea how Rosetta@home does the job, but would guess a wrong calculation here will mostly show-up as an impossible energy or impossible folding or something, and for all I know it's possible last part of 1 wu is repeated as 1st part of another wu or something similar... But, in example SETI@Home random noise is random noise so atleast my opinion is it would have been a waste of CPU-resources if did not use redundant computing to validate the results. Most BOINC-projects has landed on 3 for this redundancy, but can be issuing an extra copy to account for "lost" results and so on or to speed-up how fast gets an usable result. Since LHC@Home needs results fast, they're using 5 copies, no BOINC-projects sends out more copies except in case of errors. BTW, looking specifically on LHC@home, after the power-outage in computer centre leading to corrupt database, one interesting post afterwards is this: Backup power is also much more important down in the ring than for the computers. If there is a power glitch down there, the cooling will stop and then the power wires that looks like a half-width ATA cable and are carrying ~10'000 Amp will stop being superconducting. Then they melt very fast heating other superconducting material near them creating a chain reaction. Then the magnets will fail and we will lose the beam, it is believed that the beam can burn through 30 meters of copper. So if we get a problem down there we lose the work of more than 10'000 people for 10 years and all the equipment. Here we just lose a days worth of computer time. Now, instead of a power-glitch that melts everything the beam "just" makes a wrong turn, you'll still fry a big hole somewhere and can expect a lengthy outage to fix this. If this "wrong turn" was due to an invalid result that wasn't being verified in LHC@home because someone on the project-staff didn't want to use redundancy because this "waste donated CPU resources"... :oops: Bottom line is, maybe for you it looks like a waste of time to send-out the same wu to multiple computers, but for a project this redundancy can be critical. |
David Baker Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 17 Sep 05 Posts: 705 Credit: 559,847 RAC: 0 |
Just as a sidenote in addition to what has been said already, Rosetta@home only sends out a specific workunit ONCE (apart from situations like you described, which is a BOINC issue). Most BOINC projects I've checked send out the VERY SAME workunit MULTIPLE times (2 to 7 times or even more). As I"ve been going through the quite spectacular results from the last several weeks, I am finding a low level of computation errors. these don't cause any problem, however, because I always recompute the energy of the final structures that all of you are returning (over 99% indeed have exactly the energies that you return). because we can recompute the energy at the end, we can avoid wasting your precious cpu hours on redundant calculations. we still need much more cpu power than is currently available, and we want to get the most out of what there is. |
Dimitris Hatzopoulos Send message Joined: 5 Jan 06 Posts: 336 Credit: 80,939 RAC: 0 |
The fact that R@h doesn't waste donated CPU resources was a very important factor for me personally. Not arguing, but with regard to the "many scientific experiments someone stands in a lab and often measure and re-measure the same thing multiple times", maybe I'm missing something, but I don't see how a properly programmed science app, running on a IEEE 754 CPU, would require performing the same calculations 3 times (or even more 4-5 times etc). I guess the ocassional overcloked / malfunctioning PC might return an errorneous result (without crashing the app)? If validity of EVERY SINGLE WU is that important, then send out the same WU to 2 PCs, initially. And if results from initial 2 PCs don't match, only then send same WU out a 3rd time. Maybe it's not important to other crunchers, or they don't realise how many TeraFLOPS are redundant. But it's important to ME and apparently most people I talk with. Incidentally, yesterday one of the biggest portals in my country (Greece) chose to republish an article I wrote 20 days ago, as their front-page featured article here (original here) -both in Greek language- and reading their readers' comments (bottom), I realised that people have many reservations / objections about taking part in DC projects (or maybe those were the "vocal minority"?). Personally I'm happy that R@h values donor resources. With reluctance, I've set "no-more-work" to a project which didn't compress data going in or out (2.5MB textfile results, compressible to 500K), after more than 1 month since I pointed it out in their forum. Best UFO Resources Wikipedia R@h How-To: Join Distributed Computing projects that benefit humanity |
Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0 |
Not arguing, but with regard to the "many scientific experiments someone stands in a lab and often measure and re-measure the same thing multiple times", maybe I'm missing something, but I don't see how a properly programmed science app, running on a IEEE 754 CPU, would require performing the same calculations 3 times (or even more 4-5 times etc). Well, after SETI@Home "classic" ended they reported: "We ended up partially or entirely removing credit for about 900 of the top 10000 users"... So, since most projects can't do a similar trick as Rosetta@home is doing to validate the results, they must live with the fact that in DC some users will try-out all possible methods to pass off "garbage" as "correct" result. No idea on the actual probabilities, but would expect it's a much higher probability that 2 "bad" results passes validation, whan that 3 "bad" results passes validation... well, and it's even lower probability 4 "bad" results passes validation... Also, even both Intel and AMD follow the standards, they can still manage to example round-off differently meaning end-result can be different. Is there more reasons for a project to land on 3 results/wu? Maybe, but since AFAIK all released BOINC-projects is currently cpu-limited, it's also in their best interest to have the lowest possible redundancy they can get away with since the higher the redundancy the less usable science... Anyway, as for sending out more copies initially whan neccessary, this can be either to speed-up validation, example for LHC@home it's critical with fast turnaround-times, but can also be to account for errors or results being "lost" and never reported and so on. Atleast back when SETI@Home increased to sending one extra initially, a couple users did some checking on various wu and the conclusion seemed to be that over 50% needed 4 results/wu anyway...
While there is some users that only cares about the credit and doesn't care if the same wu has been crunched 100 times already as long as they get work, would atleast hope majority of users also cares about keeping the redundancy as small as possible. Since the projects is cpu-limited, it's also in their interest to keep the redundancy small to increase the speed of gathering science-results. Still, the projects needs redundancy to make sure the science is correct, and the higher the redundancy the higher the probability the results is correct and therefore the accuracy. Speed and accuracy is 2 things pulling in opposite directions, and most BOINC-projects seems to have landed on 3 results/wu, this is likely a compromise of some sort...
It's likely not that a project doesn't value the resources, it's more likely there's more pressing matters to do for the moment... As for compression specifically, it was planned to add support to BOINC-client so a project could gzip the files and client would expand them automatically, but this didn't work correctly yet, so this support was pulled again. Still, with doing a couple more changes, this will likely make it back in the client, but unlikely it will be ready for the v5.4.0x-client expected released "soon". So actually, when it comes to compression there's a possibility a project relied on a change to BOINC core client all projects can make use of, instead of just adding this support to their own application... Anyway, we're decidedly off-topic by now so time to stop typing... |
Moderator9 Volunteer moderator Send message Joined: 22 Jan 06 Posts: 1014 Credit: 0 RAC: 0 |
... So, since most projects can't do a similar trick as Rosetta@home is doing to validate the results, they must live with the fact that in DC some users will try-out all possible methods to pass off "garbage" as "correct" result. ... It is much harder to create false results with Rosetta. First off the Code has never been released, which helps prevent tinkering. But more important the nature of what Rosetta is doing is such that the results are almost impossible to mimic in any way that would not be detected in post processing. No idea on the actual probabilities, but would expect it's a much higher probability that 2 "bad" results passes validation, whan that 3 "bad" results passes validation... well, and it's even lower probability 4 "bad" results passes validation... Two things that people ignore when discussing redundancy for Rosetta is the actual nature of the work and the post processing phases incorporated into the system. In its simplest form, Rosetta is seeking ways to develop a computer model that can reliably predict the lowest energy levels for a folded protein. All of the results that seem promising to this goal are examined and rerun by the project. So there is no science driven need to have redundancy for the public computing area, as those results that look useful are tested again anyway by the project. In essence, the public is separating the wrong answers from the correct answers, and the project focuses on the correct ones. SETI does the same thing but their code is open source, and that fact opens them up to the possibility of getting a lot of fraudulent results that look correct, but fail in second phase testing. This in turn would create a large burden for the project for testing results themselves, which they instead have chosen to put back on the user community by using redundancy. The fact that the lack of redundancy might allow people to artificially claim higher credits is also not a science issue. This occurs on projects with redundancy too. While redundancy reduces the actual awarding of artificial credit claims it does not eliminate such awards. I want to make clear that neither I nor the Rosetta project advocate the use of non standard software or other credit inflation techniques. But, If you look at the problem objectively, all of the methods of artificially inflating credit claims are available to all users, so the playing field is actually still level for all users, even in the absence of redundancy. It is often said in these discussions that everyone will leave the project because of unfair credit practices. Well I am certain a number of people have and will leave, but there will always be others who for any number of reasons will stay. In the case of Rosetta, the science, the best user of donated resources, and the speed of project progress are paramount. This is a condition of participation, just as redundancy is for projects that chose to use it. Moderator9 ROSETTA@home FAQ Moderator Contact |
Ingleside Send message Joined: 25 Sep 05 Posts: 107 Credit: 1,514,472 RAC: 0 |
SETI does the same thing but their code is open source, and that fact opens them up to the possibility of getting a lot of fraudulent results that look correct, but fail in second phase testing. This in turn would create a large burden for the project for testing results themselves, which they instead have chosen to put back on the user community by using redundancy. SETI@Home "classic" was not open-source, but this did not stop some people to "customize" clients that only returned garbage. So the fact Rosetta@home is not open-source only slows-down someone trying to cheat. But, the other fact that Rosetta@home can very cheaply cpu-wise verify if result is correct or not is a big advantage.
Actually, crediting is also a science-issue. Again looking back on seti "classic", it was quickly detected with v3.03 that VLAR under windows was very slow, while VHAR was fast, the difference being roughly 50% in crunch-times, actually higher if running win9x. Some users therefore just dumped VLAR without crunching them, and you can't really argue just dumping wu based on angle-range without crunching them helps the science. Also, in classic it was just to crunch a wu, save a copy of a result, upload result, wait a couple days and re-upload it, wait another couple days and re-upload... This method of artificially inflating granted credit was available to all users, but it does not help the science. For projects using redundancy, the awarded credit is either lowest claimed if 2 used for validation, or remove highest and lowest and average the rest if 3 or more used for validation. It does not stop users ocassionally being very lucky and being paired-off with another claiming very high, but you're also going to be paired-off with users claiming very low, so on average artificially inflating your credit-claims gives little or no real effect compared to not inflating your credit-claims. In Rosetta@home on the other hand, you're only dependend on your own credit-claims... Having a project there users can claim whatever they wants and get this as granted will atleast initially attract more users, since some users will crunch whatever gives most "credits". If you "fix" this problem, or another project start that gives even more credits, there's a good chance many of these users will leave again. Long-term, if you don't fix the problems, the users that does not "artificially inflate" their claims will become more and more disgruntled, and starts looking for another project. Even for users that's mainly interested in the science, if they're constantly being out-credited 5x by someone having similar computer, they'll start looking for something else. With already multiple folding-projects under BOINC, and also Folding@home/BOINC will sooner or later iron-out most of their problems and release an application so even users that doesn't want to "waste" resources on redundancy will have a choise, the switch to another folding-project that does not have an "unfair" credit-system will be short. Users either switching to something else or dropping-out completely does not help the science for a project.
There will always be some staying even if the credit-system is broken, but for many credit is one of their top priorities. In SETI@Home "classic" it didn't matter if some users left due to being disgruntled by all the cheating going on, or any problems with the credit-system, since there was still enough users left to crunch on average every wu 7+ times and therefore get the science done. But, in AFAIK all other projects you're cpu-bound, meaning losing anyone due to "unfair" credit-system directly influences how much science you're getting done. |
BennyRop Send message Joined: 17 Dec 05 Posts: 555 Credit: 140,800 RAC: 0 |
What good does it do to artificially inflate your scores? For each WU, we create 10,000 models. It's easy to create a database of the models and the points/model given for each and who turned it in. If there really are 4 different groupings of points/model from high to low.. group 1 being optimized boinc client; group 2 being normal boinc client; group 3 being the linux client; and group 4 being those with errors - it'd be easy to figure out which is group 2.. and reset the scores of group 1 and group 3 to the average of group 2. It should be pretty simple to either test in the lab and set the default score per model - or look at the layout of the score/model on a graph and reset everyone. (And then start concentrating on ways of getting the clients to produce equal values no matter the environment.) |
Message boards :
Number crunching :
Project concerns after running WU FAST_ABINITIO_CENTROID_PACKING_1ctf__305_295
©2024 University of Washington
https://www.bakerlab.org