Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 68 · 69 · 70 · 71 · 72 · 73 · 74 . . . 276 · Next

AuthorMessage
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 98225 - Posted: 19 Jul 2020, 8:19:58 UTC - in response to Message 98224.  

if it says "36000s + 14400s" that indicates the watchdog has now been set back to 4hrs rather than 10hrs
The 4 hours I took to be the run time preference (per this post); the 10 hours the watchdog (per this post).
ID: 98225 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,690,215
RAC: 7,274
Message 98229 - Posted: 19 Jul 2020, 18:25:23 UTC - in response to Message 98224.  

Slightly side-tracking.
That task isn't available to view any more, but if it says "36000s + 14400s" that indicates the watchdog has now been set back to 4hrs rather than 10hrs.
I wasn't aware that'd changed back as I haven't had a long-running task for a very long time
I've got one running right now. 1 day, 5 hours, 40 minutes of CPU time: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1095559368

My wingman completed it in 13 hours, but so far I've taken 1 day, 5 hours, 40 minutes. The wingman's computer has an i5-6402P, which I've never heard of, but if it's a similar speed to an i5-6400, then it's a similar speed to my Xeon per core, so I'm not sure how he did it so quickly. How does winging work with Rosetta? Can't you end up with one guy doing more modules than another because his computer is faster?
ID: 98229 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 98230 - Posted: 19 Jul 2020, 18:42:59 UTC - in response to Message 98229.  
Last modified: 19 Jul 2020, 18:50:18 UTC

How does winging work with Rosetta?
It doesn’t. Tasks are typically not sent to more than one machine. Yours did probably only because its deadline has passed. If your machine does ever finish it, you will get the same credit as the other user. (Looking at the FLOPS: his machine is 30% faster than yours.) And yes: this is where BOINC’s credit model (designed for fixed work / variable time) breaks down on Rosetta (fixed time / variable work). (Explanation from Mod.Sense.)
ID: 98230 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,690,215
RAC: 7,274
Message 98231 - Posted: 19 Jul 2020, 18:48:19 UTC - in response to Message 98230.  
Last modified: 19 Jul 2020, 18:49:22 UTC

How does winging work with Rosetta?
It doesn’t. Tasks are typically not sent to more than one machine. Yours did probably only because its deadline has passed. If your machine does ever finish it, you will get the same credit as the other user. (Explanation from Mod.Sense.)
Ok that answers one of my two questions, but.... how did he finish it so quickly? I can only assume his CPU, although similar in a benchmark, is faster at Rosetta. Back to the question you answered - I take it Rosetta is programmed such that it cannot send back a wrong result? Most projects have to check with at least one other person to make sure you got the answer right.
ID: 98231 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 98232 - Posted: 19 Jul 2020, 19:10:35 UTC - in response to Message 98231.  

I edited while you were replying…

Looking at the stats: his machine is 30% faster at floating point ops, and 80% faster at integer ops, than yours. Using those numbers, yours should take somewhere between 17 and 25 hours. But that the task is still not finished after 30 hours suggests it’s not that simple…

From what Mod.Sense wrote, Rosetta would rather have two machines doing two different tasks than both doing the same and comparing results to ensure they’re ‘right’. I’m not sure there’s really such a thing as a ‘wrong’ answer with Rosetta anyway, if the tasks are simply asking: “What if…?” Any results that look promising will be investigated further, and can be discarded if they turn out to be somehow erroneous.
ID: 98232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,690,215
RAC: 7,274
Message 98233 - Posted: 19 Jul 2020, 21:27:55 UTC - in response to Message 98232.  
Last modified: 19 Jul 2020, 21:30:27 UTC

Looking at the stats: his machine is 30% faster at floating point ops, and 80% faster at integer ops, than yours. Using those numbers, yours should take somewhere between 17 and 25 hours. But that the task is still not finished after 30 hours suggests it’s not that simple…
Where did you get the data from? I usually compare using http://cpuboss.com/compare-cpus but that has not heard of his CPU. I tried searching for a few more comparison sites, but the ones that list his don't have benchmarks, they just list all the specs side by side.

From what Mod.Sense wrote, Rosetta would rather have two machines doing two different tasks than both doing the same and comparing results to ensure they’re ‘right’. I’m not sure there’s really such a thing as a ‘wrong’ answer with Rosetta anyway, if the tasks are simply asking: “What if…?” Any results that look promising will be investigated further, and can be discarded if they turn out to be somehow erroneous.
But if a computer makes a mistake it will miss what could be an interesting combination. There must be some kinda CRC check in the programming. Astrophysics projects use at least two machines, as the answer can be incorrect.

And yes: this is where BOINC’s credit model (designed for fixed work / variable time) breaks down on Rosetta (fixed time / variable work). (Explanation from Mod.Sense.)
It only breaks down when someone returns it too late.
ID: 98233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 98234 - Posted: 19 Jul 2020, 21:59:47 UTC - in response to Message 98233.  

Where did you get the data from?
I was just looking at the Measured floating point speed and Measured integer speed values on each Computer Details page, which come from the Whetstone and Dhrystone benchmarks that BOINC runs.
ID: 98234 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1224
Credit: 13,841,472
RAC: 1,593
Message 98236 - Posted: 19 Jul 2020, 22:43:36 UTC - in response to Message 98233.  
Last modified: 19 Jul 2020, 22:44:35 UTC

[snip]

There must be some kinda CRC check in the programming. Astrophysics projects use at least two machines, as the answer can be incorrect.

It depends. If the project is searching a very large set of starting points that should all give answers converging to the best possible answer, and the server can quickly evaluate the quality of what was returned, that a few wrong answers aren't important enough to reduce the number of starting points that are evaluated.

On the other hand, I've seen a BOINC project where nearly all of the tasks returned answers saying nothing was found. Someone noticed this, and wrote a fake application program that always returned nothing was found, without even checking if there was anything that should have been found. The project had so few users that each workunit went to only one computer, except after timeouts and obvious errors. this means that the fake results were only noticed after someone noticed that the fake application used less than 1% of the CPU time used by the real one, and by then so many of the fake results had been declared valid and the run time deleted that a large number of workunits had to be recreated and run again.
ID: 98236 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,690,215
RAC: 7,274
Message 98237 - Posted: 19 Jul 2020, 22:58:32 UTC - in response to Message 98236.  

On the other hand, I've seen a BOINC project where nearly all of the tasks returned answers saying nothing was found. Someone noticed this, and wrote a fake application program that always returned nothing was found, without even checking if there was anything that should have been found. The project had so few users that each workunit went to only one computer, except after timeouts and obvious errors. this means that the fake results were only noticed after someone noticed that the fake application used less than 1% of the CPU time used by the real one, and by then so many of the fake results had been declared valid and the run time deleted that a large number of workunits had to be recreated and run again.
I shake my head in disgust at who would do such a ting, it's not even as if you can make money out of getting more credits.
ID: 98237 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tomcat雄猫

Send message
Joined: 20 Dec 14
Posts: 180
Credit: 5,364,639
RAC: 0
Message 98238 - Posted: 20 Jul 2020, 4:19:40 UTC - in response to Message 98234.  
Last modified: 20 Jul 2020, 4:21:04 UTC

Where did you get the data from?
I was just looking at the Measured floating point speed and Measured integer speed values on each Computer Details page, which come from the Whetstone and Dhrystone benchmarks that BOINC runs.


Those numbers are anything but accurate.
My hilariously thermally constrained Macbook from 2015 has a measured floating point speed of 5.65GFLOPs (it can go above 6.10GFLOPs sometimes, which is way higher than a well-cooled i9-9900K). That is faster than my Ryzen 3600 and many current gen high-end desktop-grade CPUs from Intel.
There is no way that can be true, integer performance seems to match up to expectations, though.
ID: 98238 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1481
Credit: 14,579,319
RAC: 14,459
Message 98240 - Posted: 20 Jul 2020, 6:04:19 UTC - in response to Message 98231.  

Ok that answers one of my two questions, but.... how did he finish it so quickly?
My understanding is that for a given Work Unit, each Task actually starts with a different random seed. So while the data for 2 (or more) Tasks from a given Work Unit is the same, the starting seed value(s) are different, and so the entire calculation work done can be significantly different- even though the data being processed is the same.
Hence why there is no comparison of results involved in Validation of work done.

I could be wrong of course.
Grant
Darwin NT
ID: 98240 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1481
Credit: 14,579,319
RAC: 14,459
Message 98241 - Posted: 20 Jul 2020, 6:07:59 UTC - in response to Message 98237.  

On the other hand, I've seen a BOINC project where nearly all of the tasks returned answers saying nothing was found. Someone noticed this, and wrote a fake application program that always returned nothing was found, without even checking if there was anything that should have been found. The project had so few users that each workunit went to only one computer, except after timeouts and obvious errors. this means that the fake results were only noticed after someone noticed that the fake application used less than 1% of the CPU time used by the real one, and by then so many of the fake results had been declared valid and the run time deleted that a large number of workunits had to be recreated and run again.
I shake my head in disgust at who would do such a ting, it's not even as if you can make money out of getting more credits.
Cheating by some people in the original Seti project was the reason BOINC was developed- Credits instead of just counting the number of Work Units processed, and a method for comparing results to see if a returned result is actually Valid or not.
Grant
Darwin NT
ID: 98241 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 374
Credit: 10,698,043
RAC: 5,337
Message 98248 - Posted: 20 Jul 2020, 17:14:47 UTC - in response to Message 98233.  

Where did you get the data from? I usually compare using http://cpuboss.com/compare-cpus but that has not heard of his CPU. I tried searching for a few more comparison sites, but the ones that list his don't have benchmarks, they just list all the specs side by side.

Try :-

https://www.cpubenchmark.net/cpu_list.php
ID: 98248 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 98249 - Posted: 20 Jul 2020, 19:22:23 UTC

So far as validating results, lost results etc. Each protein study fires off thousands of tasks. Some 5% or less of those results will look to be the best. If a task ran astray out in the wild, and mistakenly reports a terrible result, that's not ideal, but there should still be a similar result in those top 5%. If the task ran astray and mistakenly reports a fantastic result, that single model is rerun in the lab and confirmed. If the lab system has the same flaw, it should get the same fantastic result. But there is also human review of the results. Sometimes you can tell, just by the shape of the result, that it doesn't look like a protein found in nature.

If a protein-protein interaction were being studied, it might be more difficult to tell that something is off just by the shape. Eventually results may by sent to the "wet lab" where they produce the two proteins and see if they actually interact as predicted by the model.

If the protein structure has already been determined, the models are compared to the known structure and the degree of their similarity is measured in RSMD.

Sometimes the human review of the top 5% of the results concludes that we still have not found the best model. Perhaps there is a high variability in appearance across the top scoring models. In such cases, variations of those top 5% of the results are sent out as a new round of work. It is for the same protein, and again will do thousands of models, but these will start with some assumptions or rules that cause you to begin with something much closer to one of those previous best results, and search around that same area for a better (lower energy) result.

I made up the 5% number. 1% or less is probably more realistic. Maybe I should have said something like "...the top 10 or 20 models".

Anyway, I hope that makes it more clear why R@h does not require a wingman to rerun the same models to confirm results. When you get down to those top 10 results, they should all look pretty similar. Each arrived at that model from a different start, but, in the end, the top results should all be similar to the actual protein's structure in nature. So, they should all be very similar. So if the 11th top result looks radically different due to some error, it will stand out like a sore thumb.
Rosetta Moderator: Mod.Sense
ID: 98249 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,690,215
RAC: 7,274
Message 98250 - Posted: 20 Jul 2020, 19:41:52 UTC - in response to Message 98249.  

So far as validating results, lost results etc. Each protein study fires off thousands of tasks
[snip]
So, they should all be very similar. So if the 11th top result looks radically different due to some error, it will stand out like a sore thumb.
Excellent description, thanks, it's nice to know how the system operates that we're running.
ID: 98250 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1981
Credit: 38,422,922
RAC: 13,431
Message 98252 - Posted: 21 Jul 2020, 0:12:27 UTC - in response to Message 98225.  

if it says "36000s + 14400s" that indicates the watchdog has now been set back to 4hrs rather than 10hrs
The 4 hours I took to be the run time preference (per this post); the 10 hours the watchdog (per this post).

Got it.
It's been so long since I needed to look at task overruns I must've completely forgotten the syntax.
Made worse by the task runtime being 4+10 rather than 10+4. If it was 8+watchdog I wouldn't have confused myself so easily (I hope)
ID: 98252 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jord
Avatar

Send message
Joined: 16 Sep 05
Posts: 41
Credit: 204,120
RAC: 0
Message 98265 - Posted: 22 Jul 2020, 10:02:23 UTC

When you made the 4.20 app for Windows, did you add the code (via the BOINC API) that checks every 10 seconds if the client has died and will then auto-exit the app?
During testing something with BOINC/BOINC manager I find that when I kill BOINC Manager about 15 seconds after it starting up, while Rosetta tasks are still loading into memory, that both BOINC and BOINC Manager exit normally but the Rosetta tasks that started stay in memory. Even after a handful of minutes these apps still run. I have to manually kill them.
Restarting BOINC Manager will only cause the tasks that started already to stay in memory and in BOINC Manager these show as "waiting to acquire slot directory lock. Another instance may be running."
ID: 98265 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Corgi

Send message
Joined: 19 Jun 19
Posts: 5
Credit: 1,388,026
RAC: 4,522
Message 98466 - Posted: 10 Aug 2020, 19:07:03 UTC

Perhaps you can help me adjust my settings - I've been getting Rosetta tasks with deadlines that would require me to walk away from my computer and not use it for anything else to ensure completion - for example, I just recontacted the project to clear two sadly-unfinished tasks with more than a day yet to run that were due two days ago. A lot of what else I do is resource-intensive, so I have to pause BOINC and F@H while they're running.

I hate seeing these tasks I can't complete! Suggestions, please?
ID: 98466 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 9,690,215
RAC: 7,274
Message 98467 - Posted: 10 Aug 2020, 19:27:52 UTC - in response to Message 98466.  

Perhaps you can help me adjust my settings - I've been getting Rosetta tasks with deadlines that would require me to walk away from my computer and not use it for anything else to ensure completion - for example, I just recontacted the project to clear two sadly-unfinished tasks with more than a day yet to run that were due two days ago. A lot of what else I do is resource-intensive, so I have to pause BOINC and F@H while they're running.

I hate seeing these tasks I can't complete! Suggestions, please?


How many cores do you have? Can you run your intensive tasks and a smaller number of Rosettas at once, by limiting Boinc to use less cores? Or leave the computer on more when you're not using it?
ID: 98467 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 22 Apr 20
Posts: 17
Credit: 270,864
RAC: 0
Message 98468 - Posted: 10 Aug 2020, 19:57:35 UTC - in response to Message 98466.  

Hi Corgi,
Running Boinc and Folding together can cause resource conflicts. Boinc can't see that Folding is using 1 (light), 3 (medium) or all 4 (full) cores so Boinc will, itself, try to use those cores as well causing Folding, Boinc and anything else you're trying to do, to slow down. You could set Folding to light, to use only 1 core and Boinc to use 3 of 4, 75%, or medium, 3 core and limit Boinc to 1 of 4, 25%. Or maybe set Folding to light, Boinc to 50% or 25%, leaving 1, or 2 cores free.

I've noticed with Folding, if set to medium, 3 cores, before a task starts, you can turn it down to light, 1 core, and back up to medium later, but if a tasks starts in light, 1 core, turning it up to medium has no effect and it will run as 1 core to the end of that task.

Hope that helps.
ID: 98468 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 68 · 69 · 70 · 71 · 72 · 73 · 74 . . . 276 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org