Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · 33 . . . 311 · Next
Author | Message |
---|---|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2146 Credit: 41,570,180 RAC: 8,210 |
When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks. For some reason, aborting the transfer first before aborting the task didn't work. Aborting the task first, then the download was always much more successful. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 404 Credit: 12,294,748 RAC: 2,551 |
Odd, when I abort the transfer the task disappears about 10 seconds later, I don’t need to abort it. I’m on 7.16 |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
That would be my expectation of the BOINC Manager. When one of the files required by a task fails to download, then the task is aborted. And there will be times when several of the tasks you are downloading depend upon the same file, and all of them abort when a file transfer fails. Rosetta Moderator: Mod.Sense |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 4,044 |
It's a pity the Boinc manager doesn't seem to notice for hours. I get the message in the log "some download is stalled" (which prevents any more tasks getting downloaded from that project) up to an hour or so after I've cancelled both the download and the task. 1) Rosetta needs to stop the server stalling downloads. 2) Boinc needs to fix their program so it doesn't get upset just because 1 file failed, then fail to notice the user told it to give up. |
amgthis Send message Joined: 25 Mar 06 Posts: 81 Credit: 203,879,282 RAC: 0 |
I haven't seen the small file stall on d/l lately, but did have this one hang a couple of days ago. More than once. rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB I'm still not convinced it's a by-product of super slow DSL on my end. In any case I think it's cleared up. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 4,044 |
I haven't seen the small file stall on d/l lately, but did have this one hang a I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong! |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 404 Credit: 12,294,748 RAC: 2,551 |
I haven't seen the small file stall on d/l lately, but did have this one hang a I had another stalled download today but my main problem over the past few days have been jobs erroring out after 10 hours. Each job has a single decoy that was still running 4 hours after my 6 hour limit. So far there have been 11 such jobs over 2 machines, 3 on one machine yesterday alone which is a quarter of that machine’s allocation for Rosetta :- https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985018 https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985251 https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985278 |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 4,044 |
I haven't seen the small file stall on d/l lately, but did have this one hang a What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 404 Credit: 12,294,748 RAC: 2,551 |
I haven't seen the small file stall on d/l lately, but did have this one hang a Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 826 |
Another small file download problem, which is currently blocking me from getting any more R@H tasks: 10v3nmgb_c14394_10mer_gb_000420_SAVE_ALL_OUT_896889_53_1 https://boinc.bakerlab.org/rosetta/result.php?resultid=1125377637 A wingmate timed out and therefore may have had the same problem: https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1012342810 Relevant lines from the log: 3/4/2020 4:00:05 AM | Rosetta@home | Started download of 10v3nmgb_c14394_10mer_gb_000420.zip 3/4/2020 4:05:12 AM | Rosetta@home | Temporarily failed download of 10v3nmgb_c14394_10mer_gb_000420.zip: transient HTTP error 3/4/2020 4:05:12 AM | Rosetta@home | Backing off 04:50:13 on download of 10v3nmgb_c14394_10mer_gb_000420.zip 3/4/2020 4:05:13 AM | | Project communication failed: attempting access to reference site 3/4/2020 4:05:15 AM | | Internet access OK - project servers may be temporarily down. 3/4/2020 4:42:49 AM | Rosetta@home | Sending scheduler request: To report completed tasks. 3/4/2020 4:42:49 AM | Rosetta@home | Reporting 2 completed tasks 3/4/2020 4:42:49 AM | Rosetta@home | Not requesting tasks: some download is stalled 3/4/2020 4:42:51 AM | Rosetta@home | Scheduler request completed Does the file that fails to download even exist on the server? The expected size of the file is only 3.23 KB. Could the server have problems downloading files of a certain small size? DSL speed here is not especially high or low. Enough other BOINC projects are selected on this computer to keep it busy. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 4,044 |
What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 4,044 |
Another small file download problem, which is currently blocking me from getting any more R@H tasks: It's always the little 3kB files that stuck for me. It suggests it's a different server producing those that's misbehaving, or the files are corrupt for some reason. They don't even download using a web browser. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 404 Credit: 12,294,748 RAC: 2,551 |
What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 4,044 |
What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here. Strange, all projects that have given me errors cause them well under the normal processing time. And some projects like LHC seem to have tasks that usually take 2 hours but can take 4 days, but still complete fine. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 404 Credit: 12,294,748 RAC: 2,551 |
Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes. To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 4,044 |
Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes. My Rosetta WUs all seem to finish in a very precise timeframe of 7.5 hours (8.5 hours on the slower machines), I've seen no variation. Perhaps the limiter only takes effect occasionally. LHC has a huge variation, the theory tasks can take from 1.5 hours to 4 days! The percentage moves very slowly as though it will take 4 days, but it seems to complete at a random point somewhere in there, often at only "2% completed", it jumps to 100% and says it was successful, I guess it's looking for an answer somewhere in there and finds it early? |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 404 Credit: 12,294,748 RAC: 2,551 |
Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes. Pass, I’ve never looked at LHC so I wouldn’t know. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 4,044 |
Pass, I’ve never looked at LHC so I wouldn’t know. They've got Atlas tasks, which will run one WU on all your CPU cores at once. I want to get a Ryzen threadripper to see if they'll give me a 64 core task :-) |
rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0 |
All WUs for Rosetta v4.07 i686-pc-linux-gnu on my 1st gen AppleTV running linux (OSMC with all GUI etc. disabled) are failing for going over the RAM limit. See here. E.g.: working set size > client RAM limit: 167.87MB > 167.55MB Is there something wrong with the working set size matching to the amount of available RAM? Or can I limit to the Rosetta Mini application only? |
rlpm Send message Joined: 23 Mar 20 Posts: 13 Credit: 84 RAC: 0 |
Looks like the same is happening for mini tasks, e.g. task 1132535295: working set size > client RAM limit: 170.39MB > 167.55MB |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org