Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · 33 . . . 311 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 8,210
Message 91767 - Posted: 24 Feb 2020, 14:03:26 UTC - in response to Message 91718.  

When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks.

Do you mean completely self corrected, or self corrected after you aborted the task? If I don't abort the task, I've seen it still stuck after about 18 hours. It just keeps on retrying and failing to download about every 3 hours.

I abort the transfer (not the task) and normally that is enough to allow downloads to restart when I do an update project.

On the odd occasion, however, it has given the message you reported after the update. In that case I leave it an hour and redo the update, on all occasions so far the update has succeeded in bringing down new WUs.

For some reason, aborting the transfer first before aborting the task didn't work. Aborting the task first, then the download was always much more successful.
ID: 91767 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 404
Credit: 12,294,748
RAC: 2,551
Message 91769 - Posted: 24 Feb 2020, 16:18:16 UTC - in response to Message 91767.  


For some reason, aborting the transfer first before aborting the task didn't work. Aborting the task first, then the download was always much more successful.


Odd, when I abort the transfer the task disappears about 10 seconds later, I don’t need to abort it.

I’m on 7.16
ID: 91769 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 91773 - Posted: 24 Feb 2020, 19:12:47 UTC - in response to Message 91769.  


Odd, when I abort the transfer the task disappears about 10 seconds later, I don’t need to abort it.


That would be my expectation of the BOINC Manager. When one of the files required by a task fails to download, then the task is aborted. And there will be times when several of the tasks you are downloading depend upon the same file, and all of them abort when a file transfer fails.
Rosetta Moderator: Mod.Sense
ID: 91773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 91774 - Posted: 24 Feb 2020, 19:19:29 UTC - in response to Message 91773.  
Last modified: 24 Feb 2020, 19:20:15 UTC


Odd, when I abort the transfer the task disappears about 10 seconds later, I don’t need to abort it.


That would be my expectation of the BOINC Manager. When one of the files required by a task fails to download, then the task is aborted. And there will be times when several of the tasks you are downloading depend upon the same file, and all of them abort when a file transfer fails.


It's a pity the Boinc manager doesn't seem to notice for hours. I get the message in the log "some download is stalled" (which prevents any more tasks getting downloaded from that project) up to an hour or so after I've cancelled both the download and the task.

1) Rosetta needs to stop the server stalling downloads.

2) Boinc needs to fix their program so it doesn't get upset just because 1 file failed, then fail to notice the user told it to give up.
ID: 91774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
amgthis

Send message
Joined: 25 Mar 06
Posts: 81
Credit: 203,879,282
RAC: 0
Message 91836 - Posted: 2 Mar 2020, 22:05:46 UTC - in response to Message 91700.  

I haven't seen the small file stall on d/l lately, but did have this one hang a
couple of days ago. More than once.

rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB

I'm still not convinced it's a by-product of super slow DSL on my end.

In any case I think it's cleared up.
ID: 91836 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 91838 - Posted: 2 Mar 2020, 22:18:57 UTC - in response to Message 91836.  

I haven't seen the small file stall on d/l lately, but did have this one hang a
couple of days ago. More than once.

rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB

I'm still not convinced it's a by-product of super slow DSL on my end.

In any case I think it's cleared up.


I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong!
ID: 91838 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 404
Credit: 12,294,748
RAC: 2,551
Message 91851 - Posted: 3 Mar 2020, 20:20:17 UTC - in response to Message 91838.  

I haven't seen the small file stall on d/l lately, but did have this one hang a
couple of days ago. More than once.

rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB

I'm still not convinced it's a by-product of super slow DSL on my end.

In any case I think it's cleared up.


I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong!


I had another stalled download today but my main problem over the past few days have been jobs erroring out after 10 hours. Each job has a single decoy that was still running 4 hours after my 6 hour limit.

So far there have been 11 such jobs over 2 machines, 3 on one machine yesterday alone which is a quarter of that machine’s allocation for Rosetta :-

https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985018
https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985251
https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985278
ID: 91851 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 91852 - Posted: 3 Mar 2020, 20:39:52 UTC - in response to Message 91851.  

I haven't seen the small file stall on d/l lately, but did have this one hang a
couple of days ago. More than once.

rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB

I'm still not convinced it's a by-product of super slow DSL on my end.

In any case I think it's cleared up.


I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong!


I had another stalled download today but my main problem over the past few days have been jobs erroring out after 10 hours. Each job has a single decoy that was still running 4 hours after my 6 hour limit.

So far there have been 11 such jobs over 2 machines, 3 on one machine yesterday alone which is a quarter of that machine’s allocation for Rosetta :-

https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985018
https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985251
https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985278


What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here.
ID: 91852 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 404
Credit: 12,294,748
RAC: 2,551
Message 91855 - Posted: 3 Mar 2020, 23:46:07 UTC - in response to Message 91852.  

I haven't seen the small file stall on d/l lately, but did have this one hang a
couple of days ago. More than once.

rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB

I'm still not convinced it's a by-product of super slow DSL on my end.

In any case I think it's cleared up.


I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong!


I had another stalled download today but my main problem over the past few days have been jobs erroring out after 10 hours. Each job has a single decoy that was still running 4 hours after my 6 hour limit.

So far there have been 11 such jobs over 2 machines, 3 on one machine yesterday alone which is a quarter of that machine’s allocation for Rosetta :-

https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985018
https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985251
https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985278


What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here.


Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process.
ID: 91855 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 826
Message 91862 - Posted: 4 Mar 2020, 13:39:06 UTC
Last modified: 4 Mar 2020, 13:44:58 UTC

Another small file download problem, which is currently blocking me from getting any more R@H tasks:

10v3nmgb_c14394_10mer_gb_000420_SAVE_ALL_OUT_896889_53_1

https://boinc.bakerlab.org/rosetta/result.php?resultid=1125377637

A wingmate timed out and therefore may have had the same problem:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1012342810

Relevant lines from the log:

3/4/2020 4:00:05 AM | Rosetta@home | Started download of 10v3nmgb_c14394_10mer_gb_000420.zip
3/4/2020 4:05:12 AM | Rosetta@home | Temporarily failed download of 10v3nmgb_c14394_10mer_gb_000420.zip: transient HTTP error
3/4/2020 4:05:12 AM | Rosetta@home | Backing off 04:50:13 on download of 10v3nmgb_c14394_10mer_gb_000420.zip
3/4/2020 4:05:13 AM | | Project communication failed: attempting access to reference site
3/4/2020 4:05:15 AM | | Internet access OK - project servers may be temporarily down.
3/4/2020 4:42:49 AM | Rosetta@home | Sending scheduler request: To report completed tasks.
3/4/2020 4:42:49 AM | Rosetta@home | Reporting 2 completed tasks
3/4/2020 4:42:49 AM | Rosetta@home | Not requesting tasks: some download is stalled
3/4/2020 4:42:51 AM | Rosetta@home | Scheduler request completed

Does the file that fails to download even exist on the server?

The expected size of the file is only 3.23 KB.

Could the server have problems downloading files of a certain small size?

DSL speed here is not especially high or low.

Enough other BOINC projects are selected on this computer to keep it busy.
ID: 91862 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 91865 - Posted: 4 Mar 2020, 18:46:43 UTC - in response to Message 91855.  

What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here.


Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process.


Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary.
ID: 91865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 91866 - Posted: 4 Mar 2020, 18:48:35 UTC - in response to Message 91862.  

Another small file download problem, which is currently blocking me from getting any more R@H tasks:

10v3nmgb_c14394_10mer_gb_000420_SAVE_ALL_OUT_896889_53_1

https://boinc.bakerlab.org/rosetta/result.php?resultid=1125377637

A wingmate timed out and therefore may have had the same problem:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1012342810

Relevant lines from the log:

3/4/2020 4:00:05 AM | Rosetta@home | Started download of 10v3nmgb_c14394_10mer_gb_000420.zip
3/4/2020 4:05:12 AM | Rosetta@home | Temporarily failed download of 10v3nmgb_c14394_10mer_gb_000420.zip: transient HTTP error
3/4/2020 4:05:12 AM | Rosetta@home | Backing off 04:50:13 on download of 10v3nmgb_c14394_10mer_gb_000420.zip
3/4/2020 4:05:13 AM | | Project communication failed: attempting access to reference site
3/4/2020 4:05:15 AM | | Internet access OK - project servers may be temporarily down.
3/4/2020 4:42:49 AM | Rosetta@home | Sending scheduler request: To report completed tasks.
3/4/2020 4:42:49 AM | Rosetta@home | Reporting 2 completed tasks
3/4/2020 4:42:49 AM | Rosetta@home | Not requesting tasks: some download is stalled
3/4/2020 4:42:51 AM | Rosetta@home | Scheduler request completed

Does the file that fails to download even exist on the server?

The expected size of the file is only 3.23 KB.

Could the server have problems downloading files of a certain small size?

DSL speed here is not especially high or low.

Enough other BOINC projects are selected on this computer to keep it busy.


It's always the little 3kB files that stuck for me. It suggests it's a different server producing those that's misbehaving, or the files are corrupt for some reason. They don't even download using a web browser.
ID: 91866 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 404
Credit: 12,294,748
RAC: 2,551
Message 91867 - Posted: 4 Mar 2020, 21:35:23 UTC - in response to Message 91865.  

What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here.


Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process.


Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary.


You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time.
ID: 91867 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 91868 - Posted: 4 Mar 2020, 21:54:33 UTC - in response to Message 91867.  

What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here.


Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process.


Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary.


You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time.


Strange, all projects that have given me errors cause them well under the normal processing time. And some projects like LHC seem to have tasks that usually take 2 hours but can take 4 days, but still complete fine.
ID: 91868 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 404
Credit: 12,294,748
RAC: 2,551
Message 91871 - Posted: 5 Mar 2020, 11:59:25 UTC - in response to Message 91868.  



Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary.


You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time.


Strange, all projects that have given me errors cause them well under the normal processing time. And some projects like LHC seem to have tasks that usually take 2 hours but can take 4 days, but still complete fine.


Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes.

To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain.
ID: 91871 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 91875 - Posted: 5 Mar 2020, 19:01:19 UTC - in response to Message 91871.  

Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes.

To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain.


My Rosetta WUs all seem to finish in a very precise timeframe of 7.5 hours (8.5 hours on the slower machines), I've seen no variation. Perhaps the limiter only takes effect occasionally. LHC has a huge variation, the theory tasks can take from 1.5 hours to 4 days! The percentage moves very slowly as though it will take 4 days, but it seems to complete at a random point somewhere in there, often at only "2% completed", it jumps to 100% and says it was successful, I guess it's looking for an answer somewhere in there and finds it early?
ID: 91875 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 404
Credit: 12,294,748
RAC: 2,551
Message 91876 - Posted: 5 Mar 2020, 20:43:13 UTC - in response to Message 91875.  

Rosetta works differently to most in that it does as much processing as it can in the time allowed rather than takes as much time as the fixed amount of processing takes.

To ensure that a rogue work unit does not lock out a core it has a watchdog that closes out the WU if it is still running 4 hours beyond the time limit set. As the WU closes down at the end of a decoy and it predicts that there is not time to process another before the deadline, this can only really happen if a single decoy runs for more than the 4 hours and I’d guess that this implies the decoy is in a loop but cannot say that for certain.


My Rosetta WUs all seem to finish in a very precise timeframe of 7.5 hours (8.5 hours on the slower machines), I've seen no variation. Perhaps the limiter only takes effect occasionally. LHC has a huge variation, the theory tasks can take from 1.5 hours to 4 days! The percentage moves very slowly as though it will take 4 days, but it seems to complete at a random point somewhere in there, often at only "2% completed", it jumps to 100% and says it was successful, I guess it's looking for an answer somewhere in there and finds it early?


Pass, I’ve never looked at LHC so I wouldn’t know.
ID: 91876 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 4,044
Message 91877 - Posted: 5 Mar 2020, 21:01:09 UTC - in response to Message 91876.  

Pass, I’ve never looked at LHC so I wouldn’t know.


They've got Atlas tasks, which will run one WU on all your CPU cores at once. I want to get a Ryzen threadripper to see if they'll give me a 64 core task :-)
ID: 91877 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rlpm

Send message
Joined: 23 Mar 20
Posts: 13
Credit: 84
RAC: 0
Message 92241 - Posted: 24 Mar 2020, 21:38:52 UTC

All WUs for Rosetta v4.07 i686-pc-linux-gnu on my 1st gen AppleTV running linux (OSMC with all GUI etc. disabled) are failing for going over the RAM limit. See here. E.g.:
working set size > client RAM limit: 167.87MB > 167.55MB

Is there something wrong with the working set size matching to the amount of available RAM?

Or can I limit to the Rosetta Mini application only?
ID: 92241 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rlpm

Send message
Joined: 23 Mar 20
Posts: 13
Credit: 84
RAC: 0
Message 92247 - Posted: 25 Mar 2020, 2:07:34 UTC - in response to Message 92241.  

Looks like the same is happening for mini tasks, e.g. task 1132535295:
working set size > client RAM limit: 170.39MB > 167.55MB
ID: 92247 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · 33 . . . 311 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org