Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · Next

AuthorMessage
amgthis

Send message
Joined: 25 Mar 06
Posts: 69
Credit: 187,265,755
RAC: 117,527
Message 91700 - Posted: 14 Feb 2020, 16:28:25 UTC

Failed Downloads. I, too have seen many ~3kb or so file size downloads just hang or 'stall' at somewhere
around 80-90% completion. Then they just sit and seem to rob my limited bandwidth impeding other traffic up and
downloads. I delete the stalled download, then refresh and it gets replaced by new. Then I watch to make
sure it d/l's successful. Sometimes a stop and start of 'network access or activity' will let it resume but usually it
stalls out again. I've been noticing this for the last couple of weeks I think. Various file names but they
are always small files ~3kb or so in size.

When you have 20 boxes sharing a 7 Mbs DSL line, bandwidth can be sketchy under the best conditions. 8^(
/Mike
ID: 91700 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 107
Credit: 114,185,810
RAC: 266,149
Message 91701 - Posted: 14 Feb 2020, 17:58:16 UTC - in response to Message 91700.  

Yes, same here, stalled downloads can only be fixed by manual intervention (abort or abort) and therefore a big pain to keep crunching the project. They require continuous attention, which is not sustainable.
ID: 91701 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 51
Credit: 1,113,098
RAC: 4,102
Message 91708 - Posted: 16 Feb 2020, 11:03:28 UTC - in response to Message 91701.  
Last modified: 16 Feb 2020, 11:05:56 UTC

Yes, same here, stalled downloads can only be fixed by manual intervention (abort or abort) and therefore a big pain to keep crunching the project. They require continuous attention, which is not sustainable.


Just had one I can't fix. Usually aborting the download, then aborting the task, then reporting it, allows me to continue. But now Boinc is still saying:

Rosetta@home 16/02/2020 11:00:16 AM Not requesting tasks: some download is stalled

I'll try a fresh post on this here, and ask in the main Boinc forum why Boinc thinks something is still stalled which isn't.

P.S. For some reason I'm not getting emailed when someone posts in this thread. Another problem! Works fine in forums of all other projects. Ah, a hidden preference defaulting to a daft way - why would I subscribe to a thread if I didn't want to be told?
ID: 91708 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 28
Credit: 3,151,149
RAC: 8,224
Message 91712 - Posted: 16 Feb 2020, 14:11:16 UTC - in response to Message 91708.  

When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks.
ID: 91712 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 51
Credit: 1,113,098
RAC: 4,102
Message 91713 - Posted: 16 Feb 2020, 15:12:06 UTC - in response to Message 91712.  

When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks.


Do you mean completely self corrected, or self corrected after you aborted the task? If I don't abort the task, I've seen it still stuck after about 18 hours. It just keeps on retrying and failing to download about every 3 hours.
ID: 91713 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 28
Credit: 3,151,149
RAC: 8,224
Message 91718 - Posted: 16 Feb 2020, 19:32:03 UTC - in response to Message 91713.  

When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks.


Do you mean completely self corrected, or self corrected after you aborted the task? If I don't abort the task, I've seen it still stuck after about 18 hours. It just keeps on retrying and failing to download about every 3 hours.

I abort the transfer (not the task) and normally that is enough to allow downloads to restart when I do an update project.

On the odd occasion, however, it has given the message you reported after the update. In that case I leave it an hour and redo the update, on all occasions so far the update has succeeded in bringing down new WUs.
ID: 91718 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 51
Credit: 1,113,098
RAC: 4,102
Message 91719 - Posted: 16 Feb 2020, 20:00:03 UTC - in response to Message 91718.  
Last modified: 16 Feb 2020, 20:00:32 UTC

When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks.


Do you mean completely self corrected, or self corrected after you aborted the task? If I don't abort the task, I've seen it still stuck after about 18 hours. It just keeps on retrying and failing to download about every 3 hours.

I abort the transfer (not the task) and normally that is enough to allow downloads to restart when I do an update project.

On the odd occasion, however, it has given the message you reported after the update. In that case I leave it an hour and redo the update, on all occasions so far the update has succeeded in bringing down new WUs.


Ok thanks, in the future I'll just abort then leave it alone. Although the next time it happens I'm going to try to gather technical info on the problem - see this thread over at Boinc: https://boinc.berkeley.edu/dev/forum_thread.php?id=13435 I've been requested to:

"1) if you see it happening, set <http_debug> in Event Log options, and retry the transfer - find out what's happening behind that 'transient HTTP error'.
2) make a careful and exact note of the file name in question. Cancel the download, and make sure it disappears from the transfers tab. Restart the client, and if the 'stalled download' message reappears, have a very careful 'read only' (no edits) peek inside client_state.xml - same folder. Find the reference (if any) to the file you cancelled, and post the whole of the

<file>
...
</file>

section it's enclosed in."
ID: 91719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1009
Credit: 23,438,962
RAC: 10,828
Message 91767 - Posted: 24 Feb 2020, 14:03:26 UTC - in response to Message 91718.  

When this has happened to me it has self corrected after about an hour - give it time and then go for another update and you should get some new tasks.

Do you mean completely self corrected, or self corrected after you aborted the task? If I don't abort the task, I've seen it still stuck after about 18 hours. It just keeps on retrying and failing to download about every 3 hours.

I abort the transfer (not the task) and normally that is enough to allow downloads to restart when I do an update project.

On the odd occasion, however, it has given the message you reported after the update. In that case I leave it an hour and redo the update, on all occasions so far the update has succeeded in bringing down new WUs.

For some reason, aborting the transfer first before aborting the task didn't work. Aborting the task first, then the download was always much more successful.
ID: 91767 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 28
Credit: 3,151,149
RAC: 8,224
Message 91769 - Posted: 24 Feb 2020, 16:18:16 UTC - in response to Message 91767.  


For some reason, aborting the transfer first before aborting the task didn't work. Aborting the task first, then the download was always much more successful.


Odd, when I abort the transfer the task disappears about 10 seconds later, I don’t need to abort it.

I’m on 7.16
ID: 91769 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator
Project administrator

Send message
Joined: 22 Aug 06
Posts: 3620
Credit: 0
RAC: 0
Message 91773 - Posted: 24 Feb 2020, 19:12:47 UTC - in response to Message 91769.  


Odd, when I abort the transfer the task disappears about 10 seconds later, I don’t need to abort it.


That would be my expectation of the BOINC Manager. When one of the files required by a task fails to download, then the task is aborted. And there will be times when several of the tasks you are downloading depend upon the same file, and all of them abort when a file transfer fails.
Rosetta Moderator: Mod.Sense
ID: 91773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 51
Credit: 1,113,098
RAC: 4,102
Message 91774 - Posted: 24 Feb 2020, 19:19:29 UTC - in response to Message 91773.  
Last modified: 24 Feb 2020, 19:20:15 UTC


Odd, when I abort the transfer the task disappears about 10 seconds later, I don’t need to abort it.


That would be my expectation of the BOINC Manager. When one of the files required by a task fails to download, then the task is aborted. And there will be times when several of the tasks you are downloading depend upon the same file, and all of them abort when a file transfer fails.


It's a pity the Boinc manager doesn't seem to notice for hours. I get the message in the log "some download is stalled" (which prevents any more tasks getting downloaded from that project) up to an hour or so after I've cancelled both the download and the task.

1) Rosetta needs to stop the server stalling downloads.

2) Boinc needs to fix their program so it doesn't get upset just because 1 file failed, then fail to notice the user told it to give up.
ID: 91774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
amgthis

Send message
Joined: 25 Mar 06
Posts: 69
Credit: 187,265,755
RAC: 117,527
Message 91836 - Posted: 2 Mar 2020, 22:05:46 UTC - in response to Message 91700.  

I haven't seen the small file stall on d/l lately, but did have this one hang a
couple of days ago. More than once.

rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB

I'm still not convinced it's a by-product of super slow DSL on my end.

In any case I think it's cleared up.
ID: 91836 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 51
Credit: 1,113,098
RAC: 4,102
Message 91838 - Posted: 2 Mar 2020, 22:18:57 UTC - in response to Message 91836.  

I haven't seen the small file stall on d/l lately, but did have this one hang a
couple of days ago. More than once.

rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB

I'm still not convinced it's a by-product of super slow DSL on my end.

In any case I think it's cleared up.


I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong!
ID: 91838 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 28
Credit: 3,151,149
RAC: 8,224
Message 91851 - Posted: 3 Mar 2020, 20:20:17 UTC - in response to Message 91838.  

I haven't seen the small file stall on d/l lately, but did have this one hang a
couple of days ago. More than once.

rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB

I'm still not convinced it's a by-product of super slow DSL on my end.

In any case I think it's cleared up.


I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong!


I had another stalled download today but my main problem over the past few days have been jobs erroring out after 10 hours. Each job has a single decoy that was still running 4 hours after my 6 hour limit.

So far there have been 11 such jobs over 2 machines, 3 on one machine yesterday alone which is a quarter of that machine’s allocation for Rosetta :-

https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985018
https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985251
https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985278
ID: 91851 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 51
Credit: 1,113,098
RAC: 4,102
Message 91852 - Posted: 3 Mar 2020, 20:39:52 UTC - in response to Message 91851.  

I haven't seen the small file stall on d/l lately, but did have this one hang a
couple of days ago. More than once.

rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB

I'm still not convinced it's a by-product of super slow DSL on my end.

In any case I think it's cleared up.


I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong!


I had another stalled download today but my main problem over the past few days have been jobs erroring out after 10 hours. Each job has a single decoy that was still running 4 hours after my 6 hour limit.

So far there have been 11 such jobs over 2 machines, 3 on one machine yesterday alone which is a quarter of that machine’s allocation for Rosetta :-

https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985018
https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985251
https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985278


What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here.
ID: 91852 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 28
Credit: 3,151,149
RAC: 8,224
Message 91855 - Posted: 3 Mar 2020, 23:46:07 UTC - in response to Message 91852.  

I haven't seen the small file stall on d/l lately, but did have this one hang a
couple of days ago. More than once.

rb_02_20_16480_16303_ab_t000_h001_robetta.zip 3.06 KiB

I'm still not convinced it's a by-product of super slow DSL on my end.

In any case I think it's cleared up.


I doubt it was your connection. I have fibre here and got the same problem. But as you say it's pretty much sorted somehow, pity nobody admitted what they did wrong!


I had another stalled download today but my main problem over the past few days have been jobs erroring out after 10 hours. Each job has a single decoy that was still running 4 hours after my 6 hour limit.

So far there have been 11 such jobs over 2 machines, 3 on one machine yesterday alone which is a quarter of that machine’s allocation for Rosetta :-

https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985018
https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985251
https://boinc.bakerlab.org/rosetta/result.php?resultid=1124985278


What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here.


Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process.
ID: 91855 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 711
Credit: 9,962,001
RAC: 3,980
Message 91862 - Posted: 4 Mar 2020, 13:39:06 UTC
Last modified: 4 Mar 2020, 13:44:58 UTC

Another small file download problem, which is currently blocking me from getting any more R@H tasks:

10v3nmgb_c14394_10mer_gb_000420_SAVE_ALL_OUT_896889_53_1

http://boinc.bakerlab.org/rosetta/result.php?resultid=1125377637

A wingmate timed out and therefore may have had the same problem:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=1012342810

Relevant lines from the log:

3/4/2020 4:00:05 AM | Rosetta@home | Started download of 10v3nmgb_c14394_10mer_gb_000420.zip
3/4/2020 4:05:12 AM | Rosetta@home | Temporarily failed download of 10v3nmgb_c14394_10mer_gb_000420.zip: transient HTTP error
3/4/2020 4:05:12 AM | Rosetta@home | Backing off 04:50:13 on download of 10v3nmgb_c14394_10mer_gb_000420.zip
3/4/2020 4:05:13 AM | | Project communication failed: attempting access to reference site
3/4/2020 4:05:15 AM | | Internet access OK - project servers may be temporarily down.
3/4/2020 4:42:49 AM | Rosetta@home | Sending scheduler request: To report completed tasks.
3/4/2020 4:42:49 AM | Rosetta@home | Reporting 2 completed tasks
3/4/2020 4:42:49 AM | Rosetta@home | Not requesting tasks: some download is stalled
3/4/2020 4:42:51 AM | Rosetta@home | Scheduler request completed

Does the file that fails to download even exist on the server?

The expected size of the file is only 3.23 KB.

Could the server have problems downloading files of a certain small size?

DSL speed here is not especially high or low.

Enough other BOINC projects are selected on this computer to keep it busy.
ID: 91862 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 51
Credit: 1,113,098
RAC: 4,102
Message 91865 - Posted: 4 Mar 2020, 18:46:43 UTC - in response to Message 91855.  

What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here.


Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process.


Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary.
ID: 91865 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peter Hucker

Send message
Joined: 12 Aug 06
Posts: 51
Credit: 1,113,098
RAC: 4,102
Message 91866 - Posted: 4 Mar 2020, 18:48:35 UTC - in response to Message 91862.  

Another small file download problem, which is currently blocking me from getting any more R@H tasks:

10v3nmgb_c14394_10mer_gb_000420_SAVE_ALL_OUT_896889_53_1

http://boinc.bakerlab.org/rosetta/result.php?resultid=1125377637

A wingmate timed out and therefore may have had the same problem:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=1012342810

Relevant lines from the log:

3/4/2020 4:00:05 AM | Rosetta@home | Started download of 10v3nmgb_c14394_10mer_gb_000420.zip
3/4/2020 4:05:12 AM | Rosetta@home | Temporarily failed download of 10v3nmgb_c14394_10mer_gb_000420.zip: transient HTTP error
3/4/2020 4:05:12 AM | Rosetta@home | Backing off 04:50:13 on download of 10v3nmgb_c14394_10mer_gb_000420.zip
3/4/2020 4:05:13 AM | | Project communication failed: attempting access to reference site
3/4/2020 4:05:15 AM | | Internet access OK - project servers may be temporarily down.
3/4/2020 4:42:49 AM | Rosetta@home | Sending scheduler request: To report completed tasks.
3/4/2020 4:42:49 AM | Rosetta@home | Reporting 2 completed tasks
3/4/2020 4:42:49 AM | Rosetta@home | Not requesting tasks: some download is stalled
3/4/2020 4:42:51 AM | Rosetta@home | Scheduler request completed

Does the file that fails to download even exist on the server?

The expected size of the file is only 3.23 KB.

Could the server have problems downloading files of a certain small size?

DSL speed here is not especially high or low.

Enough other BOINC projects are selected on this computer to keep it busy.


It's always the little 3kB files that stuck for me. It suggests it's a different server producing those that's misbehaving, or the files are corrupt for some reason. They don't even download using a web browser.
ID: 91866 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 28
Credit: 3,151,149
RAC: 8,224
Message 91867 - Posted: 4 Mar 2020, 21:35:23 UTC - in response to Message 91865.  

What is a decoy? All my machines complete a task in 7.5 to 8.5 hours. And they're not particularly fast machines, one is 12 years old. I've seen no errors in over a week. There must be a pattern here.


Within the stdout file the app reports the number of scenarios that have been processed in the time you make available, normally in the standard 8 hour window you’ll process the data with, maybe, 40 different starting positions (these are known as decoys for some reason that escapes me). I set my processing window to 6 hours and a normal work unit will process maybe 30 decoys, the work units that are erroring out have not finished the first run through the data after 10 hours (6 hour preference plus 4 hours allowed overrun) when the watchdog aborts the process.


Are you setting this limit or is it done by the server? I just let all projects take whatever time they feel necessary.


You’ll find it in your account > project preferences > target cpu time, it defaults to 8 hours but after I had quite a few of these errors I dropped mine to 6 hours in the hope I’d waste slightly less processing time.
ID: 91867 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 27 · 28 · 29 · 30 · 31 · 32 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2020 University of Washington
http://www.bakerlab.org