Stuck on uploading is a new problem?

Message boards : Number crunching : Stuck on uploading is a new problem?

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81422 - Posted: 13 Apr 2017, 19:22:42 UTC

Sure thought that I had seen this behavior before, but no mention of "uploading" anywhere here? At least the search reports there is no such comment?

Anyway, my client has a completed unit from the 160122cc... project that has been stuck in the "uploading" status for a couple of days now. Just now I saw another unit from the same project get completed and uploaded successfully, which shows enough of the servers are running properly (though the server status shows most of them are down again).

The problem work unit shows on the Transfers tab with the status "Upload: retry in..." Clicking on the "Retry Now" results in a few seconds of retrying, and then it goes back to that waiting-to-retry status. The deadline of the stuck unit is the 17th, so maybe it will get unstuck before that or the work will just get discarded when the deadline arrives. However, as mentioned, it's already been stuck for a couple of days.

As regards the old problems with bad scheduling and wasted bandwidth, mostly I quit worrying about it. There were various complicated and tedious suggestions offered. I think most of them were well intentioned and even sincere, but some of them are just wildly guessing and mostly I just don't want to be bothered.

At this point I mostly don't care, but I will add the minor observation that it seems the BOINC client for Macs works "properly". At least it always seems to start the units based on the correct deadlines and (in contrast to the Windows and Linux clients) I've never noticed it in obvious trouble with downloaded units that cannot be completed. My usage pattern for the Mac is most similar to one of the Windows machines, so I don't see that as a cause.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81422 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 81423 - Posted: 13 Apr 2017, 19:33:38 UTC

Are you running the same version of BOINC Manager on the Windows and Mac you refer to? The projects and work units have no control over which is dispatched next. BOINC Manager controls the workflow and dispatching of CPU resources.
Rosetta Moderator: Mod.Sense
ID: 81423 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81426 - Posted: 14 Apr 2017, 0:26:45 UTC - in response to Message 81423.  

Are you running the same version of BOINC Manager on the Windows and Mac you refer to? The projects and work units have no control over which is dispatched next. BOINC Manager controls the workflow and dispatching of CPU resources.


Sorry, I don't want to spend a lot of time beyond trying to make sure everything is running the latest version of everything. Perhaps more to the point, I'm not sure about the point of your question, since each platform has to have some platform-dependent code. Since you seem to be asking about the "by the way" part, perhaps I should clarify that what seems to be going on is that the Mac always picks work based on earliest deadlines, whereas the Windows and Linux clients sometimes pick units with much later deadlines. (On all platforms, there are short-deadline units that get jumped to the front, causing other work to be suspended, sometimes a bit awkwardly (but that's just the checkpointing problem).)

Anyway, on the original question, I can add a bit of data. Not specific to the sub-project. Just noticed that another computer has a stuck-on-uploading re12dslf... Deadline for that one is the 18th, to be compared with the deadline of the 17th for the one that's stuck on this computer. Both Windows, but I'm not putting many hours on the Linux boxen these days.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81426 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 81427 - Posted: 14 Apr 2017, 1:44:05 UTC

I have one that has been stuck for several hours. I am running Win7 64-bit and BOINC 7.7.2 (x64).
failed upload

512 rosetta@home 4/13/2017 9:35:35 PM Started upload of UN-NM_C4Yang_001512_2L8HC4-12_DHR62_0009.pdb_C4Yang_17_04_20_48_34_localDocking_0_SAVE_ALL_OUT_479518_11_0_0
513 4/13/2017 9:35:56 PM Project communication failed: attempting access to reference site
514 rosetta@home 4/13/2017 9:35:56 PM Temporarily failed upload of UN-NM_C4Yang_001512_2L8HC4-12_DHR62_0009.pdb_C4Yang_17_04_20_48_34_localDocking_0_SAVE_ALL_OUT_479518_11_0_0: transient HTTP error
515 rosetta@home 4/13/2017 9:35:56 PM Backing off 03:56:54 on upload of UN-NM_C4Yang_001512_2L8HC4-12_DHR62_0009.pdb_C4Yang_17_04_20_48_34_localDocking_0_SAVE_ALL_OUT_479518_11_0_0
516 4/13/2017 9:35:57 PM Internet access OK - project servers may be temporarily down.

ID: 81427 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,640,402
RAC: 36
Message 81428 - Posted: 14 Apr 2017, 5:02:44 UTC - in response to Message 81427.  

I also have a task that is stuck uploading. Will try to post more details about it tomorrow if it's still stuck when I wake up. I tried putting back the hosts information in case it's a dns problem, also tried flushing the dns resolver cache and removing any host entries too.. Doesn't seem to be DNS related.
**38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research
ID: 81428 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 81430 - Posted: 14 Apr 2017, 10:55:35 UTC - in response to Message 81427.  

The bad one is still stuck after retrying 17 times. But another Rosetta WU on that machine has since uploaded successfully in the last hour, so it appears that the server is OK.

I tried rebooting the PC to try to fix the stuck one, but the BOINC Manager would not reconnect to the client (red dot in icon), and the desktop froze. I had to force a reboot, and then run Windows repair to regain the desktop. I then tried to abort the stuck one, but it would not abort. I will have to try using Task Manager to get rid of it somehow. So clearly it is the work unit itself that is bad.
ID: 81430 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Ace Casino

Send message
Joined: 16 Jul 07
Posts: 17
Credit: 11,373,226
RAC: 12,738
Message 81431 - Posted: 14 Apr 2017, 13:56:25 UTC

I have 2 stuck WU's.

I'm running windows 10 and Boinc 7.6.33 if that's of any help.

Tried hitting retry a few times...no luck.

Oh well...stuff happens.

Happy Crunch'n
ID: 81431 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81438 - Posted: 15 Apr 2017, 9:07:11 UTC - in response to Message 81431.  

Just got a second one on the first machine. This one is an rb... unit, so now I have two units stuck uploading on one machine and one on another. Three different projects, but I've also seen at least one successful upload of a completed unit from one of the same three projects.

Not sure if it's a useful diagnostic, but when told to try again, it seems they fail in two ways. Sometimes nothing goes up, and other times a small packet goes up. Also the return to waiting mode is sometimes faster, but usually it takes a while.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81438 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 7 Feb 14
Posts: 39
Credit: 2,045,527
RAC: 0
Message 81441 - Posted: 15 Apr 2017, 9:59:46 UTC

5 tasks stuck on uploading here.



client_state.xml

<file>
    <name>rb_03_23_72525_116778__t000__ab_robetta_IGNORE_THE_REST_474917_815_0_0</name>
    <nbytes>530178.000000</nbytes>
    <max_nbytes>25000000.000000</max_nbytes>
    <md5_cksum>221c7cf702ff15910a96060fed236335</md5_cksum>
    <status>1</status>
    <upload_url>http://srv1.bakerlab.org/rosetta_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>11</num_retries>
        <first_request_time>1492170993.617707</first_request_time>
        <next_request_time>1492253144.919740</next_request_time>
        <time_so_far>2552.406322</time_so_far>
        <last_bytes_xferred>32768.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>

<file>
    <name>rb_03_23_72525_116778__t000__4_C1_SAVE_ALL_OUT_IGNORE_THE_REST_474917_259_0_0</name>
    <nbytes>887888.000000</nbytes>
    <max_nbytes>25000000.000000</max_nbytes>
    <md5_cksum>33e5d718ef03b6c814fecaa4d08c9b81</md5_cksum>
    <status>1</status>
    <upload_url>http://srv4.bakerlab.org/rosetta_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>11</num_retries>
        <first_request_time>1492169234.382923</first_request_time>
        <next_request_time>1492257568.171602</next_request_time>
        <time_so_far>2357.694327</time_so_far>
        <last_bytes_xferred>207.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>

<file>
    <name>UN-NM_C4Yang_000006_2L8HC4-12_DHR32_0019.pdb_C4Yang_17_04_20_47_25_localDocking_9_SAVE_ALL_OUT_479492_23_0_0</name>
    <nbytes>22502.000000</nbytes>
    <max_nbytes>50000000.000000</max_nbytes>
    <md5_cksum>db8309e5c372f565885d1075ac2b9683</md5_cksum>
    <status>1</status>
    <upload_url>http://srv4.bakerlab.org/rosetta_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>8</num_retries>
        <first_request_time>1492198192.963454</first_request_time>
        <next_request_time>1492258150.352014</next_request_time>
        <time_so_far>2485.533146</time_so_far>
        <last_bytes_xferred>22502.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>

<file>
    <name>3566f810a5e0096440dc8f17796115d2_eehee_pd1-docking_CancerImmunotherapy_17_04_13_32_36_globalDocking_4_SAVE_ALL_OUT_478149_7_0_0</name>
    <nbytes>99159.000000</nbytes>
    <max_nbytes>50000000.000000</max_nbytes>
    <md5_cksum>a72161808b6852c6bb6f86c8fc85619f</md5_cksum>
    <status>1</status>
    <upload_url>http://srv3.bakerlab.org/rosetta_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>5</num_retries>
        <first_request_time>1492214322.948390</first_request_time>
        <next_request_time>1492248107.442623</next_request_time>
        <time_so_far>2275.530723</time_so_far>
        <last_bytes_xferred>32768.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>

<file>
    <name>14dslfv5_14re4np_gb_0037_0001_30_0002_SAVE_ALL_OUT_480050_322_0_0</name>
    <nbytes>337741.000000</nbytes>
    <max_nbytes>50000000.000000</max_nbytes>
    <md5_cksum>8b05674e7048a0d3632f82d93a4d9571</md5_cksum>
    <status>1</status>
    <upload_url>http://srv1.bakerlab.org/rosetta_cgi/file_upload_handler</upload_url>
    <persistent_file_xfer>
        <num_retries>5</num_retries>
        <first_request_time>1492243267.332962</first_request_time>
        <next_request_time>1492251204.143131</next_request_time>
        <time_so_far>1553.738969</time_so_far>
        <last_bytes_xferred>32768.000000</last_bytes_xferred>
        <is_upload>1</is_upload>
    </persistent_file_xfer>
</file>
ID: 81441 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81443 - Posted: 15 Apr 2017, 12:01:15 UTC

Third machine now. That's my last Windows 10 box, so all of them have at least one. Still seeing some units go through without any problem. The Mac's okay so far, and I'll run a Linux box next week to see how it's going there.
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81443 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 81449 - Posted: 15 Apr 2017, 17:44:33 UTC

This has been a very elusive issue. Our sys admins have been working pretty hard at trying to figure out what is going on. It may be a network issue on the UW side but we are not sure. Still working on trying to figure this out. Thanks for all the feeback/updates.
ID: 81449 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81451 - Posted: 15 Apr 2017, 19:15:44 UTC - in response to Message 81449.  

This has been a very elusive issue. Our sys admins have been working pretty hard at trying to figure out what is going on. It may be a network issue on the UW side but we are not sure. Still working on trying to figure this out. Thanks for all the feeback/updates.


Not sure if this is a helpful clue, but I notice that the "Elapsed" column on the "Transfers" tab does not seem to make any sense. At least not if it is supposed to be related to elapsed time for the current transfer, which is how I've been interpreting it. They never seem to go to zero now? Clicking on "Retry Now" causes them to start ticking while the client is trying and the transfer is active, but then it stops counting while it's in the "Upload: retry in..." status. One of them is now up to 13:14 and the other is at 4:31, but they don't ever get back to zero, so maybe its some kind of initialization failure in the transfer of those stuck units, and then it never resets properly, so it isn't really retrying?
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81451 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1965
Credit: 38,174,417
RAC: 10,123
Message 81452 - Posted: 16 Apr 2017, 1:27:04 UTC

I was going to say I wasn't seeing this on 3 of my devices (Android, AMD desktop, Intel desktop) but just got home to find one each on my main AMD desktop and Intel laptop. Almost all other tasks go through, but when one sticks it keeps on failing.

Could some flag be getting set on these individual tasks? (Guessing)
ID: 81452 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81454 - Posted: 16 Apr 2017, 3:37:48 UTC - in response to Message 81452.  

Just checking one of my other machines, and I notice that the stuck packets seem to be hanging at the same point. Can you see it on your machine? Either 64K if the results are small, or 0.06 for larger results, but I think that's just rounding from 64K.

If that's true, then it seems about 40 packets of the results are uploaded before something goes wrong.

One more thought. Anyone running other projects? Is this only a rosetta@home problem or is it some new problem at the BOINC client level? (Seems unlikely since the client software hasn't been upgraded recently.)
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Luigi R.

Send message
Joined: 7 Feb 14
Posts: 39
Credit: 2,045,527
RAC: 0
Message 81455 - Posted: 16 Apr 2017, 5:11:47 UTC

I'm running LHC@Home, WCG and NumberFields too. No problem for these projects.
ID: 81455 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
John C MacAlister

Send message
Joined: 6 Dec 10
Posts: 16
Credit: 927,554
RAC: 0
Message 81456 - Posted: 16 Apr 2017, 12:06:24 UTC

I am running WCG and FAH with no problems. One Rosetta task stuck at 100% upload progress quoting 'Upload retry in 5:10:37'
ID: 81456 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 81458 - Posted: 16 Apr 2017, 17:02:00 UTC

The problem is not just limited to Windows. I now have one stuck on my Ubuntu machine.
Stuck Ubuntu upload

2159 rosetta@home 4/16/2017 12:57:28 PM Temporarily failed upload of rb_03_29_73500_116896__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_477229_82_0_0: transient HTTP error
2160 rosetta@home 4/16/2017 12:57:28 PM Backing off 00:02:38 on upload of rb_03_29_73500_116896__t000__1_C1_SAVE_ALL_OUT_IGNORE_THE_REST_477229_82_0_0
2161 4/16/2017 12:57:31 PM Internet access OK - project servers may be temporarily down.


ID: 81458 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81459 - Posted: 16 Apr 2017, 19:40:32 UTC - in response to Message 81458.  
Last modified: 16 Apr 2017, 19:42:01 UTC

Hmm... Interesting, but at least no reports for the Mac client. Pretty sure that code is significantly different from the other BOINC clients. My own Mac continues to run without stuck-unit problems.

Minor data point: The 64K thing has changed. Now most of the stuck units seem to upload slightly different amounts of data before they freeze.Here are the four results I currently have stuck on this machine:

0.00/1.62 MB
0.21/511.73 KB
0.22/219.68 KB
0.20/407.93 KB

(The last two are rb... and the other two on this machine are different ones. The oldest one will hit its deadline today.)

Just woke up another machine with a stuck unit. It was stuck at 64K, but after telling it to retry, the new stuck condition is 0.19/496.52 KB for an re12dslf... project.

Some kind of input buffer size problem? Maybe the rosetta@home people are trying to increase the buffer sizes for incoming data? The problem is somehow related to certain work units requesting smaller buffers than they actually need, and then getting stuck because they can't send the rest of their data?
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81459 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Iceshard_

Send message
Joined: 3 Dec 16
Posts: 1
Credit: 651,985
RAC: 0
Message 81460 - Posted: 16 Apr 2017, 19:40:45 UTC

Got a stuck upload

4/16/2017 12:34:06 PM | | Project communication failed: attempting access to reference site
4/16/2017 12:34:06 PM | rosetta@home | Temporarily failed upload of 8HnT3821_fold_and_dock_SAVE_ALL_OUT_480303_164_0_0: transient HTTP error
4/16/2017 12:34:06 PM | rosetta@home | Backing off 00:02:15 on upload of 8HnT3821_fold_and_dock_SAVE_ALL_OUT_480303_164_0_0
4/16/2017 12:34:08 PM | | Internet access OK - project servers may be temporarily down.
4/16/2017 12:36:21 PM | rosetta@home | Started upload of 8HnT3821_fold_and_dock_SAVE_ALL_OUT_480303_164_0_0
4/16/2017 12:36:22 PM | rosetta@home | [error] Error reported by file upload server: [8HnT3821_fold_and_dock_SAVE_ALL_OUT_480303_164_0_0] locked by file_upload_handler PID=2741
4/16/2017 12:36:22 PM | rosetta@home | Temporarily failed upload of 8HnT3821_fold_and_dock_SAVE_ALL_OUT_480303_164_0_0: transient upload error
4/16/2017 12:36:22 PM | rosetta@home | Backing off 00:06:39 on upload of 8HnT3821_fold_and_dock_SAVE_ALL_OUT_480303_164_0_0
ID: 81460 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile shanen
Avatar

Send message
Joined: 16 Apr 14
Posts: 195
Credit: 12,662,308
RAC: 0
Message 81461 - Posted: 16 Apr 2017, 19:53:36 UTC - in response to Message 81459.  

Hmm... Interesting, but at least no reports for the Mac client. Pretty sure that code is significantly different from the other BOINC clients. My own Mac continues to run without stuck-unit problems.

Minor data point: The 64K thing has changed. Now most of the stuck units seem to upload slightly different amounts of data before they freeze.Here are the four results I currently have stuck on this machine:

0.00/1.62 MB
0.21/511.73 KB
0.22/219.68 KB
0.20/407.93 KB

(The last two are rb... and the other two on this machine are different ones. The oldest one will hit its deadline today.)

Just woke up another machine with a stuck unit. It was stuck at 64K, but after telling it to retry, the new stuck condition is 0.19/496.52 KB for an re12dslf... project.

Some kind of input buffer size problem? Maybe the rosetta@home people are trying to increase the buffer sizes for incoming data? The problem is somehow related to certain work units requesting smaller buffers than they actually need, and then getting stuck because they can't send the rest of their data?


Retried a few minutes later, and the new stuck status is:

0.06/1.62 MB
64.00/511.73 KB
64.00/219.68 KB
64.00/407.93 KB

It would seem to mean something, but what?

Of course I'm going to go meta on you again... Confidence in the quality of the code has a relationship to confidence in the quality of the results. *sigh*
#1 Freedom = (Meaningful - Constrained) Choice{5} != (Beer^3 | Speech)
ID: 81461 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Stuck on uploading is a new problem?



©2024 University of Washington
https://www.bakerlab.org