Work unit failures.

Message boards : Number crunching : Work unit failures.

To post messages, you must log in.

AuthorMessage
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 614
Credit: 10,475,027
RAC: 5,537
Message 101518 - Posted: 26 Apr 2021, 12:47:25 UTC

Since yesterday, I've had four work units run for the regular 12 hours I have set for target run time. Each has obviously run their model several times with different random start points. Each, at completion, has errored out with a status of 0x00000000. 48 hours of work dumped. No new tasks set on both systems.

1372319356 1226215916 3117659 25 Apr 2021, 6:25:50 UTC 26 Apr 2021, 12:33:06 UTC Error while computing 43,766.03 42,977.44 --- Rosetta v4.20
windows_x86_64

1372042794 1226012455 3161065 24 Apr 2021, 14:39:12 UTC 25 Apr 2021, 20:18:07 UTC Error while computing 43,119.99 43,083.17 388.00 Rosetta v4.20
windows_x86_64

1371978435 1225956023 3161065 24 Apr 2021, 10:46:43 UTC 25 Apr 2021, 14:10:27 UTC Error while computing 43,156.42 43,092.81 388.00 Rosetta v4.20
windows_x86_64

1371983541 1225891749 3161065 24 Apr 2021, 9:25:48 UTC 25 Apr 2021, 13:39:11 UTC Error while computing 43,110.57 43,042.69 387.00 Rosetta v4.20
windows_x86_64
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 101518 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1680
Credit: 31,044,898
RAC: 13,029
Message 101519 - Posted: 26 Apr 2021, 13:52:32 UTC - in response to Message 101518.  

Since yesterday, I've had four work units run for the regular 12 hours I have set for target run time. Each has obviously run their model several times with different random start points. Each, at completion, has errored out with a status of 0x00000000. 48 hours of work dumped. No new tasks set on both systems.

1372319356 1226215916 3117659 25 Apr 2021, 6:25:50 UTC 26 Apr 2021, 12:33:06 UTC Error while computing 43,766.03 42,977.44 --- Rosetta v4.20
windows_x86_64

1372042794 1226012455 3161065 24 Apr 2021, 14:39:12 UTC 25 Apr 2021, 20:18:07 UTC Error while computing 43,119.99 43,083.17 388.00 Rosetta v4.20
windows_x86_64

1371978435 1225956023 3161065 24 Apr 2021, 10:46:43 UTC 25 Apr 2021, 14:10:27 UTC Error while computing 43,156.42 43,092.81 388.00 Rosetta v4.20
windows_x86_64

1371983541 1225891749 3161065 24 Apr 2021, 9:25:48 UTC 25 Apr 2021, 13:39:11 UTC Error while computing 43,110.57 43,042.69 387.00 Rosetta v4.20
windows_x86_64

You're right - it's been noticed by others too. You do get credit awarded when the daily clean-up job runs.
They're all "norn_struct_profile_layered_design" tasks, but it doesn't happen to all of them, oddly.
When it comes it only shows "upload failure: <file_xfer_error>", indicating it ran ok but something goes wrong after the server receives it back.
ID: 101519 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 614
Credit: 10,475,027
RAC: 5,537
Message 101520 - Posted: 26 Apr 2021, 13:59:57 UTC

>>> goes wrong after the server

So, it is still the case that the work is lost, or do the files still exist for retrying?
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 101520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1680
Credit: 31,044,898
RAC: 13,029
Message 101522 - Posted: 26 Apr 2021, 15:30:02 UTC - in response to Message 101520.  
Last modified: 26 Apr 2021, 15:30:34 UTC

>>> goes wrong after the server
So, it is still the case that the work is lost, or do the files still exist for retrying?

I don't honestly know - good question.
If I were to guess, it looks like we run them successfully - and we do eventually get credit for them as part of the daily clean-up job the project runs - but the project likely isn't getting their results for the 75% or so that report a Computation Error.
Which makes it a far bigger problem to them than it does to us. A very good reason to fix it.

Others have previously mentioned it (and I confirm I see the same issue myself) so I've just reported it.
I don't know how I've given myself this job - no-one's given it to me - but it seems I've got it, like it or not.
I'm reluctant to keep bugging the project guys, so I wait until it's apparent it's a consistent failure rather than a one-off.
As long as I don't come across as a moaner or time-waster, I get a very good response, to be fair.
They just seem to avoid looking at the forums - a time killer.
ID: 101522 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 614
Credit: 10,475,027
RAC: 5,537
Message 101535 - Posted: 26 Apr 2021, 17:35:02 UTC

I see my work units that were in that state, that I can still see on my results page, have been credited now. The credit is a little "odd", ie. it is 388 +/- 1 or 2. This is below, by quite a lot, what I would expect, the other tasks still visible on my results page show upper 400's to lower 500's. Everyone knows, of course, that the credit is not of any use.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 101535 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1680
Credit: 31,044,898
RAC: 13,029
Message 101539 - Posted: 26 Apr 2021, 18:38:49 UTC - in response to Message 101535.  

I see my work units that were in that state, that I can still see on my results page, have been credited now. The credit is a little "odd", ie. it is 388 +/- 1 or 2. This is below, by quite a lot, what I would expect, the other tasks still visible on my results page show upper 400's to lower 500's. Everyone knows, of course, that the credit is not of any use.

Yes, that's often the way. The average for a default machine rather than the host's average.
Still, a lot better than last year when it was a flat zero!

Err... you may have noticed that our "norn_struct_profile_layered_design" tasks have now been aborted by the server.
You were right that none of the results were getting back to them, so thanks for reporting it.
To be investigated, but very likely because of a long filename error in the results file, which might explain why some were ok and some weren't.
ID: 101539 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 614
Credit: 10,475,027
RAC: 5,537
Message 101541 - Posted: 26 Apr 2021, 20:11:10 UTC
Last modified: 26 Apr 2021, 20:14:44 UTC

>>> The average for a default machine rather than the host's average.

Probably, a default machine with a default run time. Both my machines here are 4GHz i7's with 12 hour target run times. I'd set the time up a while ago, our network was busy at the time, it was a simple thing to do which chopped a little load off it.

>>> aborted by the server

Yes, I saw that. I've allowed new tasks again, there were a number waiting to start anyway, so I doubt the fairly short time I closed downloads actually had any noticeable effect.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 101541 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1680
Credit: 31,044,898
RAC: 13,029
Message 101542 - Posted: 26 Apr 2021, 21:36:21 UTC - in response to Message 101541.  
Last modified: 26 Apr 2021, 21:36:52 UTC

>>> The average for a default machine rather than the host's average.
Probably, a default machine with a default run time. Both my machines here are 4GHz i7's with 12 hour target run times. I'd set the time up a while ago, our network was busy at the time, it was a simple thing to do which chopped a little load off it.

My impression is it does account for runtime, hence the +-1, but not the processing power you have compared to the average.
It is what it is

>>> aborted by the server
Yes, I saw that. I've allowed new tasks again, there were a number waiting to start anyway, so I doubt the fairly short time I closed downloads actually had any noticeable effect.

It aborted one or two of my running tasks, which was unfortunate. And then it downloaded a few more of the same type of tasks...
It'll sort itself out before long - some stragglers slipping through for now. Give it a day or so.
ID: 101542 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Work unit failures.



©2021 University of Washington
https://www.bakerlab.org