Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 43 · 44 · 45 · 46 · 47 · 48 · 49 . . . 311 · Next

AuthorMessage
Tom M

Send message
Joined: 20 Jun 17
Posts: 97
Credit: 16,726,096
RAC: 36,642
Message 94410 - Posted: 13 Apr 2020, 22:57:24 UTC - in response to Message 94367.  

I had nothing but errors on both the i686 applications on my Ryzen. Gave up on Rosetta and moved to Einstein. Discovered later that you can set a flag in cc_config.xml to ignore alternate platforms.
<no_alt_platform>1</no_alt_platform>
That would have told the Rosetta scheduler to not send me x86 applications and just send me the x86_64 applications.


Huh, I didn't notice mine apparently.

Tom M
Help, my tagline is missing..... Help, my tagline is......... Help, m........ Hel.....
ID: 94410 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sven

Send message
Joined: 7 Feb 16
Posts: 8
Credit: 222,005
RAC: 0
Message 94421 - Posted: 14 Apr 2020, 7:37:04 UTC - in response to Message 94410.  

Hi all,

concerning my problems with this issue:

****
Rosetta@home | Task xxxx exited with zero status but no 'finished' file
Rosetta@home | If this happens repeatedly you may need to reset the project.
*****

... I've got a result of what to do to avoid this kind of error message:

It seems to be recommendable to make sure, that the following setting ist adjusted:
Use at most 100% of CPU time

All other settings, including max usage of CPUs, don't influence the processing of Rosetta tasks.

Sven
ID: 94421 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2146
Credit: 41,570,180
RAC: 6,141
Message 94423 - Posted: 14 Apr 2020, 8:41:13 UTC - in response to Message 94421.  

Concerning my problems with this issue:

****
Rosetta@home | Task xxxx exited with zero status but no 'finished' file
Rosetta@home | If this happens repeatedly you may need to reset the project.
*****

...I've got a result of what to do to avoid this kind of error message:

It seems to be recommendable to make sure, that the following setting is adjusted:
Use at most 100% of CPU time

Very useful information, thanks
ID: 94423 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 97
Credit: 332,619
RAC: 8
Message 94455 - Posted: 14 Apr 2020, 15:57:03 UTC - in response to Message 94421.  

Hi all,

concerning my problems with this issue:

****
Rosetta@home | Task xxxx exited with zero status but no 'finished' file
Rosetta@home | If this happens repeatedly you may need to reset the project.
*****

... I've got a result of what to do to avoid this kind of error message:

It seems to be recommendable to make sure, that the following setting ist adjusted:
Use at most 100% of CPU time

All other settings, including max usage of CPUs, don't influence the processing of Rosetta tasks.

Sven

Interesting. You shouldn't be getting those still now that you've updated to the latest 7.16.5 client which has the revised code fix to stop those errors. Your system would have to be too busy to service the slot cleanup for longer than five minutes to still get those errors.
ID: 94455 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1735
Credit: 18,532,940
RAC: 14,716
Message 94496 - Posted: 14 Apr 2020, 23:58:01 UTC - in response to Message 94455.  
Last modified: 14 Apr 2020, 23:59:47 UTC

Hi all,

concerning my problems with this issue:

****
Rosetta@home | Task xxxx exited with zero status but no 'finished' file
Rosetta@home | If this happens repeatedly you may need to reset the project.
*****

... I've got a result of what to do to avoid this kind of error message:

It seems to be recommendable to make sure, that the following setting ist adjusted:
Use at most 100% of CPU time

All other settings, including max usage of CPUs, don't influence the processing of Rosetta tasks.

Sven

Interesting. You shouldn't be getting those still now that you've updated to the latest 7.16.5 client which has the revised code fix to stop those errors. Your system would have to be too busy to service the slot cleanup for longer than five minutes to still get those errors.
You're thinking of the "Finished file present too long" issue.
But it is very probably some sort of I/O problem- the settings for the systems barely allowed any processing, with extremely frequent suspending & resuming occurring.
Grant
Darwin NT
ID: 94496 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ian

Send message
Joined: 12 Oct 07
Posts: 3
Credit: 2,611,432
RAC: 0
Message 94520 - Posted: 15 Apr 2020, 8:56:28 UTC

Hello Rosetta community,

I have a fairly persistent issue with uploading work units. What seems to happen is that the upload looks to have proceeded normally, but then sticks at 100% and never gets removed from the transfer queue. The net effect of this is that BOINC eventually runs out of disk space as it is all in use by pending Rosetta uploads. Rosetta is the only BOINC project I have this issue with. I have tried suspending/restarting uploads as suggested in another thread. I have also tried resetting and deleting and re-adding the Rosetta project entirely, but I have the same issue. Running on Windows 10, latest BOINC client.

Does anyone have anything I can check or try in addtion to the above?
Thanks, Ian
ID: 94520 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1735
Credit: 18,532,940
RAC: 14,716
Message 94521 - Posted: 15 Apr 2020, 9:00:47 UTC - in response to Message 94520.  
Last modified: 15 Apr 2020, 9:01:26 UTC

Hello Rosetta community,

I have a fairly persistent issue with uploading work units. What seems to happen is that the upload looks to have proceeded normally, but then sticks at 100% and never gets removed from the transfer queue. The net effect of this is that BOINC eventually runs out of disk space as it is all in use by pending Rosetta uploads. Rosetta is the only BOINC project I have this issue with. I have tried suspending/restarting uploads as suggested in another thread. I have also tried resetting and deleting and re-adding the Rosetta project entirely, but I have the same issue. Running on Windows 10, latest BOINC client.

Does anyone have anything I can check or try in addtion to the above?
Thanks, Ian
Do you use any sort of 3rd party AV/Internet security programme? (just grasping at straws here).
Grant
Darwin NT
ID: 94521 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sven

Send message
Joined: 7 Feb 16
Posts: 8
Credit: 222,005
RAC: 0
Message 94522 - Posted: 15 Apr 2020, 9:02:16 UTC - in response to Message 94496.  

Most of the error messages occured over night when there was no other work load on the system.

Usually I would say that a project should be able to handle every kind of client setting. Obiously I'm wrong with this opinion.

Due to heavy fan noise I acutally adjusted the settings to "suspend when computer is in use" and I still don't have any more error messages like before when I have reduced the percentage of allowed cpu time.


Addtional information:
As I've freshly downloaded the newest boinc client on this new computer, I'm working with the version 7.16.5 (x64), which doesn't help against the "Task xxxx exited with zero status but no 'finished' file" error.
ID: 94522 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ian

Send message
Joined: 12 Oct 07
Posts: 3
Credit: 2,611,432
RAC: 0
Message 94528 - Posted: 15 Apr 2020, 10:42:45 UTC - in response to Message 94521.  

Hi, the machine has Trend Micro Security agent installed. If I exit this and retry the upload, I see the same behaviour - the upload does not seem to be blocked at all - the progress bar goes up to 100%, it just doesn't ever get removed once the upload is complete. Windows moans about there not being any virus checking active, so the virus checker is seemingly off at this point (well as far as Windows can detect). Same behaviour with the Windows firewall off, just sits at 100% progress.

If I leave it for a bit I get a project backoff message, so maybe it is just load on the server end. I have been having this for a while though, before the current interest in covid 19 work. One other point of note is that I do have the upload rate throttled to 500KBps as it is on a shared network if that makes a difference. Temporarily turing this off does not fix the issue however.
ID: 94528 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1735
Credit: 18,532,940
RAC: 14,716
Message 94532 - Posted: 15 Apr 2020, 11:29:11 UTC - in response to Message 94528.  
Last modified: 15 Apr 2020, 11:29:37 UTC

Hi, the machine has Trend Micro Security agent installed. If I exit this and retry the upload, I see the same behaviour - the upload does not seem to be blocked at all - the progress bar goes up to 100%, it just doesn't ever get removed once the upload is complete. Windows moans about there not being any virus checking active, so the virus checker is seemingly off at this point (well as far as Windows can detect). Same behaviour with the Windows firewall off, just sits at 100% progress.

If I leave it for a bit I get a project backoff message, so maybe it is just load on the server end. I have been having this for a while though, before the current interest in covid 19 work. One other point of note is that I do have the upload rate throttled to 500KBps as it is on a shared network if that makes a difference. Temporarily turing this off does not fix the issue however.
A few weeks back there were some upload issues, but they've been sorted. And no one else has been posting about similar upload issues. Getting to 100%, and then stopping would indicate it's not getting a final acknowledgement for the upload, but no idea why everything else would work bar that final ACK.
If all else has failed, i'd re-boot your modem, and re-boot the computer.
*fingers crossed*
Grant
Darwin NT
ID: 94532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 677
Message 94538 - Posted: 15 Apr 2020, 12:52:54 UTC
Last modified: 15 Apr 2020, 12:54:11 UTC

The download issues a few weeks back were due to additional servers near the Rosetta@home end of the connections, running overly aggressive antivirus programs examining everything that went by.

Does the shared network at your end of such connections have similar additional servers? If so, you may need to talk whoever runs those servers, and ask them to set it up so that everything sent to the Rosetta@home upload server is excluded from checking.
ID: 94538 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
amazoph

Send message
Joined: 24 Nov 13
Posts: 3
Credit: 2,099,022
RAC: 0
Message 94548 - Posted: 15 Apr 2020, 15:05:01 UTC - in response to Message 93970.  

Looks like you have many access violations. I am not seeing such errors with other people's problem reports. Have you run memtest?


Thanks - have tried both Memtest86 and Windows built in memtest on that host, both came back clean after several passes.
Not sure as to the reason for the errors, for the moment I've stopped this host from taking Rosetta tasks and have put it on WCG til I have a chance to look further.

Seems to be only Rosetta that's affected as other applications (GPUGrid and WCG's MCM) run OK.


I found the issue, this machine had XMP memory timings enabled in BIOS. Reverted to stock lower speed timing and starting to get WUs completed without errors now.
ID: 94548 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1735
Credit: 18,532,940
RAC: 14,716
Message 94588 - Posted: 16 Apr 2020, 3:46:10 UTC

rb_04_12_21176_20979__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_912410_4
          Sent                 Time reported/deadline           Status
15 Apr 2020, 23:33:22 UTC     16 Apr 2020, 2:30:44 UTC     Cancelled by server
Cancelled only 3 hours after it was sent.

If Rosetta is going to allow a grace period for Tasks that are returned after the deadline, then the next replication of it shouldn't be sent out until after the deadline grace period has passed. Saves having things like this occur.
Or do away with the grace period & send the the next copy out, and cancel the original Task. Or keep the grace period but still cancel the original Task once the next one is sent.

Pretty much everything that host does arrives late.
Grant
Darwin NT
ID: 94588 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1735
Credit: 18,532,940
RAC: 14,716
Message 94591 - Posted: 16 Apr 2020, 6:20:56 UTC - in response to Message 94588.  
Last modified: 16 Apr 2020, 6:22:12 UTC

rb_04_12_21176_20979__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_912410_4
          Sent                 Time reported/deadline           Status
15 Apr 2020, 23:33:22 UTC     16 Apr 2020, 2:30:44 UTC     Cancelled by server
Cancelled only 3 hours after it was sent.

If Rosetta is going to allow a grace period for Tasks that are returned after the deadline, then the next replication of it shouldn't be sent out until after the deadline grace period has passed. Saves having things like this occur.
Or do away with the grace period & send the the next copy out, and cancel the original Task. Or keep the grace period but still cancel the original Task once the next one is sent.

Pretty much everything that host does arrives late.



And then i got hit with this one.

hgfp_split2_562_fold_SAVE_ALL_OUT_916305_1
          Sent                 Time reported/deadline           Status
15 Apr 2020, 12:54:58 UTC     16 Apr 2020, 3:41:43 UTC     Cancelled by server

errors	Too many errors (may have bug) Too many total results WU cancelled



The sever cancelling a bad batch, sure, but to kill off the Task due to an error on the other system- as the problem could be with the system, not the Task (which appears to be the case this time around). Especially when it sent out the new Task after the Error Task had been returned.
I've done quite a few resends previously without issue.
Grant
Darwin NT
ID: 94591 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ian

Send message
Joined: 12 Oct 07
Posts: 3
Credit: 2,611,432
RAC: 0
Message 94594 - Posted: 16 Apr 2020, 7:29:59 UTC - in response to Message 94538.  

I'll check, but as far as I am aware, the connection goes straight onto the internet via a firewall. What I don't understand is that Rosetta is the only BOINC project that I have this issue with - I am connected to several. Do they not all share the same upload infrastructure (and acknowledgement message)? I could understand the behaviour if the ACKs were not getting through for any BOINC projects. Is there a particular port that the response comes through that is peculiar to Rosetta?

It looks like my contributions for this host are not getting credited looking at the host average page - I see a flat line. Interestingly the contributions for my home machine do get through though, which would seem to point to something blocking the upload response somewhere.
ID: 94594 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Raistmer

Send message
Joined: 7 Apr 20
Posts: 49
Credit: 797,293
RAC: 0
Message 94597 - Posted: 16 Apr 2020, 8:02:33 UTC

"Too many restarts with no progress. Keep application in memory while preempted."
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13811
ID: 94597 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1735
Credit: 18,532,940
RAC: 14,716
Message 94598 - Posted: 16 Apr 2020, 8:47:47 UTC - in response to Message 94594.  

I'll check, but as far as I am aware, the connection goes straight onto the internet via a firewall. What I don't understand is that Rosetta is the only BOINC project that I have this issue with - I am connected to several. Do they not all share the same upload infrastructure (and acknowledgement message)? I could understand the behaviour if the ACKs were not getting through for any BOINC projects.
Each project has it's own servers, and all use TCP/IP for connections.



Is there a particular port that the response comes through that is peculiar to Rosetta?
No idea on that one. My guess, is no.



It looks like my contributions for this host are not getting credited looking at the host average page - I see a flat line.
A Result has to be returned before it can be reported. And is has to be reported for the project to be able to Validate it, then allocate Credit.



Interestingly the contributions for my home machine do get through though, which would seem to point to something blocking the upload response somewhere.
Yep.
It's an issue that only you with that system appear to be experiencing, so it's something to do with that particular system, or it's connection to the Rosetta servers.


Can you get a cheap USB mobile modem (or know someone with one)?
A lot of stuffing around to set up, but if you can get one for $20 just to use that to connect to the internet instead of your existing connection (and still a lot easier than taking the whole computer somewhere else) and that way you can see if it is somehow your system, or it's the internet connection you are using that's causing the issue.
Grant
Darwin NT
ID: 94598 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94612 - Posted: 16 Apr 2020, 13:41:32 UTC - in response to Message 94591.  

The sever cancelling a bad batch, sure, but to kill off the Task due to an error on the other system- as the problem could be with the system, not the Task (which appears to be the case this time around). Especially when it sent out the new Task after the Error Task had been returned.


I think you are taking it a bit too personally. A batch of WUs is not cancelled by the project when they have good reason to believe your attempt to crunch it will go better. They set up most WUs to do one additional try after a failure for the reason you mention, maybe the second attempt will go better. But, looking across the whole batch is the only way to make a decision about whether to withdraw the batch, and that has nothing to do with your current state on the WU.
Rosetta Moderator: Mod.Sense
ID: 94612 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
crystalsys
Avatar

Send message
Joined: 11 Aug 09
Posts: 8
Credit: 1,648,888
RAC: 566
Message 94618 - Posted: 16 Apr 2020, 15:20:06 UTC
Last modified: 16 Apr 2020, 15:24:53 UTC

Version 3.8 (? I'm not sure and log does not show the version) error and tie up

I keep getting jobs running on a 3.X (?) application that hang at some point, and then the log shows something like this:

4/16/2020 11:05:45 AM | Rosetta@home | Task hgfp_het2_576_fold_SAVE_ALL_OUT_911081_124_1 exited with zero status but no 'finished' file
4/16/2020 11:05:45 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.

Resetting the project is not necessary, but aborting that task seems to be, unless you want to waste more CPU time on it. Is there a way to restrict which app versions you get? I've looked, can't seem to find it.

I've not seen this with version 4.15
ID: 94618 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1234
Credit: 14,338,560
RAC: 677
Message 94628 - Posted: 16 Apr 2020, 18:17:59 UTC - in response to Message 94618.  

Version 3.8 (? I'm not sure and log does not show the version) error and tie up

I keep getting jobs running on a 3.X (?) application that hang at some point, and then the log shows something like this:

4/16/2020 11:05:45 AM | Rosetta@home | Task hgfp_het2_576_fold_SAVE_ALL_OUT_911081_124_1 exited with zero status but no 'finished' file
4/16/2020 11:05:45 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.

Resetting the project is not necessary, but aborting that task seems to be, unless you want to waste more CPU time on it. Is there a way to restrict which app versions you get? I've looked, can't seem to find it.

I've not seen this with version 4.15

Upgrading to BOINC 7.16.5 makes that problem much less likely.
ID: 94628 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 43 · 44 · 45 · 46 · 47 · 48 · 49 . . . 311 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2025 University of Washington
https://www.bakerlab.org