Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 44 · 45 · 46 · 47 · 48 · 49 · 50 . . . 219 · Next

AuthorMessage
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1330
Credit: 13,624,788
RAC: 10
Message 94591 - Posted: 16 Apr 2020, 6:20:56 UTC - in response to Message 94588.  
Last modified: 16 Apr 2020, 6:22:12 UTC

rb_04_12_21176_20979__t000__0_C1_SAVE_ALL_OUT_IGNORE_THE_REST_912410_4
          Sent                 Time reported/deadline           Status
15 Apr 2020, 23:33:22 UTC     16 Apr 2020, 2:30:44 UTC     Cancelled by server
Cancelled only 3 hours after it was sent.

If Rosetta is going to allow a grace period for Tasks that are returned after the deadline, then the next replication of it shouldn't be sent out until after the deadline grace period has passed. Saves having things like this occur.
Or do away with the grace period & send the the next copy out, and cancel the original Task. Or keep the grace period but still cancel the original Task once the next one is sent.

Pretty much everything that host does arrives late.



And then i got hit with this one.

hgfp_split2_562_fold_SAVE_ALL_OUT_916305_1
          Sent                 Time reported/deadline           Status
15 Apr 2020, 12:54:58 UTC     16 Apr 2020, 3:41:43 UTC     Cancelled by server

errors	Too many errors (may have bug) Too many total results WU cancelled



The sever cancelling a bad batch, sure, but to kill off the Task due to an error on the other system- as the problem could be with the system, not the Task (which appears to be the case this time around). Especially when it sent out the new Task after the Error Task had been returned.
I've done quite a few resends previously without issue.
Grant
Darwin NT
ID: 94591 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ian

Send message
Joined: 12 Oct 07
Posts: 3
Credit: 2,593,227
RAC: 39
Message 94594 - Posted: 16 Apr 2020, 7:29:59 UTC - in response to Message 94538.  

I'll check, but as far as I am aware, the connection goes straight onto the internet via a firewall. What I don't understand is that Rosetta is the only BOINC project that I have this issue with - I am connected to several. Do they not all share the same upload infrastructure (and acknowledgement message)? I could understand the behaviour if the ACKs were not getting through for any BOINC projects. Is there a particular port that the response comes through that is peculiar to Rosetta?

It looks like my contributions for this host are not getting credited looking at the host average page - I see a flat line. Interestingly the contributions for my home machine do get through though, which would seem to point to something blocking the upload response somewhere.
ID: 94594 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Raistmer

Send message
Joined: 7 Apr 20
Posts: 49
Credit: 788,998
RAC: 0
Message 94597 - Posted: 16 Apr 2020, 8:02:33 UTC

"Too many restarts with no progress. Keep application in memory while preempted."
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=13811
ID: 94597 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1330
Credit: 13,624,788
RAC: 10
Message 94598 - Posted: 16 Apr 2020, 8:47:47 UTC - in response to Message 94594.  

I'll check, but as far as I am aware, the connection goes straight onto the internet via a firewall. What I don't understand is that Rosetta is the only BOINC project that I have this issue with - I am connected to several. Do they not all share the same upload infrastructure (and acknowledgement message)? I could understand the behaviour if the ACKs were not getting through for any BOINC projects.
Each project has it's own servers, and all use TCP/IP for connections.



Is there a particular port that the response comes through that is peculiar to Rosetta?
No idea on that one. My guess, is no.



It looks like my contributions for this host are not getting credited looking at the host average page - I see a flat line.
A Result has to be returned before it can be reported. And is has to be reported for the project to be able to Validate it, then allocate Credit.



Interestingly the contributions for my home machine do get through though, which would seem to point to something blocking the upload response somewhere.
Yep.
It's an issue that only you with that system appear to be experiencing, so it's something to do with that particular system, or it's connection to the Rosetta servers.


Can you get a cheap USB mobile modem (or know someone with one)?
A lot of stuffing around to set up, but if you can get one for $20 just to use that to connect to the internet instead of your existing connection (and still a lot easier than taking the whole computer somewhere else) and that way you can see if it is somehow your system, or it's the internet connection you are using that's causing the issue.
Grant
Darwin NT
ID: 94598 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94612 - Posted: 16 Apr 2020, 13:41:32 UTC - in response to Message 94591.  

The sever cancelling a bad batch, sure, but to kill off the Task due to an error on the other system- as the problem could be with the system, not the Task (which appears to be the case this time around). Especially when it sent out the new Task after the Error Task had been returned.


I think you are taking it a bit too personally. A batch of WUs is not cancelled by the project when they have good reason to believe your attempt to crunch it will go better. They set up most WUs to do one additional try after a failure for the reason you mention, maybe the second attempt will go better. But, looking across the whole batch is the only way to make a decision about whether to withdraw the batch, and that has nothing to do with your current state on the WU.
Rosetta Moderator: Mod.Sense
ID: 94612 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
crystalsys
Avatar

Send message
Joined: 11 Aug 09
Posts: 8
Credit: 1,394,460
RAC: 59
Message 94618 - Posted: 16 Apr 2020, 15:20:06 UTC
Last modified: 16 Apr 2020, 15:24:53 UTC

Version 3.8 (? I'm not sure and log does not show the version) error and tie up

I keep getting jobs running on a 3.X (?) application that hang at some point, and then the log shows something like this:

4/16/2020 11:05:45 AM | Rosetta@home | Task hgfp_het2_576_fold_SAVE_ALL_OUT_911081_124_1 exited with zero status but no 'finished' file
4/16/2020 11:05:45 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.

Resetting the project is not necessary, but aborting that task seems to be, unless you want to waste more CPU time on it. Is there a way to restrict which app versions you get? I've looked, can't seem to find it.

I've not seen this with version 4.15
ID: 94618 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1176
Credit: 13,195,130
RAC: 5,920
Message 94628 - Posted: 16 Apr 2020, 18:17:59 UTC - in response to Message 94618.  

Version 3.8 (? I'm not sure and log does not show the version) error and tie up

I keep getting jobs running on a 3.X (?) application that hang at some point, and then the log shows something like this:

4/16/2020 11:05:45 AM | Rosetta@home | Task hgfp_het2_576_fold_SAVE_ALL_OUT_911081_124_1 exited with zero status but no 'finished' file
4/16/2020 11:05:45 AM | Rosetta@home | If this happens repeatedly you may need to reset the project.

Resetting the project is not necessary, but aborting that task seems to be, unless you want to waste more CPU time on it. Is there a way to restrict which app versions you get? I've looked, can't seem to find it.

I've not seen this with version 4.15

Upgrading to BOINC 7.16.5 makes that problem much less likely.
ID: 94628 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
crystalsys
Avatar

Send message
Joined: 11 Aug 09
Posts: 8
Credit: 1,394,460
RAC: 59
Message 94632 - Posted: 16 Apr 2020, 19:58:29 UTC - in response to Message 94628.  
Last modified: 16 Apr 2020, 19:59:19 UTC

OK, so I decided to take your advice, though I don't have any recent notices of a new version.

In BOINC Manager (currently 7.14.2 x64) I clicked 'check for new version'. It came back and told me there wasn't one. I normally don't have the log window open, but I did because I was monitoring the stalled tasks. In THAT window, I got a message in RED saying there was a new version. Did the check again, NOW it says there is a new version.

Hopefully they also fixed the bogus 'there is no newer version' message.

Thanks!
ID: 94632 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 95
Credit: 289,903
RAC: 32
Message 94635 - Posted: 16 Apr 2020, 20:41:46 UTC - in response to Message 94632.  

That feature in the Manager does not work. You can check for the latest BOINC version on the BOINC download page. The latest is 7.16.5.

https://boinc.berkeley.edu/download_all.php

You can also restrict the applications you run by configuring a cc_config.xml file.

https://boinc.berkeley.edu/wiki/Client_configuration#Application_configuration
ID: 94635 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1330
Credit: 13,624,788
RAC: 10
Message 94644 - Posted: 17 Apr 2020, 0:20:04 UTC - in response to Message 94612.  

A batch of WUs is not cancelled by the project when they have good reason to believe your attempt to crunch it will go better. They set up most WUs to do one additional try after a failure for the reason you mention, maybe the second attempt will go better. But, looking across the whole batch is the only way to make a decision about whether to withdraw the batch, and that has nothing to do with your current state on the WU.
That's just it.
As near as i can tell, that batch of WUs wasn't cancelled by the servers (i've actually processed 3 others that were resends, and one initial issue with no problems), that Task was sent out to see if it was dodgy or not. But then the Server cancelled it anyway without giving me a chance to even process it.

                    minimum quorum 1
               initial replication 1
max # of error/total/success tasks 1, 2, 1
errors Too many errors (may have bug) Too many total results WU cancelled
Why would the server cancel that Task? Given the time and effort it takes to produce work, i'd have thought Tasks being cancelled before checking them out would be worth looking in to.
Grant
Darwin NT
ID: 94644 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1176
Credit: 13,195,130
RAC: 5,920
Message 94650 - Posted: 17 Apr 2020, 3:04:30 UTC - in response to Message 94644.  
Last modified: 17 Apr 2020, 3:10:15 UTC

[snip]


                    minimum quorum 1
               initial replication 1
max # of error/total/success tasks 1, 2, 1
errors Too many errors (may have bug) Too many total results WU cancelled
Why would the server cancel that Task? Given the time and effort it takes to produce work, i'd have thought Tasks being cancelled before checking them out would be worth looking in to.

The previous failed attempt WAS enough checking it out. Tasks already downloaded are normally cancelled only if they haven't started.
ID: 94650 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1330
Credit: 13,624,788
RAC: 10
Message 94652 - Posted: 17 Apr 2020, 3:50:54 UTC - in response to Message 94650.  
Last modified: 17 Apr 2020, 3:51:54 UTC

[snip]


                    minimum quorum 1
               initial replication 1
max # of error/total/success tasks 1, 2, 1
errors Too many errors (may have bug) Too many total results WU cancelled
Why would the server cancel that Task? Given the time and effort it takes to produce work, i'd have thought Tasks being cancelled before checking them out would be worth looking in to.
The previous failed attempt WAS enough checking it out. Tasks already downloaded are normally cancelled only if they haven't started.
If that was the case, why resend it? The whole point of resending something, is to check it out. If it doesn't need to be checked out, it doesn't need to be resent.
And as i posted, i've done 3 others of that type that had errored on other systems, without them being cancelled.
Grant
Darwin NT
ID: 94652 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94676 - Posted: 17 Apr 2020, 13:32:27 UTC - in response to Message 94652.  

Human intervention is required to make the decision about whether there is a specific problem with one machine, or a more general problem with the WU batch. By the time the human had enough information to make that decision, some WUs of the batch were already out to a second host.
Rosetta Moderator: Mod.Sense
ID: 94676 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1826
Credit: 33,755,837
RAC: 6,911
Message 94727 - Posted: 18 Apr 2020, 4:02:48 UTC

Not sure if I should report this as a problem, but...

On an Android phone I'm running 4 tasks and have another (varying) 3 or 4 waiting to follow.
I've been reporting and receiving more tasks regularly. All sounds good.

Trouble is, the Server Status page has been reporting no tasks available to download for at least a day.
And the number of in progress tasks has been reducing steadily until a few hours ago and currently reads nil. Right now I have 7.

I've certainly received and reported tasks since both read nil.

Not complaining, obviously. Just reporting
ID: 94727 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile GoldenHat

Send message
Joined: 14 Apr 20
Posts: 3
Credit: 122,663
RAC: 0
Message 94754 - Posted: 18 Apr 2020, 11:26:17 UTC

Hello, I'm a newbie to Rosetta and got things set up and running ok. In the last two days I've noticed my laptop running this app in an odd manner. Instead of running at 100% CPU, it fluctuates between 33% and 100%, my fan turns on and off each time yet I have the settings set as default - 100% CPU time. I'm concerned because 1) It's slower to process the data, 2) It's wearing out my PC and I'm inclined to delete the app from the computer if this continues. I have a Toshiba Qosmio i7 Quad-core with 8 logical processors. It runs the CPU, GPU 0 and GPU 1 at full capacity, with CPU speed at 3Ghz with a base speed of 2.4Ghz.
Any ideas how I can get it running flat at 100% rather than this fluctuation? When I started it was fine, just in the last two days it's gone funky.

Thanks,
Richard.
ID: 94754 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 315
Credit: 9,213,793
RAC: 434
Message 94755 - Posted: 18 Apr 2020, 11:31:57 UTC - in response to Message 94754.  

Hello, I'm a newbie to Rosetta and got things set up and running ok. In the last two days I've noticed my laptop running this app in an odd manner. Instead of running at 100% CPU, it fluctuates between 33% and 100%, my fan turns on and off each time yet I have the settings set as default - 100% CPU time. I'm concerned because 1) It's slower to process the data, 2) It's wearing out my PC and I'm inclined to delete the app from the computer if this continues. I have a Toshiba Qosmio i7 Quad-core with 8 logical processors. It runs the CPU, GPU 0 and GPU 1 at full capacity, with CPU speed at 3Ghz with a base speed of 2.4Ghz.
Any ideas how I can get it running flat at 100% rather than this fluctuation? When I started it was fine, just in the last two days it's gone funky.

Thanks,
Richard.


What os are you running?

Have you tried running the system monitor to see what processes are taking cpu time and maybe which processes are cutting in and out to cause the fluctuations?
ID: 94755 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1330
Credit: 13,624,788
RAC: 10
Message 94756 - Posted: 18 Apr 2020, 11:54:35 UTC - in response to Message 94754.  

Any ideas how I can get it running flat at 100% rather than this fluctuation? When I started it was fine, just in the last two days it's gone funky.
In the last couple of days you have picked up more work from Seti, Some was run on the iGPU, the rest is on the Nvdia GPU.
And the problem with how it was before, is that the system was producing nothing but errors here at Rosetta. if you set your Computing preferences to the following, things should settle down. Less errors & less fans starting up & slowing down continually.
Computing
   Usage limits	
                                   Use at most 100% of the CPUs
                                   Use at most 100% of CPU time

   When to suspend	
           Suspend when computer is on battery (selected)
               Suspend when computer is in use (not selected)
 Suspend GPU computing when computer is in use (not selected)
   'In use' means mouse/keyboard input in last 3 minutes
  Suspend when no mouse/keyboard input in last --- minutes
     Suspend when non-BOINC CPU usage is above --- %
                          Compute only between ---

   Other	
                                Store at least 1 days of work
                     Store up to an additional 0.02 days of work
                    Switch between tasks every 60 minutes
     Request tasks to checkpoint at most every 60 seconds

   Disk
                              Use no more than 20 GB
                                Leave at least 2 GB free
                              Use no more than 60 % of total

   Memory
          When computer is in use, use at most 95 %
      When computer is not in use, use at most 95 %
 Leave non-GPU tasks in memory while suspended (not selected)
                   Page/swap file: use at most 75 %

Grant
Darwin NT
ID: 94756 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94791 - Posted: 18 Apr 2020, 18:01:44 UTC

What is your setting for "Suspend when non-BOINC CPU usage is above --- %"? Perhaps you have other tasks popping in and consuming CPU, which is causing BOINC to snooze. Especially if the value is at the default (25%?) it can be easy for the various other tasks to exceed that (briefly). I would set it to 75% of higher. Don't worry, the BOINC tasks still have low priority.

If you are also running work on your GPU, keep in mind that CPU is still required to service the active work on the GPU. I believe many set things to use at most some % of CPUs that leaves one core free to service the GPU work. I don't run GPU work, perhaps someone could reply with details on how to set up that arrangement.
Rosetta Moderator: Mod.Sense
ID: 94791 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1330
Credit: 13,624,788
RAC: 10
Message 94804 - Posted: 19 Apr 2020, 0:02:39 UTC - in response to Message 94791.  

I believe many set things to use at most some % of CPUs that leaves one core free to service the GPU work. I don't run GPU work, perhaps someone could reply with details on how to set up that arrangement.
I find it's best to reserve a core to support the GPU. If a GPU Task is running, it gets the CPU support it needs and it doesn't impact on the processing time of CPU Tasks that are running. If there is no GPU work being done, then the CPU core/thread is free to do CPU work.
The app_config.xml file needs to go in to the Seti project folder.

If installed on the C: drive
C:/ProgramData BOINC/projects setiathome.berkeley.edu/app_coonfig.xml

Make sure to use Notepad or similar to create or edit the file (NOT Word or Wordpad)
<app_config>
 <app>
  <name>setiathome_v8</name>
  <gpu_versions>
  <gpu_usage>1.0</gpu_usage>
  <cpu_usage>1.0</cpu_usage>
  </gpu_versions>
 </app>
 <app>
  <name>astropulse_v7</name>
  <gpu_versions>
  <gpu_usage>1.0</gpu_usage>
  <cpu_usage>1.0</cpu_usage>
  </gpu_versions>
 </app>
</app_config>

Grant
Darwin NT
ID: 94804 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 4804
Credit: 0
RAC: 0
Message 94807 - Posted: 19 Apr 2020, 0:23:59 UTC - in response to Message 94727.  

We are deprecating the 'rosetta_for_devices' app. The arm platforms have been added to the 'rosetta' application group. We will also be deprecating the minirosetta app and will soon have just the rosetta app. There are still some minirosetta jobs in our queue.
ID: 94807 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 44 · 45 · 46 · 47 · 48 · 49 · 50 . . . 219 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2022 University of Washington
https://www.bakerlab.org