Jobs seem to complete OK but have status 'abandoned'

Questions and Answers : Unix/Linux : Jobs seem to complete OK but have status 'abandoned'

To post messages, you must log in.

AuthorMessage
loris

Send message
Joined: 26 Mar 20
Posts: 7
Credit: 3,937
RAC: 0
Message 96009 - Posted: 4 May 2020, 12:29:31 UTC

Hi,

I am running jobs on a cluster via a resource manager. The batch script I use starts BOINC in the following manner:

boinc --no_gui_rpc --fetch_minimal_work --exit_when_idle --attach_project ${URL} ${AUTH}

The jobs seem to complete OK and do consume CPU time on the cluster, and there are no errors in the client log. Howver the status show on the R@H website often seems to be 'abandoned'.

Is the way I am calling BOINC incorrect?
ID: 96009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4016
Credit: 0
RAC: 0
Message 96039 - Posted: 4 May 2020, 16:01:08 UTC - in response to Message 96009.  

Have you just changed to the new project URL with the S after the http?

That has been the only time I've seen "abandoned" work units personally.
Rosetta Moderator: Mod.Sense
ID: 96039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
loris

Send message
Joined: 26 Mar 20
Posts: 7
Credit: 3,937
RAC: 0
Message 96154 - Posted: 6 May 2020, 7:49:00 UTC - in response to Message 96039.  

I changed the URL to
https
and a single job was subsequently completed and validated. However, of an array of 10 jobs started at the same time, 6 complete almost immediately with "exiting because no more results", but I think that is a different problem. I have already added some random delay to prevent too many requests for tasks happening at the same time, but perhaps this delay needs to be longer.
ID: 96154 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4016
Credit: 0
RAC: 0
Message 96168 - Posted: 6 May 2020, 13:04:49 UTC - in response to Message 96154.  

Now that the DB is shared across all running R@h tasks, I doubt you need the delays. But, I guess I'm not positive what you mean about happening at the same time, do you mean starting? or running? A delay wouldn't change how many eventually get running, so I think you mean you are staggering their start. I doubt you need this with now with v4.20.
Rosetta Moderator: Mod.Sense
ID: 96168 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
loris

Send message
Joined: 26 Mar 20
Posts: 7
Credit: 3,937
RAC: 0
Message 96202 - Posted: 7 May 2020, 7:15:51 UTC - in response to Message 96168.  

Yes, I mean staggering. This does seem to be necessary although I still got

07-May-2020 08:57:57 [Rosetta@home] Not sending work - last request too recent: 0 sec

for one of four jobs started one minute apart.

What version are you referring to? I have client version 7.16.5.
ID: 96202 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4016
Credit: 0
RAC: 0
Message 96367 - Posted: 11 May 2020, 14:39:45 UTC - in response to Message 96202.  

I was referring to the Rosetta verison. v4.20 made changes to share the large database directory across all active threads, rather than each expanding its own copy in each slot directory.
Rosetta Moderator: Mod.Sense
ID: 96367 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 158
Credit: 4,851,671
RAC: 8,954
Message 96392 - Posted: 12 May 2020, 12:28:35 UTC

Please pardon my confusion but why partner fetch minimal work with exit when idle?
ID: 96392 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
loris

Send message
Joined: 26 Mar 20
Posts: 7
Credit: 3,937
RAC: 0
Message 96623 - Posted: 19 May 2020, 7:30:59 UTC - in response to Message 96392.  

I am not sure I understand your question but I am trying to set things up so that each job I submit to the cluster just fetches a single r@h task.

Currently I am starting single jobs by hand with a separation of a couple of minutes, but each job seems to cause the previous job to be abandoned.
ID: 96623 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 754
Credit: 5,213,843
RAC: 21,680
Message 96624 - Posted: 19 May 2020, 9:03:02 UTC
Last modified: 19 May 2020, 9:06:01 UTC

With your computers hidden helping you is pretty much impossible.


Having said that, BOINC is not designed to be run on a cluster, so that is most likely where your issues are.
Install BOINC on each system, attach to the project, and then things should work (as long as the hardware is sufficient).
Grant
Darwin NT
ID: 96624 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4016
Credit: 0
RAC: 0
Message 96628 - Posted: 19 May 2020, 15:49:07 UTC - in response to Message 96623.  

@loris, if you are submitting tasks to the Robetta server, these message boards are not the place to look for help.
Rosetta Moderator: Mod.Sense
ID: 96628 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
loris

Send message
Joined: 26 Mar 20
Posts: 7
Credit: 3,937
RAC: 0
Message 96645 - Posted: 20 May 2020, 6:16:37 UTC - in response to Message 96628.  

@loris, if you are submitting tasks to the Robetta server, these message boards are not the place to look for help.


Where is the correct place? I thought this forum was for questions relating to "Installing and running BOINC on Unix and Linux".
ID: 96645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
loris

Send message
Joined: 26 Mar 20
Posts: 7
Credit: 3,937
RAC: 0
Message 96646 - Posted: 20 May 2020, 6:25:52 UTC - in response to Message 96624.  

In what way are my computers hidden?

Regarding the cluster, the software is installed (via NFS) on all nodes of the cluster. The problem, I think, is more to do with the way I start the jobs via the scheduling system. Possibly it is to do with the fact that the scheduler could try to start multiple jobs on one node. Perhaps
max_ncpus_pct
then applies to all the jobs, so all but one get terminated.
ID: 96646 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 754
Credit: 5,213,843
RAC: 21,680
Message 96648 - Posted: 20 May 2020, 8:40:12 UTC - in response to Message 96646.  
Last modified: 20 May 2020, 8:43:21 UTC

In what way are my computers hidden?

loris
                  User ID 2120609
Rosetta@home member since 26 Mar 2020
                  Country International
              Total credit 3,588
     Recent average credit 122.29
                 Computers hidden

In your account page, Preferences, Preferences for this project "Rosetta@home preferences"
"Should Rosetta@home show your computers on its web site?" would be unselected.


As i posted before- BOINC was not designed to make use of a cluster.
It is for installing on individual computers and the Manager on each computer is responsible for getting work, downloading the appropriate application as required, and returning the results & reporting them.
Grant
Darwin NT
ID: 96648 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
loris

Send message
Joined: 26 Mar 20
Posts: 7
Credit: 3,937
RAC: 0
Message 96656 - Posted: 20 May 2020, 13:11:04 UTC - in response to Message 96648.  

Thanks for the info regarding my computers being hidden.

As far as installing on a cluster is concerned, I realize that is not what BOINC was designed for. However, since every node essentially behaves as an individual computer, I thought it wouldn't be too hard to get it to work. I'll try running a number of jobs serially and see how that goes.
ID: 96656 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4016
Credit: 0
RAC: 0
Message 96659 - Posted: 20 May 2020, 14:00:48 UTC - in response to Message 96645.  

@loris, sorry Loris, we seem to be talking about two different things. So, it sounds like you are indeed in the right place. You just (jokingly) have to put up with all of the questions about how you went about settings this up and are trying to run it. The simplest way would be to install each machine and let them each do their own connections to the project for work. In that sense, the project never sees a cluster.
Rosetta Moderator: Mod.Sense
ID: 96659 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Questions and Answers : Unix/Linux : Jobs seem to complete OK but have status 'abandoned'



©2020 University of Washington
https://www.bakerlab.org