Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 46 · 47 · 48 · 49 · 50 · 51 · 52 . . . 55 · Next

AuthorMessage
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,649,459
RAC: 0
Message 79408 - Posted: 13 Jan 2016, 15:04:46 UTC - in response to Message 79406.  

I aborted them after 25+ hours of running.

Until this is fixed, I might suggest that rather than aborting said tasks, to simply restart BOINC (full Boinc shutdown of work + re-launch of BOINC) as it will usually cause any 'hung' tasks that are past their target runtimes to wrap-up and report in for full credit. I'll leave any further troubleshooting to the admins.

**38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research
ID: 79408 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Steve

Send message
Joined: 22 Nov 15
Posts: 8
Credit: 164,345
RAC: 0
Message 79409 - Posted: 13 Jan 2016, 17:44:52 UTC - in response to Message 79362.  
Last modified: 13 Jan 2016, 18:11:49 UTC

A few things to consider:

Where did you set your preferences? Changes made in the BOINC Manager will override any web-based settings.

Double check the wording. In my version of BOINC Manager a box must be checked to keep tasks running while the computer is in use while you must select the “no” radio button to achieve the same thing using web-based prefs.

What I'm puzzled about is that BOINC is starting new tasks when older ones still are Waiting to Run...

This can happen if there isn’t enough memory to continue running a particular task. BOINC will set that one aside and try another. Rosetta tasks are among the most memory hungry tasks you will encounter in the BOINC world. So how much memory per core do you have and, more importantly, how much is BOINC allowed to use?

Could computer (not BOINC) sleep/hibernation settings be coming into play?

Thanks Snags - useful input. I have used local settings and the option window confirms that it's using those (it has a button to use prefs from the web but I haven't clicked that)

PC is a quad core with 12GB RAM, but it's running several large java-based services so memory typically runs around 80-90% used but with very little swapping. However as I'm not using the largest of those services most days I've now stopped that (releasing around 4GB) and will only run it when I need to access it. Rosetta tasks are usually under 200MB each in task manager so that should now mean there's plenty of memory available.

Making previously suggested changes seems to have improved things somewhat (only one overdue task waiting this morning) so I'll see if the latest change does any better.

I saw you had 12Gb RAM so didn't expect RAM to be an issue, but now I read this it is likely to have been a factor. My 8 concurrent tasks typically contribute 1.5GB out of 6.5Gb RAM in use, but I have 16Gb RAM total to utilise.


Well, a week or so on after those option changes and things are much better although not entirely clean - I have 3 tasks that were deadlined 10th Jan, and sat 'waiting to run' for several days while BOINC chose to start running other tasks rather than resume these waiting tasks [I still wonder if this is correct behaviour].

Task Manager typically shows 75% memory in use (9GB out of 12GB) so there's no longer any memory shortage and I have unchecked all the Suspend options in BOINC except CPU above 70% (which it pretty much never is).

I'm going to reboot now as this PC has been up since before Christmas, so I'll see if that causes any cleanup. *edit* After a reboot BOINC has started running my two oldest 'waiting' tasks (even though we are past the deadline of 10th Jan). It seems as though a reboot every couple of days is probably the answer.

cheers
Steve
ID: 79409 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Steve

Send message
Joined: 22 Nov 15
Posts: 8
Credit: 164,345
RAC: 0
Message 79410 - Posted: 13 Jan 2016, 18:19:47 UTC - in response to Message 79406.  
Last modified: 13 Jan 2016, 18:34:40 UTC

Hi All,

Been running Rosetta for a while and now encountering serious issues with near-endless or endless loops.
Normal running time is 6 hours on a task. And half of the WU's seem to adhere to that, however the other half is showing some weird behaviour :
1. Running forever without any estimated time left, going on for 20+ hours
as an example : nkid_1_3_2016_final3_0716_00058_0043.pdb343_TG_dez_fold_SAVE_ALL_OUT_322141_663_0
nkid_1_3_2016_final3_0692_00366_0042.pdb342_TG_dez_fold_SAVE_ALL_OUT_322134_678_0

2. Running forever, but with an estimated time left which keeps creeping up.
don't have examples here, I aborted them after 25+ hours of running.

This appears on a laptop. On my desktop, it seems to work well. Although I have other issues there with the scheduling of Rosetta.

Could you please investigate ?

Many thanks in advance !

Kind Regards,

B.E.

Hi B.E.

This looks very similar to my experience (see my recent posts and replies). May I ask: do you suspend/hibernate your laptop or do you shut down and then reboot?

My experience has been that some Rosetta tasks go 'waiting to run' or run with no remaining estimate on my system if I don't reboot for several days and they then don't complete for many hours; other Rosetta tasks complete with no issue.

If you are using suspend/hibernate (or just shutting the lid on the laptop) try rebooting it occasionally and see if that helps.

Best wishes
Steve
ID: 79410 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BelgianEnthousiast

Send message
Joined: 25 May 15
Posts: 5
Credit: 1,023,045
RAC: 0
Message 79411 - Posted: 13 Jan 2016, 22:29:36 UTC

Hi Timo, Steve,

Thanks for your quick reactions & advice !

To answer to your questions :
1) I tried exiting "graciously" from BOINC Mgr hoping that it would pick it up and wrap the WU's up,
but to no avail unfortunately.
I also tried resetting the project, and it keeps giving these mixed results.

2) I do not hibernate nor put my laptop in (deep)sleep mode. I always it shut down fully.
Tried that multiple times too, but again without much success...

To add to my initial post, I got another trio hanging at the moment :
rb_01_10_61977_106329_T000__1C1_SAVE_ALL_OUT etc. (running for 12h14, no remaining estimate and only at 27.492 %)
shrtNTF2_2_UM_1_N16E92S12_noCH_NTF2_bb-610__1_0001 etc. (running for 9h44, no remaining estimate and on 89.556 % but no more moving)
and another nkid_1_2_2016 etc. (running for 9h45, no remaining estimate and only on 11.186 %)

all the while I have a tj_2016A_insert_X_DHR53_DHR18 etc running just fine, showing 2h19, 3h47 remaining and at 34.973 %.

Nice evening !

B.E.
ID: 79411 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Chris

Send message
Joined: 2 Nov 05
Posts: 1
Credit: 2,687,913
RAC: 0
Message 79415 - Posted: 15 Jan 2016, 20:50:28 UTC

I've been have problems downloading jobs. I'm using the mobile program. Any help would be awesome!
ID: 79415 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 79418 - Posted: 17 Jan 2016, 7:06:06 UTC - in response to Message 79415.  

I've been have problems downloading jobs. I'm using the mobile program. Any help would be awesome!


There is only very limited android capability and tasks that can utilize it are only generated sporadically. There is not a steady stream of work available that can run on that platform.
Rosetta Moderator: Mod.Sense
ID: 79418 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Grimshiire

Send message
Joined: 14 May 06
Posts: 1
Credit: 142,733
RAC: 0
Message 79694 - Posted: 4 Mar 2016, 21:40:40 UTC

Since the latest Rosetta application(3.71) has been running, I've experienced multiple Rosetta projects becoming non-response. Progress % doesn't increase even after letting it run for 24 hours. No longer will projects be downloaded & there are no issues with the other projects that are downloaded & executed.
Is there an existing issue with the current Rosetta application?
ID: 79694 · Rating: 0 · rate: Rate + / Rate - Report as offensive
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 23,054,272
RAC: 5,361
Message 79695 - Posted: 5 Mar 2016, 4:14:42 UTC - in response to Message 79694.  

Since the latest Rosetta application(3.71) has been running, I've experienced multiple Rosetta projects becoming non-response. Progress % doesn't increase even after letting it run for 24 hours. No longer will projects be downloaded & there are no issues with the other projects that are downloaded & executed.
Is there an existing issue with the current Rosetta application?



There has been a number of problems with the "backrub" workloads (like yours "02_2016_2fbn_backrub_design_327089_168_0"). They execute significantly into the workload and then start requesting memory. The failure mode indicates the command line that starts up Rosetta wants too much memory.

IMO, delete the backrub tasks.

Here is a search of the forums referencing the backrub taaks.
https://boinc.bakerlab.org/rosetta/forum_search_action.php


ID: 79695 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Erich56

Send message
Joined: 11 Jan 16
Posts: 35
Credit: 1,437,503
RAC: 0
Message 79722 - Posted: 8 Mar 2016, 17:52:20 UTC

Since today (March 8th), around noon UTC, I am experiencing numerous "compute errors". Almost 80% of the tasks are terminated after various time lengths, showing this error.
Has anyone else made the same experience?
ID: 79722 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 79723 - Posted: 8 Mar 2016, 17:58:32 UTC - in response to Message 79722.  

Since today (March 8th), around noon UTC, I am experiencing numerous "compute errors". Almost 80% of the tasks are terminated after various time lengths, showing this error.
Has anyone else made the same experience?


Those "supercoil_trimer" jobs fail on my linux computer, too.

PS: How about some pagination for this thread? It takes forever to load...
ID: 79723 · Rating: 0 · rate: Rate + / Rate - Report as offensive
TJ

Send message
Joined: 29 Mar 09
Posts: 127
Credit: 4,799,890
RAC: 0
Message 79725 - Posted: 8 Mar 2016, 19:12:11 UTC

No new work here, while there are enough WU's according to the server status.

It is going on for about a week now. All tasks ready and then not reported or new work.
Do a manual Update helps sometimes.
Now it says: Communication deferred 23:55:55 so a whole day.
This happened to me two times earlier in the last week or so. A manual update does not help, it only starts counting from 24 hours back again.
So the project won't my computers anymore. This project is troublesome for a very long time, years, and still the old server software.

My computers can do other work too.
Greetings,
TJ.
ID: 79725 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 79726 - Posted: 8 Mar 2016, 19:37:21 UTC

It appears the project has had over a thousand new hosts come in each of the last several days. This puts an abnormally high load on the servers. I believe this is why folks are seeing retries and failures to get work and report completed tasks.

New work is being generated, and results are being accepted, just apparently not fast enough for all requests to be satisfied. Yesterday I saw the number of outstanding tasks jump from about 500,000 to over a million in just about half an hour. So not over were half a million new tasks assigned during that time, but also new tasks to cover all of those reported back as completed during that period.

The automatic retries the BOINC Manager performs will resolve the problem. Moving to longer runtime preference will help reduce load on servers. And manual update to server will eventually hit a window when tasks are available. But the reboots, or detach/reattach the project are NOT what makes things work. It is simply the timing of when your request hits the server, and what is has available.
Rosetta Moderator: Mod.Sense
ID: 79726 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 79727 - Posted: 8 Mar 2016, 19:48:30 UTC

More work will be issued soon!
ID: 79727 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 79729 - Posted: 8 Mar 2016, 20:00:22 UTC

11,476 new hosts registered from March 1 through March 8. So about 1,500 per day.

Normal is about 300 per day.
Rosetta Moderator: Mod.Sense
ID: 79729 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2125
Credit: 41,249,734
RAC: 9,368
Message 79737 - Posted: 9 Mar 2016, 2:34:51 UTC - in response to Message 79729.  

11,476 new hosts registered from March 1 through March 8. So about 1,500 per day.

Normal is about 300 per day.

As shown here

Rosetta Users Overview
ID: 79737 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 79740 - Posted: 9 Mar 2016, 9:08:12 UTC - in response to Message 79737.  

How can you double the number of users overnight if you wanted to? It is a measurement anomaly?
ID: 79740 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,664,803
RAC: 11,191
Message 79743 - Posted: 9 Mar 2016, 14:39:37 UTC - in response to Message 79740.  

How can you double the number of users overnight if you wanted to? It is a measurement anomaly?

Another Charity Engine surge?
ID: 79743 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 79744 - Posted: 9 Mar 2016, 14:44:53 UTC - in response to Message 79743.  

Another Charity Engine surge?


Yes, many of the user names were "ce" followed by numbers.

Rosetta Moderator: Mod.Sense
ID: 79744 · Rating: 0 · rate: Rate + / Rate - Report as offensive
BarryAZ

Send message
Joined: 27 Dec 05
Posts: 153
Credit: 30,843,285
RAC: 0
Message 79745 - Posted: 9 Mar 2016, 16:31:11 UTC

I've been encountering two problems of late.

1) For whatever reason, reporting gets stalled periodically and requires a manual 'push' -- which then reports 5, 10 or more work units and downloads work units -- this does not happen with any other project.

2) A small but significant number of work units throw off computation errors -- this has happened on several systems and it does so after processing for anywhere from 1 to 5 hours.

ID: 79745 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Jim1348

Send message
Joined: 19 Jan 06
Posts: 881
Credit: 52,257,545
RAC: 0
Message 79746 - Posted: 9 Mar 2016, 17:02:41 UTC - in response to Message 79745.  
Last modified: 9 Mar 2016, 17:05:07 UTC

1) For whatever reason, reporting gets stalled periodically and requires a manual 'push' -- which then reports 5, 10 or more work units and downloads work units -- this does not happen with any other project.

I have seen that too. I attributed it to the fact that I was coming off a CPDN work unit that took 8 or 9 days to complete, and the BOINC scheduler had not picked up on that yet. But another thought is that the BOINC server version on Rosetta is rather old (it is said), and may not work so well with the latest BOINC clients; I am using 7.6.22 or 7.6.29, depending on the machine.

I have not seen the errors yet, but just started again a week ago, with 40 successes thus far (24 hour runs).
ID: 79746 · Rating: 0 · rate: Rate + / Rate - Report as offensive
Previous · 1 . . . 46 · 47 · 48 · 49 · 50 · 51 · 52 . . . 55 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org