Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 46 · 47 · 48 · 49 · 50 · 51 · 52 . . . 55 · Next
Author | Message |
---|---|
Timo Send message Joined: 9 Jan 12 Posts: 185 Credit: 45,649,459 RAC: 0 |
I aborted them after 25+ hours of running. Until this is fixed, I might suggest that rather than aborting said tasks, to simply restart BOINC (full Boinc shutdown of work + re-launch of BOINC) as it will usually cause any 'hung' tasks that are past their target runtimes to wrap-up and report in for full credit. I'll leave any further troubleshooting to the admins. **38 cores crunching for R@H on behalf of cancercomputer.org - a non-profit supporting High Performance Computing in Cancer Research |
Steve Send message Joined: 22 Nov 15 Posts: 8 Credit: 164,345 RAC: 0 |
A few things to consider: Well, a week or so on after those option changes and things are much better although not entirely clean - I have 3 tasks that were deadlined 10th Jan, and sat 'waiting to run' for several days while BOINC chose to start running other tasks rather than resume these waiting tasks [I still wonder if this is correct behaviour]. Task Manager typically shows 75% memory in use (9GB out of 12GB) so there's no longer any memory shortage and I have unchecked all the Suspend options in BOINC except CPU above 70% (which it pretty much never is). I'm going to reboot now as this PC has been up since before Christmas, so I'll see if that causes any cleanup. *edit* After a reboot BOINC has started running my two oldest 'waiting' tasks (even though we are past the deadline of 10th Jan). It seems as though a reboot every couple of days is probably the answer. cheers Steve |
Steve Send message Joined: 22 Nov 15 Posts: 8 Credit: 164,345 RAC: 0 |
Hi All, Hi B.E. This looks very similar to my experience (see my recent posts and replies). May I ask: do you suspend/hibernate your laptop or do you shut down and then reboot? My experience has been that some Rosetta tasks go 'waiting to run' or run with no remaining estimate on my system if I don't reboot for several days and they then don't complete for many hours; other Rosetta tasks complete with no issue. If you are using suspend/hibernate (or just shutting the lid on the laptop) try rebooting it occasionally and see if that helps. Best wishes Steve |
BelgianEnthousiast Send message Joined: 25 May 15 Posts: 5 Credit: 1,023,045 RAC: 0 |
Hi Timo, Steve, Thanks for your quick reactions & advice ! To answer to your questions : 1) I tried exiting "graciously" from BOINC Mgr hoping that it would pick it up and wrap the WU's up, but to no avail unfortunately. I also tried resetting the project, and it keeps giving these mixed results. 2) I do not hibernate nor put my laptop in (deep)sleep mode. I always it shut down fully. Tried that multiple times too, but again without much success... To add to my initial post, I got another trio hanging at the moment : rb_01_10_61977_106329_T000__1C1_SAVE_ALL_OUT etc. (running for 12h14, no remaining estimate and only at 27.492 %) shrtNTF2_2_UM_1_N16E92S12_noCH_NTF2_bb-610__1_0001 etc. (running for 9h44, no remaining estimate and on 89.556 % but no more moving) and another nkid_1_2_2016 etc. (running for 9h45, no remaining estimate and only on 11.186 %) all the while I have a tj_2016A_insert_X_DHR53_DHR18 etc running just fine, showing 2h19, 3h47 remaining and at 34.973 %. Nice evening ! B.E. |
Chris Send message Joined: 2 Nov 05 Posts: 1 Credit: 2,687,913 RAC: 0 |
I've been have problems downloading jobs. I'm using the mobile program. Any help would be awesome! |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I've been have problems downloading jobs. I'm using the mobile program. Any help would be awesome! There is only very limited android capability and tasks that can utilize it are only generated sporadically. There is not a steady stream of work available that can run on that platform. Rosetta Moderator: Mod.Sense |
Grimshiire Send message Joined: 14 May 06 Posts: 1 Credit: 142,733 RAC: 0 |
Since the latest Rosetta application(3.71) has been running, I've experienced multiple Rosetta projects becoming non-response. Progress % doesn't increase even after letting it run for 24 hours. No longer will projects be downloaded & there are no issues with the other projects that are downloaded & executed. Is there an existing issue with the current Rosetta application? |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,572,678 RAC: 2,754 |
Since the latest Rosetta application(3.71) has been running, I've experienced multiple Rosetta projects becoming non-response. Progress % doesn't increase even after letting it run for 24 hours. No longer will projects be downloaded & there are no issues with the other projects that are downloaded & executed. There has been a number of problems with the "backrub" workloads (like yours "02_2016_2fbn_backrub_design_327089_168_0"). They execute significantly into the workload and then start requesting memory. The failure mode indicates the command line that starts up Rosetta wants too much memory. IMO, delete the backrub tasks. Here is a search of the forums referencing the backrub taaks. https://boinc.bakerlab.org/rosetta/forum_search_action.php |
Erich56 Send message Joined: 11 Jan 16 Posts: 35 Credit: 1,437,503 RAC: 0 |
Since today (March 8th), around noon UTC, I am experiencing numerous "compute errors". Almost 80% of the tasks are terminated after various time lengths, showing this error. Has anyone else made the same experience? |
Dr. Merkwürdigliebe Send message Joined: 5 Dec 10 Posts: 81 Credit: 2,657,273 RAC: 0 |
Since today (March 8th), around noon UTC, I am experiencing numerous "compute errors". Almost 80% of the tasks are terminated after various time lengths, showing this error. Those "supercoil_trimer" jobs fail on my linux computer, too. PS: How about some pagination for this thread? It takes forever to load... |
TJ Send message Joined: 29 Mar 09 Posts: 127 Credit: 4,799,890 RAC: 0 |
No new work here, while there are enough WU's according to the server status. It is going on for about a week now. All tasks ready and then not reported or new work. Do a manual Update helps sometimes. Now it says: Communication deferred 23:55:55 so a whole day. This happened to me two times earlier in the last week or so. A manual update does not help, it only starts counting from 24 hours back again. So the project won't my computers anymore. This project is troublesome for a very long time, years, and still the old server software. My computers can do other work too. Greetings, TJ. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
It appears the project has had over a thousand new hosts come in each of the last several days. This puts an abnormally high load on the servers. I believe this is why folks are seeing retries and failures to get work and report completed tasks. New work is being generated, and results are being accepted, just apparently not fast enough for all requests to be satisfied. Yesterday I saw the number of outstanding tasks jump from about 500,000 to over a million in just about half an hour. So not over were half a million new tasks assigned during that time, but also new tasks to cover all of those reported back as completed during that period. The automatic retries the BOINC Manager performs will resolve the problem. Moving to longer runtime preference will help reduce load on servers. And manual update to server will eventually hit a window when tasks are available. But the reboots, or detach/reattach the project are NOT what makes things work. It is simply the timing of when your request hits the server, and what is has available. Rosetta Moderator: Mod.Sense |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
More work will be issued soon! |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
11,476 new hosts registered from March 1 through March 8. So about 1,500 per day. Normal is about 300 per day. Rosetta Moderator: Mod.Sense |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2082 Credit: 40,621,050 RAC: 4,944 |
11,476 new hosts registered from March 1 through March 8. So about 1,500 per day. As shown here Rosetta Users Overview |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
How can you double the number of users overnight if you wanted to? It is a measurement anomaly? |
dcdc Send message Joined: 3 Nov 05 Posts: 1831 Credit: 119,210,217 RAC: 1,667 |
How can you double the number of users overnight if you wanted to? It is a measurement anomaly? Another Charity Engine surge? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Another Charity Engine surge? Yes, many of the user names were "ce" followed by numbers. Rosetta Moderator: Mod.Sense |
BarryAZ Send message Joined: 27 Dec 05 Posts: 153 Credit: 30,843,285 RAC: 0 |
I've been encountering two problems of late. 1) For whatever reason, reporting gets stalled periodically and requires a manual 'push' -- which then reports 5, 10 or more work units and downloads work units -- this does not happen with any other project. 2) A small but significant number of work units throw off computation errors -- this has happened on several systems and it does so after processing for anywhere from 1 to 5 hours. |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
1) For whatever reason, reporting gets stalled periodically and requires a manual 'push' -- which then reports 5, 10 or more work units and downloads work units -- this does not happen with any other project. I have seen that too. I attributed it to the fact that I was coming off a CPDN work unit that took 8 or 9 days to complete, and the BOINC scheduler had not picked up on that yet. But another thought is that the BOINC server version on Rosetta is rather old (it is said), and may not work so well with the latest BOINC clients; I am using 7.6.22 or 7.6.29, depending on the machine. I have not seen the errors yet, but just started again a week ago, with 40 successes thus far (24 hour runs). |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org