Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 21 · 22 · 23 · 24 · 25 · 26 · 27 . . . 295 · Next
Author | Message |
---|---|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2073 Credit: 40,602,258 RAC: 5,342 |
Where did all the WUs go? There were loads to download the last time I looked. Now none. Went up to 14k tasks, then all gone again. Something weird happening. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1229 Credit: 14,172,067 RAC: 1,295 |
If you look at the top ten computers https://boinc.bakerlab.org/rosetta/top_hosts.php?sort_by=expavg_credit&offset=0, the first 4 places are occupied by [DPC] Nifhack with AMD: [snip] Looks like the main limitation of CPUs with this many processors is not the number of processors, but the speed of the memory that all the processors in the same package share. If so, some of these processors could even be beyond the point where deciding which processor to allow to make the next memory access takes up enough of the run time is high enough to cause a significant slowdown. You might also look up the cache size inside each of these CPUs - competing for cache space could also cause a significant slowdown. |
Sam Send message Joined: 9 Mar 06 Posts: 3 Credit: 3,281,268 RAC: 1,312 |
Hi Franko, I get the same error (and not only on Rosetta@home. I think my hard disk is too busy to output a 'finished file' on time for Boinc. You can just ignore it, because most of the time your workunits are fine. Sjmielh |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I get the same error (and not only on Rosetta@home. I think my hard disk is too busy to output a 'finished file' on time for Boinc. That is an interesting thought. I have not seen that error for a long time, and I now use only SSDs on all my machines. Also, I usually use a write-cache (or ramdisk), so most of my writes and even reads are from main memory. I think that does it. |
fcbrants Send message Joined: 25 Mar 13 Posts: 13 Credit: 3,933,177 RAC: 0 |
This machine uses two SSD's in RAID 0 on a Dell PERC H710 RAID card with 1 GB of RAM (which could be the source of the problem), with the write policy set to "Write Back", which is defined as, "In Write Back mode the controller sends a data transfer completion signal to the host when the controller cache has received all of the data in a transaction." For some reason, Windows Explorer (Exploder?) hangs when this machine is NOT under load, AND I have several windows explorer windows open. Is there a way to increase this timeout to accommodate this machine's peculiarities? Thanks!! Franko I get the same error (and not only on Rosetta@home. I think my hard disk is too busy to output a 'finished file' on time for Boinc. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,572,458 RAC: 4,322 |
The error message is displayed by the BOINC Client. I think it is just a BOINC Client timing issue that they have declared "fixed" several times. I don't think it is ever a problem, just annoying. client/app_control.cpp // Check for finish files every 10 sec. // If we already found a finish file, abort the app; // it must be hung somewhere in boinc_finish(); // static double last_finish_check_time = 0; if (gstate.clock_change || gstate.now - last_finish_check_time > 10) { last_finish_check_time = gstate.now; for (i=0; i<active_tasks.size(); i++) { ACTIVE_TASK* atp = active_tasks[i]; if (atp->task_state() == PROCESS_UNINITIALIZED) continue; if (atp->finish_file_time) { // process is still there 10 sec after it wrote finish file. // abort the job atp->abort_task(EXIT_ABORTED_BY_CLIENT, "finish file present too long"); <<<<<<<<<<<< line 140 } else if (atp->finish_file_present()) { atp->finish_file_time = gstate.now; } } } |
fcbrants Send message Joined: 25 Mar 13 Posts: 13 Credit: 3,933,177 RAC: 0 |
Thanks, but after looking at the affected tasks, it looks like the result was discarded & no credit granted. That said, it's looking more & more like this was a problem with my Dell PERC H710P RAID card. The machine was sluggish as hell with the disk cache write back enabled & everything Really went south (machine became unbootable) after I tried a backup. Fiddled with it for days, finally pulled the backup battery off the card, which disabled the cache & let it sit overnight. Next morning, reinstalled the card, and back on go. Jacked my "use at most" CPU's back up to 100% & the machine is still snappy. Back to Munching & Crunching ;) Thanks for looking this up for me, if I run into problems again, I will try increasing this timeout. Franko The error message is displayed by the BOINC Client. |
fcbrants Send message Joined: 25 Mar 13 Posts: 13 Credit: 3,933,177 RAC: 0 |
Dang it, I'm still getting the same error. I tried to find the file app_control.cpp, but couldn't find it - is this a file I can edit? Thanks!! Franko The error message is displayed by the BOINC Client. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1229 Credit: 14,172,067 RAC: 1,295 |
Dang it, I'm still getting the same error. [snip] Files with the .cpp extension are usually C++ source files, which can be edited. However, doing so is not useful unless: 1. You have a copy of the file. Most BOINC downloads do not include the source files - you have to know where to find the source files and download the entire package of source files. 2. You know enough C++ to make useful edits. 3. You have all of the compilers installed to compile the entire program for your operating system. 4. You have the instructions to compile all source files needed, and then link them into a new version of the program. 5. You know how to substitute the new version of the program for the old version. |
fcbrants Send message Joined: 25 Mar 13 Posts: 13 Credit: 3,933,177 RAC: 0 |
Got it, thanks!! I spent some more time with this machine running at 100% (32 Rosetta tasks + 1 SETI task on the GPU) & it DID hang occasionally, which would explain this error. As this is also my daily driver, I backed the "Use at most CPU's" option down to 93.75% (30 of 32 threads) & I haven't seen the problem since. Problem resolved. Thanks!! Franko Dang it, I'm still getting the same error. |
anklab Send message Joined: 1 Jun 10 Posts: 1 Credit: 9,548,113 RAC: 211 |
Hi! Recently, I have noticed that WU calculations that go on for a long time are also evaluated, as WU calculations that take place for a short time. For example, mu computers Intel Core2Duo E8500 and Intel Core i5-2500. E8500 get WUs with 4 hours crunching, i5-2500 with 24 hours. it is strange that different tasks with different work results are granted equally. Core i5-2500 // 24 hours // granted 160.33 ====================================================== E8500 // 4 hours // granted 152.93 ====================================================== Much earlier, i5-2500 received for each completed WU approximately 800~850 credits. What can i do? |
LarryMajor Send message Joined: 1 Apr 16 Posts: 22 Credit: 31,533,212 RAC: 0 |
Much earlier, i5-2500 received for each completed WU approximately 800~850 credits. I'd do nothing for a few days. It appears to have been the recent WUs/scoring that caused a big drop. Mine started to look more typical in the past 24 hours. |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 38 Credit: 1,258,039 RAC: 0 |
Hi I have tasks erroring after 10 hours of calculation <core_client_version>7.14.2</core_client_version> A few did succeed from the same lot after the same amount of calculation time <core_client_version>7.14.2</core_client_version> |
wshadw Send message Joined: 3 Dec 18 Posts: 1 Credit: 0 RAC: 0 |
I am getting a message of "Abandoned by Project" on too many workunits. With 8 hour workunits this is unacceptable and since I compute in the Gridcoin pool I cannot change my settings. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1229 Credit: 14,172,067 RAC: 1,295 |
I am getting a message of "Abandoned by Project" on too many workunits. With 8 hour workunits this is unacceptable and since I compute in the Gridcoin pool I cannot change my settings. Could this mean that your computer is so slow that two other computers have finished the workunit before your does? Does your computer finish workunits before their deadlines? |
Arnav Sood Send message Joined: 20 Aug 18 Posts: 2 Credit: 11,782,086 RAC: 0 |
Have been unable to upload work units since yesterday (two have timed out). Keeps telling me "project backoff." I'm on an iMac Pro 2017 running macOS 10.14 Mojave and BOINC 7.12 |
fcbrants Send message Joined: 25 Mar 13 Posts: 13 Credit: 3,933,177 RAC: 0 |
I just checked my logs back to 12/10 15:00 CST & it looks like I've been uploading continuously, uninterrupted. Win64 Boinc 7.12.1. Have been unable to upload work units since yesterday (two have timed out). Keeps telling me "project backoff." |
Jim1348 Send message Joined: 19 Jan 06 Posts: 881 Credit: 52,257,545 RAC: 0 |
I was away from home (of course), and Rosetta took out my i7-4770. Everything was frozen up. I have never seen that before for Rosetta. Apparently it was this work unit: https://boinc.bakerlab.org/result.php?resultid=1046921926 <core_client_version>7.12.0</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255)</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.07_i686-pc-linux-gnu @foldit_2006238_0004_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_foldit_2006238_0004_data.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 2498717 Starting watchdog... Watchdog active. Starting watchdog... Watchdog active. ERROR: Unable to open database file for dun10 rotamer library: minirosetta_database/rotamer/shapovalov/StpDwn_0-0-0/cys.bbdep.rotamers.lib ERROR:: Exit from: src/core/pack/dunbrack/RotamerLibrary.cc line: 1085 BACKTRACE: [0xe8ca514] [0xca17443] [0xca178ce] [0xca92145] [0xc90133c] [0xc9ef641] [0xd019a4b] [0xd3e6e18] [0xd3eb9ce] [0xc96b2d1] [0xc963eb2] [0xb7fef3f] [0xac8f844] [0x9404246] [0x9299a6c] [0xc232777] [0xc234a84] [0xc2f46c0] [0xc2f323b] [0x929e531] [0x8054670] [0xedcf791] [0xedcf98d] [0x8266087] BOINC:: Error reading and gzipping output datafile: default.out 14:21:38 (2187): called boinc_finish(1) </stderr_txt> Rosetta is the only project I have running on that machine (limited to six cores, with two cores free); I don't even have a GPU installed. It probably won't happen again, but once is enough. EDIT: I updated Ubuntu 16.04, and upon reboot, picked up this in my BOINC log. I have never seen it before, and have no idea what it means. 6 Rosetta@home 12/14/2018 2:51:39 PM [error] App version has unsupported platform i686-pc-linux-gnu; changing to x86_64-pc-linux-gnu 7 Rosetta@home 12/14/2018 2:51:39 PM [error] State file error: duplicate app version: minirosetta x86_64-pc-linux-gnu 378 8 Rosetta@home 12/14/2018 2:51:39 PM [error] App version has unsupported platform i686-pc-linux-gnu; changing to x86_64-pc-linux-gnu But everything appears to be back to normal, and Rosetta is running OK now. |
Killersocke@rosetta Send message Joined: 13 Nov 06 Posts: 29 Credit: 2,579,125 RAC: 0 |
to my surprise i see 24 Tasks in my Profile uploaded to my PC In real i have 10 in my Boinc Manager Whats going on there? Anwendung Rosetta 4.07 Name foldit_2006238_0005_fold_and_dock_SAVE_ALL_OUT_707998_5433 Status Angehalten durch Benutzer erhalten Anwendung Rosetta 4.07 Name foldit_2006238_0002_fold_and_dock_SAVE_ALL_OUT_707992_5434 Status Angehalten durch Benutzer erhalten Anwendung Rosetta 4.07 Name foldit_2006254_0004_fold_and_dock_SAVE_ALL_OUT_708044_5432 Status Angehalten durch Benutzer erhalten slots/2 Anwendung Rosetta 4.07 Name foldit_2006238_0003_fold_and_dock_SAVE_ALL_OUT_707994_5434 Status Angehalten durch Benutzer erhalten slots/7 Anwendung Rosetta 4.07 Name foldit_2006238_1059_fold_and_dock_SAVE_ALL_OUT_708020_5431 Status Angehalten durch Benutzer erhalten slots/5 Anwendung Rosetta 4.07 Name foldit_2006238_1059_fold_and_dock_SAVE_ALL_OUT_708020_4988 Status Angehalten durch Benutzer erhalten slots/4 Anwendung Rosetta 4.07 Name foldit_2006254_0002_fold_and_dock_SAVE_ALL_OUT_708040_5432 Status Angehalten durch Benutzer erhalten slots/3 Anwendung Rosetta 4.07 Name foldit_2006254_0003_fold_and_dock_SAVE_ALL_OUT_708042_5432 Status Aktiv erhalten slots/6 Anwendung Rosetta 4.07 Name foldit_2006238_0004_fold_and_dock_SAVE_ALL_OUT_707996_5434 Status Aktiv erhalten slots/11 Anwendung Rosetta 4.07 Name foldit_2006238_0005_fold_and_dock_SAVE_ALL_OUT_707998_5434 Status Aktiv erhalten slots/13 |
jjch Send message Joined: 10 Nov 13 Posts: 14 Credit: 439,549,091 RAC: 4,209 |
I think I may be experiencing a similar issue. Recently I noted the work in progress value appeared to be approximately double the normal amount of work units I have running at a time. In order to trouble shoot this I set Rosetta to no new tasks and let them run out. Checking Boincstats I no longer have any work left on any host. According to Rosetta I currently have a total of 1709 tasks in progress. For example host 1770544 it is not running any Rosetta tasks but yet the In progress count is 216. https://boinc.bakerlab.org/rosetta/results.php?hostid=1770544&offset=0&show_names=0&state=1&appid= I did try resetting the project on that host but it didn't make any difference. My impression there is a problem on the Rosetta server side and it isn't updating the task status properly. I think we need the Rosetta programming team look into this further. |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org