Message boards : Number crunching : minirosetta 2.17
Previous · 1 . . . 5 · 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
Tex1954 Send message Joined: 3 Apr 11 Posts: 9 Credit: 3,394,752 RAC: 1 |
I am seeing validate errors (with matching wingman results) on tasks whose name has the form of: Well, I suggest they give us a storage fee in the form of double normal points to house their flawed software on our systems... LOL! 8-) Tex1954 |
Chris Holvenstot Send message Joined: 2 May 10 Posts: 220 Credit: 9,106,918 RAC: 0 |
Text1954 said: Well, I suggest they give us a storage fee in the form of double normal points to house their flawed software on our systems... There are even a few additional types of tasks getting the validate errors with matching wingman results but there are fewer of them so I'm just going to sit back and see what they do with these before sorting through more the chaff. It sure would be nice if they would update their server software so that we could pull a task list by Server State / Outcome like some of the other projects have. It would make digging through the results a bunch easier. |
Tex1954 Send message Joined: 3 Apr 11 Posts: 9 Credit: 3,394,752 RAC: 1 |
There are even a few additional types of tasks getting the validate errors with matching wingman results but there are fewer of them so I'm just going to sit back and see what they do with these before sorting through more the chaff. Well, I am not a developer for their tasks, just a helper with hardware. These (all BOINC etc. tasks) are all cooperative ventures. Sometimes, the certain folks feel superior and/or embarrassed and clog/break the information circle... Would be nice if "someone" that actually writes the apps would pop in and let us know somebody is awake! As mentioned before, some sort of Status message germane to the current situation? A one line Sticky NOTE for crying out loud? LOL! Anyway, plugging along with the rest of ya'll... 8-) |
bobgoblin Send message Joined: 15 Oct 05 Posts: 2 Credit: 1,616,056 RAC: 0 |
I was not concerned about, nor did I believe, rosetta was causing harm to the i7's. I have not seen the "waiting for memory" message on either machine. The 12+ hour crunch time is a very recent development, only in the last few weeks. Have the memory demands of rosetta increased? If so, then I will not run them on those machines and continue to run seti, cpdn, and einstein instead. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
R@h memory demands vary considerably with different types of tasks. I've noticed several recently that are using more then 300MB of memory at times. If you catch it happening again, please look at the BOINC Manager to see if you've got 8 tasks in a "running" status, and then look at Windows task manager to see if all 8 are getting CPU time. There has been another problem that seems to come up from time to time where the task looks like it is running from BOINC, but not actually getting any CPU time. And because it never gets CPU time, the R@h watchdog can't take action to end the task or clean it up (because it would need CPU to take any action). As I recall, the only way around that one (other then aborting the task) is to completely end and restart BOINC (just suspending the task and resuming it doesn't seem to resolve the problem). ...with 7 other tasks running, and standing to loose work they've done since their last checkpoints when you restart BOINC, you may be time ahead to just abort such tasks, or if you know you are going to reboot the machine soon, suspend them until you reboot. I have not been able to determine any patterns as to what makes this occur when BOINC says the task is running, yet doesn't allocate CPU time to it. So, any details about mix with other projects, or number of tasks involved or amount of memory the stalled task shows being used in Windows task manager... hopefully with enough detail a pattern will begin to emerge. I'm not positive, but I believe this has only been occurring on Windows machines, so perhaps that's a start. Rosetta Moderator: Mod.Sense |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
I have not been able to determine any patterns as to what makes this occur when BOINC says the task is running, yet doesn't allocate CPU time to it. So, any details about mix with other projects, or number of tasks involved or amount of memory the stalled task shows being used in Windows task manager... hopefully with enough detail a pattern will begin to emerge. I'm not positive, but I believe this has only been occurring on Windows machines, so perhaps that's a start. Sorry, Mod.Sense, I've seen it occur, albeit extremely rarely, on my Mac. It also occurs on many other projects although it seems more prevalent here on Rosetta; at least, there are more complaints here than on the other boards I peruse. Have you been in contact with Josef Segur? Judging from his most recent contribution to the boinc_dev list ("check_progress option") he has an interest in this problem and could probably point you to discussions elsewhere and/or individuals who are also collecting observations and trying to discern patterns. It also might be helpful if a project the size and import of Rosetta expressed an interest in having BOINC address the issue. Best, Snags |
bump Send message Joined: 13 Apr 10 Posts: 1 Credit: 2,315,841 RAC: 0 |
I too am getting the compute errors and I do not believe memory to be the issue. I have 3 boxes and they all get the errors. Two of them have 4GB and they are essentially idle most of the day. According the stats a maximum of 2.1 GB of the 4GB has ever been used. Along with the compute errors I am seeing difficulty in uploading finished jobs. right now I have 10 queued up on one box and they keep getting paused and set for retry. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes, just from my own observations it seemed to be more likely to occur when tasks were contending for memory. What makes it hard to study is the lack of messages about tasks being suspended to wait for memory. And when multiple projects were in the mix, it got very difficult to tell whether another task was started due to memory limits, or due to project switching, or what. Another factor is that when you get on to your machine to look at things, your preference for "when active" memory is often less then "when idle", and so now simply observing it is effecting it too. I never found a way to cause it to happen. And even when memory is constrained, it doesn't seem to happen very often. Yet when it does happen, it seems to come in waves where you see it several times in just a few days, and then not again for weeks or months. Makes me question if the OS is not properly swapping memory back in when a task is resumed. Rosetta Moderator: Mod.Sense |
robertmiles Send message Joined: 16 Jun 08 Posts: 1234 Credit: 14,338,560 RAC: 1,227 |
I've been seeing it occasionally with a computer running 64-bit Windows Vista, with 8 GB memory available and BOINC allowed to use 40% of that. I've already posted more details about some of those workunits earlier in this thread. No error messages about why unless they're in that workunit's log file. For the last few, the time when it stopped using any CPU time at all was around 1 minute after it resumed processing after the last checkpoint. I have the same computer participating in most of the other BOINC projects related to medical research, with those that do not have checkpoints currently disabled. Occasionally two minirosetta workunits at a time; three CPU cores set to allow BOINC use, but I don't remember seeing three minirosetta workunits try to run at once. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Task (blind_rhoda_boinc_nmr_control.2nz6A_330_abrelax_cs_frags_sgourn_IGNORE_THE_REST_25677_1336_0) 418323998 failed on Mac after about 5 minutes. Other tasks with names like blind* fail similarly. ERROR: ct == final_atoms ERROR:: Exit from: src/core/scoring/rms_util.cc line: 475 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> ]]> |
James Thompson Send message Joined: 13 Oct 05 Posts: 46 Credit: 186,109 RAC: 0 |
Thanks svincent. This is another input file issue, this time from a different user. The jobs have been removed, and we're working on the problem right now. |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
So what's with the following No heartbeat from core client for 30 sec - exiting messages ? Job had been sitting doing NOTHING for 13.5 hours (???) which I noticed and subsequently restarted BOINC. The Windows XP PC concerned is using nVidia onboard graphics (no idea if this has any bearing) https://boinc.bakerlab.org/rosetta/result.php?resultid=419011804 <core_client_version>6.10.58</core_client_version> <![CDATA[ <stderr_txt> [2011- 4-30 2:38:23:] :: BOINC:: Initializing ... ok. [2011- 4-30 2:38:23:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev39052.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/casd_sr10_boinc_nmr_control.1ff3B_20_abrelax_cs_frags_tex.boinc.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 28800 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x7C910F1E read attempt to address 0x00000001 Engaging BOINC Windows Runtime Debugger... [2011- 4-30 21:39:36:] :: BOINC:: Initializing ... ok. [2011- 4-30 21:39:36:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev39052.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/casd_sr10_boinc_nmr_control.1ff3B_20_abrelax_cs_frags_tex.boinc.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. Continuing computation from checkpoint: chk_S_00046_FragmentSampler__stage1 ... success! Continuing computation from checkpoint: chk_S_00046_FragmentSampler__stage2 ... success! Continuing computation from checkpoint: chk_S_00046_FragmentSampler__stage3 ... success! Continuing computation from checkpoint: chk_S_00046_FragmentSampler__stage4_kk_1 ... success! Continuing computation from checkpoint: chk_S_00046_FragmentSampler__stage4_kk_2 ... success! # cpu_run_time_pref: 28800 No heartbeat from core client for 30 sec - exiting [2011- 4-30 22: 6:36:] :: BOINC:: Initializing ... ok. [2011- 4-30 22: 6:36:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev39052.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/casd_sr10_boinc_nmr_control.1ff3B_20_abrelax_cs_frags_tex.boinc.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. Continuing computation from checkpoint: chk_S_00052_FragmentSampler__stage1 ... success! Continuing computation from checkpoint: chk_S_00052_FragmentSampler__stage2 ... success! Continuing computation from checkpoint: chk_S_00052_FragmentSampler__stage3 ... success! Continuing computation from checkpoint: chk_S_00052_FragmentSampler__stage4_kk_1 ... success! # cpu_run_time_pref: 28800 No heartbeat from core client for 30 sec - exiting [2011- 4-30 22: 7:11:] :: BOINC:: Initializing ... ok. [2011- 4-30 22: 7:11:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev39052.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/casd_sr10_boinc_nmr_control.1ff3B_20_abrelax_cs_frags_tex.boinc.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. Continuing computation from checkpoint: chk_S_00052_FragmentSampler__stage1 ... success! Continuing computation from checkpoint: chk_S_00052_FragmentSampler__stage2 ... success! Continuing computation from checkpoint: chk_S_00052_FragmentSampler__stage3 ... success! Continuing computation from checkpoint: chk_S_00052_FragmentSampler__stage4_kk_1 ... success! # cpu_run_time_pref: 28800 Continuing computation from checkpoint: chk_S_00052_FastRelax__chk1_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk2_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk3_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk4_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk5_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk6_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk7_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk8_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk9_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk10_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk11_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk12_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk13_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk14_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk15_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk16_fa ... success! Continuing computation from checkpoint: chk_S_00052_FastRelax__chk17_fa ... success! No heartbeat from core client for 30 sec - exiting [2011- 4-30 22: 8:47:] :: BOINC:: Initializing ... ok. [2011- 4-30 22: 8:47:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev39052.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/casd_sr10_boinc_nmr_control.1ff3B_20_abrelax_cs_frags_tex.boinc.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 28800 No heartbeat from core client for 30 sec - exiting [2011- 4-30 22:17:41:] :: BOINC:: Initializing ... ok. [2011- 4-30 22:17:41:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing broker options ... Registered extra options. Initializing core... Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Initializing options.... ok Options::initialize() Options::adding_options() Options::initialize() Check specs. Options::initialize() End reached Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev39052.zip Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/casd_sr10_boinc_nmr_control.1ff3B_20_abrelax_cs_frags_tex.boinc.zip Setting database description ... Setting up checkpointing ... Setting up graphics native ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 28800 ====================================================== DONE :: 56 starting structures 14704.5 cpu seconds This process generated 56 decoys from 56 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish </stderr_txt> ]]> |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Despite the heartbeat issues, you did complete the task: DONE :: 56 starting structures 14704.5 cpu seconds This process generated 56 decoys from 56 attempts So what's with the following |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
Yep, task completed after I restarted BOINC - put into snooze , then shutdown and started (in that order) ?? Despite the heartbeat issues, you did complete the task: |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Well guess we will have to wait for the Grad student to wake up and come on duty to fully address you question. I found a little something from the Wiki of Boinc that addresses this issue: Why am I getting a 'Reason: Access Violation (0xc0000005) error'? 1. Change your preferences to leave Rosetta@Home in memory, General Preferences Log in (at General Preferences if you're not already) -> Edit Preferences (down the bottom) -> Leave applications in memory while preempted? Check yes and click the update preferences button; also, remember to "update" the BOINC Client Software so that the changes are downloaded. Open the BOINC Manager and select the "Projects Tab", left-click on "Rosetta@home" to select the project, and click the "Update" Button. 2. An error occurred somewhere on the computer, it could have been the BOINC Client Software or the Rosetta@Home Science Application or any programme that your computer was doing at the time. This is not a Rosetta@Home specific error, as far as I am aware it happens, on occasion, in all of the BOINC Powered Projects with all of the Science Applications. Keep Rosetta@Home in memory and ignore this problem if it's not getting out of hand. I'm going to leave it at that....wait for the big experts Yep, task completed after I restarted BOINC - put into snooze , then shutdown and started (in that order) ?? |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
Cheers Greg ! Well guess we will have to wait for the Grad student to wake up and come on duty to fully address you question. I found a little something from the Wiki of Boinc that addresses this issue: |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
Well guess we will have to wait for the Grad student to wake up and come on duty to fully address you question. I found a little something from the Wiki of Boinc that addresses this issue: The "no heartbeat" message means the science app and BOINC client lost contact with each other. When the science application doesn't receive the heartbeat (the "I'm alive") message from BOINC it is supposed to exit. As long as it was merely a temporary obstruction and BOINC hasn't actually crashed it should see that the application has stopped, restart it and proceed merrily on its way. Only when it happens repeatedly with a single task (100 times) does BOINC give up, sending that task back and starting a brand new task. If I'm reading correctly the "no heartbeat" messages occurred after you had restarted BOINC and Rosetta was able to successfully complete the task despite them. They may or may not be related to the cause of the error Gregg highlighted and which may have led to a BOINC crash which it couldn't recover from without a restart, thus the long delay until you noticed, restarted, and set BOINC and Rosetta on their merry way again. You might try to recall what else was running on your computer at the time of the "no heartbeat" messages (22:6:36, 22:7:11, 22:8:47, 22:17:41). Anti-virus, anti-spyware, some other maintenance type scan, indexing? Could be something you started deliberately or could be something running automatically in the background. I don't suppose you started some new process (indexing, say) between 2:38:23 and the time BOINC stopped (which, if BOINC hadn't been running for 13.5 hours when you restarted must have been about 8. Is that right?). That could point to the cause of the crash and, if the process was ongoing (or maybe set to check for changes, like an index or a backup), could also explain the "no heartbeat" messages. Best, Snags |
TPCBF Send message Joined: 29 Nov 10 Posts: 111 Credit: 5,143,328 RAC: 1,511 |
Hey guys, is it really necessary to full quote the same stuff over and over again? :( Ralf |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Hey guys, is it really necessary to full quote the same stuff over and over again? :( Well in this case it keeps everything together in one block so we can reference ALL the information, the error messages, initial complaint, possible solutions and information about the error. This is a small enough thread it wasn't that big of a deal. In bigger threads it can be a problem. Ive been around long enough to know how some of us as a joke created a thread so long by just replying to the same quote time after time. Mod remembers this. So this is just a pidly thread. |
Ian_D Send message Joined: 21 Sep 05 Posts: 55 Credit: 4,216,173 RAC: 0 |
Think I may have "solved" this one and as you so rightly said, it looks like it was a hardware problem. Looking at System info messages I've been getting a lot of intermittent paging problems to one of the hard disks aroud the times of the Reason: Access Violation (0xc0000005) failures Cheers for the steer ! Ian |
Message boards :
Number crunching :
minirosetta 2.17
©2024 University of Washington
https://www.bakerlab.org