Processing Ceases

Questions and Answers : Windows : Processing Ceases

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 65896 - Posted: 29 Apr 2010, 23:53:34 UTC

I have a recurring problem that BOINC Support is not able to resolve, and I was advised to submit the problem to Rosetta. I have e-mail messages and log excerpts that I can PM, if anybody is interested.

I have a dual-core machine with 2 Gigs RAM and Windows XP-SP3, so two MiniRosetta tasks are usually running at the same time. Seemingly at random, a task will just stop processing, and my CPU utilization will drop by 50%. If I do nothing, it is only a matter of time before the other task will also stop processing, and my CPUs will be at 99% System Idle. During these interruptions, the BOINC Manager tells me that the tasks are either "Running," or are "Running, high priority."

The only way I can recover and restart the Rosetta tasks is to reboot my computer, and this is getting very frustrating.

Is this a known problem? Is there a fix?

deesy58
ID: 65896 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65901 - Posted: 30 Apr 2010, 4:57:02 UTC
Last modified: 30 Apr 2010, 4:59:39 UTC

It seems every month of two I will see a report like this. No root cause has yet been found. I forwarded your PM with log details to the Project Team for review. I am sure they are fully engaged with CASP beginning.

The only pattern I've noticed is that some machines have the problem cronically, and others never see it at all. Do you have a feel for what % of tasks stop using CPU in this manner on your machine?
Rosetta Moderator: Mod.Sense
ID: 65901 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 65910 - Posted: 30 Apr 2010, 18:58:36 UTC - in response to Message 65901.  

It seems every month of two I will see a report like this. No root cause has yet been found. I forwarded your PM with log details to the Project Team for review. I am sure they are fully engaged with CASP beginning.

The only pattern I've noticed is that some machines have the problem cronically, and others never see it at all. Do you have a feel for what % of tasks stop using CPU in this manner on your machine?


I have noticed this problem for several days, and it continued to worsen until yesterday when it appeared to have peaked. Two of the last Rosetta tasks have been particularly unstable. One completed last night after "crashing" about six or seven times. The other is still running after about 21.5 hours elapsed, but it has halted two or three times, too. I have also "aborted" at least two tasks. As of right now, both tasks have been running without interruption for about 12 hours.

If I suspend the tasks, then attempt to resume processing on them, they will report that they are running, but will use no CPU resources to do so. Only a reboot will solve the problem.

deesy
ID: 65910 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 65919 - Posted: 1 May 2010, 16:59:39 UTC - in response to Message 65910.  

It happened again during the night. When I went to bed, both tasks were processing fine. This morning, only one of them is using the CPU and advancing towards completion. The other is not using the CPU, and is not advancing towards completion, even though BOINC Manager reports that it is "Running, high priority."

If this is the Rosetta software, it really needs to be fixed. If it is hardware, then the software does not seem to be responding appropriately to an error condition. Just halting without any sort of notification would seem to be a little less than consistent with good programming practice, and I am left with no clue as to where to look for the problem. This is really frustrating!

deesy
ID: 65919 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 65924 - Posted: 1 May 2010, 20:35:28 UTC

It just happened again. My leading task "froze" at 35.2% complete. I exited the BOINC Manager, and set the option to halt the science projects when doing so. Then I restarted BOINC Manager. Both tasks restarted, and I waited to see progress towards completion on the previously frozen task. I was quite surprised when the progress suddenly changed from 35.2% complete to 30.9% complete while I was looking at it. This is just weird! What happened to the other 4.3%?

I now must keep the Windows Task Manager open at all times so that I can monitor CPU usage in order to detect when one or both of the Rosetta science tasks freezes.

deesy
ID: 65924 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 65926 - Posted: 2 May 2010, 1:47:09 UTC

The task finally finished after 13 hours of processing. I had to exit and restart BOINC three times today. I have put the BOINC Manager icon on my desktop so that it will be easier to access for restarts. :(

deesy
ID: 65926 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65935 - Posted: 3 May 2010, 3:04:04 UTC - in response to Message 65924.  

It just happened again. My leading task "froze" at 35.2% complete. I exited the BOINC Manager, and set the option to halt the science projects when doing so. Then I restarted BOINC Manager. Both tasks restarted, and I waited to see progress towards completion on the previously frozen task. I was quite surprised when the progress suddenly changed from 35.2% complete to 30.9% complete while I was looking at it. This is just weird! What happened to the other 4.3%?

I now must keep the Windows Task Manager open at all times so that I can monitor CPU usage in order to detect when one or both of the Rosetta science tasks freezes.

deesy


Any time you end a task and it is removed from memory, some amount of completed work is lost. Work in progress is periodically saved be a process called checkpointing.
Rosetta Moderator: Mod.Sense
ID: 65935 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 65963 - Posted: 5 May 2010, 2:41:09 UTC - in response to Message 65935.  

Any time you end a task and it is removed from memory, some amount of completed work is lost. Work in progress is periodically saved be a process called checkpointing.


Okay. I understand that. What I do NOT understand, however, is why some Work Units just seem to "freeze" at some level of completion, and stop using computer resources (CPU). These events seem to be random. After more than 48 hours of continuous processing on both of my CPU cores, one of the two tasks being processed just "froze" at a level of 7.763& complete. The only way to resume processing on this WU was to exit BOINC, and then restart it.

deesy
ID: 65963 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 65973 - Posted: 5 May 2010, 12:16:00 UTC

Tasks dropping off and no longer using CPU, even when BOINC Manager would indicate they should be (i.e. they aren't suspended for any reason nor preempted to run another project) is one quirk that hasn't been tracked down yet. Fortunately it seems to effect very few people and even within those machines, only a small number of tasks. I've been gathering all the hints I can to report to the Project Team.
Rosetta Moderator: Mod.Sense
ID: 65973 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 65980 - Posted: 5 May 2010, 16:24:05 UTC - in response to Message 65973.  

Tasks dropping off and no longer using CPU, even when BOINC Manager would indicate they should be (i.e. they aren't suspended for any reason nor preempted to run another project) is one quirk that hasn't been tracked down yet. Fortunately it seems to effect very few people and even within those machines, only a small number of tasks. I've been gathering all the hints I can to report to the Project Team.


Okay, perhaps I might be able to supply what could be a helpful hint.

I have determined that these "freezes" appear to be WU-related. The "freeze" happened again last night, making it four times in a little more than 16 hours. Each time, it was the same WU that stopped:

rb_05_02_122_331_rs_stg0_lrlx_t000_boincid_SAVE_ALL_OUT.IGNORE_THE_REST_B_20262_857_0

Elapsed: 14:01:17 Progress: 27.103% To Completion: 21:31:52

Perhaps this will help.

Each time it freezes and I must restart BOINC to recover, I lose progress on the task, so it is going to take a lot longer to complete these types of WUs, and it requires constant attention to the Windows Task Manager to detect the "freezing." Perhaps these types of WU's are sufficiently different from others that the miniboinc_2.11 software can't process them reliably. Should I immediately abort each of these kinds of WUs when I see that they have been assigned? Do you think that might alleviate my problem?

deesy
ID: 65980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 65991 - Posted: 5 May 2010, 22:32:52 UTC - in response to Message 65980.  

[quote]Tasks dropping off and no longer using CPU, even when BOINC Manager would indicate they should be (i.e. they aren't suspended for any reason nor preempted to run another project) is one quirk that hasn't been tracked down yet. Fortunately it seems to effect very few people and even within those machines, only a small number of tasks. I've been gathering all the hints I can to report to the Project Team.


After the offending WU "froze" for the sixth (and final) time about ten minutes ago, I aborted the task. A new, similar, WU began processing immediately. After an elapsed time of one minute and 21 seconds, it aborted itself with a "Computation Error" message.

Now, it appears that one of the WUs being processed might be some sort of "test," since the word "test" appears in the name of the WU.

From here on out, I will immediately abort all WUs that appear to be of the same type as those that appear unable to process on my machine.

deesy
ID: 65991 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66005 - Posted: 6 May 2010, 16:31:28 UTC

You have to do what works for you. Ideally, if you are around the machine and able to look in on it, you would let them run and see if you can confirm the theory that specific names consistently have problems. Or suspend them until a time when you are around the machine. If everyone had the same symptoms as you, then there would be a glaring lack of returned results for those tasks and a big red flag would have risen some time ago. So, I suspect you will find that most of them will run fine.

And no, the word "test", or any other word in a task name has no relation to expected reliability in crunching. It is just a reference to this task's relationship to oth
ID: 66005 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66006 - Posted: 6 May 2010, 16:31:31 UTC

You have to do what works for you. Ideally, if you are around the machine and able to look in on it, you would let them run and see if you can confirm the theory that specific names consistently have problems. Or suspend them until a time when you are around the machine. If everyone had the same symptoms as you, then there would be a glaring lack of returned results for those tasks and a big red flag would have risen some time ago. So, I suspect you will find that most of them will run fine.

And no, the word "test", or any other word in a task name has no relation to expected reliability in crunching. It is just a reference to this task's relationship to others in the study of the protein.
Rosetta Moderator: Mod.Sense
ID: 66006 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 66009 - Posted: 6 May 2010, 17:37:25 UTC - in response to Message 66006.  

You have to do what works for you. Ideally, if you are around the machine and able to look in on it, you would let them run and see if you can confirm the theory that specific names consistently have problems. Or suspend them until a time when you are around the machine. If everyone had the same symptoms as you, then there would be a glaring lack of returned results for those tasks and a big red flag would have risen some time ago. So, I suspect you will find that most of them will run fine.

And no, the word "test", or any other word in a task name has no relation to expected reliability in crunching. It is just a reference to this task's relationship to others in the study of the protein.


Okay. I'll just abort any task that ever freezes in the future. If I collect the names of the "offending" tasks, do you want me to post them here?

deesy
ID: 66009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 66021 - Posted: 7 May 2010, 16:37:42 UTC - in response to Message 66009.  

If I collect the names of the "offending" tasks, do you want me to post them here?

deesy


Certainly, yes. Posting is even better then directly EMailing them to me because it allows others to compare their own notes with yours and offer further information.
Rosetta Moderator: Mod.Sense
ID: 66021 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 66029 - Posted: 8 May 2010, 8:01:01 UTC - in response to Message 66021.  

If I collect the names of the "offending" tasks, do you want me to post them here?

deesy


Certainly, yes. Posting is even better then directly EMailing them to me because it allows others to compare their own notes with yours and offer further information.



Okay, Mod.Sense, I will do that. It is difficult for me to imagine that my machine regularly halts processing certain Work Units, but that nobody else experiences the problem. That seems extremely unlikely to me. Perhaps other contributers have become annoyed with the problem and either stopped contributing to Rosetta@Home entirely, or they simply abort the task whenever it happens (like I will now do). In 40 years of software development, our development teams always made an intense effort to locate and fix bugs, even if only a very few users were adversely affected. :-|

deesy
ID: 66029 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 66251 - Posted: 21 May 2010, 8:33:46 UTC

Okay! Here is another one. After processing without problems since May 5, 2010, the latest WU "crashed" a few minutres ago. Here is the name of the WU:

rb_05_19_162_579_rs_stg0_lrlx_t000_casp9_SAVE_ALL_OUT.IGNORE_THE_REST_B_21112_2094_0

I will no longer attempt to restart these defective WUs, but will abort them as soon as I notice that processing has ceased.

I hope that this information will assist in tracking down the root of the problem.

deesy
ID: 66251 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 66283 - Posted: 22 May 2010, 15:04:07 UTC
Last modified: 22 May 2010, 15:06:18 UTC

Here's one more:

rs_stg0_lrlxjcst_t308_run6_SAVE_ALL_OUT_20984_304_0

This would be a lot easier, and perhaps more contributors would post these defective WUs, if we could copy the name to the clipboard, and then paste it. Either that, or come up with simpler WU names.

deesy
ID: 66283 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 66443 - Posted: 3 Jun 2010, 18:18:40 UTC

Here we go again. I wish these work units were more stable.

rs_stg0_lrlxcst_T477_casp8_SAVE_ALL_OUT_20745_1622_0

Progress: 12.369% (aborted the task)

deesy
ID: 66443 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
deesy58

Send message
Joined: 20 Apr 10
Posts: 75
Credit: 193,831
RAC: 0
Message 66496 - Posted: 6 Jun 2010, 16:28:05 UTC

Well, here's another one:

rs_stg0_lrlx_T389_casp8_SAVE_ALL_OUT_20772_2567_0

It seems a waste that these work units complete 10%-15% before crashing. This one quit processing in the middle of the night, again. BOINC points a finger at the project, and the project just shrugs.

Is anybody watching? Does anybody care? Is this a normal occurrence? Should this information be posted elsewhere?

WTF!

deesy
ID: 66496 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Questions and Answers : Windows : Processing Ceases



©2024 University of Washington
https://www.bakerlab.org