Several workunits stuck on computer, had to manually abort

Message boards : Number crunching : Several workunits stuck on computer, had to manually abort

To post messages, you must log in.

AuthorMessage
NewtonianRefractor

Send message
Joined: 29 Sep 08
Posts: 19
Credit: 2,350,860
RAC: 0
Message 72622 - Posted: 29 Mar 2012, 6:18:48 UTC
Last modified: 29 Mar 2012, 6:19:25 UTC

I have a linux computer running scientific linux attached to this project(host 1526762). Recently I had at least 2 workunits that appeared to be 'stuck'. They each ran for over 40 hours and the percentage was stuck somewhere between 60% and 75%. I had to manually abort them.

Here is one of them: wuid 450705292. The interesting thing is that the CPU time is only reported as 19,736.05 seconds. When I checked the computer the particular core of the cpu that the WU was assigned to was completely idle.

Since I have that happen twice, and I only check on this computer every few weeks I aborted the rest of the Rosetta workunits and switched it to another project (workd community grid) for now.
ID: 72622 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1868
Credit: 8,259,674
RAC: 9,401
Message 72623 - Posted: 29 Mar 2012, 7:12:46 UTC - in response to Message 72622.  

Since I have that happen twice, and I only check on this computer every few weeks I aborted the rest of the Rosetta workunits and switched it to another project (workd community grid) for now.


On my windows 7 i simply reboot the pc and the wu restart to crunch...

ID: 72623 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1894
Credit: 8,773,304
RAC: 3,957
Message 72625 - Posted: 29 Mar 2012, 10:45:21 UTC - in response to Message 72623.  

Since I have that happen twice, and I only check on this computer every few weeks I aborted the rest of the Rosetta workunits and switched it to another project (workd community grid) for now.


On my windows 7 i simply reboot the pc and the wu restart to crunch...


Normally you can also just exit Boinc itself and then restart it and the unit will resume normal crunching.
ID: 72625 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Several workunits stuck on computer, had to manually abort



©2024 University of Washington
https://www.bakerlab.org