Report stuck & aborted WU here please

Message boards : Number crunching : Report stuck & aborted WU here please

To post messages, you must log in.

Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 18 · Next

AuthorMessage
Profile Nite Owl
Avatar

Send message
Joined: 2 Nov 05
Posts: 87
Credit: 3,019,449
RAC: 0
Message 12593 - Posted: 24 Mar 2006, 0:19:40 UTC
Last modified: 24 Mar 2006, 0:52:35 UTC

ID: 12593 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dgnuff
Avatar

Send message
Joined: 1 Nov 05
Posts: 350
Credit: 24,773,605
RAC: 0
Message 12609 - Posted: 24 Mar 2006, 8:10:21 UTC - in response to Message 12547.  
Last modified: 24 Mar 2006, 8:13:46 UTC

The last 1% guy that I reported: FA_RLXbk_hom010_1bk2__359_406_0 ran OK after a reboot.

However, another system has a different one stuck: FA_RLXpg_hom005_1pgx__361_334_0. Same drill - rebooting now.


Everyone having WUs stuck or their PC crash on a WU that begins with FA_RLX, please report your PC configuration, including number of CPUs. I wonder if this is a problem with dual CPU AMD chips.


Probably not. The system this WU is stuck on is a single core Celeron: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=157182

ID: 12609 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grutte Pier [Wa Oars]~Nemesis

Send message
Joined: 8 Nov 05
Posts: 3
Credit: 386,730
RAC: 0
Message 12624 - Posted: 24 Mar 2006, 16:20:22 UTC
Last modified: 24 Mar 2006, 16:21:35 UTC

This one got stuck on 1%:

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11045076
ID: 12624 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MAOJC

Send message
Joined: 19 Jan 06
Posts: 15
Credit: 2,727,567
RAC: 0
Message 12633 - Posted: 24 Mar 2006, 18:28:19 UTC - in response to Message 12565.  
Last modified: 24 Mar 2006, 18:30:15 UTC

FA_RLXpt-hom006_1ptq_361_120_0
FA_RLXpt-hom004_1ptq_361_120_0

both the above WU were stuck @ 10+ hours with 1+ days to complete at ~14-15% completion on Linux Dual core Opteron. Aborting the first WU with Boinc Manager cause BOINC to Segfault and core and left a running Rosetta process and the gui_rpc port bound. Had to hard kill the running rosetta process.

here is the killed process off a ps -ef, note the 7200 cpu_run_time parameter.

XXXXXX 23730 1 12 04:50 pts/0 00:00:30 rosetta_4.81_i686-pc-linux-gnu xx 1ptq _ -output_silent_gz -silent -increase_cycles 10 -relax_score_filter -new_centroid_packing -abrelax -output_chi_silent -stringent_relax -vary_omega -omega_weight 0.5 -farlx -ex1 -ex2 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -no_filters -nstruct 10 -protein_name_prefix hom008_ -frags_name_prefix hom008_ -filter1 -45 -filter2 -55 -termini -cpu_run_time 7200 -constant_seed -jran 2482751


got another one it looks like:

FA_RLXpt_hom003-1ptq_361_274_0 stuck @ 8.73% with 3:45 hrs out of 8 total target hours but 10+ predicted hours and climbing. after abouting it shows 0:42 hours computing time.

ID: 12633 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
John Perko

Send message
Joined: 1 Jan 06
Posts: 3
Credit: 604,568
RAC: 0
Message 12640 - Posted: 24 Mar 2006, 20:07:15 UTC

3/24/2006 3:04:45 PM|rosetta@home|Unrecoverable error for result FA_RLXch_hom007_2chf__362_214_0 (aborted via GUI RPC)

The above unit stuck at 1% for about 2:40:00. It is the responsibility of the people who program these projects to find the bugs. My advice to people is to abort them and let the program move on.
ID: 12640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 12642 - Posted: 24 Mar 2006, 20:22:30 UTC - in response to Message 12640.  
Last modified: 24 Mar 2006, 20:27:09 UTC

3/24/2006 3:04:45 PM|rosetta@home|Unrecoverable error for result FA_RLXch_hom007_2chf__362_214_0 (aborted via GUI RPC)

The above unit stuck at 1% for about 2:40:00. It is the responsibility of the people who program these projects to find the bugs. My advice to people is to abort them and let the program move on.



The project is trying to find and fix this bug. They are aggressively seeking the answer, and the information people are providing on this thread is being used in that effort. The hangs are not the same every time and it is taking time to locate the cause of the problem.

So if you want to abort a hung WU, at least report the Result ID from your stats page here so the programmers can take a look at it.

Thank you for your assistance.


Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 12642 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dag
Avatar

Send message
Joined: 16 Dec 05
Posts: 106
Credit: 1,000,020
RAC: 0
Message 12645 - Posted: 24 Mar 2006, 22:45:29 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=14676361

24 hours, ~46%. Had 100% usage of a cpu core for that whole time.
dag
--Finding aliens is cool, but understanding the structure of proteins is useful.
ID: 12645 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 12647 - Posted: 25 Mar 2006, 1:27:25 UTC

work-units aborted at 1%:

FA_RLXci_hom029_2ci2I_362_311_0 after 109,513.44 secs
FA_RLXpt_hom002_1ptq__361_439_0 after 207,726.50 secs

maybe its time to stop doing longer work units so I can see when servers haven't reported results in the last few hours...
ID: 12647 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bo-Arne

Send message
Joined: 16 Dec 05
Posts: 2
Credit: 2,850,415
RAC: 0
Message 12661 - Posted: 25 Mar 2006, 7:19:01 UTC - in response to Message 12557.  

I had to abort FA_RLXpt_hom006_1ptq_361_427. Restarted three times but it got stuck at 23,73%, Model 3, Step 332516, just a few minutes after entering full atom relax.

Result ID 14585714. (per moderator request)
ID: 12661 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Loki

Send message
Joined: 9 Dec 05
Posts: 9
Credit: 36,264
RAC: 0
Message 12662 - Posted: 25 Mar 2006, 7:38:54 UTC

Stuck WU (Result ID = 14857109) at 17.5 % in AB initio calculation.
ID: 12662 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 12666 - Posted: 25 Mar 2006, 8:21:38 UTC - in response to Message 12647.  

work-units aborted at 1%:

FA_RLXci_hom029_2ci2I_362_311_0 after 109,513.44 secs
FA_RLXpt_hom002_1ptq__361_439_0 after 207,726.50 secs

maybe its time to stop doing longer work units so I can see when servers haven't reported results in the last few hours...


Result ID's: 14740903 & 14592911
ID: 12666 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Team_Elteor_Borislavj~Intelligence

Send message
Joined: 7 Dec 05
Posts: 14
Credit: 56,027
RAC: 0
Message 12670 - Posted: 25 Mar 2006, 9:54:21 UTC

HB_BARCODE_30_4ubpA_351_16734_0 still stuck at 1% after 9 hours of crunching with 100% load!

ID: 12670 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mewbysea

Send message
Joined: 29 Jan 06
Posts: 17
Credit: 15,346,891
RAC: 4,068
Message 12678 - Posted: 25 Mar 2006, 12:31:26 UTC

FA_RLXpt_hom004_1ptq_361_127_0 stuck at 83.81%.
WU ID = 11670028; Result ID = 14405752
PC (153231) = Dell 8400, P4 (HT) 3.2 GHz (stock), WIN XP (SP2)
Aborted after over 30 hours of crunching.


ID: 12678 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grutte Pier [Wa Oars]~Nemesis

Send message
Joined: 8 Nov 05
Posts: 3
Credit: 386,730
RAC: 0
Message 12691 - Posted: 25 Mar 2006, 16:14:46 UTC
Last modified: 25 Mar 2006, 16:15:49 UTC

After a bogus WU on one of my pc's that cost me over 300 credits (it was hanging for a long time) I went though all of my WU's. This is a list of all my recent WU's that were aborted with an error:

Intel(R) Pentium(R) M processor 1.73GHz
Microsoft Windows XP Professional Edition, Service Pack 2, (05.01.2600.00)

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11410757
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10507541
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10454400
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10309222
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10644097

AuthenticAMD mobile AMD Athlon(tm) XP 2000+
Microsoft Windows XP Professional Edition, Service Pack 2, (05.01.2600.00)

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11665942
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11639527
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11478408

Intel(R) Pentium(R) 4 CPU 1.60GHz (@2.40GHz)
Microsoft Windows XP Professional Edition, Service Pack 2, (05.01.2600.00)

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11045076
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11068185
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11008648
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10993712
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10976761
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10961239
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10961160
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10931034
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10928750
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10808712

AuthenticAMD mobile AMD Athlon(tm) XP-M 2800+ (LV)
Microsoft Windows XP Professional Edition, Service Pack 2, (05.01.2600.00)

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10419709
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10421027
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10438624
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10529395
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10455024
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10417302
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10390604
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10095664
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10064299
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10015662

AuthenticAMD AMD Sempron(tm) Processor 3000+
Microsoft Windows XP Professional Edition, Service Pack 2, (05.01.2600.00)

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10387309
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=9956247

AuthenticAMD AMD Athlon(tm) 64 X2 Dual Core Processor 4400+
Microsoft Windows XP Professional Edition, Service Pack 2, (05.01.2600.00)

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10629816
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10459506
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10283291
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10059896
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=9544176
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=5796746
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=5733085

AuthenticAMD AMD Sempron(tm) 2400+
Microsoft Windows XP Professional Edition, Service Pack 2, (05.01.2600.00)

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11452627
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11439431
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10345630
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10727823
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=10548190

I'm running Rosetta for the medical purpose, but I think there's over 1000 credits in the list above...
ID: 12691 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TA_GeoffS

Send message
Joined: 16 Dec 05
Posts: 2
Credit: 704,640
RAC: 0
Message 12700 - Posted: 25 Mar 2006, 19:56:56 UTC
Last modified: 25 Mar 2006, 19:59:37 UTC

I'll try to be more vigilent with respect to the status of the WU when I killed it, but I don't think any of these were 1% issues... they were well into the WU and stuck (no progress over a 20 minute span, graphic not moving at all... should I be looking for something else?) All machines are dedicated crunchers with very little else being done on them.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11711529 (68k CPU seconds, 358 pts)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11682684 (117k CPU seconds, 606 pts)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11011944 (142k CPU seconds, 732 pts)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11733000 (43k CPU seconds, 247 pts)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11732999 (71k CPU seconds, 406 pts)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11733500 (60k CPU seconds, 347 pts)
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11580576 (3k CPU seconds, 19 pts)

ID: 12700 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rossmor35

Send message
Joined: 24 Sep 05
Posts: 4
Credit: 84,870
RAC: 0
Message 12711 - Posted: 26 Mar 2006, 13:33:04 UTC


This WU stuck at 1% for 6.5hrs before i aborted it.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11960938
ID: 12711 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Hoogie

Send message
Joined: 4 Nov 05
Posts: 13
Credit: 1,572,894
RAC: 0
Message 12712 - Posted: 26 Mar 2006, 14:21:12 UTC
Last modified: 26 Mar 2006, 14:24:22 UTC

The following workunit 12125177, HB_BARCODE_30_1c8cA_351_20458, has stopped at Model 1 Step 20167. This is repeatable, and I have aborted it.
ID: 12712 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 12713 - Posted: 26 Mar 2006, 16:31:31 UTC

This wu https://boinc.bakerlab.org/rosetta/workunit.php?wuid=11694953 got stuck
at Model 1 step 20690 100% cpu and at 1 %.
After restart it got stuck at the same place again.

Aborted

Anders n


ID: 12713 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rich

Send message
Joined: 30 Nov 05
Posts: 5
Credit: 594,384
RAC: 0
Message 12719 - Posted: 27 Mar 2006, 10:56:10 UTC

Good morning.

Attached is a work unit I just aborted at 1% after 16 hrs or so. I assume you can pull up the result codes. Let me know if there is more information you'all usually collect and report. I just discovered this thread and will make an effort to report more often.

Hope you'all find a solution. I get these about once every 2 weeks. What is really frustrating to me is to come home from travel and find several days wasted on a 1% work-unit. However, I understand that it is a work-in-progress.

Take care and have a good day.

Rich Seyfert

Work unit name = FA_RLX56_hom014_256bA_362_392_0
Rich Seyfert
Eatontown, NJ
SeyfertR@att.net
ID: 12719 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
casio7131

Send message
Joined: 10 Oct 05
Posts: 35
Credit: 149,748
RAC: 0
Message 12721 - Posted: 27 Mar 2006, 12:59:35 UTC
Last modified: 27 Mar 2006, 13:05:12 UTC

stuck at 1% after 11h40min:
27/03/2006 10:32:47 PM|rosetta@home|Pausing task HB_BARCODE_30_5croA_351_23561_0 (left in memory)
https://boinc.bakerlab.org/rosetta/result.php?resultid=14998210

command executed: projects/boinc.bakerlab.org_rosetta/rosetta_4.82_windows_intelx86.exe cc 5cro A -abrelax -stringent_relax -more_relax_cycles -output_chi_silent -vary_omega -rand_envpair_res_wt -rand_SS_wt -farlx -ex1 -ex2 -silent -barcode_from_fragments -new_centroid_packing -barcode_from_fragments_length 30 -ssblocks -barcode_mode 3 -omega_weight 0.5 -jitter_frag -jitter_variation gauss -output_silent_gz -nstruct 10 -paths ccfrags200.txt -relax_score_filter -filter1 -85 -filter2 -95 -short_range_hb_weight 0.50 -long_range_hb_weight 1.0 -increase_cycles 10 -cpu_run_time 7200 -constant_seed -jran 3349200

i've looked at it for a further 10-20 min and it didn't seem to have moved any more. i will restart boinc now and see what happens.
---
after restart, it has stuck again (at the same point). workunit aborted.
ID: 12721 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 18 · Next

Message boards : Number crunching : Report stuck & aborted WU here please



©2024 University of Washington
https://www.bakerlab.org