Message boards : Number crunching : Problems with version 5.96
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 · Next
Author | Message |
---|---|
ConflictingEmotions Send message Joined: 5 Jun 08 Posts: 10 Credit: 3,081,990 RAC: 0 |
We haven't been able to reproduce this behavior yet. Tomorrow I'll update rosetta with the latest boinc api and double check the source code to see if there were any changes between versions that could be causing this. We are seeing an odd error at the end of a local run on our linux machines that suggests an api issue but it may or may not be related. Why are you doing a local run? It should always be the same as us. If you print some useful error messages to print out then probably some of us would be willing to run it for you. It does take about 2 hrs to appear but restarting seems to go back to the same place. The error I reported indicates that there is something wrong with the memory call - Google indicates that it is freeing on non-existent memory or providing insufficient size. I can not offer more because these systems are behind a firewall. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Are people seeing this problem with other work units or is it a t405 specific problem for now? |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 225 |
All that I had that "got stuck" were t405 wu's. That, of course, is circumstantial evidence only of course. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
sslickerson Send message Joined: 14 Oct 05 Posts: 101 Credit: 578,497 RAC: 0 |
Are people seeing this problem with other work units or is it a t405 specific problem for now? It seems to be only or mostly t405 work units. |
ConflictingEmotions Send message Joined: 5 Jun 08 Posts: 10 Credit: 3,081,990 RAC: 0 |
Are people seeing this problem with other work units or is it a t405 specific problem for now? Yes, t405 work units have caused problems on two different systems. I have not had any other work units have the problem. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
There was a little miscommunication and the person who ran the local test confirms that we can reproduce this with the t405 task so we have something to work with now. It appears to be just a linux issue. When the boinc api calls exit, it stalls. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Heres a t401 that blew up https://boinc.bakerlab.org/rosetta/result.php?resultid=170539103 <core_client_version>5.10.45</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # cpu_run_time_pref: 14400 # random seed: 3431315 ERROR:: Exit from: .loop_relax.cc line: 1814 </stderr_txt> ]]> |
Helix Von Smelix Send message Joined: 16 Oct 05 Posts: 12 Credit: 4,029,747 RAC: 17 |
There was a little miscommunication and the person who ran the local test confirms that we can reproduce this with the t405 task so we have something to work with now. It appears to be just a linux issue. When the boinc api calls exit, it stalls. Out of ten+ Win XP (SP2&3) ALL had this problem and it was with t407 too. Also think there was the same issue with other t4xx WU's |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
we can't reproduce this on our windows machines and it appears to be a linux specific issue. Can others confirm seeing this with windows platforms? can you point me to your tasks so I can see the stderr if so? |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
Unfortunately... mine will all say "aborted by user". Because I didn't take the time to end and restart BOINC 5 times. And suspend and resume didn't seem to resolve the problem of BOINC thinking the task is active, but it wasn't getting any CPU. https://boinc.bakerlab.org/rosetta/result.php?resultid=171538656 Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
sslickerson Send message Joined: 14 Oct 05 Posts: 101 Credit: 578,497 RAC: 0 |
we can't reproduce this on our windows machines and it appears to be a linux specific issue. Can others confirm seeing this with windows platforms? can you point me to your tasks so I can see the stderr if so? Here you are David, windows only problem on my end: Task 171979908 I aborted all of the other 5.96 tasks but they were behaving the same way thus the rest on my list have a status of "aborted by user" with nothing else to go off of. |
Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
we can't reproduce this on our windows machines and it appears to be a linux specific issue. Can others confirm seeing this with windows platforms? can you point me to your tasks so I can see the stderr if so? Yes, I can. My t405__CASP8_JUMPAB_TYPE2_RES81to192_SAVE_ALL_OUT_BARCODE__3785_53598_0 is being stuck at 44.231% on Windows host. I've not tried yet to restart the client, just suspended the task (+keep in memory) after seeing it idle and immediately resumed afterwards. Such task in past occasionally continued flawlessly till the successfull end after really restarting their executable. [...] Letting the applications restart due to missed heartbeat did not help. I've thus restarted the client and the task is now hapily crunching. I've elevated Rosetta's STD, so the task should finish in some 2 hours, but will probably be not reported until maybe 8 am UTC. You can see the result afterwards. Unluckilly I've not got the idea to make a snapshot of its slot... But if it helps, stderr.txt contains a loooot of "res 13 and var 1 at position 1 is not a proper Nterm variant" lines and stderr.txt has following inside: Unhandled Exception Detected...but without any debugger output afterwards. No other suspicious files around. Peter |
Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
I've restarted the client and the task is now hapily crunching. I've elevated Rosetta's STD, so the task should finish in some 2 hours, but will probably be not reported until maybe 8 am UTC. You can see the result afterwards. In 58 minutes the task got it to 54.843% and the same Access Violation happened again. Paused, waiting on requests. As Feet1st noted, if necessary for the result data, the task might finish in 5 BOINC restarts :-D Peter |
The_Bad_Penguin Send message Joined: 5 Jun 06 Posts: 2751 Credit: 4,271,025 RAC: 0 |
From The_Brain_QC, who posted in Science section: Problem of multithreading with rosetta 5.96 beta
|
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 225 |
I can confirm that all of the t405 wu's that "stuck" were running on Win XP Sp3 systems. The symptoms were not the same as the "stuck at 100%", mine were typically sticking between 40 and 50%, otherwise the story is the same, showing as Running" in BM, but the wall time, % and completion were static, and the Windows Idle process was using the quota - it was the fall in CPU temperature that alerted me to the problem. As above, the stderr will just show aborted by user. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Conan Send message Joined: 11 Oct 05 Posts: 150 Credit: 4,191,010 RAC: 2,944 |
Are the actual rosetta processes not running and the boinc client stays idle as if it doesn't know that the error occurred? We need some more feedback to assess the situation. There is definitely a problem right now with these jobs that were submitted yesterday. If the client doesn't report back, we can't tell that the errors are occurring. I sent an email to Rom and David Anderson to see if there may have been an issue with the BOINC api. Have this occur on both Windows and Linux. On Linux getting to 100% keeps saying that it is 'running' but nothing is happening. I have aborted 2 of these so far with another half dozen to go. Have now had one that only got to 16% on Windows before stopping doing anything but the status says 'running'. Aborted that one also, will now abort all "t405" type work units as losing many hours with nothing to show for it. |
Pepo Send message Joined: 28 Sep 05 Posts: 115 Credit: 101,358 RAC: 0 |
I've restarted the client and the task is now hapily crunching. I've elevated Rosetta's STD, so the task should finish in some 2 hours, but will probably be not reported until maybe 8 am UTC. You can see the result afterwards. The t405__CASP8_JUMPAB_TYPE2_RES81to192_SAVE_ALL_OUT_BARCODE__3785_53598_0 finally finished after another client restart. Peter |
Feet1st Send message Joined: 30 Dec 05 Posts: 1755 Credit: 4,690,520 RAC: 0 |
I've got 2 T405s I had suspended on a Win XP machine. What should I capture when they hang up? And when to capture? Before or after suspending the task? [edit] I shortend my runtime, and the first completed normally at 1:51. Do they have to run longer to see the problem? Add this signature to your EMail: Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might! https://boinc.bakerlab.org/rosetta/ |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
A t401 that computer errored https://boinc.bakerlab.org/rosetta/result.php?resultid=170539103 And a t411 that also computed wrong https://boinc.bakerlab.org/rosetta/result.php?resultid=170539090 |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I posted minirosetta version 1.29 and rosetta_beta 5.97 on ralph which both include a fix for this bad bug that stalls clients. The problem was a possible infinite loop in the boinc api when an access violation caused by our t405 job was caught after the job completed. Hopefully the tests running on ralph will confirm the fix. |
Message boards :
Number crunching :
Problems with version 5.96
©2024 University of Washington
https://www.bakerlab.org