David E K Forum moderator Project administrator Project developer Project scientist Joined: Jul 1 05 Posts: 660 ID: 14 Credit: 838,217 RAC: 28
Please post bugs and issues regarding minirosetta version 1.45.
This update includes fixes to long runtimes for 'relax' jobs, validation errors, check point recovery issues, and numerical instability in hydrogen-bond scoring.
We think we might have fixed the preemption problem so please keep an eye out for this. The "can't acquire lockfile" issue might also be related. If you are having lockfile problems, please make sure there are no other boinc applications running in the same slot. If necessary, turn off the client and make sure all boinc apps are not running, and then restart the client.
Please, please, please post new versions in Rosetta Application Version Release Log. That's how we get notified by email so that we can adjust our firewalls--it's very important.
ID: 57625 | Rating: 0 | rate:
/
David E K Forum moderator Project administrator Project developer Project scientist Joined: Jul 1 05 Posts: 660 ID: 14 Credit: 838,217 RAC: 28
Please, please, please post new versions in Rosetta Application Version Release Log. That's how we get notified by email so that we can adjust our firewalls--it's very important.
This problem is getting very frustratung. I run Eosetta as well as nine other BOINC tasks so I try to balance approximate compute times across all but the Climate Prediction task which taks 2000 CPU minutes per task but allows two years to complete each task.
So when a Rosetta task runs away for 18-22 hours of CPU time I end up aborting it since it says a little more than 9 minutes left out of about 6 hours it was estimated to run.
All other BOINC based projects are fairly accurate and none have come close to 3-4 times the initial estimate as these are. Right now I have two Rosetta tasks running and both have hung at a little more than nine min for one and ten for the other.
So until this problem is resolved I have no choice but to suspend all further Rosetta tasks. I feel bad having to abort those 5-6 tasks previously since that is over 100 hours of CPU time wasted with two still suspended. That comes to over 160 hours of CPU time wasted when that amount of time could have completed dozens of tasks for other projects. Event SETI hasn't had a task that took more the 40 hours while the LHCAT tasks only take about two hours each so I could have processed 80 of them with the wasted CPU time from Rosette 1.40 tasks.
So I do hope they can fix what is wrong so I can get back to processing Rosette tasks again.
____________
ID: 57639 | Rating: 0 | rate:
/
David E K Forum moderator Project administrator Project developer Project scientist Joined: Jul 1 05 Posts: 660 ID: 14 Credit: 838,217 RAC: 28
This problem is getting very frustratung. I run Eosetta as well as nine other BOINC tasks so I try to balance approximate compute times across all but the Climate Prediction task which taks 2000 CPU minutes per task but allows two years to complete each task.
So when a Rosetta task runs away for 18-22 hours of CPU time I end up aborting it since it says a little more than 9 minutes left out of about 6 hours it was estimated to run.
All other BOINC based projects are fairly accurate and none have come close to 3-4 times the initial estimate as these are. Right now I have two Rosetta tasks running and both have hung at a little more than nine min for one and ten for the other.
So until this problem is resolved I have no choice but to suspend all further Rosetta tasks. I feel bad having to abort those 5-6 tasks previously since that is over 100 hours of CPU time wasted with two still suspended. That comes to over 160 hours of CPU time wasted when that amount of time could have completed dozens of tasks for other projects. Event SETI hasn't had a task that took more the 40 hours while the LHCAT tasks only take about two hours each so I could have processed 80 of them with the wasted CPU time from Rosette 1.40 tasks.
So I do hope they can fix what is wrong so I can get back to processing Rosette tasks again.
do you still have the names of those problem tasks? can you try our recently updated version and see if you have the same problems?
ID: 57640 | Rating: 0 | rate:
/
Mod.Sense Forum moderator Project administrator Joined: Aug 22 06 Posts: 2397 ID: 106194 Credit: 0 RAC: 0
ChiTownDale have you seen problems like this with v1.45?? It includes changes that should eliminate the long running models, and unpredictable completion times.
____________ Rosetta Moderator: Mod.Sense
Task 212334548, workunit 193576676 failed on my iMac2 10.4.11. after about half an hour. It seems to have been completed successfully by someone on an XP system.
<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
SIGBUS: bus error
Crashed executable name: minirosetta_1.45_i686-apple-darwin
built using BOINC library version 6.5.0
Machine type Intel 80486 (32-bit executable)
System version: Macintosh OS 10.4.11 build 8S2167
Fri Dec 5 12:23:43 2008
Promising results. All the silly warning messages have disappeared for me. No NAN hbonding errors either (yet?). Where I used to have 2 out of 5 WUs crash out with the error "Can't acquire lockfile" it's dropped to 2 out of 7 crashing out for that reason (over my first 21 results only - may not turn out to be representative).
In addition, similar to RottenMutt and JChojnacki, task 212336883 errored out with:
Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
After running a while, the WUs exit with code 193 and a stack trace.
Note that this is on 4 different Linux nodes, (all of which were running well with version 1.40, except for the NANs problem).
ID: 57665 | Rating: 0 | rate:
/
David Ball Joined: Nov 25 05 Posts: 20 ID: 19653 Credit: 1,031,407 RAC: 43
Vista 64 bit on stock HP machine with Q6600 CPU and 5 GB memory - no OC
BOINC 6.2.19
App: Mini 1.45
Name cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_hr1958_olange_5387_12341_1
Ran for around 4 hours and exited with
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000
Stack trace is in the result
http://boinc.bakerlab.org/rosetta/result.php?resultid=212406604
____________
Have you read a good Science Fiction book lately?
I have had 3 WUs error out on me but seems to be much more stable than it was:
http://boinc.bakerlab.org/rosetta/result.php?resultid=212602945
Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 948562
Report deadline 16 Dec 2008 20:17:24 UTC
CPU time 18577.93
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 28800
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x007FA877 read attempt to address 0x1F59DCA6
Engaging BOINC Windows Runtime Debugger...
********************
http://boinc.bakerlab.org/rosetta/result.php?resultid=212495875
Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 948562
Report deadline 16 Dec 2008 10:40:05 UTC
CPU time 6441.172
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 28800
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000
Engaging BOINC Windows Runtime Debugger...
http://boinc.bakerlab.org/rosetta/result.php?resultid=212434493
Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
Computer ID 948562
Report deadline 16 Dec 2008 4:04:00 UTC
CPU time 13200.43
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 28800
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x004EAD47 read attempt to address 0x00000000
I notice that this is one of the first two 1.45 workunits I've seen running on my dual-core machine at the same time - is there some problem with that and my memory size (2 GB total)?
Promising results. All the silly warning messages have disappeared for me. No NAN hbonding errors either (yet?). Where I used to have 2 out of 5 WUs crash out with the error "Can't acquire lockfile" it's dropped to 2 out of 7 crashing out for that reason (over my first 21 results only - may not turn out to be representative).
9 out of the next 11 were successful too, making 24 good out of 32, which is the best performance I've had for a very long time. Combined with a continuing 100% record on Beta 5.98s (much more coming through recently) I'm officially happier and less frustrated.
My 5th best day ever!
Not perfect yet, but just reporting some better news instead of constant misery. You must be working on the right lines. Keep it up!
____________
Another crash on Mac OSX 10.4.11. Task 212684901: Workunit 193888088
Same area of code as before (update_domain_map)
<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
# cpu_run_time_pref: 14400
SIGBUS: bus error
Crashed executable name: minirosetta_1.45_i686-apple-darwin
built using BOINC library version 6.5.0
Machine type Intel 80486 (32-bit executable)
System version: Macintosh OS 10.4.11 build 8S2167
Sun Dec 7 06:56:14 2008
steve Joined: Nov 27 08 Posts: 7 ID: 289641 Credit: 1,085 RAC: 0
David,
I just recieved an error during this file analysis:
Time of DownLoad:
12/7/2008 12:30:34 PM|rosetta@home|Starting task fixed_bb_hb_rlbd_1tig_IGNORE_THE_REST_DECOY_5470_34_0 using minirosetta version 145
Time of Error
12/7/2008 3:31:37 PM|rosetta@home|Started upload of fixed_bb_hb_rlbd_1tig_IGNORE_THE_REST_DECOY_5470_34_0_0
Error Message
"Could not write to a specified memory location". The message asked me if I wanted to DeBug.. I pressed cancel.
Time of Upload:
The file was upload to your server as:
12/7/2008 3:31:43 PM|rosetta@home|Finished upload of fixed_bb_hb_rlbd_1tig_IGNORE_THE_REST_DECOY_5470_34_0_0
Steve
Please post bugs and issues regarding minirosetta version 1.45.
This update includes fixes to long runtimes for 'relax' jobs, validation errors, check point recovery issues, and numerical instability in hydrogen-bond scoring.
We think we might have fixed the preemption problem so please keep an eye out for this. The "can't acquire lockfile" issue might also be related. If you are having lockfile problems, please make sure there are no other boinc applications running in the same slot. If necessary, turn off the client and make sure all boinc apps are not running, and then restart the client.
ID: 57690 | Rating: 0 | rate:
/
P . P . L . Joined: Aug 20 06 Posts: 365 ID: 105843 Credit: 361,915 RAC: 771
Promising results. All the silly warning messages have disappeared for me. No NAN hbonding errors either (yet?). Where I used to have 2 out of 5 WUs crash out with the error "Can't acquire lockfile" it's dropped to 2 out of 7 crashing out for that reason (over my first 21 results only - may not turn out to be representative).
9 out of the next 11 were successful too, making 24 good out of 32, which is the best performance I've had for a very long time. Combined with a continuing 100% record on Beta 5.98s (much more coming through recently) I'm officially happier and less frustrated.
And now 16 more successes out of 18 making 40 out of 50. Most errors came early, so I'm now confident enough to up my run-time from 2 to 3 hours again.
Good work on this "Can't acquire lockfile" problem. I'm just going to tidy up the lockfiles, reboot and see if the good results continue.
Efforts much appreciated here. Let's see if it can be nailed in the next update.
____________
I'm just going to tidy up the lockfiles, reboot and see if the good results continue.
On this, when trying to stop the BOINC service I wasn't allowed to until I'd ended the boinc.exe client process under User Name boinc_master (Vista64 OS quad-core AMD Phenom).
In the Task Manager I 'showed processes for all users' to do this and saw that 2 rosetta_beta_5.98_windows_x86_64.exe*32 processes were still running (correct) but also about 20 minirosetta_1.4x_windows_x86_64.exe*32 processes were running. About half of those were for MiniRosetta 1.40 and the other half for v1.45. All under the User Name boinc_project. There should just have been 2 for v1.45.
My last Mini 1.40 WU was completed late last Friday, so these 10-ish 1.40 processes have persisted for 3 days (no re-boots in that time).
I manually ended all these processes.
Going to the C:\ProgramData\BOINC\slots folder, there were 23 folders (numbered from 0 to 22), the first 19 of which contained a 0-byte boinc_lockfile file and a stderr.txt and a stdout.txt file. The other 4 folders contained the files I'd expect for running processes.
I deleted all the boinc_lockfile files and re-booted. On start-up, the first 19 folders had been removed, leaving the 4 active ones.
I'm no programmer and may be talking out of my hat, but could these old processes still running have something to do with being unable to acquire boinc_lockfile ?
When there's a Compute Error is there some fault in the way the process closes down and releases the files it holds open? Or could it be to do with the user (me) aborting a WU manually when I see it's stalled and producing error messages?
I'm guessing wildly, obviously, but hopefully this means something more sensible to you clever chaps. You seem to be close to some solutions for a problem that's persisted over several versions, so maybe this is the final clue you need? Hope it helps.
____________
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x007FA87A read attempt to address 0x000002E8
Jack
____________
ID: 57712 | Rating: 0 | rate:
/
Chu Forum moderator Project administrator Project developer Project scientist Joined: Feb 23 06 Posts: 120 ID: 61076 Credit: 112,439 RAC: 4
Hi everyone, due to a limitation on command line length set by BOINC, the jobs with name "*_ZN_ABRELAX_*" have their command lines automatically truncated when sent out on Rosetta@Home. That is why you see it stopped with an ERROR like "ERROR: Illegal value for integer option -run:jran specified: " right away. In fact, the same workunits returned with very good success rate on the testing server. We are investigating why the alpha testing server did not catch such errors in the first place. This is another unfortunate incident which will be a new lesson for us. Sorry for any inconvenience this has brought to you.
____________
I notice that for most of those workunits, your wingman returned an error on the same workunit, so I'd suspect problems built into either the cs_vanilla workunits or a new feature of minirosetta that few other workunits have used before.
In those cases where your wingman completed the workunit without a problem, it was on an Intel Xeon CPU with a lot more memory. This leads me to believe that cs_vanilla workunits need a lot more memory than your computer has, and suggests that your rather old version of BOINC (5.2.13) may not be sending the information needed to choose only workunits that will work with your memory size. The version of BOINC used where those workunits were successful was 6.2.19.
My computer is in between - using BOINC 5.10.45 with a total RAM memory of 2 GB, and the only cs_vanilla workunit I've had so far ran longer than most of yours, but still failed.
Hi everyone, due to a limitation on command line length set by BOINC, the jobs with name "*_ZN_ABRELAX_*" have their command lines automatically truncated when sent out on Rosetta@Home. That is why you see it stopped with an ERROR like "ERROR: Illegal value for integer option -run:jran specified: " right away.
Thanks. Just about to report 3 of them. Aborted another in advance.
I haven't gotten Minirosetta to run successfully on my XP Pro computer since it's inception. Every time there is a new version of Minirosetta or BOINC or any other change that might effect the success of Minirosetta, I give it a try. The rest of the time I abort Minirosetta until I get a Rosetta Beta WU.
I run many programs on my computer, even new programs that may need debugging; and Minirosetta is the only one that crashes or hangs my computer.
Here are my stats for the last 100 Rosetta WUs:
2 new Rosetta Beta WUs (anticipate success)
2 done (OK) Rosetta Beta WUs
8 failed Rosetta Mini WUs, most 1.40, some 1.45
80 aborted Rosetta Mini WUs, most 1.40, some 1.45
So I waste my time with 88 Rosetta Mini WUs, for 4 Rosetta Beta WUs that are good. (22 to 1)
Another computer in my house, an XP HE with 1/5th the horsepower, can sometimes compute Minirosetta WUs. Its stats are:
5 done (OK) Rosetta Mini WUs, most 1.40
3 failed Rosetta Mini WUs, most 1.40
I haven't gotten Minirosetta to run successfully on my XP Pro computer since it's inception. Every time there is a new version of Minirosetta or BOINC or any other change that might effect the success of Minirosetta, I give it a try. The rest of the time I abort Minirosetta until I get a Rosetta Beta WU.
Here are my stats for the last 100 Rosetta WUs:
2 new Rosetta Beta WUs (anticipate success)
2 done (OK) Rosetta Beta WUs
8 failed Rosetta Mini WUs, most 1.40, some 1.45
80 aborted Rosetta Mini WUs, most 1.40, some 1.45
So I waste my time with 88 Rosetta Mini WUs, for 4 Rosetta Beta WUs that are good. (22 to 1)
Another computer in my house, an XP HE with 1/5th the horsepower, can sometimes compute Minirosetta WUs. Its stats are:
5 done (OK) Rosetta Mini WUs, most 1.40
3 failed Rosetta Mini WUs, most 1.40
How much RAM memory do each of those computers have, and how many CPU cores do each of them have? minirosetta is now memory-hungry enough that the answers make a big difference.
Also, have you tried a 1.45 workunit with a name which doesn't start with cs_vanilla? Those workunits have been especially troublesome lately.
What's the total amount of RAM memory on your machine? I just looked over many of the cs_vanilla type of workunits with enough information posted to this thread to find the workunits, and found the following:
Most of that type of workunit that ran under BOINC 6.2.19 on a machine with at least 4 GB of memory succeeded.
Perhaps half of those under BOINC 6.2.19 and 3 GB succeeded.
Most of those with BOINC 6.2.14 or older failed.
Most of those with 2 GB or less failed.
I didn't find enough under BOINC 6.2.18 to be sure, but perhaps half of those I saw succeeded.
Most of the workunits with _ZN_ABRELAX_ in the workunit name have problems; see the earlier message about them.
I saw a lot fewer failures for workunits with different types of names.
Naturally, what I was able to find was probably biased by the fact that people aren't likely to post enough information about workunits that don't fail on at least one machine for me to be able to find them.
I suspect that we need a new system requirements evaluation specific to workunits that use the same features as the cs_vanilla workunits.
How much RAM memory do each of those computers have, and how many CPU cores do each of them have? minirosetta is now memory-hungry enough that the answers make a big difference.
Also, have you tried a 1.45 workunit with a name which doesn't start with cs_vanilla? Those workunits have been especially troublesome lately.
The first one I spoke of has two cores, and -
Memory: 1.94 GB physical, 3.78 GB virtual
The second one has one core, and -
Memory: 1.02 GB physical, 3.91 GB virtual
I'll try a non-vanilla workunit. Do they have chocolate?
If memory is a critical issue, why can't Minirosetta check available memory at the start, and quit right away if memory is too scarce?
Also it seems that the Scheduler ought to observe when a client aborts more than 20 (or some other number) Minirosetta workunits, and then don't send any to that client. Or, better, let clients select the programs that they will accept, as other projects allow.
In those cases where your wingman completed the workunit without a problem, it was on an Intel Xeon CPU with a lot more memory. This leads me to believe that cs_vanilla workunits need a lot more memory than your computer has, and suggests that your rather old version of BOINC (5.2.13) may not be sending the information needed to choose only workunits that will work with your memory size.
I find my BOINC version quite reliable, and it certainly sends the memory size information properly. The computers' links at Rosetta show the proper memory size, and in the past my 512MB machines have been refused work when none was available for their memory size.
Your suggestion about cs_vanilla WUs needing more memory may be right, though. I have two quads (8 cores total) with lots of memory, and neither have had any of the cs_vanilla errors.
Maybe those WU eventually hit a model where they are using a lot more memory than they are supposed to.
I don't know about these units passing on Intel CORE cpu's and more memory... I am having dozens of these cs_vanilla units bomb out on machines with dual CORE2 quad XEONS and 16GB of RAM... I think these machines are big enough to handle anything out there. And I was running 14 of them up to a day or two ago... Most are still crunching, but, are starting to wind down so that they can be part of a compute farm...
So, I think these is something wrong in these units or the v1.45 of mini....
____________ Looking for a team ??? Join BoincSynergy!!
ID: 57732 | Rating: 0 | rate:
/
Tony Joined: Dec 12 05 Posts: 6 ID: 35547 Credit: 1,691,507 RAC: 3,364
ModLoad: 77140000 00160000 C:\Windows\SysWOW64\ntdll.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : wntdll.pdb
File Version : 6.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18000
ModLoad: 76c90000 00110000 C:\Windows\syswow64\kernel32.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : wkernel32.pdb
File Version : 6.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18000
ModLoad: 75520000 000d0000 C:\Windows\syswow64\USER32.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : wuser32.pdb
File Version : 6.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18000
ModLoad: 760e0000 00090000 C:\Windows\syswow64\GDI32.dll (6.0.6001.18023) (-exported- Symbols Loaded)
Linked PDB Filename : wgdi32.pdb
File Version : 6.0.6001.18023 (vistasp1_gdr.080221-1537)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18023
ModLoad: 759d0000 000c6000 C:\Windows\syswow64\ADVAPI32.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : advapi32.pdb
File Version : 6.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18000
ModLoad: 75640000 000f0000 C:\Windows\syswow64\RPCRT4.dll (6.0.6001.18051) (-exported- Symbols Loaded)
Linked PDB Filename : wrpcrt4.pdb
File Version : 6.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18000
ModLoad: 752f0000 00060000 C:\Windows\syswow64\Secur32.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : wsecur32.pdb
File Version : 6.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18000
ModLoad: 75f10000 00060000 C:\Windows\system32\IMM32.DLL (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : wimm32.pdb
File Version : 6.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18000
ModLoad: 75e00000 000c8000 C:\Windows\syswow64\MSCTF.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : msctf.pdb
File Version : 6.0.6000.16386 (vista_rtm.061101-2205)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6000.16386
ModLoad: 75f70000 000aa000 C:\Windows\syswow64\msvcrt.dll (7.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : msvcrt.pdb
File Version : 7.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 7.0.6001.18000
ModLoad: 76170000 00009000 C:\Windows\syswow64\LPK.DLL (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : wlpk.pdb
File Version : 6.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18000
ModLoad: 75aa0000 0007d000 C:\Windows\syswow64\USP10.dll (1.626.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : usp10.pdb
File Version : 1.0626.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft(R) Uniscribe Unicode script processor
Product Version : 1.0626.6001.18000
ModLoad: 73b00000 00021000 C:\Windows\system32\NTMARTA.DLL (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : ntmarta.pdb
File Version : 6.0.6000.16386 (vista_rtm.061101-2205)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6000.16386
ModLoad: 755f0000 0004a000 C:\Windows\syswow64\WLDAP32.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : wldap32.pdb
File Version : 6.0.6000.16386 (vista_rtm.061101-2205)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6000.16386
ModLoad: 75ed0000 0002d000 C:\Windows\syswow64\WS2_32.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : ws2_32.pdb
File Version : 6.0.6000.16386 (vista_rtm.061101-2205)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6000.16386
ModLoad: 75f00000 00006000 C:\Windows\syswow64\NSI.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : nsi.pdb
File Version : 6.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18000
ModLoad: 75930000 00007000 C:\Windows\syswow64\PSAPI.DLL (6.0.6000.16386) (-exported- Symbols Loaded)
Linked PDB Filename : psapi.pdb
File Version : 6.0.6000.16386 (vista_rtm.061101-2205)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6000.16386
ModLoad: 73ae0000 00011000 C:\Windows\system32\SAMLIB.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : samlib.pdb
File Version : 6.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18000
ModLoad: 75b20000 00144000 C:\Windows\syswow64\ole32.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : ole32.pdb
File Version : 6.0.6000.16386 (vista_rtm.061101-2205)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6000.16386
ModLoad: 6e470000 000dc000 C:\Windows\system32\dbghelp.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : dbghelp.pdb
File Version : 6.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18000
ModLoad: 74fe0000 00008000 C:\Windows\system32\version.dll (6.0.6001.18000) (-exported- Symbols Loaded)
Linked PDB Filename : version.pdb
File Version : 6.0.6001.18000 (longhorn_rtm.080118-1840)
Company Name : Microsoft Corporation
Product Name : Microsoft� Windows� Operating System
Product Version : 6.0.6001.18000
Now I've seen some cs_vanilla errors on my biggest memory computer, an 8GB quad. This can't be due to running out of memory, (unless they're hitting the limit of a 32 bit processes address space).
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
THanks for all the error reports! I think we've found the issue here. THis was damn tricky to find since, for some reason, it doesnt appear to occur on linux plattforms even nearly as frequently as on mAC and windows. I ran the equivalent of several thousand WUs on our local cluster and didnt have a single job crash.
But i think we've found at least one issue by testing on our limited windows/mac resources, and a bug fix is going out to ralph tonight or tomorrow morning depending on how much caffeine i can get hold of.
Our aim is to get mini inline with old rosetta in terms of error rate as soon as we can!
Thanks for all the feedback, it totally helps finding these bugs!
Hi everyone, due to a limitation on command line length set by BOINC, the jobs with name "*_ZN_ABRELAX_*" have their command lines automatically truncated when sent out on Rosetta@Home. That is why you see it stopped with an ERROR like "ERROR: Illegal value for integer option -run:jran specified: " right away. In fact, the same workunits returned with very good success rate on the testing server. We are investigating why the alpha testing server did not catch such errors in the first place. This is another unfortunate incident which will be a new lesson for us. Sorry for any inconvenience this has brought to you.
Is this still the case or have they been corrected and re-issued?
I ask this because more are coming through and I just had one crash out on me. I aborted the rest, just in case, to save wasting processing time.
____________
If your computer used any processing time at all on the unit it must have been another error. because in this particular case the tasks fail to process at all
If your computer used any processing time at all on the unit it must have been another error. Because in this particular case the tasks fail to process at all
Of course, thanks. I forgot. Looks like they were corrected then - crashed out after 25 minutes. I'll let them run.
____________
Task 212871743
cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_hi0719_olange_5386_35341_0
compute error
died at 12807.86 seconds with the usual (0xc0000005)error
task 212923776
cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_ccr19_olange_5384_35553_1
died at 3373.781 seconds with the usual (0xc0000005) error
Task 212986916
cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_mth1598_olange_5388_38521_1
died at 1881.563 seconds with the usual (0xc0000005) error
Task 213039289
cs_vanilla_abrelax_homo_bench_cs_vanilla_abrelax_cs_hi0719_olange_5386_42252_0
died at 5740.828 seconds with the usual (0xc0000005) error
Aborted the remaing vanila task, to many compute errors.
The last one was after 18 (!) hours of computation. 18 hours of wasted eletric power.
That's it. I will suspend to participate until the application runs more stable. For that I will set all my computers to not to download any further workunits on monday. Maybe I will come around next spring to see if things are working again.
Got a validation error on score12_rlbd_1gvp_IGNORE_THE_REST_DECOY_5473_170 any indication as to what may have caused this?
The task ran for the full time so no indication on my end of a problem.
Got a validation error on score12_rlbd_1gvp_IGNORE_THE_REST_DECOY_5473_170 any indication as to what may have caused this?
The task ran for the full time so no indication on my end of a problem.
I don't know, but I noticed that your wingman on that workunit seemed to have chosen a shorter workunit size, and therefore shut down before reaching whatever caused that problem. Also, I've noticed that choosing a preferred workunit length above 10 hours seems to get me more problematic workunits, so if you get such problems often, you might want to try reducing your preferred workunit size.
Got a validation error on score12_rlbd_1gvp_IGNORE_THE_REST_DECOY_5473_170 any indication as to what may have caused this?
The task ran for the full time so no indication on my end of a problem.
I took another look at your results, and noticed that it returned 596 decoys. I don't think I've seen a workunit before that returned a 3 digit number of decoys, so perhaps there needs to be a check of whether both minirosetta 1.45 and the workunit validation software can handle that many decoys for one workunit and still do it properly.
Server state Over
Outcome Client error
Client state Compute error
Exit status -226 (0xffffff1e)
Computer ID 963376
Report deadline 22 Dec 2008 1:30:07 UTC
CPU time 21570.15
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
too many exit(0)s
</message>
<stderr_txt>
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
Can't acquire lockfile - exiting
Can't acquire lockfile - exiting
Got a validation error on score12_rlbd_1gvp_IGNORE_THE_REST_DECOY_5473_170 any indication as to what may have caused this?
The task ran for the full time so no indication on my end of a problem.
I took another look at your results, and noticed that it returned 596 decoys. I don't think I've seen a workunit before that returned a 3 digit number of decoys, so perhaps there needs to be a check of whether both minirosetta 1.45 and the workunit validation software can handle that many decoys for one workunit and still do it properly.
I've been running at this setting for several months with out any major troubles and have had several that returned triple digit decoys. I setup to run 1 day after running for less than 10 hours for a long time and having units run what seemed like forever. This way I've not had any taks run over my preference and it works well for my setup. I just don't remember a task that did not validate that had run to completion here before this one.
Got a validation error on score12_rlbd_1gvp_IGNORE_THE_REST_DECOY_5473_170 any indication as to what may have caused this?
The task ran for the full time so no indication on my end of a problem.
I took another look at your results, and noticed that it returned 596 decoys. I don't think I've seen a workunit before that returned a 3 digit number of decoys, so perhaps there needs to be a check of whether both minirosetta 1.45 and the workunit validation software can handle that many decoys for one workunit and still do it properly.
I've been running at this setting for several months with out any major troubles and have had several that returned triple digit decoys. I setup to run 1 day after running for less than 10 hours for a long time and having units run what seemed like forever. This way I've not had any taks run over my preference and it works well for my setup. I just don't remember a task that did not validate that had run to completion here before this one.
Then perhaps the limit handled successfuly is higher than 99 decoys per workunit, but not as high as 596.
When I encountered two cs_vanilla compute errors in a row I set Rosetta to NNW. That was 4 days ago. Until the software is fixed and announced here it will remain so. It behooves the project team to fix these errors ASAP rather than wait until this thread (like its predecessors) is cluttered with hundreds of posts reporting the same stuff. I do not understand this counter-productive behavior.
ID: 57826 | Rating: 0 | rate:
/
David E K Forum moderator Project administrator Project developer Project scientist Joined: Jul 1 05 Posts: 660 ID: 14 Credit: 838,217 RAC: 28
we are definitely working on it and will likely have an update within a few days after testing on ralph.
ID: 57829 | Rating: 0 | rate:
/
Mike Tyka Forum moderator Project administrator Project developer Project scientist Joined: Oct 20 05 Posts: 95 ID: 5612 Credit: 2,190 RAC: 0
Assertion failure in Task 213968874 (abinitio_abrelax_nohomfrag_129_B_1qgvA_5483_146_0)
Workunit 195032150, Mac OS X 10.4.11
Failed after 30 seconds
<core_client_version>6.2.18</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>
<stderr_txt>
Appologies for this - i screwed up the submit for two proteins:
1qgv and 1t2j . I've tried to remove the jobs as soon as i noticed but
around 200 WUs went out anyway. If you get a WU with either of those two protein tags please abort it!
read here for two links on how to take care of lockfiles.
Last 24 hours have produced this error on 5 WU's
Server state Over
Outcome Client error
Client state Compute error
Exit status -226 (0xffffff1e)
Computer ID 963376
Report deadline 22 Dec 2008 1:30:07 UTC
CPU time 21570.15
stderr out <core_client_version>6.2.19</core_client_version>
<![CDATA[
<message>
too many exit(0)s
</message>
<stderr_txt>
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
# cpu_run_time_pref: 86400
Can't acquire lockfile - exiting
Can't acquire lockfile - exiting
Got a validation error on score12_rlbd_1gvp_IGNORE_THE_REST_DECOY_5473_170 any indication as to what may have caused this?
The task ran for the full time so no indication on my end of a problem.
I took another look at your results, and noticed that it returned 596 decoys. I don't think I've seen a workunit before that returned a 3 digit number of decoys, so perhaps there needs to be a check of whether both minirosetta 1.45 and the workunit validation software can handle that many decoys for one workunit and still do it properly.
I've been running at this setting for several months with out any major troubles and have had several that returned triple digit decoys. I setup to run 1 day after running for less than 10 hours for a long time and having units run what seemed like forever. This way I've not had any taks run over my preference and it works well for my setup. I just don't remember a task that did not validate that had run to completion here before this one.
Then perhaps the limit handled successfuly is higher than 99 decoys per workunit, but not as high as 596.
While that is possible you'd think that if there was a limit it'd be coded into the app and tasks would end once the limit was reached.
1wjdA_ZNMP_ABRELAX_tetraL_IGNORE_THE_REST_ZINC_METALLOPROTEIN-1wjdA-_5478_4043_0 got stuck and was showing 23.45% remaining which is odd, being that the messages in boinc manager showed it had started about 5 minutes earlier before getting inturputed by benchmark testing.
after aborting the task the next one started and the cores went to 100% immediately.
I notice that your results are the first I've seen that were run under boinc 6.4.1. I wonder if that's the source of the problem instead of minirosetta 1.45?
Server Status Page is showing a problem 839am 12/14/08
As of 14 Dec 2008 20:26:34 UTC the Server Status Page shows:
Program rah_make_work1 on host srv3 with status "Not running".
Work units Ready to send: 1
It looks like program rah_make_work2 isn't able to handle the load all by itself.
I have uninstalled and reinstalled XP, reinstalled boinc, added the save in memory clause, standard clocks on computer, memtest on ram, burning for whole machine and im still getting these errors about comp error and locked files. At the same time SETI has no problems at all with the machine, its speed or its ram or anything.
"A good program doesnt need 54 hoops to jump through before it works"
After a clean install and full format , i have to lean towards the rosetta coding as the cause.
Problems with this code are:
Doesnt release all or just some of the processes when asked to snooze, lockfile is always present, says too many restarts.
Other machines here are running fine but this one seems to have problems with only Rosetta at home. After a clean install and full format , i have to lean towards the rosetta coding as the cause.
Does Not follow fair sharing of resources, Boinc manager at 50:50 and Rosetta has basically locked out all other projects.
How about a nice little msi file to patch up the damage and lets get folding.
The core client is the one doing the resource shuffling not the rosetta app. And I suspect it is doing its work alright as long as you're not trying to micro-manage BOINC.
____________
very strange...but i think that installing boinc again just reinstalls the base program and does nothing to the project files. did you go into your slots folder and erase the slots? that is where the lockfiles are located. be sure to complete all your current running tasks first before deleting. I run boinc off of a different partition than C, perhaps you can complete your current work and then install on a different partition and see if that takes care of the problem. after I did the slot clean up on my system everything worked ok.
Greg be
I have uninstalled and reinstalled XP, reinstalled boinc, added the save in memory clause, standard clocks on computer, memtest on ram, burning for whole machine and im still getting these errors about comp error and locked files. At the same time SETI has no problems at all with the machine, its speed or its ram or anything.
"A good program doesnt need 54 hoops to jump through before it works"
After a clean install and full format , i have to lean towards the rosetta coding as the cause.
Problems with this code are:
Doesnt release all or just some of the processes when asked to snooze, lockfile is always present, says too many restarts.
Other machines here are running fine but this one seems to have problems with only Rosetta at home. After a clean install and full format , i have to lean towards the rosetta coding as the cause.
Does Not follow fair sharing of resources, Boinc manager at 50:50 and Rosetta has basically locked out all other projects.
How about a nice little msi file to patch up the damage and lets get folding.
Greg,
I did a full Hard Drive format, there are no files project or otherwise then reinstalled XP. The slots are cleaned up.
After reinstalling Boinc and Rosetta as a project letting it manage itself for 24hours I had all the same errors as before, lockfile, not releasing and as always no credit. I have been running the services from a nonSystem disk, I will reinstall on the system disk and see how that works.
Thanks for taking time to help out.
And now 16 more successes out of 18 making 40 out of 50. Most errors came early, so I'm now confident enough to up my run-time from 2 to 3 hours again.
Update on this. In the last week, 116 MiniRosetta 1.45 tasks, 3hr runtime:
64 Success (55%)
52 Failure (45%)
So, better than last time I ran with 3hr runtimes (was 43%) but still some way to go. I think the figure for 2hr run times was 73% (up to 80% with v1.45 on above figures).
____________
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
minirosetta_1.45_i686-apple-darwin(47077,0xa0538fa0) malloc: *** error for object 0x17478c0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
# cpu_run_time_pref: 14400
minirosetta_1.45_i686-apple-darwin(47077,0xa0538fa0) malloc: *** error for object 0x17478c0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
SIGBUS: bus error
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
minirosetta_1.45_i686-apple-darwin(48148,0xa0538fa0) malloc: *** error for object 0x17478c0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
# cpu_run_time_pref: 14400
SIGBUS: bus error
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
minirosetta_1.45_i686-apple-darwin(48486,0xa0538fa0) malloc: *** error for object 0x17478c0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
# cpu_run_time_pref: 14400
minirosetta_1.45_i686-apple-darwin(48486,0xa0538fa0) malloc: *** error for object 0x17478c0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
minirosetta_1.45_i686-apple-darwin(48486,0xa0538fa0) malloc: *** error for object 0x17478c0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
SIGBUS: bus error
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
minirosetta_1.45_i686-apple-darwin(58522,0xa0538fa0) malloc: *** error for object 0x17478c0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
# cpu_run_time_pref: 14400
minirosetta_1.45_i686-apple-darwin(58522,0xb0087000) malloc: *** error for object 0x17478c0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
terminate called after throwing an instance of 'std::length_error'
what(): basic_string::_S_create
SIGABRT: abort called
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
minirosetta_1.45_i686-apple-darwin(47419,0xa0538fa0) malloc: *** error for object 0x17478c0: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
# cpu_run_time_pref: 14400
SIGSEGV: segmentation violation
come on guys..your killing me.
2 compute errors in 8 hours today and then 4 out of 6 compute errors on the 11th that can be placed on bad tasks. What is with tasks getting half way and then crashing with no credit? You should make rosie grant the claimed credit on these errors since it is computing the credit. then we are not wasting our cpu time and electricity on 0 points. I could have got 101 points for the second crash. this month i have lost 18.5 hrs in bad tasks that died halfway and I could have got 508.4 credits if there was granted credit for crashing. The crash rate this month so far has been 6% on my system.
You seem to be inserting a few extra characters when you create links, probably quote marks, which prevents me from following the links.
**note** i edited his url lines to get rid of the " ". thing should be ok now.
there is nothing to see there as it was a user abort and that cancels run time information. this is for both links.
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00427BEA write attempt to address 0x08787FFC
What the heck is this error now? did your task go bad at the last minute?
can someone from the team explain what the heck 0xc blah blah error means?
I was giving you 6hr run times but now i have dropped to 4. to many credit losses lately. If the 1.47's crash I will be reducing my resource share as well, until you guys figure out what the heck is going on.
It would appear that there are no new tasks at the moment. Be patient and all will be revealed. There haven't been any announcement from the team so either there is a problem their end or they are getting some new work units ready.
____________