Rosetta 4.1+ and 4.2+

Message boards : Number crunching : Rosetta 4.1+ and 4.2+

To post messages, you must log in.

Previous · 1 . . . 23 · 24 · 25 · 26 · 27 · 28 · 29 . . . 34 · Next

AuthorMessage
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,576,016
RAC: 20,379
Message 98818 - Posted: 7 Sep 2020, 22:36:51 UTC - in response to Message 98810.  

Well I have no clue on how to troubleshoot the issue. As I stated no issues with any other tasks. I have 32GB of memory and memory usage is only 25% of max. I see less than 1GB or memory usage on the Rosetta tasks.


I swear by "Memtest 86" (or "Memtest 86+"), whichever works on your system - one doesn't work on older machines and one doesn't work on newer ones, I can't remember which. You download it for free, it makes a bootable OS-independant CD, and you run it for about an hour or so until it says "pass complete". Even one single RAM error reported, you need to replace the RAM. You can easily find out which chip is faulty by testing one at a time.
ID: 98818 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 96
Credit: 322,693
RAC: 1,374
Message 98819 - Posted: 7 Sep 2020, 22:37:50 UTC - in response to Message 98813.  

I only run two Rosetta tasks at a time at most. The one task that I mentioned uses all available memory (32GB) plus all of the 6GB swap file every ten minutes or so. Must be writing out to a scratch file or something. Most single task memory usage I ever saw before on any Rosetta task was around 4GB. What prompted me to bump to 32GB in the first place.
So this task species is most definitely an extreme outlier.

As far as changing settings, since no other tasks from no other projects have any issues, the solution is just to quit crunching Rosetta.
ID: 98819 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 96
Credit: 322,693
RAC: 1,374
Message 98821 - Posted: 7 Sep 2020, 22:41:01 UTC - in response to Message 98818.  

Well I have no clue on how to troubleshoot the issue. As I stated no issues with any other tasks. I have 32GB of memory and memory usage is only 25% of max. I see less than 1GB or memory usage on the Rosetta tasks.


I swear by "Memtest 86" (or "Memtest 86+"), whichever works on your system - one doesn't work on older machines and one doesn't work on newer ones, I can't remember which. You download it for free, it makes a bootable OS-independant CD, and you run it for about an hour or so until it says "pass complete". Even one single RAM error reported, you need to replace the RAM. You can easily find out which chip is faulty by testing one at a time.

I swear by stressapptest. My systems pass 24 hours of memory testing using all available memory and all available cores with no errors.
ID: 98821 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 96
Credit: 322,693
RAC: 1,374
Message 98823 - Posted: 7 Sep 2020, 22:44:10 UTC - in response to Message 98815.  

[Edit 2] Well it was this Rosetta task kp8RjDVk_fold_and_dock_SAVE_ALL_OUT_1009390_201. It is grabbing all the memory and the swap file every five minutes or so.
It's a resend, this is what the first system got with it.

            Outcome Computation error
       Client state Compute error
        Exit status 1 (0x00000001) Unknown error code
        Computer ID 5159178
           Run time 19 min 44 sec
           CPU time 18 min 38 sec
     Validate state Invalid
             Credit 0.00
  Device peak FLOPS 3.28 GFLOPS
Application version Rosetta v4.20 windows_x86_64



Stderr output
<core_client_version>7.0.80</core_client_version>
<![CDATA[
<message>
Función incorrecta.
 (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @kp8RjDVk_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_kp8RjDVk_data.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3873245
Using database: database_357d5d93529_n_methylminirosetta_database

ERROR: Error in core::kinematics::FoldTree::get_jump_that_builds_residue(): This residue is not the child of (built by) a jump!
ERROR:: Exit from: ......srccorekinematicsFoldTree.cc line: 436
BOINC:: Error reading and gzipping output datafile: default.out
16:00:04 (3796): called boinc_finish(1)

</stderr_txt>
]]>

Thanks for the reply. That report is exactly what I am seeing on this task. My memory usage for the task climbs from 1GB all the way to all memory and swap in use for the task every ten minutes or so and then falls back to normal. Looking at it in htop was what allowed me to figure out the culprit.
So I assume a faulty work unit and I will just abort it now.
ID: 98823 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,576,016
RAC: 20,379
Message 98824 - Posted: 7 Sep 2020, 22:44:18 UTC - in response to Message 98819.  

I only run two Rosetta tasks at a time at most. The one task that I mentioned uses all available memory (32GB) plus all of the 6GB swap file every ten minutes or so. Must be writing out to a scratch file or something. Most single task memory usage I ever saw before on any Rosetta task was around 4GB. What prompted me to bump to 32GB in the first place.
So this task species is most definitely an extreme outlier.

As far as changing settings, since no other tasks from no other projects have any issues, the solution is just to quit crunching Rosetta.


Kinda looks like you're getting unlucky and receiving dodgy tasks that eat memory. Oh well, put up with the errors and let them see the problem, or switch it off and let someone else get the horrid ones for a while.
ID: 98824 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,576,016
RAC: 20,379
Message 98825 - Posted: 7 Sep 2020, 22:45:27 UTC - in response to Message 98821.  

Well I have no clue on how to troubleshoot the issue. As I stated no issues with any other tasks. I have 32GB of memory and memory usage is only 25% of max. I see less than 1GB or memory usage on the Rosetta tasks.


I swear by "Memtest 86" (or "Memtest 86+"), whichever works on your system - one doesn't work on older machines and one doesn't work on newer ones, I can't remember which. You download it for free, it makes a bootable OS-independant CD, and you run it for about an hour or so until it says "pass complete". Even one single RAM error reported, you need to replace the RAM. You can easily find out which chip is faulty by testing one at a time.

I swear by stressapptest. My systems pass 24 hours of memory testing using all available memory and all available cores with no errors.


I'd never heard of that, I assume it does the same thing as Memtest. Does it run within the OS? If so I'd not trust it, as the OS can't let it test memory in use by the kernel.
ID: 98825 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 96
Credit: 322,693
RAC: 1,374
Message 98828 - Posted: 7 Sep 2020, 22:56:18 UTC - in response to Message 98825.  
Last modified: 7 Sep 2020, 23:10:56 UTC

Well I have no clue on how to troubleshoot the issue. As I stated no issues with any other tasks. I have 32GB of memory and memory usage is only 25% of max. I see less than 1GB or memory usage on the Rosetta tasks.


I swear by "Memtest 86" (or "Memtest 86+"), whichever works on your system - one doesn't work on older machines and one doesn't work on newer ones, I can't remember which. You download it for free, it makes a bootable OS-independant CD, and you run it for about an hour or so until it says "pass complete". Even one single RAM error reported, you need to replace the RAM. You can easily find out which chip is faulty by testing one at a time.

I swear by stressapptest. My systems pass 24 hours of memory testing using all available memory and all available cores with no errors.


I'd never heard of that, I assume it does the same thing as Memtest. Does it run within the OS? If so I'd not trust it, as the OS can't let it test memory in use by the kernel.

Well I first used Memtest from a USB stick. But the memory testers on OCN state that is a very poor tester for Linux. They recommend the Google stressapptest. That is the one Google developed to test their servers that they deploy in their AWS farms before putting them into service. It is a standard application in the repositories.

I then follow up the memory stress testing with several hours of Prime95 and y-cruncher to put the system under actual compute loads to make sure it is stable before starting up BOINC with my actual loads. Closest I can come to actual BOINC loads. But BOINC is the final arbiter of stability. If I don't run Rosetta, I don't get any errors on any of my other projects.
[Edit]
Here are some links about it.
https://www.ghacks.net/2009/10/19/google-stress-app-test/
https://rog.asus.com/forum/showthread.php?73665-Our-preferred-memory-stress-test
ID: 98828 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1663
Credit: 17,329,705
RAC: 24,442
Message 98830 - Posted: 7 Sep 2020, 23:07:10 UTC - in response to Message 98828.  

If I don't run Rosetta, I don't get any errors on any of my other projects.
Yet you're the only one that is having signal 11 issues with WUs that others can process with no problems at all- even with the same application.

Signal 11 indicates a memory problem. The problem only occurs with Rosetta Tasks- which in general use way more RAM than other projects Tasks. And if you've been getting these errors since before the faulty memory pig WUs came out.
Everything points towards a hardware memory issue- be it too much/too little voltage, to much overclock, or just a dodgy address(es).
*shrug*
Grant
Darwin NT
ID: 98830 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 98831 - Posted: 7 Sep 2020, 23:17:46 UTC - in response to Message 98828.  

If I don't run Rosetta, I don't get any errors on any of my other projects.
The thing is: other people don’t get any errors on the Rosetta tasks that fail on your machine. Rosetta seems to be uncovering a fault that those synthetic stress testers fail to detect.
ID: 98831 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 96
Credit: 322,693
RAC: 1,374
Message 98832 - Posted: 7 Sep 2020, 23:18:06 UTC - in response to Message 98830.  

If I don't run Rosetta, I don't get any errors on any of my other projects.
Yet you're the only one that is having signal 11 issues with WUs that others can process with no problems at all- even with the same application.

Signal 11 indicates a memory problem. The problem only occurs with Rosetta Tasks- which in general use way more RAM than other projects Tasks. And if you've been getting these errors since before the faulty memory pig WUs came out.
Everything points towards a hardware memory issue- be it too much/too little voltage, to much overclock, or just a dodgy address(es).
*shrug*

Not arguing with you. As I previously stated, I guess Rosetta tasks work the memory harder than any other project. The Einstein GW tasks are supposedly very hard on memory yet I have no issues. The TN-Grid tasks which are also molecular modeling like Rosetta have no issues.
And I never got any response from my question about VSYSCALL=emulate needed or not for Rosetta apps. Maybe that is the problem. I can either continue to run tasks here and have errors or give up completely. No skin off my nose as far as I am concerned. Only using 2 of 30 cores so not losing too much compute time.
ID: 98832 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 96
Credit: 322,693
RAC: 1,374
Message 98833 - Posted: 7 Sep 2020, 23:20:55 UTC - in response to Message 98831.  
Last modified: 7 Sep 2020, 23:21:43 UTC

If I don't run Rosetta, I don't get any errors on any of my other projects.
The thing is: other people don’t get any errors on the Rosetta tasks that fail on your machine. Rosetta seems to be uncovering a fault that those synthetic stress testers fail to detect.

Well neither Prime95 or y-cruncher are synthetic applications. They are real compute loads like Rosetta. And yet they uncover no memory issues or cause sigsegv errors.
ID: 98833 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,576,016
RAC: 20,379
Message 98851 - Posted: 8 Sep 2020, 1:53:23 UTC - in response to Message 98828.  


Well I first used Memtest from a USB stick. But the memory testers on OCN state that is a very poor tester for Linux. They recommend the Google stressapptest. That is the one Google developed to test their servers that they deploy in their AWS farms before putting them into service. It is a standard application in the repositories.

I then follow up the memory stress testing with several hours of Prime95 and y-cruncher to put the system under actual compute loads to make sure it is stable before starting up BOINC with my actual loads. Closest I can come to actual BOINC loads. But BOINC is the final arbiter of stability. If I don't run Rosetta, I don't get any errors on any of my other projects.
[Edit]
Here are some links about it.
https://www.ghacks.net/2009/10/19/google-stress-app-test/
https://rog.asus.com/forum/showthread.php?73665-Our-preferred-memory-stress-test


I think I've always used CDs because older machines sucked at booting from USB.

What do you mean a "poor tester for Linux"? It tests the physical RAM, and it doesn't matter what OS you run on the machine afterwards.

I've used all sorts of dodgy 2nd hand crap from Ebay, and Memtest has always spotted faulty RAM within 5 or 10 minutes. Never had anything crash that's passed a 2 hour memtest.

It's weird that it's only Rosetta and only your machine. It has to be a bug in Rosetta that only occurs on certain models of CPU. If you had hardware problems, other projects would screw up too. Rosetta is hardly the most difficult project to run. I'd say LHC stresses it most (the virtual machine apps, not Sixtrack), do you run that?
ID: 98851 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1663
Credit: 17,329,705
RAC: 24,442
Message 98857 - Posted: 8 Sep 2020, 2:16:00 UTC - in response to Message 98851.  

It's weird that it's only Rosetta and only your machine. It has to be a bug in Rosetta that only occurs on certain models of CPU.
That isn't the case.
There are the same types of Tasks running using the same application on the same model CPUs without error.

If it occurred on all systems of a given CPU using different applications, but the same applications on similar CPUs were OK then it would be a problem with that CPU type (needing a micro-code fix, or a specific fix for that CPU in the Application). If the errors were produced by a particular application on a particular CPU, but other applications on that same CPU work OK, then it'd be a problem with the application. As the errors are only occurring on a given system, and not on other systems using the same application & the same CPU it's a pretty fair bet that it is is an issue with that system.
Grant
Darwin NT
ID: 98857 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 11,576,016
RAC: 20,379
Message 98859 - Posted: 8 Sep 2020, 2:19:18 UTC - in response to Message 98857.  

It's weird that it's only Rosetta and only your machine. It has to be a bug in Rosetta that only occurs on certain models of CPU.
That isn't the case.
There are the same types of Tasks running using the same application on the same model CPUs without error.

If it occurred on all systems of a given CPU using different applications, but the same applications on similar CPUs were OK then it would be a problem with that CPU type (needing a micro-code fix, or a specific fix for that CPU in the Application). If the errors were produced by a particular application on a particular CPU, but other applications on that same CPU work OK, then it'd be a problem with the application. As the errors are only occurring on a given system, and not on other systems using the same application & the same CPU it's a pretty fair bet that it is is an issue with that system.


Agreed. I guess he either has to mess around with hardware settings or pull RAM chips, or just not give that PC Rosetta to do.
ID: 98859 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 96
Credit: 322,693
RAC: 1,374
Message 98868 - Posted: 8 Sep 2020, 6:30:18 UTC - in response to Message 98851.  

What do you mean a "poor tester for Linux"? It tests the physical RAM, and it doesn't matter what OS you run on the machine afterwards.

It is the opinion of the memory testers at OCN that Memtest is a particularly poor tester. Does not test very thoroughly and also not very consistently. Since those experts have much more experience than I, I trust their opinions.
ID: 98868 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Keith Myers
Avatar

Send message
Joined: 29 Mar 20
Posts: 96
Credit: 322,693
RAC: 1,374
Message 98869 - Posted: 8 Sep 2020, 6:33:29 UTC - in response to Message 98857.  

but the same applications on similar CPUs were OK then it would be a problem with that CPU type (needing a micro-code fix, or a specific fix for that CPU in the Application). If the errors were produced by a particular application on a particular CPU, but other applications on that same CPU work OK, then it'd be a problem with the application.

But we still do not know that. As far as I can tell in my research in various threads here and at Seti and Einstein, if an application is written expecting the deprecated VSYSCALL function to be available, the application will segfault. Only applies to Linux systems. Not applicable in Windows.
ID: 98869 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1663
Credit: 17,329,705
RAC: 24,442
Message 98870 - Posted: 8 Sep 2020, 6:50:58 UTC - in response to Message 98869.  

but the same applications on similar CPUs were OK then it would be a problem with that CPU type (needing a micro-code fix, or a specific fix for that CPU in the Application). If the errors were produced by a particular application on a particular CPU, but other applications on that same CPU work OK, then it'd be a problem with the application.

But we still do not know that. As far as I can tell in my research in various threads here and at Seti and Einstein, if an application is written expecting the deprecated VSYSCALL function to be available, the application will segfault. Only applies to Linux systems. Not applicable in Windows.
Does the OS version info give you an idea of whether VSYSCALL function is likely to be available or not? Does the the fact that you do complete some Tasks indicate it's not the issue?

Other systems running the same Linux application that completed WUs that errored out on your system.
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1125457579
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1125456414
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=1125187602
Grant
Darwin NT
ID: 98870 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 25,238,680
RAC: 0
Message 98873 - Posted: 8 Sep 2020, 9:53:52 UTC
Last modified: 8 Sep 2020, 10:20:42 UTC

I'm seeing a bunch of fold_and_dock work units that are using huge amounts of memory. I just spotted one where its properties had a working set size of 43GB. The machine in question has 64GB. I have suspended all the other tasks so it can get out of the way. I am now seeing the disk LED on constantly so its probably grown past the available memory and paging.

I have a couple of failures like this which only wanted 18GB.

Not sure what they're doing but the average BOINC user isn't going to have machines with that much memory and most are going to fail.
BOINC blog
ID: 98873 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1663
Credit: 17,329,705
RAC: 24,442
Message 98874 - Posted: 8 Sep 2020, 10:03:22 UTC - in response to Message 98873.  
Last modified: 8 Sep 2020, 10:29:28 UTC

I have a couple of failures like this which only wanted 18GB.
And on the other system that tried to process it (also Linux), while it only used 228MB of RAM, it crashed out with a Signal 11 error in 45 sec.

Still waiting to see a the result of one of these Tasks on a Windows system.



Edit- just found one against the out of control RAM Task that Keith aborted.

            Outcome Computation error
       Client state Compute error
        Exit status 1 (0x00000001) Unknown error code
        Computer ID 5159178
           Run time 19 min 44 sec
           CPU time 18 min 38 sec
     Validate state Invalid
             Credit 10.00
  Device peak FLOPS 3.28 GFLOPS
Application version Rosetta v4.20 windows_x86_64


Stderr output
<core_client_version>7.0.80</core_client_version>
<![CDATA[
<message>
Función incorrecta.
 (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @kp8RjDVk_fold_and_dock_flags -silent_gz -mute all -out:file:silent default.out -in:file:boinc_wu_zip fold_and_dock_kp8RjDVk_data.zip -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3873245
Using database: database_357d5d93529_n_methylminirosetta_database

ERROR: Error in core::kinematics::FoldTree::get_jump_that_builds_residue(): This residue is not the child of (built by) a jump!
ERROR:: Exit from: ......srccorekinematicsFoldTree.cc line: 436
BOINC:: Error reading and gzipping output datafile: default.out
16:00:04 (3796): called boinc_finish(1)

</stderr_txt>
]]>




Looks like another batch of dud Work Units.


Edit- just found one running on one of my systems.
Been going for just under an hour, and it's properties at this stage are

Virtual memory size 45.69GB
   Working set size  9.35GB
And it hasn't check pointed since 7 minutes after it started.

Checking Task Manager, the RAM usage for that task is increasing at roughly 2MB per second.
Grant
Darwin NT
ID: 98874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 25,238,680
RAC: 0
Message 98876 - Posted: 8 Sep 2020, 10:31:13 UTC - in response to Message 98874.  
Last modified: 8 Sep 2020, 10:43:14 UTC

I have a couple of failures like this which only wanted 18GB.
And on the other system that tried to process it (also Linux), while it only used 228MB of RAM, it crashed out with a Signal 11 error in 45 sec.

The other system only has 8GB of memory, that might account for why it only ran for 45 seconds.

I have aborted all the fold_and_dock tasks and let the other tasks run.
BOINC blog
ID: 98876 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 23 · 24 · 25 · 26 · 27 · 28 · 29 . . . 34 · Next

Message boards : Number crunching : Rosetta 4.1+ and 4.2+



©2024 University of Washington
https://www.bakerlab.org