Rosetta 4.0+

Message boards : Number crunching : Rosetta 4.0+

To post messages, you must log in.

Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 · Next

AuthorMessage
spRocket
Avatar

Send message
Joined: 23 Mar 20
Posts: 22
Credit: 3,008,018
RAC: 0
Message 92374 - Posted: 27 Mar 2020, 2:43:07 UTC

(Reposting here, since this is a 4.08 issue)

I'm finding that I get signal 11 issues with a couple of older AMD processors, an Athlon II X4 630 and a Phenom II X2 550 Black Edition (the latter running with two unlocked cores). Both of these systems are running on ASUS M4A785-M motherboards with 4 GB of ECC RAM.

It seems that Rosetta Mini works OK, but the full Rosetta consistently gets errors on tasks.

An example from Task 1133622372:
<core_client_version>7.9.3</core_client_version>
<![CDATA[
<message>
process got signal 11</message>
<stderr_txt>
command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting -parser:protocol jhr_boinc.xml @flags -in:file:silent 7hp5zr7e_jhr_design1_COVID-19.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip 7hp5zr7e_jhr_design1_COVID-19.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3696211
Starting watchdog...
Watchdog active.

</stderr_txt>
]]>


Both of these CPUs are shown as "Family 16" in the CPU type listing.

In the meantime, I've shifted both of these systems over to World Community Grid, which is working as it should. On the other hand, my Ryzen 7/1700 is happily devouring Rosetta tasks, as is an old ThinkPad with an i7 L 640.
ID: 92374 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 92377 - Posted: 27 Mar 2020, 7:05:03 UTC
Last modified: 27 Mar 2020, 7:20:12 UTC

Just an observation. I was getting the download problem almost daily on my machines, but have not had one for 5 days now.

What I can see in my task list, is a failure with insufficient memory, both machines, 4 core 8 thread, have 16GB with 90% use figures.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 92377 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,000,634
RAC: 0
Message 92425 - Posted: 27 Mar 2020, 22:58:29 UTC

"ERROR: Assertion `copy_pose.size() == native.size()` failed. MSG:the reference pose must be the same size as the working pose
ERROR:: Exit from: ......srcprotocolsprotein_interface_designfiltersRmsdFilter.cc line: 323
22:47:47 (7828): called boinc_finish(0)"

https://boinc.bakerlab.org/rosetta/result.php?resultid=1134099093

Is this an error? The work unit is validated. Is the result usable? I see this across 2 Ryzen hosts.
ID: 92425 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1960
Credit: 38,076,311
RAC: 6,958
Message 92430 - Posted: 28 Mar 2020, 0:07:46 UTC - in response to Message 92371.  

I just saw a 24.01 KB zip file being downloaded. 24.0 KB appeared to download at normal speed, then it was several seconds before it downloaded the last 0.01 KB.

In other words, the larger zip files aren't fully exempt from the problem; they just aren't affected severely enough to shut down Rosetta@Home new tasks.

Confirmed here too, many times
ID: 92430 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Peti

Send message
Joined: 17 Mar 20
Posts: 5
Credit: 142,053
RAC: 0
Message 92434 - Posted: 28 Mar 2020, 1:58:04 UTC - in response to Message 92425.  

"ERROR: Assertion `copy_pose.size() == native.size()` failed. MSG:the reference pose must be the same size as the working pose
ERROR:: Exit from: ......srcprotocolsprotein_interface_designfiltersRmsdFilter.cc line: 323
22:47:47 (7828): called boinc_finish(0)"

https://boinc.bakerlab.org/rosetta/result.php?resultid=1134099093

Is this an error? The work unit is validated. Is the result usable? I see this across 2 Ryzen hosts.

Hi,
I'd think it's an error if it says so. But it's inside the Rosetta software or data, some tasks are getting this very same error message at my PC, too.
for example, this: https://boinc.bakerlab.org/rosetta/result.php?resultid=1134374428
ERROR: Assertion `copy_pose.size() == native.size()` failed. MSG:the reference pose must be the same size as the working pose
ERROR:: Exit from: src/protocols/protein_interface_design/filters/RmsdFilter.cc line: 323


And to note, my PC was not overclocked at that time, and I did not reboot the PC or stop boinc in any way around that time.
so it must be the software....
ID: 92434 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,806,125
RAC: 3,336
Message 92435 - Posted: 28 Mar 2020, 2:58:19 UTC - in response to Message 92425.  

"ERROR: Assertion `copy_pose.size() == native.size()` failed. MSG:the reference pose must be the same size as the working pose
ERROR:: Exit from: ......srcprotocolsprotein_interface_designfiltersRmsdFilter.cc line: 323
22:47:47 (7828): called boinc_finish(0)"

https://boinc.bakerlab.org/rosetta/result.php?resultid=1134099093

Is this an error? The work unit is validated. Is the result usable? I see this across 2 Ryzen hosts.

Getting credit with an error reported depends on how the validator was written. It may have been written to accept task output as valid if two different computers report identical errors for the same workunit.
ID: 92435 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,000,634
RAC: 0
Message 92438 - Posted: 28 Mar 2020, 10:02:17 UTC

I did reboot the PC at least once while that WU ran. The error is only on the Rosetta log, as the Server says the unit validated successfully.
And because the server validated the WU, there was no other copy sent to another host.


These COVID-19 WU?s are heavy on the RAM so I try not use other programs that use lots of RAM, would be a pity if they weren't even working properly on my PC's.
ID: 92438 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 92473 - Posted: 28 Mar 2020, 17:11:16 UTC
Last modified: 28 Mar 2020, 20:10:26 UTC

Work units are comprised of a number of models ("decoys"). Credit is issued by the number of completed models. A fast machine completes more models per hour than a slower machine, and is granted credit per completed model, so higher credit per hour.

The error report must relate to the last model that was attempted in the work unit. Any prior completed models still report in and get credit.

Some number of failures is to be expected. Every model is a combination of things that noone as tried before. So as we collectively navigate the search space, some of the models can get lost.

Observation of failure is a part of the scientific process. Your machines report the failures back to the Project Team, and they can then be studied for details on why they fail and how to modify the program to work better in the future.

It is important that everyone realize that things BOINC calls "failures" are more aptly described as "learning experiences". Until your machine came across the combination of factors that caused it to fail, noone knew the program needed improvement. Keep 'em coming.
Rosetta Moderator: Mod.Sense
ID: 92473 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,000,634
RAC: 0
Message 92476 - Posted: 28 Mar 2020, 17:37:15 UTC - in response to Message 92473.  
Last modified: 28 Mar 2020, 17:41:47 UTC

Thanks for the reply, Mod.Sense.

A quick forum search shows this ERROR: Assertion issue is new, with the first report a mere 5 days ago. And IIRC from looking at the workunits, it specific to the Rosetta 4.07 COVID-19 work units. Haven't seen it with Rosetta Mini.
ID: 92476 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,806,125
RAC: 3,336
Message 92696 - Posted: 31 Mar 2020, 1:48:28 UTC

Some of you may want to extend each task to get more work done during the current shortage of tasks.

If so, try this:

If you are using the simple view, click on View near the top line, then Advance view....

Click on Projects, then Rosetta@home, then Your account.

Under Preferences, click on Rosetta@home preferences.

In each preferences section, click on Edit preferences.

Click on the V for Target CPU run time.

Click on a value just above your current setting. Increasing this value too fast causes problems.

Click on Update preferences.

Click on the X at the top right corner of the Rosetta@home preferences window to shut it down.

Click on Projects, then Rosetta@home, then Update.

If you want to go back to the Simple view, click on View, then Simple view....

You might repeat the above every few days until you reach the maximum value for Target CPU run time.
ID: 92696 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 38
Credit: 1,131,592
RAC: 2,897
Message 94267 - Posted: 12 Apr 2020, 19:20:18 UTC
Last modified: 12 Apr 2020, 19:20:40 UTC

Hi

I have minirosetta tasks running on a linux machine like this one where I realized I had been running a long with almost no CPU used.

In the slot file I found some errors so I decided to cancel them

No heartbeat from core client for 30 sec - exiting
FILE_LOCK::unlock(): close failed.: Bad file descriptor
*** glibc detected *** ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu: double free or corruption (!prev): 0x0fdefa10 ***

FILE_LOCK::unlock(): close failed.: Bad file descriptor
SIGSEGV: segmentation violation
*** glibc detected *** ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu: free(): corrupted unsorted chunks: 0x101dabd0 ***
*** glibc detected *** ../../projects/boinc.bakerlab.org_rosetta/minirosetta_3.78_x86_64-pc-linux-gnu: corrupted double-linked list: 0x101dadd8 ***

It seems that I have others taking the same way, the CPU time is almost null with a consistent run-time...

I had only 2 rosetta mini that succeeded.

My rosetta tasks seem to be all OK.

What should I do ? completely stop rosetta mini to be sent for that machine ?

Thanks
ID: 94267 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94272 - Posted: 12 Apr 2020, 20:23:39 UTC - in response to Message 94267.  

24 processors, and 8GB of RAM is not going to work very well for Rosetta@home. You probably have more than half of the tasks in a "waiting for memory" status.
Rosetta Moderator: Mod.Sense
ID: 94272 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 38
Credit: 1,131,592
RAC: 2,897
Message 94352 - Posted: 13 Apr 2020, 15:40:23 UTC - in response to Message 94272.  

I have limited to 4 rosetta + 4 mini using an app_config. The rest is running TN-Grid.

The system is currently only using 4 GB out of 8, so plenty or RAM left.

The mini tasks keep having the same issue, I have some running over 30 hours without CPU used nor completion.

I have limited mini to 1 tasks and rosetta to 6 now, I have problem accessing the machine now except from a linux ssh command line and I have loads of mini tasks waiting and I don't know how to bulk cancel them all...
ID: 94352 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 38
Credit: 1,131,592
RAC: 2,897
Message 94360 - Posted: 13 Apr 2020, 17:39:43 UTC - in response to Message 94352.  
Last modified: 13 Apr 2020, 17:40:10 UTC

I managed to access that boinc using boinctasks from another machine now, I aborted all pending mini tasks (BT is great to manager many tasks at once).

I'll let it run for some days to see how it goes, for the moment there are enough rosetta (normal) tasks for some time I think, I'll see how it goes.
ID: 94360 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile adrianxw
Avatar

Send message
Joined: 18 Sep 05
Posts: 650
Credit: 11,632,350
RAC: 1,054
Message 94361 - Posted: 13 Apr 2020, 17:49:58 UTC

The post a couple back about memory, I fully concur. When I build a system, I always try to have at least 2GB per thread.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 94361 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 38
Credit: 1,131,592
RAC: 2,897
Message 94449 - Posted: 14 Apr 2020, 15:31:24 UTC - in response to Message 94361.  

It is a dedicated server hosted by a foreign provider, cheap and not recent : I cannot upgrade memory or anything.

All rosetta tasks are running fine (biggest use of RAM) and finishing in success, even with 6 concurrent tasks running, and all mini tasks are ending in error (except one), even limited to 1 at a time, so it cannot be a lack of RAM (rosetta uses more than mini) (and I doubled checked I still have a fair amount of free RAM at any given time).

I realize I cannot select applications to exclude mini in rosetta preferences ! (unlike all other boinc projects)

And in app_config I cannot set max number to 0 because it is ignored, I have to set to 1 to see the max limit considered by boinc... do I have any other way to completely exclude mini and waste processing time ?
ID: 94449 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 94454 - Posted: 14 Apr 2020, 15:54:29 UTC - in response to Message 94449.  

Being able to see mini tasks that have failed without being aborted would be the best way to see why they aren't working on that machine. Please post with links to the host and problem WUs.
Rosetta Moderator: Mod.Sense
ID: 94454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 38
Credit: 1,131,592
RAC: 2,897
Message 94462 - Posted: 14 Apr 2020, 17:30:13 UTC - in response to Message 94454.  
Last modified: 14 Apr 2020, 17:31:48 UTC

I had posted these details in my message above (april 12) but it seems the tasks were now purged from the website, my first link is not showing the example task I had given anymore (how long do they remain visible ? this was only 2 days ago).

But I posted examples of the error messages I could find in the slot directory by that time in that same message above.
ID: 94462 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 22 Aug 06
Posts: 38
Credit: 1,131,592
RAC: 2,897
Message 94640 - Posted: 16 Apr 2020, 22:52:15 UTC - in response to Message 94462.  

I realize the last mini task I had has been stuck for 3 days without using no CPU (at least no advancement is done on the task), all the files in the slot have not been updated since 3 days.

I don't know how to extract the err file out of the linux hosted machine so i made screenshots because I'm going to abort this task and as we say rosetta doesn't keep the task log on the server after one or two days.





I was given a solution to exclude all mini on that machine by using an app_info config file (re-describe all rosetta apps, and no mini app).

The rosetta tasks continue to run normally on that machine...
ID: 94640 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,806,125
RAC: 3,336
Message 94651 - Posted: 17 Apr 2020, 3:17:17 UTC - in response to Message 94640.  
Last modified: 17 Apr 2020, 3:18:54 UTC

[snip]

I realize the last mini task I had has been stuck for 3 days without using no CPU (at least no advancement is done on the task), all the files in the slot have not been updated since 3 days.

I don't know how to extract the err file out of the linux hosted machine so i made screenshots because I'm going to abort this task and as we say rosetta doesn't keep the task log on the server after one or two days.


Upgrading to BOINC 7.16.5 makes that error much less likely.
ID: 94651 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 13 · 14 · 15 · 16 · 17 · 18 · 19 · Next

Message boards : Number crunching : Rosetta 4.0+



©2024 University of Washington
https://www.bakerlab.org