Posts by Mugurel

1) Message boards : Number crunching : too many instances of rosetta (Message 69943)
Posted 4 Apr 2011 by Mugurel
Post:
Well, it looks like your failing tasks show a process got signal 11 message. I had assumed these were tasks you may have killed as you were studying the situation... but your latest comments indicate this is happening on its own.

The BOINC FAQ indicates this could be faulty memory or page file... but ALSO can be an issue when running 32 bit applications on a 64 bit machine. Rosetta currently just wraps the 32bit app. for the executable sent to 64bit machines, so that description would seem to fit.

The FAQ indicates that installing ia32 will help the situation. I am not familiar with ia32. Do you know if you have it already? Would you be ok with installing it on a machine to see if it helps?



I have ia32-libs installed. So that is OK. I think. I try to run ldd on the excutable but it is a static executable:
beo-39:~/.boinc/boinc.beo-39/projects/boinc.bakerlab.org_rosetta$ ldd minirosetta_2.17_x86_64-pc-linux-gnu 
        not a dynamic executable


I did killed few of the running processes to get boinc going, but that did not help. For the rest of times the processes got killed by themself. If I keep a "top" open, I see that they start, run for 3 minutes or less then they are killed (and shows as zombie), then wu get corrupted and boinc starts other instances of rosetta and all starts to hapen again, till boinc decided to start only fewer sentinces of rosetta and some other projects. That is usually after about 300-500 wu are wasted. Then it is fine.

For the memory, the settings are to use max 90% of the system when idle and max 75% of the system when in use. As I said earlier none of the 48cores systems come close to use 50% of their RAM. No swap disk at all is used.

At the moment I made sure rosetta doesn't use 100% of cpu's by running in paralell many other projects. This did not solve the problem, it just avoids it... :-(

Ionel
2) Message boards : Number crunching : too many instances of rosetta (Message 69921)
Posted 31 Mar 2011 by Mugurel
Post:
Rosetta uses more memory per active WU then most other BOINC projects. But it looks like you've got about 2.5GB per core on this machine, which should be plenty. Is that the machine you are talking about?

Oh, now I see you've got SEVERAL machines with 48 cores, in a variety of configurations... I'm guessing the machine you are seeing delayed responsiveness on has less memory. Basically, when you get all 48 cores fired up, the machine is probably doing considerable page swapping slowing it down.

My prior comments and suggestions on helping a machine run well with less memory should pertain to your situation as well.



Yes, there are few machines with 48 cores. None of them are using the entire amount of RAM, less than half is used. No swapping at all. All applications are kept in memory.

The problems happens when boinc decided to switch from whatever it runs to rosetta, and if more than 8 instances of rosetta starts at the same time, then they got killed/hang after running for about 3 minutes. If some other instances were active, they also will get killed and the wu will no longer be usable.

Again this doesnt happend on machines with 8 or less cores, only on those with 16 or more (like 48).

Ionel
3) Message boards : Number crunching : too many instances of rosetta (Message 69914)
Posted 29 Mar 2011 by Mugurel
Post:
Hi,

Anyone else has difficulties with rosetta if too many instances runs at the same time?

I have problems on 48 cores linux 64 bit platforms. The same runs fine on 4 and 8 cores systems. For more than 8 I get into troubles...

The nodes become less responsive, and many instances of rosetta are started then stopped, and other one gets started, and so on. In the logs is nothing usefull.

29-Mar-2011 18:15:20 [rosetta@home] Restarting task mem_widd_run03_centroid_A_2kdc_SAVE_ALL_OUT_IGNORE_THE_REST_22158_915157_0 using minirosetta version 217
29-Mar-2011 18:15:20 [rosetta@home] Task mem_widd_run03_centroid_A_1zll_SAVE_ALL_OUT_IGNORE_THE_REST_22158_914948_0 exited with zero status but no 'finished' file
29-Mar-2011 18:15:20 [rosetta@home] If this happens repeatedly you may need to reset the project.
29-Mar-2011 18:15:23 [rosetta@home] Restarting task mem_widd_run03_centroid_A_1zll_SAVE_ALL_OUT_IGNORE_THE_REST_22158_914948_0 using minirosetta version 217
29-Mar-2011 18:15:23 [rosetta@home] Task mem_widd_run03_centroid_A_2hac_SAVE_ALL_OUT_IGNORE_THE_REST_22158_603074_0 exited with zero status but no 'finished' file
29-Mar-2011 18:15:23 [rosetta@home] If this happens repeatedly you may need to reset the project.


A reset would not help. The boinc manager is not really able to get in touch with the boinc client, it get stuck with the small window: Please wait, communicating with the client.

It is only fine when rosetta runs only few instances and the rest of cpu's are occupied by other processes.

I have no similar issues with any other projects (I attached to almost 40 projects).

Anyone has any clue on this?

Thank you.

Ionel






©2024 University of Washington
https://www.bakerlab.org