Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 24 · 25 · 26 · 27 · 28 · 29 · 30 . . . 237 · Next

AuthorMessage
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 325
Credit: 9,229,618
RAC: 521
Message 90521 - Posted: 15 Mar 2019, 16:34:42 UTC - in response to Message 90520.  

Ok, thanks. It was worth asking :-)
ID: 90521 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 19,170,269
RAC: 430
Message 90522 - Posted: 15 Mar 2019, 18:08:00 UTC - in response to Message 90517.  
Last modified: 15 Mar 2019, 18:28:22 UTC

OK, extra memory ordered for both machines so we’ll see if that sorts it.


It seems like Rosetta gets into a state where it consumes 1gb+ per WU. I am running 35 WU and there is always a couple taking over a gb.

I watch the difference between CPU and RUN times and swap used. As long as the swap used is very low, you are probably not running into memory problems. I tend to buy more GB of memory than threads. I originally got my 36 thread machine with 32GB and that was not enough. You can see that 19gb of my swap space has been used even though the machine has 64gb installed for the 36 threads. 19gb swap space used is concerning.

Based on over a thousand jobs each, the credit difference between the 64-bit Rosetta WU and Minirosetta 32-bit WU is negligible. 44.0 credits/CPU hr for Rosetta 4.08 and 45.7 credits/CPU hr.

top ic .... sorted by memory use.

top - 10:55:55 up 1 day, 18:24, 0 users, load average: 40.22, 36.72, 36.27
Tasks: 524 total, 37 running, 487 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.4 us, 1.4 sy, 96.6 ni, 1.1 id, 0.0 wa, 0.4 hi, 0.1 si, 0.0 st
MiB Mem : 64090.7 total, 1051.2 free, 16283.5 used, 46756.0 buff/cache
MiB Swap: 32112.0 total, 32093.0 free, 19.0 used. 45874.0 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24590 boinc 39 19 1722808 1.5g 75400 R 98.3 2.5 219:01.25 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_15_1798_1948__t000__1_C1+
25349 boinc 39 19 1384300 1.2g 75400 R 99.3 1.9 198:57.60 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_15_1798_1948__t000__1_C1+
22988 boinc 39 19 838204 723668 75400 R 97.7 1.1 259:32.35 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_14_1536_1929__t000__0_C1+
24878 boinc 39 19 706928 592640 75784 R 99.3 0.9 211:30.53 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_15_1805_1950__t000__0_C1+
15222 boinc 39 19 605140 491200 76104 R 99.0 0.7 459:54.12 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_15_1674_1946__t000__0_C1+
20625 boinc 39 19 605492 491108 75400 R 99.3 0.7 319:46.33 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_15_1808_1947__t000__0_C1+
16439 boinc 39 19 583112 468876 75784 R 97.4 0.7 428:23.20 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -run:protocol jd2_scripting @flags_rb_03_15_1674_1946__t000__0_C1+
24082 boinc 39 19 583664 465920 68044 R 99.3 0.7 231:28.63 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu @flags_rb_03_15_1805_1950__t000__ab_robetta -in:file:boinc_wu_zip+
17334 boinc 39 19 575680 457680 68620 R 99.3 0.7 404:59.21 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
22280 boinc 39 19 543464 425512 68556 R 99.7 0.6 276:44.21 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
19901 boinc 39 19 533536 415428 68556 R 99.7 0.6 338:55.09 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
22209 boinc 39 19 530860 413260 68236 R 99.3 0.6 278:15.90 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu @flags_rb_03_15_1808_1947__t000__ab_robetta -in:file:boinc_wu_zip+
25711 boinc 39 19 523612 408668 70668 R 99.3 0.6 190:02.19 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu @foldit_2007571_0001_fold_and_dock_flags -silent_gz -mute all -ou+
21481 boinc 39 19 521072 406132 70604 R 99.3 0.6 297:12.91 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu @foldit_2007571_0005_fold_and_dock_flags -silent_gz -mute all -ou+
17873 boinc 39 19 516024 398184 68620 R 99.3 0.6 391:55.17 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
15374 boinc 39 19 511956 394116 68556 R 99.3 0.6 455:50.04 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
30825 boinc 39 19 509260 391232 68620 R 99.3 0.6 78:42.82 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
14998 boinc 39 19 508228 390160 68620 R 98.0 0.6 465:27.01 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
18209 boinc 39 19 503324 385500 68620 R 99.0 0.6 383:28.39 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
31538 boinc 39 19 500516 382744 68620 R 99.3 0.6 60:22.53 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.08_x86_64-pc-linux-gnu -abinitio::fastrelax 1 -ex2aro 1 -frag3 00001.200.3mers -in:file:+
ID: 90522 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Juha

Send message
Joined: 28 Mar 16
Posts: 13
Credit: 705,034
RAC: 0
Message 90523 - Posted: 15 Mar 2019, 18:29:56 UTC - in response to Message 90522.  

19gb swap space used is concerning.


19 GB would indeed be a lot of swap in use but haven't you got the unit wrong? It looks like 19 MB to me.
ID: 90523 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1864
Credit: 34,376,924
RAC: 7,599
Message 90532 - Posted: 18 Mar 2019, 12:38:20 UTC - in response to Message 90512.  

You may need more memory. You have 8 GB on your Ryzen, but the Rosetta work units sometimes take up to 1 GB each.

Ouch, I know my free memory sometimes goes down to 2 or 3% but I hadn’t thought of it going negative.

Thanks for the suggestion, I’ll look at getting another 8gb and maybe some more for the FX rig as well, that only has 4gb for the 4 cores.

Hmm, that raises a thought. They’re both running half and half between Rosetta and WCG which, I think, has a lower memory requirement?

Sorry to be a bit late on this, but I did notice around 13th March I had a task consuming 2.4Gb and 14Gb of my 16Gb (total) RAM being in use to run 8 tasks.

I can't recall the tasks involved. Right now I'm back to my more usual level of 7.74Gb in use
ID: 90532 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 19,170,269
RAC: 430
Message 90533 - Posted: 18 Mar 2019, 16:14:06 UTC - in response to Message 90523.  

19gb swap space used is concerning.


19 GB would indeed be a lot of swap in use but haven't you got the unit wrong? It looks like 19 MB to me.


DOH! You are obviously correct. I got units of GB dancing in my head.
ID: 90533 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 325
Credit: 9,229,618
RAC: 521
Message 90535 - Posted: 19 Mar 2019, 10:40:22 UTC - in response to Message 90532.  


Sorry to be a bit late on this, but I did notice around 13th March I had a task consuming 2.4Gb and 14Gb of my 16Gb (total) RAM being in use to run 8 tasks.

I can't recall the tasks involved. Right now I'm back to my more usual level of 7.74Gb in use


As of this AM I have 16gb on the ryzen and it's currently showing 81% free memory but that's with no Rosetta as no WUs have come down since early yesterday.

I'll monitor going forward and report back.
ID: 90535 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 303
Credit: 379,182
RAC: 0
Message 90536 - Posted: 19 Mar 2019, 11:41:37 UTC - in response to Message 90535.  
Last modified: 19 Mar 2019, 11:45:37 UTC

As of this AM I have 16gb on the ryzen and it's currently showing 81% free memory but that's with no Rosetta as no WUs have come down since early yesterday.

Yes, we are back to 8086 tasks ready to send according to the server status page which actually means 0 tasks ready to send. Maybe the admins should investigate, what those 8086 tasks are and if they eventually cause the issues.
.
ID: 90536 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1864
Credit: 34,376,924
RAC: 7,599
Message 90539 - Posted: 19 Mar 2019, 23:53:13 UTC - in response to Message 90535.  

Sorry to be a bit late on this, but I did notice around 13th March I had a task consuming 2.4Gb and 14Gb of my 16Gb (total) RAM being in use to run 8 tasks.

I can't recall the tasks involved. Right now I'm back to my more usual level of 7.74Gb in use

As of this AM I have 16gb on the ryzen and it's currently showing 81% free memory but that's with no Rosetta as no WUs have come down since early yesterday.

I'll monitor going forward and report back.

So you've got your extra RAM installed already? If it was a RAM issue (with 8Gb) you'll be fine now.

I was only indicating there were some rogue tasks around last week that may have tripped you up back then. Hopefully new tasks play nicer as standard.

Your original question was to ask if there was anything you could do - there probably wasn't at that time and you've more than covered yourself now under normal conditions.
ID: 90539 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1864
Credit: 34,376,924
RAC: 7,599
Message 90540 - Posted: 19 Mar 2019, 23:57:12 UTC - in response to Message 90536.  

As of this AM I have 16gb on the ryzen and it's currently showing 81% free memory but that's with no Rosetta as no WUs have come down since early yesterday.

Yes, we are back to 8086 tasks ready to send according to the server status page which actually means 0 tasks ready to send. Maybe the admins should investigate, what those 8086 tasks are and if they eventually cause the issues.

Up to 10 minutes ago it was still showing those 8086 so that doesn't sound right.

However, I'm here to say a whole load of tasks just came down and the server status page has just changed to show an additional 20k Rosetta tasks in progress and 15k still unsent. No idea how long that will last, but there is some progress.
ID: 90540 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 272
Credit: 19,170,269
RAC: 430
Message 90542 - Posted: 21 Mar 2019, 4:47:11 UTC
Last modified: 21 Mar 2019, 4:47:43 UTC

I was watching when a couple of the Rosetta WU failed.

They computed properly down until the TIME REMAINING was zero seconds and the compute time was 8 hours and a few minutes. Instead of reporting the completion, the WU was marked as WAITING with zero seconds remaining. When the WU restarted, it indicated a COMPUTE ERROR with the "finish file present too long</message>". The 34 failing WU seemed to all fail at the end and were 4.08 Linux WU.

https://boinc.bakerlab.org/rosetta/result.php?resultid=1063704662
ID: 90542 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1864
Credit: 34,376,924
RAC: 7,599
Message 90544 - Posted: 21 Mar 2019, 8:18:56 UTC

We had a good run, but no tasks left to download (and that mysterious 8086 ready to send again, whatever that is)
ID: 90544 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 325
Credit: 9,229,618
RAC: 521
Message 90546 - Posted: 21 Mar 2019, 14:26:12 UTC - in response to Message 90542.  

I was watching when a couple of the Rosetta WU failed.

They computed properly down until the TIME REMAINING was zero seconds and the compute time was 8 hours and a few minutes. Instead of reporting the completion, the WU was marked as WAITING with zero seconds remaining. When the WU restarted, it indicated a COMPUTE ERROR with the "finish file present too long</message>". The 34 failing WU seemed to all fail at the end and were 4.08 Linux WU.

https://boinc.bakerlab.org/rosetta/result.php?resultid=1063704662


That sounds very similar to mine.

I did notice that a few of mine showed n decoys and then appeared to restart and showed a session with 1 decoy before failing.
ID: 90546 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bcavnaugh
Avatar

Send message
Joined: 7 Dec 13
Posts: 7
Credit: 2,389,640
RAC: 0
Message 90547 - Posted: 21 Mar 2019, 17:54:38 UTC
Last modified: 21 Mar 2019, 17:54:52 UTC

Not getting any Tasks on this Host https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3112116
A T630 Server but my other T630 is getting them fine https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3282035
Both running Server 2012 R2
ID: 90547 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bcavnaugh
Avatar

Send message
Joined: 7 Dec 13
Posts: 7
Credit: 2,389,640
RAC: 0
Message 90548 - Posted: 21 Mar 2019, 19:01:45 UTC - in response to Message 90547.  

ID: 90548 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 325
Credit: 9,229,618
RAC: 521
Message 90551 - Posted: 22 Mar 2019, 18:44:01 UTC - in response to Message 90548.  

Not getting any Tasks on this Host https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3112116
A T630 Server but my other T630 is getting them fine https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3282035
Both running Server 2012 R2

Looks OK now https://boinc.bakerlab.org/rosetta/results.php?hostid=3112116


I suspect that was the last splutterings as the pool was draining, project status is showing 0 tasks unsent (but, as has been said, 8086 tasks ready to send).
ID: 90551 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1864
Credit: 34,376,924
RAC: 7,599
Message 90560 - Posted: 23 Mar 2019, 1:52:23 UTC - in response to Message 90551.  

Not getting any Tasks on this Host https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3112116
A T630 Server but my other T630 is getting them fine https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=3282035
Both running Server 2012 R2

Looks OK now https://boinc.bakerlab.org/rosetta/results.php?hostid=3112116


I suspect that was the last splutterings as the pool was draining, project status is showing 0 tasks unsent (but, as has been said, 8086 tasks ready to send).

Maybe they're tasks for pre-80386 machines? ...
ID: 90560 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 325
Credit: 9,229,618
RAC: 521
Message 90567 - Posted: 23 Mar 2019, 19:53:39 UTC

Despite having a 6 hour limit set I am currently processing a batch of Rosetta 4.08 WUs that have been running for 8 hours and are showing an estimated 2 hours remaining.

They all have names starting :-

rb_03_21_2022_2162_ab_t000__robetta_cstwt_5.0_FT

Is this normal or are they likely to error out?
ID: 90567 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Admin
Project administrator

Send message
Joined: 1 Jul 05
Posts: 4805
Credit: 0
RAC: 0
Message 90569 - Posted: 24 Mar 2019, 0:43:53 UTC - in response to Message 90567.  

This seems odd but I would continue to let it run since it is a relatively large protein to model.
ID: 90569 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Link
Avatar

Send message
Joined: 4 May 07
Posts: 303
Credit: 379,182
RAC: 0
Message 90574 - Posted: 24 Mar 2019, 9:17:59 UTC - in response to Message 90569.  

This seems odd but I would continue to let it run since it is a relatively large protein to model.

Besides that, the limit is CPU-hours, so depending on what else the CPU has to do, the runtime can be a lot longer.
.
ID: 90574 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 325
Credit: 9,229,618
RAC: 521
Message 90575 - Posted: 24 Mar 2019, 11:37:54 UTC - in response to Message 90569.  

This seems odd but I would continue to let it run since it is a relatively large protein to model.


After 10 hours (elapsed and CPU) 2 of them (1064201222 and 1064201281) errored out with the same symptoms I’ve been seeing.

Interestingly the 4 that succeeded (1064201216, 1064201223, 1064201224 and 1064201283) also had the default.out.gz exist, stream information inconsistent error so that is also a red herring.
ID: 90575 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 24 · 25 · 26 · 27 · 28 · 29 · 30 . . . 237 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2022 University of Washington
https://www.bakerlab.org