Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 160 · 161 · 162 · 163 · 164 · 165 · 166 . . . 306 · Next
Author | Message |
---|---|
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,097,379 RAC: 17,254 |
Rosetta 4.20 here again. I am running 5 of them and 1 rosetta python.They must come in regular small bursts, because there's always a fair amount running according to server status. I've only got 5 pythons on the only computer that will run them. For some reason it refuses to run 6 (it has 6 cores), even if there's loads of RAM left. |
tullio Send message Joined: 10 May 20 Posts: 63 Credit: 630,125 RAC: 0 |
I can run 2 rosetta pythons at most on my 12 GB RAM. If thre is a third it will be waiting for memory. My Intel i5 9400F has 3 cores that is six processors. Tullio |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,097,379 RAC: 17,254 |
I can run 2 rosetta pythons at most on my 12 GB RAM. If thre is a third it will be waiting for memory. My Intel i5 9400F has 3 cores that is six processors.You could be right, they probably ask for more RAM than they actually use, just in case. I forgot that machine with 6 cores only had 16GB. I stole some of it to put in my new Ryzen. Must have 64GB on my gaming machine! It's currently running 5 pythons using 11.5/16GB. But it doesn't say "waiting for memory" like I've seen before. It just doesn't start them. And I'm sure I've seen it under half utilizing the memory and not starting one. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
This is weird https://boinc.bakerlab.org/rosetta/result.php?resultid=1464088106 1.5 days processing for 20 minutes or so cpu time. So heres the breakdown 022-01-18 10:33:11 (15556): Status Report: Elapsed Time: '15314.521130' 2022-01-18 10:33:11 (15556): Status Report: CPU Time: '29.109375' 2022-01-18 00:17:37 (15156): Creating new snapshot for VM. 2022-01-18 00:17:42 (15156): Deleting stale snapshot. 2022-01-18 00:17:43 (15156): Checkpoint completed. 2022-01-18 00:21:45 (15156): VM state change detected. (old = 'running', new = 'paused') 2022-01-18 00:22:01 (15156): Powering off VM. 2022-01-18 00:22:01 (15156): Successfully stopped VM (end of my day so I shut down via suspend, shut down client (leave in memory), exit BOINC Now I restart: 2022-01-18 08:12:38 (15556): VM state change detected. (old = 'poweredoff', new = 'running') 2022-01-18 08:12:38 (15556): Status Report: Elapsed Time: '9314.493395' 2022-01-18 08:12:38 (15556): Status Report: CPU Time: '18.328125' 2022-01-18 08:12:38 (15556): Preference change detected 2022-01-18 08:12:38 (15556): Setting CPU throttle for VM. (100%) 2022-01-18 08:12:38 (15556): Setting checkpoint interval to 600 seconds. (Higher value of (Preference: 180 seconds) or (Vbox_job.xml: 600 seconds)) 2022-01-18 08:32:02 (15556): Creating new snapshot for VM. 2022-01-18 08:32:12 (15556): Deleting stale snapshot. Then this point 022-01-18 10:33:11 (15556): Status Report: Elapsed Time: '15314.521130' 2022-01-18 10:33:11 (15556): Status Report: CPU Time: '29.109375' here is 6 hrs 022-01-18 12:31:57 (15556): Status Report: Elapsed Time: '21314.549383' 2022-01-18 12:31:57 (15556): Status Report: CPU Time: '39.125000' 2022-01-18 12:37:42 (15556): Creating new snapshot for VM. 2022-01-18 12:37:43 (15556): Deleting stale snapshot. 2022-01-18 14:28:55 (15556): Status Report: Elapsed Time: '27314.711735' 2022-01-18 14:28:55 (15556): Status Report: CPU Time: '49.218750' 2022-01-18 16:09:27 (15556): Status Report: Elapsed Time: '33315.182032' 2022-01-18 16:09:27 (15556): Status Report: CPU Time: '59.093750' 2022-01-18 18:11:20 (15556): Status Report: Elapsed Time: '39315.521685' 2022-01-18 18:11:20 (15556): Status Report: CPU Time: '68.562500' Something went nuts, but does not show up in the report: 2022-01-18 19:27:47 (15556): Checkpoint completed. 2022-01-18 19:33:12 (11508): Detected: vboxwrapper 26202 2022-01-18 19:33:12 (11508): Detected: BOINC client v7.16.20 2022-01-18 19:33:13 (11508): Detected: VirtualBox VboxManage Interface (Version: 6.1.30) 2022-01-18 19:33:13 (11508): Feature: Checkpoint interval offset (88 seconds) 2022-01-18 19:33:13 (11508): Detected: Minimum checkpoint interval (600.000000 seconds) 2022-01-18 19:33:13 (11508): Restore from previously saved snapshot. 2022-01-18 19:33:14 (11508): Restore completed. 2022-01-18 19:33:19 (11508): Status Report: Elapsed Time: '43879.012785' 2022-01-18 19:33:19 (11508): Status Report: CPU Time: '75.46875 2022-01-18 21:13:48 (11508): Status Report: Elapsed Time: '49879.776962' 2022-01-18 21:13:48 (11508): Status Report: CPU Time: '86.453125' 2022-01-18 22:59:05 (11508): Status Report: Elapsed Time: '55880.065147' 2022-01-18 22:59:05 (11508): Status Report: CPU Time: '96.125000' 2022-01-19 00:02:14 (11508): VM state change detected. (old = 'running', new = 'paused') 2022-01-19 00:02:44 (11508): Powering off VM. 2022-01-19 00:02:44 (11508): Successfully stopped VM. *End of day 1* Start day 2 2022-01-19 07:58:26 (16032): VM state change detected. (old = 'poweredoff', new = 'running') 2022-01-19 07:58:26 (16032): Status Report: Elapsed Time: '58981.617149' 2022-01-19 07:58:26 (16032): Status Report: CPU Time: '100.656250' 022-01-19 10:00:34 (16032): Status Report: Elapsed Time: '64981.857656' 2022-01-19 10:00:34 (16032): Status Report: CPU Time: '112.250000' 022-01-19 11:46:01 (16032): Status Report: Elapsed Time: '70982.433000' 2022-01-19 11:46:01 (16032): Status Report: CPU Time: '122.140625' 022-01-19 13:26:46 (16032): Status Report: Elapsed Time: '76982.663074' 2022-01-19 13:26:46 (16032): Status Report: CPU Time: '132.531250' 2022-01-19 15:11:43 (16032): Status Report: Elapsed Time: '82982.833196' 2022-01-19 15:11:43 (16032): Status Report: CPU Time: '142.390625' 2022-01-19 17:17:08 (16032): Status Report: Elapsed Time: '88982.986887' 2022-01-19 17:17:08 (16032): Status Report: CPU Time: '152.312500' 2022-01-19 19:05:11 (16032): Status Report: Elapsed Time: '94983.557718' 2022-01-19 19:05:11 (16032): Status Report: CPU Time: '161.968750' This is where I take the time to look and see how things are going and say WTF! 2 days! Come on! ABORT |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,097,379 RAC: 17,254 |
Welcome to the club. ALL tasks for 6 of my machines do that. 1 in 50 tasks for my "good" machine do that. Whatever the bug is, it can be visible sometimes on some hardware and always on other hardware. I think we can't see enough information unless we're inside the VM. Is that possible? And wow, you were in bed for under 8 hours. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
And wow, you were in bed for under 8 hours. yeah and I am paying for that. I didn't think of looking in the VM for info. If I get stuck next time, I'll have a look. |
.clair. Send message Joined: 2 Jan 07 Posts: 274 Credit: 26,399,595 RAC: 0 |
They make snapshots even if the app is stuck in a loop going nowhere, I have seen 30+ snapshots with only 5 minits of cpu time wasters. By the way what happened to 700,000 workunits vanished from the front page que? its down to only 1.8 million are they trying to find the buggy one`s |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2132 Credit: 41,484,592 RAC: 17,238 |
Just checking in because I had a fair few Rosetta 4.20 tasks come down.YOU!! You stole them! I wanted those. I'm going to hunt you down, and I mean physically! I actually did. Full buffer on both machines I have near me before mentioning it. No need to thank me. I'll make tea - do you take sugar? |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,097,379 RAC: 17,254 |
I don't like hot drinks. Orange juice or vodka please, or both.Just checking in because I had a fair few Rosetta 4.20 tasks come down.YOU!! You stole them! I wanted those. I'm going to hunt you down, and I mean physically! |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,097,379 RAC: 17,254 |
They make snapshots even if the app is stuck in a loop going nowhere,Interesting, now up to 2.2 million. I'll try grabbing some and see what happens. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,097,379 RAC: 17,254 |
Well that didn't work, I tried 3 machines. Two of them failed pythons (no CPU time) and the other took four 4.20 tasks and got a computation error! |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
They make snapshots even if the app is stuck in a loop going nowhere,Interesting, now up to 2.2 million. I'll try grabbing some and see what happens. Ignore that big fancy number on the front page. That is what they have in queue for both the AI and RAH of which 99% are AI tasks. Get to the next layer deep where it breaks down 4.2 and python. This is the real number for us lowly PC crunchers: Application Unsent In progress Runtime of last 100 tasks in hours: average, min, max Users in last 24 hours Rosetta 0 61887 6.62 (0.28 - 51.23) 2600 rosetta python projects 4999 13547 4.59 (0.71 - 57.86) 1059 |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Check this out from a 4.2 task today <core_client_version>7.16.20</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @rb_01_17_185861_181891_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 1 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_01_17_185861_181891_ab_t000__robetta.zip -frag3 rb_01_17_185861_181891_ab_t000__robetta.200.3mers.index.gz -fragA rb_01_17_185861_181891_ab_t000__robetta.200.9mers.index.gz -fragB rb_01_17_185861_181891_ab_t000__robetta.200.5mers.index.gz -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1484534 Using database: database_357d5d93529_n_methylminirosetta_database [ ERROR ]: Caught exception: File: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306 chi angle must be between -180 and 180: -nan(ind) ------------------------ Begin developer's backtrace ------------------------- BACKTRACE: ------------------------- End developer's backtrace -------------------------- AN INTERNAL ERROR HAS OCCURED. PLEASE SEE THE CONTENTS OF ROSETTA_CRASH.log FOR DETAILS. </stderr_txt> ]]> Gees...really?!?!? |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,097,379 RAC: 17,254 |
Ignore that big fancy number on the front page.But the point is that was spotted dropping suddenly, so they must have removed some, presumably due to problems. I wish other projects had that number. All we get to see is the little front end buffer on most projects. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,097,379 RAC: 17,254 |
Check this out from a 4.2 task todayLooks rather like something tried to use the 4th dimension. Does your processor not support that function? |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Check this out from a 4.2 task todayLooks rather like something tried to use the 4th dimension. Does your processor not support that function? I have no graphics on my CPU. Besides, the program is supposed to take care of any graphics or whatever. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Ignore that big fancy number on the front page.But the point is that was spotted dropping suddenly, so they must have removed some, presumably due to problems. I wish other projects had that number. All we get to see is the little front end buffer on most projects. But again, that number is mute to this aspect of the project. It has no bearing on what we do. Just watch the numbers I quoted. That is all you need to be concerned about. Because that is the work WE get, not the machine. It just looks cool to say..oh we have 2 million tasks queued up, but when you dig deeper on Robetta, then you see, AI, AI, AI,AI.....Rosetta,AI,AI,AI,AI maybe a Rosetta. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,097,379 RAC: 17,254 |
I have no graphics on my CPU.I was trying to make a joke. It said the angle wasn't within the normal 360 degrees. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,097,379 RAC: 17,254 |
Actually the word is moot, not that I ever use that strange sounding word. I also have an aversion to the word dupe.Ignore that big fancy number on the front page.But the point is that was spotted dropping suddenly, so they must have removed some, presumably due to problems. I wish other projects had that number. All we get to see is the little front end buffer on most projects. It has no bearing on what we do.Admittedly it doesn't tell you which of the two apps it is, but it's more meaningful than the tiny number, which is just the first bunch in their RAM buffer or whatever. The main number is what we always used to look at to see how much work was left. You could see there was a month's supply etc. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Ah ha! Well sorry for the misspelling. Quick typing and no thinking after work. 4th dimension does exist...but anyway...again after work..not thinking. Missed the pun. Now...if you go here: https://robetta.bakerlab.org/queue.php?id=&target=&username=&seq=&page=2 and look at the active tasks at random, you will see that the majority are queued for RoseTTAFold which is the AI. You can read about it here: https://www.ipd.uw.edu/2021/07/rosettafold-accurate-protein-structure-prediction-accessible-to-all/ Now bed... |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org