Minirosetta 3.73-3.78

Message boards : Number crunching : Minirosetta 3.73-3.78

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 14 · Next

AuthorMessage
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,710,284
RAC: 2,004
Message 80064 - Posted: 10 May 2016, 13:28:39 UTC - in response to Message 80063.  

Now 99% of total cpu's 5 dedicated cpu and 1 to work with GPU and other stuff. Percent of time cpu is back to 100%


[quote]There is no difference between 99% and 96% of CPUs in the computing configuration of your machine. Any minor change was likely due to background churning of other jobs ... either normal system tasks or other Boinc compute jobs.

There are two BOINC COMPUTING PREFERENCES -> COMPUTING controls for the CPU.
One is "% of CPUs" which controls the number of CPUs that are active.
Second is "% of CPU time" which intentionally inserts idle into the compute time.

Use "% of CPUs" and AVOID the "% of CPU time" like the plague. Inserting non-BOINC time into the project execution is like what you saw with Rosetta running at 8%. Your 8% was like setting the "% of CPU time" at 50%.

The "% of CPUs" deals in whole CPUs.
"% of CPUs" set to 99% will allow 5 of your 6 CPU to run CPU only jobs.
You can drop "% of CPUs" down to 100% - 1/6 = 83.4% and it should still allow 5 of your CPUs to run. If you set "% of CPUs" to 83%, then BOINC will idle the second CPU and only 4 would run.

EXAMPLE:
On my i7 with 8-CPUs, setting "% of CPUs" to 99% disables 1 CPU ... and displays the following message in the EVENT LOG:

5/10/2016 6:00:32 AM | | Number of usable CPUs has changed from 8 to 7.
5/10/2016 6:00:32 AM | | max CPUs used: 7

Setting "% of CPUs" to 88% yields the same message.
Setting "% of CPUs" to 87% drops another CPU with the EVENT LOG message:

5/10/2016 6:02:32 AM | | Number of usable CPUs has changed from 7 to 6.
5/10/2016 6:02:32 AM | | max CPUs used: 6


[quote]Ok, I will lower my overal Boinc CPU load to 98% and see if that helps.
And what you see on POEM is the same with me. 100% GPU and grabbing a significant percent of CPU. So it could be like you said, Rosetta getting bounced.
- Lowered both levels of processor usage to 96%. Will let things run and see if that helps Rosie catch back up. Thanks for the help. Let you know later if that solves the issue.

[quote]If the Rosetta job is bouncing between 16% and 8%, the CPU caches are getting cleared out during the 8% time that Rosetta is being idled by other programs executing on your system. You cannot tell how many times Rosetta is getting/losing control during that 1 second sample but it is probably a large number of times.

This is a very good indication that CPU cache thrashing (two or more jobs wanting to have their code/data in CPU caches) is a problem. Since the Boinc Whetstone benchmark ran full speed on your machine and other user machines, when Rosetta bounces between 16%-8% and cache contents are evicted, your machine is not making as much Rosetta compute progress because it is waiting for code/data to be retrieved again from slower main memory. When compared to the other machine ratios of Rosetta/Whetstone, their ratio is higher than yours appears and they are getting a higher % of claimed credits.

It is hard to estimate the exact impact based on these high level numbers but if you saw 8% on Rosetta, that is not good and likely part of the problem.

I have seen the GPU job load on the CPU vary as a function of the SYSTEM and as a function of the GPU, CPU and memory bandwidth. POEM is taking 100% of a CPU on my i7-3770k/Nvidia 970 GPU.

The newer OpenCL GPU apps do seem to take a good chunk of a CPU. They take more CPU than their CUDA counterparts. On machines that I run POEM or similar OpenCL GPU projects, I set the :

BOINC -> COMPUTER PREFERENCES -> USAGE LIMITS -> % of CPUs = 99%

to keep 1 CPU available for the GPU jobs AND for reasonable response on the system.



[quote]It might be Poem. Even though it is GPU mainly it grabs .263% of the CPU but when looking at processes it takes 17% of the CPU and Rosetta jumps around between 16 and 8%.

[quote]I am not sure about VHC or its control knobs. I have looked at SixTrack source code many years ago and had a SixTrack (LHC@Home) account but they could not generate work to crunch, so I gave up. I have also never ran a VirtualBox version of any project, so have no experience there either.

If the app runs under BOINC control, you can set the PROJECT->NO NEW TASKS and let the tasks drain out or simply suspend the VLHC project application for a period and see if it makes a difference in the Rosetta results.

A quick examination of the Windows 10 task manager might tell:

TASK MANAGER -> MORE DETAILS -> PROCESSES

screen should tell you a lot.

The CPU column should total close to 100% if you allow all CPU to be busy.
SORT BY CPU by clicking on the CPU column.

The Rosetta jobs should be consuming 1/6 (one of your 6 CPU) or 16.6% of the machine. If they are consuming noticeably less than 16.6% then that means the Rosetta job is not running 100% of the time, the Rosetta code and data is being evicted from the L1/.../Lx CPU caches. It takes a few cycles for the CPU to get data from those near caches. If the CPU has to go to main memory for evicted code/data, it takes 10x that long and Rosetta will run but VERY inefficiently while it waits for code/data to warm the caches again. Rosetta works hard but is waiting on code/data from memory.

It is worth your time to run a couple experiments on your machine to see if anything is affecting progress..




[quote]You think that VHC could be interfering? They both seem stuck on low average credit and VHC runs on 24 time slots. You can not alter the run time on that project.

Since I have been on Rosetta longer than VHC, I may have to drop VHC.
I was trying it because I wanted to see how virtual box worked.

[quote][quote]Is my CPU not strong enough for the current tasks that have been running
ID: 80064 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 80066 - Posted: 10 May 2016, 15:07:28 UTC - in response to Message 80064.  

@Greg_BE: You might want to learn how those nasty quote tags work...
ID: 80066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,710,284
RAC: 2,004
Message 80068 - Posted: 10 May 2016, 17:56:13 UTC - in response to Message 80066.  
Last modified: 10 May 2016, 17:57:15 UTC

@Greg_BE: You might want to learn how those nasty quote tags work...


I think that is because I am writing above the previous post instead of below like here. The computer for the forum can't read backwards.

I haven't posted on here in years so I have forgotten how this works.
ID: 80068 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
rjs5

Send message
Joined: 22 Nov 10
Posts: 273
Credit: 21,449,661
RAC: 16,596
Message 80069 - Posted: 10 May 2016, 19:24:32 UTC - in response to Message 80068.  

You can put messages above the old message .... AND I thought that was a clever idea since it worked for me.

@Greg_BE: You might want to learn how those nasty quote tags work...


I think that is because I am writing above the previous post instead of below like here. The computer for the forum can't read backwards.

I haven't posted on here in years so I have forgotten how this works.

ID: 80069 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,710,284
RAC: 2,004
Message 80070 - Posted: 10 May 2016, 20:01:27 UTC - in response to Message 80069.  
Last modified: 10 May 2016, 20:02:18 UTC

we are going way off topic now. so time to end this.


You can put messages above the old message .... AND I thought that was a clever idea since it worked for me.

@Greg_BE: You might want to learn how those nasty quote tags work...


I think that is because I am writing above the previous post instead of below like here. The computer for the forum can't read backwards.

I haven't posted on here in years so I have forgotten how this works.

ID: 80070 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1864
Credit: 8,184,911
RAC: 7,140
Message 80103 - Posted: 19 May 2016, 12:22:54 UTC

825724761

ERROR: in::file::boinc_wu_zip 5H2LD-13_tj58_5_054307_0014_I_0001_data.zip does not exist!
ERROR:: Exit from: ......srcappspublicboincminirosetta.cc line: 226
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

ID: 80103 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 11,805,838
RAC: 0
Message 80107 - Posted: 19 May 2016, 23:04:49 UTC - in response to Message 80103.  

825724761

ERROR: in::file::boinc_wu_zip 5H2LD-13_tj58_5_054307_0014_I_0001_data.zip does not exist!
ERROR:: Exit from: ......srcappspublicboincminirosetta.cc line: 226
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish



I see the same error: seems to happen with all tasks named yh_*. Boinc 7.2.42/Ubuntu 14.04
ID: 80107 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 80109 - Posted: 20 May 2016, 19:29:30 UTC

Yes, all of the yh* jobs are failing on my computer, too.
ID: 80109 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 80110 - Posted: 20 May 2016, 19:59:46 UTC - in response to Message 80109.  

Thanks for the report! I've contacted the authors of these jobs!

Yes, all of the yh* jobs are failing on my computer, too.

ID: 80110 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
yhsia
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 21 May 16
Posts: 2
Credit: 97,172
RAC: 0
Message 80111 - Posted: 21 May 2016, 1:03:20 UTC - in response to Message 80110.  

Thanks for the report! I've contacted the authors of these jobs!

Yes, all of the yh* jobs are failing on my computer, too.



Sorry those were my jobs! Apologizing for the wasted run times, I'm figuring out what went wrong :(.
ID: 80111 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1990
Credit: 38,522,839
RAC: 15,277
Message 80113 - Posted: 21 May 2016, 4:22:55 UTC
Last modified: 21 May 2016, 4:30:31 UTC

I got several of the above, so no need to report them, but another isolated one came up:

4hi0_B_16_BEN_SUP_hyb_cst_v02_i00_t000__krypton_SAVE_ALL_OUT_03_09_358432_163_1
ERROR: Cannot open file "i11.pdb"
ERROR:: Exit from: ......srccoreimport_poseimport_pose.cc line: 255
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

Exited after just 50 seconds, so no harm done at my end

Oh, also another odd one that seemed to run ok, but claimed credit yet received 0 but without a validate error

rb_05_17_65554_109652__t000__ab_robetta_IGNORE_THE_REST_358733_4959_1
======================================================
DONE :: 1 starting structures 28567 cpu seconds
This process generated 47 decoys from 47 attempts
======================================================
BOINC :: WS_max 2.60878e+008

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>


Validate state Workunit error - check skipped
Claimed credit 200.579495343863
Granted credit 0
application version 3.73

ID: 80113 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 80116 - Posted: 21 May 2016, 19:26:01 UTC - in response to Message 80113.  

Thanks Sid! I've fixed the issue, but unfortunately some units already got sent out =[

I got several of the above, so no need to report them, but another isolated one came up:

4hi0_B_16_BEN_SUP_hyb_cst_v02_i00_t000__krypton_SAVE_ALL_OUT_03_09_358432_163_1
ERROR: Cannot open file "i11.pdb"
ERROR:: Exit from: ......srccoreimport_poseimport_pose.cc line: 255
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

Exited after just 50 seconds, so no harm done at my end

Oh, also another odd one that seemed to run ok, but claimed credit yet received 0 but without a validate error

rb_05_17_65554_109652__t000__ab_robetta_IGNORE_THE_REST_358733_4959_1
======================================================
DONE :: 1 starting structures 28567 cpu seconds
This process generated 47 decoys from 47 attempts
======================================================
BOINC :: WS_max 2.60878e+008

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish


]]>


Validate state Workunit error - check skipped
Claimed credit 200.579495343863
Granted credit 0
application version 3.73

ID: 80116 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1990
Credit: 38,522,839
RAC: 15,277
Message 80117 - Posted: 22 May 2016, 1:31:27 UTC - in response to Message 80116.  

Good stuff. The one with the credit problem seems to have got cleaned up in the meantime and granted credit equal to claimed credit, so all's well there too.

A few more failed tasks but only of the type already reported, so all in hand as they work their way out of the queue.
Thanks Sid! I've fixed the issue, but unfortunately some units already got sent out =[
I got several of the above, so no need to report them, but another isolated one came up:

4hi0_B_16_BEN_SUP_hyb_cst_v02_i00_t000__krypton_SAVE_ALL_OUT_03_09_358432_163_1
ERROR: Cannot open file "i11.pdb"
ERROR:: Exit from: ......srccoreimport_poseimport_pose.cc line: 255
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

Exited after just 50 seconds, so no harm done at my end

Oh, also another odd one that seemed to run ok, but claimed credit yet received 0 but without a validate error

rb_05_17_65554_109652__t000__ab_robetta_IGNORE_THE_REST_358733_4959_1
======================================================
DONE :: 1 starting structures 28567 cpu seconds
This process generated 47 decoys from 47 attempts
======================================================
BOINC :: WS_max 2.60878e+008

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>


Validate state Workunit error - check skipped
Claimed credit 200.579495343863
Granted credit 0
application version 3.73


ID: 80117 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dr. Merkwürdigliebe
Avatar

Send message
Joined: 5 Dec 10
Posts: 81
Credit: 2,657,273
RAC: 0
Message 80132 - Posted: 28 May 2016, 14:29:32 UTC

ID: 80132 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 80133 - Posted: 28 May 2016, 21:10:45 UTC - in response to Message 80132.  

Compute error

Thanks! I've informed the author of the job.
ID: 80133 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile wyxchari

Send message
Joined: 27 Nov 14
Posts: 11
Credit: 85,318
RAC: 0
Message 80149 - Posted: 2 Jun 2016, 11:08:27 UTC

I'm tired of computer errors Rosetta. Many tasks fail at the end and then not receive credit. I prefer to use my computer time on other projects as WCG Cancer never give me error. Goodbye forever.
ID: 80149 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Timo
Avatar

Send message
Joined: 9 Jan 12
Posts: 185
Credit: 45,644,168
RAC: 214
Message 80150 - Posted: 2 Jun 2016, 12:43:43 UTC - in response to Message 80149.  

I'm tired of computer errors Rosetta. Many tasks fail at the end and then not receive credit. I prefer to use my computer time on other projects as WCG Cancer never give me error. Goodbye forever.



Actually, your error'd task recieved full credit (See bottom of page here: https://boinc.bakerlab.org/rosetta/result.php?resultid=824626167) As with most invalid tasks, there is a job that grants credit to invalid jobs once a day as they don't get credit right away, and this granted credit only shows on the result summary page.
ID: 80150 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1990
Credit: 38,522,839
RAC: 15,277
Message 80160 - Posted: 6 Jun 2016, 0:55:36 UTC

A new error report for yhsia to look at - cuts in at 30-45 minutes into the tasks for some reason:

yh160603_5H2LD-13-R_tj59_5_043651_0011_E_0001_SAVE_ALL_OUT_377694_12_1
<message>
(unknown error) - exit code -529697949 (0xe06d7363)
</message>

yh160603_5H2LD-13-R_tj58_5_000001_0002_C_0001_SAVE_ALL_OUT_377666_89_0
<message>
(unknown error) - exit code -529697949 (0xe06d7363)
</message>

ID: 80160 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 1990
Credit: 38,522,839
RAC: 15,277
Message 80161 - Posted: 6 Jun 2016, 6:53:40 UTC - in response to Message 80160.  

Also yh160603_5H2LD-13-R_tj59_5_000001_0001_E_0001_SAVE_ALL_OUT_377690_131_1
<message>
(unknown error) - exit code -1073741819 (0xc0000005)
</message>


ID: 80161 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
krypton
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 16 Nov 11
Posts: 108
Credit: 2,164,309
RAC: 0
Message 80162 - Posted: 6 Jun 2016, 21:18:16 UTC

Hi Sid,

Thanks for the alert! looks like these jobs require lots of memory. We have a way to specify how much memory to use. It will corrected in the next round of submission!
ID: 80162 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 14 · Next

Message boards : Number crunching : Minirosetta 3.73-3.78



©2024 University of Washington
https://www.bakerlab.org