Message boards : Number crunching : Rosetta 4.1+ and 4.2+
Previous · 1 . . . 11 · 12 · 13 · 14 · 15 · 16 · 17 . . . 34 · Next
Author | Message |
---|---|
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
thanks, i'd check that out3 finish file present too long errors on Pi4 Rosetta v4.20 aarch64-unknown-linux-gnu |
reindl Send message Joined: 31 Mar 20 Posts: 1 Credit: 1,765,751 RAC: 0 |
Can you reduce the size of task for Android phones? I have Samsung S20 equiped with Qualcomm flagship processor Snapdragon 865, and it could take more than half day to finish one task. And the deadline was set to about 3 days after task downloaded. I have to keep my phone charged most time of a day to finish the tasks received. This is not reasonable and gave me a lot of pressure. So, could you please reduce the size of each task? Thanks There are 2 things you can do:
2. Go to your settings and create a seperate profile for your phones with a shorter target runtime |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Can the servers be updated such that a wingman is only created once the originally created task is unable to report results? Otherwise first guy reports late, but gets in before the second guy, and then the second guy gets the same WU reporting back. See discussion here, and sample wu here Rosetta Moderator: Mod.Sense |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,380,064 RAC: 20,136 |
Can the servers be updated such that a wingman is only created once the originally created task is unable to report results?Project options <report_grace_period>x</report_grace_period> <grace_period_hours>x</grace_period_hours> A "grace period" (in seconds or hours respectively) for task reporting. A task is considered time-out (and a new replica generated) if it is not reported by client_deadline + x. So my thought is the Grace period needs to be 12 hours. The deadline can be 3 Days, 7 days etc, then there is the Watchdog timer which is presently 10 hours. Allow another couple of hours (just because...) and that gives you 12 hours for the grace_period_hours x. So a new Task won't be created until 12 hours after the deadline for the initial replication has passed (thinking about it even 6 hours would probably be long enough most of the time). Grant Darwin NT |
sgaboinc Send message Joined: 2 Apr 14 Posts: 282 Credit: 208,966 RAC: 0 |
i'm getting more finish file too long errors on Pi4 4.20, i've not upgraded boinc-client. can't find a binary package that would install problem free, many dependencies. however, i noticed one thing about the finish file too long errors. they seem related to the Junior_HalfRoid tasks https://boinc.bakerlab.org/rosetta/result.php?resultid=1172540347 https://boinc.bakerlab.org/rosetta/result.php?resultid=1172395662 and when these wu run, my Pi4 is close to using up all ram available. I'm not too sure if memory may after all be involved. e,g. that they generate many error messages in the 'finish file' due to low memory conditions it doesn't seem to be an easy way to solve it if it is due to memory short of running fewer tasks. but the point is when the tasks start memory consumption normally looks ok and it grows as the work progress. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 10,612 |
then there is the Watchdog timer which is presently 10 hours Minor diversion from the topic: I know this is what the watchdog is set to the last time we heard, but wasn't it for a very specific reason? Does that reason apply any more? Because if it doesn't, it's a really long time for nominally 8hr task runtimes. My sense of the watchdog was it's to allow for relatively short overruns that happen from time to time, but provides a cutoff for tasks if they've kind of gone rogue for some unknown reason. 10hrs doesn't really do the job any more and should be reduced to something more appropriate (was 4hrs) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
If I'm not mistaken, I believe the watchdog was extended to 10 hours, specifically for these potentially long-running Halfroids. Rosetta Moderator: Mod.Sense |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
One of those WUs used over 1GB and the other used over 2GB. What was in the out file about memory? It would seem that running fewer threads would be better than failing WUs. But I would suspect that BOINC client would have had to put the others to "waiting for memory" in order to run the larger one anyway. So, reducing the number of threads should basically be occurring automatically, and only when the specific WU requires it. Rosetta Moderator: Mod.Sense |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,380,064 RAC: 20,136 |
and when these wu run, my Pi4 is close to using up all ram available. I'm not too sure if memory may after all be involved.Low available system RAM would impact on RAM available for disk caching. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 10,612 |
If I'm not mistaken, I believe the watchdog was extended to 10 hours, specifically for these potentially long-running Halfroids. So the reason may still apply in future? I haven't seen one for a while. Ok |
dduggan47 Send message Joined: 18 Sep 05 Posts: 12 Credit: 4,298,437 RAC: 4,162 |
I apologize for asking a question that's probably already been asked and answered but, despite having been running BOINC since the early days (and SETI before BOINC existed), I'm not always sufficiently technical to follow all the details discussed here. My problem is that I'm getting many tasks which get "timed out - no response". For a while I was trying to look ahead and abort a lot of tasks, started and unstarted, which weren't going to finish by the deadline. I gather though that that might not be my best strategy for resolving this. On one machine a couple of days ago I changed the "store at least" and "... additional" to 1 day and 0.25 days respectively, but on the other box I forgot and didn't make that change until today. At the moment I have 3 running tasks on the 1st machine that will not make it. On the other machine it's 12 running and about that many more which haven't started yet but won't make the deadline. Am I right in assuming that BOINC will eventually figure this out? In the meantime, what's my best move? Abort all that won't make it? Abort only the unstarted? Let them all go until BOINC figures it out? Thanks. |
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 797,293 RAC: 0 |
This project (instead of SETI you familiar with) allows to reduce lenght of already received tasks. Best option for your host is to set them to minimal possible length anfd then gradually increase as long as you don't miss deadline. This can be done in project options here: https://clip2net.com/s/47qBO85 As you could see I have 2 different sets of options - for powerful hosts (big task length) and for netbooks/smartphones (short length, 4 hours per task currently) P.S. You need to update project settings (update project from BOINC) and then restart BOINC client itself to update already downloaded tasks length. Newly downloaded will be of new length already. |
dduggan47 Send message Joined: 18 Sep 05 Posts: 12 Credit: 4,298,437 RAC: 4,162 |
Thanks for your help, Raistmer.
This seems counterintuitive. Wouldn't I be better off to increase the expected length and then (I hope) run them in less time than to decrease the time and risk not making the deadlines? P.S. You need to update project settings (update project from BOINC) and then restart BOINC client itself to update already downloaded tasks length. Newly downloaded will be of new length already. I changed the expected times before reading your note but did it the opposite way as I described above. I can redo that if you advise that it would work better, even though I can't say I understand why. I also aborted anything that didn't look like it was going to make the deadline. After seeing your post I stopped and restarted the BOINC client. This seemed to increase the expected times by a lot more than my change on some (but not all) running tasks but had little or no effect on unstarted tasks. In my decades of running BOINC on around 40 different projects I've never run into this problem before. I'm finding it quite confusing. OTOH I was decades younger then too. Age tends not to reduce confusion! :-) Thanks again. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,380,064 RAC: 20,136 |
The best option is just to use the default Target CPU Runtime, and to have no cache at all, given the number of projects you are running.Am I right in assuming that BOINC will eventually figure this out? In the meantime, what's my best move? Abort all that won't make it? Abort only the unstarted? Let them all go until BOINC figures it out? Even if Rosetta were your only project, 0.5 days & 0.02 days extra is plenty. Grant Darwin NT |
Raistmer Send message Joined: 7 Apr 20 Posts: 49 Credit: 797,293 RAC: 0 |
Expected length is the amount of CPU time task will allowed to run. And here is the big difference with SETI and most other projects. Task doesn't contain fixed number of calculations to complete it. If CPU time allows, new model will be started for same task (slightly different initial atoms configuration or smth alike). So, if you allow 8 hours per task it will run 8 hours. Only 2h - then it will end in 2 hours. And yes, to avoid cache overflow in the future better to set BOINC cache size as small as it could be. But changing cache size will not help with already downloaded tasks. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,380,064 RAC: 20,136 |
rb_05_09_24541_24116_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_05_10_927507_5_0 <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @rb_05_09_24541_24116_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 2 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_05_09_24541_24116_ab_t000__robetta.zip -frag3 rb_05_09_24541_24116_ab_t000__robetta.200.3mers.index.gz -fragA rb_05_09_24541_24116_ab_t000__robetta.200.10mers.index.gz -fragB rb_05_09_24541_24116_ab_t000__robetta.200.5mers.index.gz -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1576447 Using database: database_357d5d93529_n_methylminirosetta_database [ ERROR ]: Caught exception: File: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306 chi angle must be between -180 and 180: -nan(ind) ------------------------ Begin developer's backtrace ------------------------- BACKTRACE: ------------------------- End developer's backtrace -------------------------- AN INTERNAL ERROR HAS OCCURED. PLEASE SEE THE CONTENTS OF ROSETTA_CRASH.log FOR DETAILS. </stderr_txt> ]]> This is the second time i've had this particular error message- last time it was dodgy WU, the other system that got it also got the same error. Waiting to see if that's the case again this time around. Grant Darwin NT |
Ivailo Bonev Send message Joined: 9 May 07 Posts: 15 Credit: 4,552,971 RAC: 9,921 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=1176852042 <core_client_version>7.16.5</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol jhr_boinc_v4.xml @flags -in:file:silent Junior_HalfRoid_design5_COVID-19_SAVE_ALL_OUT_IGNORE_THE_REST_6gx3kn9p.silent -in:file:silent_struct_type binary -silent_gz -mute all -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip Junior_HalfRoid_design5_COVID-19_SAVE_ALL_OUT_IGNORE_THE_REST_6gx3kn9p.zip @Junior_HalfRoid_design5_COVID-19_SAVE_ALL_OUT_IGNORE_THE_REST_6gx3kn9p.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3876534 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: [ERROR] Unable to open constraints file: f39b38c813752ceb1e616c99588b316d_n0_c0_1_0001.MSAcst ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457 BOINC:: Error reading and gzipping output datafile: default.out 11:22:22 (11520): called boinc_finish(1) </stderr_txt> ]]> |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,380,064 RAC: 20,136 |
rb_05_09_24541_24116_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_05_10_927507_5_0 Looks like it was another dodgy WU- other system had the same error. Grant Darwin NT |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 2002 Credit: 9,780,807 RAC: 5,492 |
|
James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0 |
Reference message 96433, which discusses problems with similar WUs. Edit- having said that, i just had one of those WUs do the same thing on my system, yet was processed OK on another system, and even though i've processed several others of the same type with no problems. (unknown error) - exit code -1073741819 (0xc0000005)Name: new_3cl_10aa_6lu7_modified_AVLstub_relaxed_renumbered_0674_33_extract_B_SAVE_ALL_OUT_928500_391_1 Application: Rosetta v4.20 windows_x86_64 Device: 3710630 Task: 1178942057. WU: 1058857778 Status: Error while computing. Exit status: -1073741819 (0xC0000005) STATUS_ACCESS_VIOLATION Errors: Too many errors (may have bug) Too many total results. Stderr output: (unknown error) - exit code -1073741819 (0xc0000005) My task was the 2nd try for this WU. The first host got same error, so question issue with this type of WU/task. My host also rec'd the same error with WU 1058853076, with my host again being the 2nd try for the same task. Edit: As mentioned by others, some of the above WUs process normally while others receive the above-mentioned error. My host quoted above normally processed task 1178341520 (new_3cl_10aa_6lu7****). |
Message boards :
Number crunching :
Rosetta 4.1+ and 4.2+
©2024 University of Washington
https://www.bakerlab.org