Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 99 · 100 · 101 · 102 · 103 · 104 · 105 . . . 309 · Next
Author | Message |
---|---|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 12,941 |
From Sid Celery 9 Apr It could be a lot of things, but when I check the start of the Event log I'm finding like 44 tasks uploaded and a few coming down and online they all report with Computation errors at that time. It may be different for others, but it's been taking out every task of mine, good or bad, and crashing the whole PC. If everything's good tomorrow morning, it'll be because the Server aborted all those tasks today. Let's see if I'm right. |
PorkyPies Send message Joined: 6 Apr 20 Posts: 45 Credit: 1,650,779 RAC: 0 |
I've been in contact with Project admins and this was a deliberate change, not a misconfiguration. I noticed the dud tasks have stopped coming down. Well done for getting them removed. I thought the increased memory and disk space requirement was deliberate, The project clearly think they'll have some work that needs that much memory and/or disk space. Pity for the machines that don't have more than 4GB but I guess it can't be helped unless they want to split tasks into small or large types and have different queues of work. Probably a lot of work on the project side to implement for not much gain. I've taken my 4GB Pi4's out of my Pi cluster. MarksRpiCluster |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,378,164 RAC: 20,578 |
And as i indicated with that link i posted, you are talking about decades for normal drives under normal usage conditions.SSD Endurance ExperimentI've read many articles complaining that SSDs last nowhere near as long as HDDs. A few HDDs do fail unexpectedly, but SSDs wear out, because they have a finite number of writes. They cannot possibly last longer than that time. Just as some HDDs fail before their time, so to do some SSDs. For all of the articles that complain about SSD failures, there would be just as many about HDD failures. SSD vs HDD: Which One is More Reliable? But in terms of data security, evidence of flash wear appeared after 200TB of writes for TechReport’s Solid State Drives, when their Samsung 840 Series started logging reallocated sectors. As the only TLC candidate in the bunch, this drive was expected to show the first cracks. The 840 Series didn’t encounter actual problems until 300TB, when it failed a hash check during the setup for an unpowered data retention test. The drive went on to pass that test and continue writing, but it recorded a rash of uncorrectable errors around the same time. Uncorrectable errors can compromise data integrity and system stability, so I’d recommend taking drives out of service the moment they appear.I'll take an SSD over a HDD any day. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 12,941 |
From Sid Celery 9 Apr Partly right. No re-boot, but my entire cache showing Computation errors and a message in the Event log saying: 20/04/2021 10:05:11 | Rosetta@home | [error] Signature verification failed for database_357d5d93529_n_methyl.zip and a back-off from re-contacting the server for 24hrs 2 steps forward, one step back... |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,378,164 RAC: 20,578 |
I'd backoff any over clocks for memory & CPU and let things run at stock for a while. Some of the errors could be due to internet/AV issues eg <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> app_version download error: couldn't get input files: <file_xfer_error> <file_name>database_357d5d93529_n_methyl.zip</file_name> <error_code>-120 (RSA key check failed for file)</error_code> <error_message>signature verification failed</error_message> </file_xfer_error> </message> ]]> But the Tasks that are starting and then erroring out after a while eg <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> (unknown error) - exit code 3221225477 (0xc0000005)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol fr_cart_fast.xml @fr_flags_bcov2 -in:file:silent miniprotein_relax9_SAVE_ALL_OUT_IGNORE_THE_REST_9fk2oh9e.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip miniprotein_relax9_SAVE_ALL_OUT_IGNORE_THE_REST_9fk2oh9e.zip @miniprotein_relax9_SAVE_ALL_OUT_IGNORE_THE_REST_9fk2oh9e.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3225505 Using database: database_357d5d93529_n_methylminirosetta_database Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0000000000000004 Engaging BOINC Windows Runtime Debugger...Indicate some other issue. I've had a couple of miniprotein_relax8_ error out after a while with a similar error message <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> (unknown error) - exit code 3221225477 (0xc0000005)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol fr_cart_fast.xml @fr_flags_bcov2 -in:file:silent miniprotein_relax8_SAVE_ALL_OUT_IGNORE_THE_REST_5mm6sc7p.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip miniprotein_relax8_SAVE_ALL_OUT_IGNORE_THE_REST_5mm6sc7p.zip @miniprotein_relax8_SAVE_ALL_OUT_IGNORE_THE_REST_5mm6sc7p.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1040802 Using database: database_357d5d93529_n_methylminirosetta_database Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00007FF736388316 read attempt to address 0xFFFFFFFF Engaging BOINC Windows Runtime Debugger..., but 95% or more of them have completed without issue. And while a few pre_helical_bundles_round1_attempt1_ error out in seconds <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pre_helix_boinc_v1.xml @helix_design.flags -in:file:silent pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.zip @pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3386203 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: [ERROR] Unable to open constraints file: d13b0a13bd57de6e8dc1565c1b82259f_0001.MSAcst ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457 BOINC:: Error reading and gzipping output datafile: default.out 10:12:15 (5600): called boinc_finish(1) </stderr_txt> ]]> But once again, the vast majority have completed ok. I've gone from over 150 errors to just 5. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 12,941 |
From Brian Nixon, 31 Mar After 1 day (a very short amount of time) it appears I'm being too optimistic. Using the number of tasks In Progress as a proxy for how successful people are at downloading tasks In March, the figure was 550k When all the problems began, the figure dropped to around 318k - a loss of 41% Today the figure is around 360k - loss reduced to 34.5% Usually it's a good thing to have a large queue of tasks to run. A week ago this figure increased to over 20m tasks. After the 2 or 3 rogue task-types that were causing all the crashes were removed, this dropped to 19m. Now it seems like the change to RAM & Disk requirements will only take effect for new tasks added to the queue - the amounts showing in my client_state.xml are largely the same as before. It may take 7 or 8 weeks for 19m tasks in the current queue to be ploughed through to see the (slightly) lower resource demands. June 2021... This is me speculating after just 1 day. Hopefully I'm wrong and it's quicker than that. I'm working on the basis that "bad news early" is better than no news at all. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 12,941 |
I'd backoff any over clocks for memory & CPU and let things run at stock for a while. Is this directed at me? If so, yes, I've assumed some of my problems are of my own making. I'm edging things down every couple of days and I've got a particular setting I'm looking to move down a lot the next chance I get. My temps are abnormally high atm, so I have to fix that. I've had a couple of miniprotein_relax8_ error out after a while with a similar error message Haven't all those tasks been aborted by the server now? And while a few pre_helical_bundles_round1_attempt1_ error out in seconds<core_client_version>7.16.11</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1)</message> <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pre_helix_boinc_v1.xml @helix_design.flags -in:file:silent pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.zip @pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3386203 Using database: database_357d5d93529_n_methylminirosetta_database ERROR: [ERROR] Unable to open constraints file: d13b0a13bd57de6e8dc1565c1b82259f_0001.MSAcst ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457 BOINC:: Error reading and gzipping output datafile: default.out 10:12:15 (5600): called boinc_finish(1) </stderr_txt> ]]> I've reported that as well. Some crash out within 20secs with a Computation error, while others stop short after 7 or 8mins but validated as if nothing went wrong. But both report errors, which is weird. ERROR: [ERROR] Unable to open constraints file: e1096e175045f039d630a9b7543a561f_0001.MSAcst |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 12,028 |
You don't have moods?!I'm a very calm person actually. The only mood I get in here is amused when people get upset over nothing. |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 12,028 |
In addition, tasks with the names "miniprotein_relax8" and "_abinitio_1_abinitio_" have been deleted from the queue and another bad batch they noticed before we informed them of these two.That's odd, I've never had a computer crash due to a faulty task from any project. A whole machine going down from one program error, that's a Windows XP problem isn't it? |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 12,028 |
The only reboots I've had is that criminally auto-rebooting Windows 10. I've thwarted that though. My updates are "managed by my organisation" or so it thinks.[quote]From Sid Celery 31 Mar9 Apr |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 12,028 |
Depends what you mean by normal. Mine has a security camera recording onto it, two graphics cards and a 24 core CPU doing Boinc, I record TV to it, .... I guess there are some people who just play solitaire and use email, those might last that long.And as i indicated with that link i posted, you are talking about decades for normal drives under normal usage conditions.SSD Endurance ExperimentI've read many articles complaining that SSDs last nowhere near as long as HDDs. A few HDDs do fail unexpectedly, but SSDs wear out, because they have a finite number of writes. They cannot possibly last longer than that time. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,214,047 RAC: 1,768 |
[quote]From Sid Celery 31 Mar9 Apr That's funny....you actually thinking MS gives a crap about what YOU, or your organization, wants to do with THEIR software. I hope it works for you I really really do but past history suggests MS just ups the priority of their updates and you get unwanted ones anyway because it serves their tracking needs. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 12,941 |
In addition, tasks with the names "miniprotein_relax8" and "_abinitio_1_abinitio_" have been deleted from the queue and another bad batch they noticed before we informed them of these two.That's odd, I've never had a computer crash due to a faulty task from any project. A whole machine going down from one program error, that's a Windows XP problem isn't it? It never did with my previous PC - and after the removal of these tasks it didn't happen last night either - but while those particular tasks were running and crashing, they took out every other task of any type and the whole PC with it. Maybe it's just me. Anyway, it seems to have stopped now |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 12,941 |
Using the number of tasks In Progress as a proxy for how successful people are at downloading tasks Currently 384k in progress - loss reduced to 30% <guessing> maybe back-up project tasks are being replaced by Rosetta? Every little helps |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2140 Credit: 41,518,559 RAC: 12,941 |
I've been in contact with Project admins and this was a deliberate change, not a misconfiguration. There had been some talk of larger tasks for more capable machines in the past. You may well be right that it was an attempt to provide them. But on machines with lower available resources, they seem not to get anything rather than only being offered low-resource-reqt tasks. And now it seems <everything> needs large resources. I'm sure there's a better way of implementing the provision of appropriately-sized tasks, but no-one's hit on it yet. Perhaps it needs info from the host requesting tasks first. But I'm guessing again. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,378,164 RAC: 20,578 |
They were still going through Yesterday, but given the low percentage of errors i didn't consider them to be an issue. That you did have such a high number of errors indicated that there was something going on with your system.I've had a couple of miniprotein_relax8_ error out after a while with a similar error messageHaven't all those tasks been aborted by the server now? Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,378,164 RAC: 20,578 |
The disk I/O from BONC projects is bugger all as a factor of DWPD (Drive Writes Per Day), even for a system with 64 cores/128 threads all in use.Depends what you mean by normal. Mine has a security camera recording onto it, two graphics cards and a 24 core CPU doing Boinc, I record TV to it, .... I guess there are some people who just play solitaire and use email, those might last that long.And as i indicated with that link i posted, you are talking about decades for normal drives under normal usage conditions.SSD Endurance ExperimentI've read many articles complaining that SSDs last nowhere near as long as HDDs. A few HDDs do fail unexpectedly, but SSDs wear out, because they have a finite number of writes. They cannot possibly last longer than that time. And SSDs used for recording video streams 24/7 will also last just as long if they have plenty of free space (30% or more) to allow for garbage collection & wear levelling to occur as needed. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1725 Credit: 18,378,164 RAC: 20,578 |
I'm sure there's a better way of implementing the provision of appropriately-sized tasks, but no-one's hit on it yet.There's a simple quick & dirty method that would be easy for the project to implement. The present application is v 4.2x The project compiles another copy, exactly the same, and calls it v5.2x and uses that one for processing large RAM requirement Tasks. In the Rosetta@home preferences they give the option of which version to run. The default for current & new users is v4.2x People can choose to also process large RAM tasks by selecting v5.2x eg Default settings Run only the selected applications Rosetta v4: yes Rosetta v5: no If no work for selected applications is available, accept work from other applications? no Settings for those that choose to run large RAM Tasks. Run only the selected applications Rosetta v4: yes Rosetta v5: yes If no work for selected applications is available, accept work from other applications? no People can also choose to run just the one type, but do the other type if their preferred type isn't available at the time they request work by setting the bottom line "If no work..." to yes, When a Work Unit is created, the researcher flags which application needs to be used to process it- Regular or large RAM requirement. That way any Task that requires large amounts of RAM, will only go to systems that are capable of handling it (if the user pays attention to the requirements before selecting the option to do those types of Tasks....). Of course when they move beyond v4, they'd need to go to v6 for regular Tasks, and v7 for large RAM Tasks, and update the Rosetta preferences page, and let people know what's happening before hand. Grant Darwin NT |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 12,028 |
It's my computer and they can't make me do anything, including pay for it.[quote]From Sid Celery 31 Mar9 Apr |
Mr P Hucker Send message Joined: 12 Aug 06 Posts: 1600 Credit: 12,116,986 RAC: 12,028 |
It could also be people manually doing other things. I sometimes like to concentrate on one project. If that runs out of work, I'll pick another and might not be back for a while. Somebody just knocked me into 3rd place elsewhere, this will not do, back in a week....Using the number of tasks In Progress as a proxy for how successful people are at downloading tasks |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org