Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 99 · 100 · 101 · 102 · 103 · 104 · 105 . . . 309 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 12,941
Message 101380 - Posted: 20 Apr 2021, 0:25:18 UTC - in response to Message 101379.  

From Sid Celery 9 Apr

I've regularly found my own PCs have rebooted overnight due to these faulty tasks.


I've never considered that being the cause of a reboot before... hmmmmm light bulb going off icon needed!!!

It could be a lot of things, but when I check the start of the Event log I'm finding like 44 tasks uploaded and a few coming down and online they all report with Computation errors at that time.
It may be different for others, but it's been taking out every task of mine, good or bad, and crashing the whole PC.

If everything's good tomorrow morning, it'll be because the Server aborted all those tasks today. Let's see if I'm right.
ID: 101380 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
PorkyPies

Send message
Joined: 6 Apr 20
Posts: 45
Credit: 1,650,779
RAC: 0
Message 101382 - Posted: 20 Apr 2021, 5:34:11 UTC - in response to Message 101373.  
Last modified: 20 Apr 2021, 5:37:27 UTC

I've been in contact with Project admins and this was a deliberate change, not a misconfiguration.
It's been looked at more closely and brought down to a figure nearer 4Gb - hopefully we see the result of that soon.
I note In Progress tasks are edging up, but let's see how that pans out.

There was obviously a need for that change, but I don't know what it is.
I've asked if a brief note can be posted to explain what they're working on that requires the increase.
No idea when or if that will happen.

I noticed the dud tasks have stopped coming down. Well done for getting them removed.

I thought the increased memory and disk space requirement was deliberate, The project clearly think they'll have some work that needs that much memory and/or disk space. Pity for the machines that don't have more than 4GB but I guess it can't be helped unless they want to split tasks into small or large types and have different queues of work. Probably a lot of work on the project side to implement for not much gain. I've taken my 4GB Pi4's out of my Pi cluster.
MarksRpiCluster
ID: 101382 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1725
Credit: 18,378,164
RAC: 20,578
Message 101385 - Posted: 20 Apr 2021, 7:06:52 UTC - in response to Message 101368.  

SSD Endurance Experiment
I've read many articles complaining that SSDs last nowhere near as long as HDDs. A few HDDs do fail unexpectedly, but SSDs wear out, because they have a finite number of writes. They cannot possibly last longer than that time.
And as i indicated with that link i posted, you are talking about decades for normal drives under normal usage conditions.
Just as some HDDs fail before their time, so to do some SSDs.

For all of the articles that complain about SSD failures, there would be just as many about HDD failures.

SSD vs HDD: Which One is More Reliable?
But in terms of data security, evidence of flash wear appeared after 200TB of writes for TechReport’s Solid State Drives, when their Samsung 840 Series started logging reallocated sectors. As the only TLC candidate in the bunch, this drive was expected to show the first cracks. The 840 Series didn’t encounter actual problems until 300TB, when it failed a hash check during the setup for an unpowered data retention test. The drive went on to pass that test and continue writing, but it recorded a rash of uncorrectable errors around the same time. Uncorrectable errors can compromise data integrity and system stability, so I’d recommend taking drives out of service the moment they appear.

Recalculating the limit until data becomes compromised at 300TB, an SSD like the Samsung 840 Series is theoretically reliable up to 21.4 years. Compare that to the fact that an HD drive is 50% likely to fail after 6 years.
I'll take an SSD over a HDD any day.
Grant
Darwin NT
ID: 101385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 12,941
Message 101387 - Posted: 20 Apr 2021, 9:51:37 UTC - in response to Message 101380.  

From Sid Celery 9 Apr
I've regularly found my own PCs have rebooted overnight due to these faulty tasks.

I've never considered that being the cause of a reboot before... hmmmmm light bulb going off icon needed!!!

It could be a lot of things, but when I check the start of the Event log I'm finding like 44 tasks uploaded and a few coming down and online they all report with Computation errors at that time.
It may be different for others, but it's been taking out every task of mine, good or bad, and crashing the whole PC.

If everything's good tomorrow morning, it'll be because the Server aborted all those tasks today. Let's see if I'm right.

Partly right.
No re-boot, but my entire cache showing Computation errors and a message in the Event log saying:
20/04/2021 10:05:11 | Rosetta@home | [error] Signature verification failed for database_357d5d93529_n_methyl.zip

and a back-off from re-contacting the server for 24hrs
2 steps forward, one step back...
ID: 101387 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1725
Credit: 18,378,164
RAC: 20,578
Message 101389 - Posted: 20 Apr 2021, 10:07:52 UTC

I'd backoff any over clocks for memory & CPU and let things run at stock for a while.
Some of the errors could be due to internet/AV issues eg
<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
app_version download error: couldn't get input files:
<file_xfer_error>
  <file_name>database_357d5d93529_n_methyl.zip</file_name>
  <error_code>-120 (RSA key check failed for file)</error_code>
  <error_message>signature verification failed</error_message>
</file_xfer_error>
</message>
]]>

But the Tasks that are starting and then erroring out after a while eg
<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 3221225477 (0xc0000005)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol fr_cart_fast.xml @fr_flags_bcov2 -in:file:silent miniprotein_relax9_SAVE_ALL_OUT_IGNORE_THE_REST_9fk2oh9e.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip miniprotein_relax9_SAVE_ALL_OUT_IGNORE_THE_REST_9fk2oh9e.zip @miniprotein_relax9_SAVE_ALL_OUT_IGNORE_THE_REST_9fk2oh9e.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3225505
Using database: database_357d5d93529_n_methylminirosetta_database


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0000000000000004 

Engaging BOINC Windows Runtime Debugger...
Indicate some other issue.






I've had a couple of miniprotein_relax8_ error out after a while with a similar error message
<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code 3221225477 (0xc0000005)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol fr_cart_fast.xml @fr_flags_bcov2 -in:file:silent miniprotein_relax8_SAVE_ALL_OUT_IGNORE_THE_REST_5mm6sc7p.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip miniprotein_relax8_SAVE_ALL_OUT_IGNORE_THE_REST_5mm6sc7p.zip @miniprotein_relax8_SAVE_ALL_OUT_IGNORE_THE_REST_5mm6sc7p.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1040802
Using database: database_357d5d93529_n_methylminirosetta_database


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00007FF736388316 read attempt to address 0xFFFFFFFF

Engaging BOINC Windows Runtime Debugger...
, but 95% or more of them have completed without issue.


And while a few pre_helical_bundles_round1_attempt1_ error out in seconds
<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
Incorrect function.
 (0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pre_helix_boinc_v1.xml @helix_design.flags -in:file:silent pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.zip @pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3386203
Using database: database_357d5d93529_n_methylminirosetta_database

ERROR: [ERROR] Unable to open constraints file: d13b0a13bd57de6e8dc1565c1b82259f_0001.MSAcst
ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457
BOINC:: Error reading and gzipping output datafile: default.out
10:12:15 (5600): called boinc_finish(1)

</stderr_txt>
]]>

But once again, the vast majority have completed ok.

I've gone from over 150 errors to just 5.
Grant
Darwin NT
ID: 101389 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 12,941
Message 101390 - Posted: 20 Apr 2021, 10:31:41 UTC - in response to Message 101373.  

From Brian Nixon, 31 Mar
I've had no issues with insufficient disk space or memory.
This points to a misconfiguration of the new batch of work units, as it seems unlikely it would be the project’s intention to cut off a third of its capacity…

Look in client_state.xml for the rsc_memory_bound and rsc_disk_bound settings of the new work units: they used to be 1,800,000,000 each; to yield the errors people are reporting they must now be set to 7,000,000,000 and 9,000,000,000.

Brian, I looked at my client_state.xml file and, as you speculated(?), those are the figures showing there.

I've been in contact with Project admins and this was a deliberate change, not a misconfiguration.
It's been looked at more closely and brought down to a figure nearer 4Gb - hopefully we see the result of that soon.
I note In Progress tasks are edging up, but let's see how that pans out.

After 1 day (a very short amount of time) it appears I'm being too optimistic.

Using the number of tasks In Progress as a proxy for how successful people are at downloading tasks
In March, the figure was 550k
When all the problems began, the figure dropped to around 318k - a loss of 41%
Today the figure is around 360k - loss reduced to 34.5%

Usually it's a good thing to have a large queue of tasks to run. A week ago this figure increased to over 20m tasks.
After the 2 or 3 rogue task-types that were causing all the crashes were removed, this dropped to 19m.
Now it seems like the change to RAM & Disk requirements will only take effect for new tasks added to the queue - the amounts showing in my client_state.xml are largely the same as before.
It may take 7 or 8 weeks for 19m tasks in the current queue to be ploughed through to see the (slightly) lower resource demands. June 2021...

This is me speculating after just 1 day. Hopefully I'm wrong and it's quicker than that.
I'm working on the basis that "bad news early" is better than no news at all.
ID: 101390 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 12,941
Message 101391 - Posted: 20 Apr 2021, 10:59:10 UTC - in response to Message 101389.  

I'd backoff any over clocks for memory & CPU and let things run at stock for a while.
Some of the errors could be due to internet/AV issues eg
<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
app_version download error: couldn't get input files:
<file_xfer_error>
  <file_name>database_357d5d93529_n_methyl.zip</file_name>
  <error_code>-120 (RSA key check failed for file)</error_code>
  <error_message>signature verification failed</error_message>
</file_xfer_error>
</message>
]]>

Is this directed at me?
If so, yes, I've assumed some of my problems are of my own making. I'm edging things down every couple of days and I've got a particular setting I'm looking to move down a lot the next chance I get.
My temps are abnormally high atm, so I have to fix that.

I've had a couple of miniprotein_relax8_ error out after a while with a similar error message

Haven't all those tasks been aborted by the server now?

And while a few pre_helical_bundles_round1_attempt1_ error out in seconds
<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
Incorrect function.
 (0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe -run:protocol jd2_scripting -parser:protocol pre_helix_boinc_v1.xml @helix_design.flags -in:file:silent pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.silent -in:file:silent_struct_type binary -silent_gz -mute all -silent_read_through_errors true -out:file:silent_struct_type binary -out:file:silent default.out -in:file:boinc_wu_zip pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.zip @pre_helical_bundles_round1_attempt1_SAVE_ALL_OUT_IGNORE_THE_REST_7tc3qf4n.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 3386203
Using database: database_357d5d93529_n_methylminirosetta_database

ERROR: [ERROR] Unable to open constraints file: d13b0a13bd57de6e8dc1565c1b82259f_0001.MSAcst
ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457
BOINC:: Error reading and gzipping output datafile: default.out
10:12:15 (5600): called boinc_finish(1)

</stderr_txt>
]]>

But once again, the vast majority have completed ok.

I've gone from over 150 errors to just 5.

I've reported that as well. Some crash out within 20secs with a Computation error, while others stop short after 7 or 8mins but validated as if nothing went wrong.
But both report errors, which is weird.
ERROR: [ERROR] Unable to open constraints file: e1096e175045f039d630a9b7543a561f_0001.MSAcst
ERROR:: Exit from: ......srccorescoringconstraintsConstraintIO.cc line: 457

ID: 101391 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 12,028
Message 101400 - Posted: 20 Apr 2021, 17:40:30 UTC - in response to Message 101372.  

You don't have moods?!
Not only do you have moods, sometimes they're arsey - that is, more than one.
Never mind, though. I wouldn't want you to get moody over my facts and opinions... lol

Let's go back to you making a good point - then everyone's happy
I'm a very calm person actually. The only mood I get in here is amused when people get upset over nothing.
ID: 101400 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 12,028
Message 101401 - Posted: 20 Apr 2021, 17:43:01 UTC - in response to Message 101378.  

In addition, tasks with the names "miniprotein_relax8" and "_abinitio_1_abinitio_" have been deleted from the queue and another bad batch they noticed before we informed them of these two.
Hopefully we'll all see a lot less crashes than we have recently.
I've regularly found my own PCs have rebooted overnight due to these faulty tasks.
That's odd, I've never had a computer crash due to a faulty task from any project. A whole machine going down from one program error, that's a Windows XP problem isn't it?
ID: 101401 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 12,028
Message 101402 - Posted: 20 Apr 2021, 17:44:13 UTC - in response to Message 101379.  

[quote]From Sid Celery 31 Mar9 Apr

I've regularly found my own PCs have rebooted overnight due to these faulty tasks.


I've never considered that being the cause of a reboot before...hmmmmm light bulb going off icon needed!!!
The only reboots I've had is that criminally auto-rebooting Windows 10. I've thwarted that though. My updates are "managed by my organisation" or so it thinks.
ID: 101402 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 12,028
Message 101403 - Posted: 20 Apr 2021, 17:46:48 UTC - in response to Message 101385.  
Last modified: 20 Apr 2021, 17:47:09 UTC

SSD Endurance Experiment
I've read many articles complaining that SSDs last nowhere near as long as HDDs. A few HDDs do fail unexpectedly, but SSDs wear out, because they have a finite number of writes. They cannot possibly last longer than that time.
And as i indicated with that link i posted, you are talking about decades for normal drives under normal usage conditions.
Depends what you mean by normal. Mine has a security camera recording onto it, two graphics cards and a 24 core CPU doing Boinc, I record TV to it, .... I guess there are some people who just play solitaire and use email, those might last that long.
ID: 101403 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 5 Jan 06
Posts: 1895
Credit: 9,214,047
RAC: 1,768
Message 101405 - Posted: 20 Apr 2021, 21:36:21 UTC - in response to Message 101402.  

[quote]From Sid Celery 31 Mar9 Apr

I've regularly found my own PCs have rebooted overnight due to these faulty tasks.


I've never considered that being the cause of a reboot before...hmmmmm light bulb going off icon needed!!!


The only reboots I've had is that criminally auto-rebooting Windows 10. I've thwarted that though. My updates are "managed by my organisation" or so it thinks.


That's funny....you actually thinking MS gives a crap about what YOU, or your organization, wants to do with THEIR software. I hope it works for you I really really do but past history suggests MS just ups the priority of their updates and you get unwanted ones anyway because it serves their tracking needs.
ID: 101405 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 12,941
Message 101410 - Posted: 21 Apr 2021, 0:02:40 UTC - in response to Message 101401.  

In addition, tasks with the names "miniprotein_relax8" and "_abinitio_1_abinitio_" have been deleted from the queue and another bad batch they noticed before we informed them of these two.
Hopefully we'll all see a lot less crashes than we have recently.
I've regularly found my own PCs have rebooted overnight due to these faulty tasks.
That's odd, I've never had a computer crash due to a faulty task from any project. A whole machine going down from one program error, that's a Windows XP problem isn't it?

It never did with my previous PC - and after the removal of these tasks it didn't happen last night either - but while those particular tasks were running and crashing, they took out every other task of any type and the whole PC with it.
Maybe it's just me.

Anyway, it seems to have stopped now
ID: 101410 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 12,941
Message 101411 - Posted: 21 Apr 2021, 0:08:50 UTC - in response to Message 101390.  

Using the number of tasks In Progress as a proxy for how successful people are at downloading tasks
In March, the figure was 550k
When all the problems began, the figure dropped to around 318k - a loss of 41%
Today the figure is around 360k - loss reduced to 34.5%

Currently 384k in progress - loss reduced to 30%
<guessing> maybe back-up project tasks are being replaced by Rosetta? Every little helps
ID: 101411 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2140
Credit: 41,518,559
RAC: 12,941
Message 101412 - Posted: 21 Apr 2021, 0:39:34 UTC - in response to Message 101382.  

I've been in contact with Project admins and this was a deliberate change, not a misconfiguration.
It's been looked at more closely and brought down to a figure nearer 4Gb - hopefully we see the result of that soon.
I note In Progress tasks are edging up, but let's see how that pans out.

There was obviously a need for that change, but I don't know what it is.
I've asked if a brief note can be posted to explain what they're working on that requires the increase.
No idea when or if that will happen.

I noticed the dud tasks have stopped coming down. Well done for getting them removed.

I thought the increased memory and disk space requirement was deliberate, The project clearly think they'll have some work that needs that much memory and/or disk space. Pity for the machines that don't have more than 4GB but I guess it can't be helped unless they want to split tasks into small or large types and have different queues of work. Probably a lot of work on the project side to implement for not much gain. I've taken my 4GB Pi4's out of my Pi cluster.

There had been some talk of larger tasks for more capable machines in the past. You may well be right that it was an attempt to provide them.
But on machines with lower available resources, they seem not to get anything rather than only being offered low-resource-reqt tasks.
And now it seems <everything> needs large resources.

I'm sure there's a better way of implementing the provision of appropriately-sized tasks, but no-one's hit on it yet.
Perhaps it needs info from the host requesting tasks first. But I'm guessing again.
ID: 101412 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1725
Credit: 18,378,164
RAC: 20,578
Message 101414 - Posted: 21 Apr 2021, 7:33:34 UTC - in response to Message 101391.  

I've had a couple of miniprotein_relax8_ error out after a while with a similar error message
Haven't all those tasks been aborted by the server now?
They were still going through Yesterday, but given the low percentage of errors i didn't consider them to be an issue. That you did have such a high number of errors indicated that there was something going on with your system.
Grant
Darwin NT
ID: 101414 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1725
Credit: 18,378,164
RAC: 20,578
Message 101415 - Posted: 21 Apr 2021, 7:41:06 UTC - in response to Message 101403.  

SSD Endurance Experiment
I've read many articles complaining that SSDs last nowhere near as long as HDDs. A few HDDs do fail unexpectedly, but SSDs wear out, because they have a finite number of writes. They cannot possibly last longer than that time.
And as i indicated with that link i posted, you are talking about decades for normal drives under normal usage conditions.
Depends what you mean by normal. Mine has a security camera recording onto it, two graphics cards and a 24 core CPU doing Boinc, I record TV to it, .... I guess there are some people who just play solitaire and use email, those might last that long.
The disk I/O from BONC projects is bugger all as a factor of DWPD (Drive Writes Per Day), even for a system with 64 cores/128 threads all in use.
And SSDs used for recording video streams 24/7 will also last just as long if they have plenty of free space (30% or more) to allow for garbage collection & wear levelling to occur as needed.
Grant
Darwin NT
ID: 101415 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1725
Credit: 18,378,164
RAC: 20,578
Message 101416 - Posted: 21 Apr 2021, 7:50:04 UTC - in response to Message 101412.  
Last modified: 21 Apr 2021, 8:08:58 UTC

I'm sure there's a better way of implementing the provision of appropriately-sized tasks, but no-one's hit on it yet.
Perhaps it needs info from the host requesting tasks first. But I'm guessing again.
There's a simple quick & dirty method that would be easy for the project to implement.
The present application is v 4.2x
The project compiles another copy, exactly the same, and calls it v5.2x and uses that one for processing large RAM requirement Tasks.

In the Rosetta@home preferences they give the option of which version to run. The default for current & new users is v4.2x
People can choose to also process large RAM tasks by selecting v5.2x

eg
Default settings
                                         Run only the selected applications Rosetta v4: yes
                                                                            Rosetta v5: no
If no work for selected applications is available, accept work from other applications? no


Settings for those that choose to run large RAM Tasks.
                                         Run only the selected applications Rosetta v4: yes
                                                                            Rosetta v5: yes
If no work for selected applications is available, accept work from other applications? no

People can also choose to run just the one type, but do the other type if their preferred type isn't available at the time they request work by setting the bottom line "If no work..." to yes,

When a Work Unit is created, the researcher flags which application needs to be used to process it- Regular or large RAM requirement. That way any Task that requires large amounts of RAM, will only go to systems that are capable of handling it (if the user pays attention to the requirements before selecting the option to do those types of Tasks....).


Of course when they move beyond v4, they'd need to go to v6 for regular Tasks, and v7 for large RAM Tasks, and update the Rosetta preferences page, and let people know what's happening before hand.
Grant
Darwin NT
ID: 101416 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 12,028
Message 101420 - Posted: 21 Apr 2021, 17:28:44 UTC - in response to Message 101405.  

[quote]From Sid Celery 31 Mar9 Apr

I've regularly found my own PCs have rebooted overnight due to these faulty tasks.


I've never considered that being the cause of a reboot before...hmmmmm light bulb going off icon needed!!!


The only reboots I've had is that criminally auto-rebooting Windows 10. I've thwarted that though. My updates are "managed by my organisation" or so it thinks.


That's funny....you actually thinking MS gives a crap about what YOU, or your organization, wants to do with THEIR software. I hope it works for you I really really do but past history suggests MS just ups the priority of their updates and you get unwanted ones anyway because it serves their tracking needs.
It's my computer and they can't make me do anything, including pay for it.
ID: 101420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mr P Hucker
Avatar

Send message
Joined: 12 Aug 06
Posts: 1600
Credit: 12,116,986
RAC: 12,028
Message 101421 - Posted: 21 Apr 2021, 17:30:35 UTC - in response to Message 101411.  

Using the number of tasks In Progress as a proxy for how successful people are at downloading tasks
In March, the figure was 550k
When all the problems began, the figure dropped to around 318k - a loss of 41%
Today the figure is around 360k - loss reduced to 34.5%

Currently 384k in progress - loss reduced to 30%
<guessing> maybe back-up project tasks are being replaced by Rosetta? Every little helps
It could also be people manually doing other things. I sometimes like to concentrate on one project. If that runs out of work, I'll pick another and might not be back for a while. Somebody just knocked me into 3rd place elsewhere, this will not do, back in a week....
ID: 101421 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 99 · 100 · 101 · 102 · 103 · 104 · 105 . . . 309 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org