Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 74 · 75 · 76 · 77 · 78 · 79 · 80 . . . 274 · Next

AuthorMessage
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,309,918
RAC: 16,071
Message 99686 - Posted: 15 Nov 2020, 6:36:33 UTC - in response to Message 99685.  

Edit- although still no luck with Scheduler responses, says it's down for maintenance.
Still down.
Grant
Darwin NT
ID: 99686 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 25,010,478
RAC: 383
Message 99696 - Posted: 17 Nov 2020, 4:38:08 UTC - in response to Message 99674.  

app_ config.xml in rosetta project directory ---------------------

<app_config>
<app>
<name>rosetta</name>
</app>
<project_max_concurrent>3</project_max_concurrent>
</app_config>

Crunch those numbers!


You don't need the app tags if you are using a project_max_concurrent. It applies to the project as whole, not to a particular app. You can simplify it to:

<app_config>
<project_max_concurrent>3</project_max_concurrent>
</app_config>
BOINC blog
ID: 99696 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tom

Send message
Joined: 29 Nov 08
Posts: 10
Credit: 6,044,733
RAC: 0
Message 99762 - Posted: 25 Nov 2020, 3:48:33 UTC

for some reason, i have been set to ONE work unit a day for quite a while now. after literally years of processing lots of work units, trouble-free, i still don't understand why. afaik, it started when boinc switched to ssl, but since i can successfully connect to other sites over ssl, the switch over (yes, i switched, too) shouldn't have nuked my ability to communicate with the project. and no, i don't see any errors in the event log, although i'm not very expert at looking through it.

currently running:
boinc 7.16.11
mac os x 10.7.5
mac mini server i7
ID: 99762 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Falconet

Send message
Joined: 9 Mar 09
Posts: 350
Credit: 1,000,634
RAC: 0
Message 99764 - Posted: 25 Nov 2020, 8:09:22 UTC - in response to Message 99762.  
Last modified: 25 Nov 2020, 8:10:57 UTC

Deleted.
ID: 99764 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,309,918
RAC: 16,071
Message 99765 - Posted: 25 Nov 2020, 8:09:25 UTC - in response to Message 99762.  

for some reason, i have been set to ONE work unit a day for quite a while now.
Because all you do is produce errors.
If you want more than 1 Task per day, you need to start producing Valid work.


Try detaching & re-attaching to the project- that will dump all your current work, but it will make the system re-download the science application.
If you are still producing errors- then it will most likely be a hardware issue- memory, power supply, CPU overheating (memory or PSU overheating) (or possibly an OS issue, but very, very unlikely- unless you recently did an update of some sort?)
Grant
Darwin NT
ID: 99765 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 99767 - Posted: 25 Nov 2020, 12:24:03 UTC - in response to Message 99762.  

This is the same problem you reported five months ago? The server is limiting the amount of work it sends because your computer is returning so many errors.

It can’t be related to SSL, since BOINC is successfully communicating with the server and able to download tasks.

The other thing that changed around the same time was the update to application version 4.20. Your recent tasks have all failed within seconds of starting, which suggests there’s some kind of fundamental incompatibility between the application and your system. Any Mac OS experts here who can offer any suggestions how to diagnose that? You could try the Mac forum, but it’s pretty quiet in there…
ID: 99767 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile nikolce

Send message
Joined: 28 Apr 07
Posts: 2
Credit: 2,002,356
RAC: 0
Message 99785 - Posted: 27 Nov 2020, 11:16:27 UTC

Hi all,

Can someone tell me if I should abort the below tasks? It's a bit annoying to find your CPU crunching nothing for two days.



Thank you!
ID: 99785 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 373
Credit: 10,588,101
RAC: 7,998
Message 99786 - Posted: 27 Nov 2020, 12:34:00 UTC - in response to Message 99785.  

Hi all,

Can someone tell me if I should abort the below tasks? It's a bit annoying to find your CPU crunching nothing for two days.



Thank you!


Have you any other Rosetta jobs running correctly alongside those?

Can you see the file names of the WUs to see if they’re all the same type?

Could you suspend (some of) those tasks and see if the replacement tasks run ok? That should tell you if it’s the tasks that are ok or a problem with your setup.
ID: 99786 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 99788 - Posted: 27 Nov 2020, 13:20:15 UTC - in response to Message 99785.  

Those look badly broken. They’re running but making no progress, which is why the estimated remaining time has grown so large. Kill them.

Is this your Intel machine? It is returning nothing but errors lately. There seems to be something seriously wrong with it.
ID: 99788 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile nikolce

Send message
Joined: 28 Apr 07
Posts: 2
Credit: 2,002,356
RAC: 0
Message 99792 - Posted: 27 Nov 2020, 21:01:10 UTC

Thanks for bringing the errors to my attention. Apparently from that point on it has been returning nothing but errors. I caught it today since my RAC dropped a little.

As recommended I killed the tasks and it started to drop on the next tasks within minutes.



I restarted the PC and tested the CPU and memory with Prime95 on smallest and large FFTs for 15 minutes each ( I know it's should be longer), with no errors. Meanwhile the PC was not showing any signs of instability. I've resumed the project and the tasks are doing fine for almost an hour now. I'll keep a close eye in the next couple of days. I thought I'll have to retire the old bugger.

Thank you!
ID: 99792 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,309,918
RAC: 16,071
Message 99795 - Posted: 27 Nov 2020, 21:26:35 UTC - in response to Message 99792.  

Thanks for bringing the errors to my attention. Apparently from that point on it has been returning nothing but errors. I caught it today since my RAC dropped a little.

As recommended I killed the tasks and it started to drop on the next tasks within minutes.
If it occurs again, try rebooting before aborting the Tasks.
Grant
Darwin NT
ID: 99795 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 99885 - Posted: 3 Dec 2020, 16:04:05 UTC

stub_cyc_target tasks completing in anywhere from under 2 hours to over 19 (against a default target of 8).

(Not a problem; just an observation…)
ID: 99885 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile lazyacevw

Send message
Joined: 18 Mar 20
Posts: 12
Credit: 93,576,463
RAC: 0
Message 99910 - Posted: 4 Dec 2020, 18:08:24 UTC
Last modified: 4 Dec 2020, 18:11:05 UTC

Can anyone comment on the level of compression the tasks are sent with and separately the level of compression that is applied before submitting completed workloads? Anyone know the type of compression? lzma2? bzip2?

The reason I ask is that beginning about 1 or 2 months ago, I started exceeding my 22 GB monthly data cap about 22 to 25 days into the month. I've been running pretty much the same batch of clients since April and the internet connection is dedicated solely to R@H so I only assume that work units have become more complex and larger. If I don't have any data hiccups, I average around 150,000 credits. I've started to shut down a few clients in order to just stay under my data cap. I'm not sure if the usage is purely work units or if it is the ancillary files R@H downloads (like the database_357d5d93529_n_methyl files) that they use to set up different variables and references. I'm running dedicated Linux machines that have been pared down to preclude any unnecessary or ancillary data usage. I even gone so far as to set the updater service to not check for updates on all of the machines.

I'd really like to bring a few dozen more cores online but I'm in a holding pattern until my data usage goes down.

When I run out of data, I have to tether my phone to each computer each day to do batch uploads and downloads. This isn't ideal because it's a little inconvenient having to plug into each computer plus I'm sure it creates unnecessary network oscillations in R@H distribution server. There have been a few days where I forgot to connect and kept breaking the 2 day crunch time limit. I'm sure that is inefficient for the project as a whole.

Would the devs consider upping the compression levels? It does slightly increase client and server overhead but most computers are more than capable to spend an extra minute or two to accomplish more compression. It might help people bring more systems online.
ID: 99910 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 99916 - Posted: 4 Dec 2020, 19:27:10 UTC - in response to Message 99910.  
Last modified: 4 Dec 2020, 19:38:47 UTC

In my observation, most downloaded data files use Zip compression. I haven’t paid any attention to what gets uploaded, though I’ve seen gzip mentioned in some logs and file names.

The big (500 MB Zip) database and the applications can go months between updates, so even though they’re relatively large they shouldn’t be affecting your recent usage.

You can see a history of data download and upload totals in the file daily_xfer_history.xml in your BOINC data directory; you could analyse that to see how usage has changed over time.

As a single data point, my current usage seems to be averaging around 4 MB per task.

You could increase the target CPU run time in your project preferences to run each task for longer (and thus fewer tasks in total, requiring less overall data transfer, in any given period of time) while maintaining the same credit rate.
ID: 99916 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1467
Credit: 14,309,918
RAC: 16,071
Message 99920 - Posted: 4 Dec 2020, 22:52:54 UTC - in response to Message 99916.  
Last modified: 4 Dec 2020, 23:12:00 UTC

You could increase the target CPU run time in your project preferences to run each task for longer (and thus fewer tasks in total, requiring less overall data transfer, in any given period of time) while maintaining the same credit rate.
Someone would need to try that out to see if it would make any significant difference, i suspect the difference would be minimal (if any).
The downloads of Tasks and their support files are rather small in size. It's the returned result files that can be extremely large (i've noticed a couple over 900MB in size). Running the Tasks for longer will result in a larger result file. So instead of returning 2 smaller files, you're returning one larger one. The only saving being less downloaded files- which as i mentioned are very small in comparison -resulting in little if any data transfer reduction.



The reason I ask is that beginning about 1 or 2 months ago, I started exceeding my 22 GB monthly data cap about 22 to 25 days into the month.
Which is around the time you brought some new systems online.
A 4 core/thread, 8c/t & of course the 64c/t Threadripper system.
They would all have a significant impact of the amount of results you return.



When I run out of data, I have to tether my phone to each computer each day to do batch uploads and downloads. This isn't ideal because it's a little inconvenient having to plug into each computer
Does your modem/router support WiFi? If so, just setup your phone as a Personal Hotspot.
If not, a WiFi dongle on one of the systems & connect that system to the phone, and enable internet sharing from that system for all of the others.



I'm running dedicated Linux machines that have been pared down to preclude any unnecessary or ancillary data usage. I even gone so far as to set the updater service to not check for updates on all of the machines.
I suspect Linux has a similar option to Windows Update that allows systems to check for their updates on the local network before checking for them over the internet.



Would the devs consider upping the compression levels? It does slightly increase client and server overhead but most computers are more than capable to spend an extra minute or two to accomplish more compression.
Problem being that you can only compress data up to a certain point, after which no further compression is possible. And spending 2, 3 or 4 times as long compressing (and uncompressing) the data for 3-5% saving in file size is really not an option.
Grant
Darwin NT
ID: 99920 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
MarkJ

Send message
Joined: 28 Mar 20
Posts: 72
Credit: 25,010,478
RAC: 383
Message 99924 - Posted: 5 Dec 2020, 3:46:24 UTC

You could put a proxy server on your network and that can save on duplicate transfers. Unfortunately since Rosetta switched to https it won’t help with project data files.

I find the best benefit for os updates, the 1st machine downloads them but subsequent requests come from the proxy server. It also works well with Einstein and their locality scheduler.
BOINC blog
ID: 99924 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile lazyacevw

Send message
Joined: 18 Mar 20
Posts: 12
Credit: 93,576,463
RAC: 0
Message 99935 - Posted: 6 Dec 2020, 13:10:10 UTC - in response to Message 99924.  
Last modified: 6 Dec 2020, 13:10:38 UTC

In my observation, most downloaded data files use Zip compression. I haven’t paid any attention to what gets uploaded, though I’ve seen gzip mentioned in some logs and file names.

Thanks. I too have seen gz and zip files in project slot directories along with a lot of uncompressed files. It just isn't clear what files were generated, decompressed from files received, or created when all of the work is said and done. Essentially, I can only easily see the middle part of the data processing.
You can see a history of data download and upload totals in the file daily_xfer_history.xml in your BOINC data directory; you could analyse that to see how usage has changed over time..

I took a look at the file but I couldn't make much sense of it.
Which is around the time you brought some new systems online.
A 4 core/thread, 8c/t & of course the 64c/t Threadripper system.
They would all have a significant impact of the amount of results you return.

It does appear from my profile that I brought new systems online recently but in actuality, I just reinstalled the OS on most of my systems when R@H ran out of tasks. That TR has been running every day since early April.
Does your modem/router support WiFi? If so, just setup your phone as a Personal Hotspot.
If not, a WiFi dongle on one of the systems & connect that system to the phone, and enable internet sharing from that system for all of the others.

I'm using switches connected to a wired personal hotspot. I tried using a USB-C to Ethernet adapter but my Samsung S8 doesn't appear to support it. Internet sharing isn't as easy to set up as it is with Windows but I will look into it. My clients do not have wireless adapters either. I might see about using a Windows laptop and have it connect wirelessly to my phone and then share it's connection over LAN. So, I do have a few options....
You could put a proxy server on your network and that can save on duplicate transfers

I really should set up a pfsense/squid box for the network. One for extra security but another for the network caching feature. It would work as long as the Linux updater doesn't use HTTPS. I could probably also set up a local update mirror. Thanks for the tip!
ID: 99935 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1223
Credit: 13,824,497
RAC: 2,340
Message 99948 - Posted: 7 Dec 2020, 5:08:29 UTC

A task running for over 12 hours so far, even though I've selected a run length of 8 hours:

3stub_cyc_target_1cwa_01152_14_extract_B_SAVE_ALL_OUT_1044879_311

The estimated time remaining is INCREASING, not decreasing.

It is doing checkpoints WITHOUT ending the task. 26 seconds since the last one.

Is something wrong with this task? Should I abort it?
ID: 99948 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Speedy
Avatar

Send message
Joined: 25 Sep 05
Posts: 163
Credit: 800,690
RAC: 173
Message 99949 - Posted: 7 Dec 2020, 7:03:35 UTC - in response to Message 99948.  

A task running for over 12 hours so far, even though I've selected a run length of 8 hours:

3stub_cyc_target_1cwa_01152_14_extract_B_SAVE_ALL_OUT_1044879_311

The estimated time remaining is INCREASING, not decreasing.

It is doing checkpoints WITHOUT ending the task. 26 seconds since the last one.

Is something wrong with this task? Should I abort it?

Have you tried exiting boinc and opening it again or restarting your computer/laptop? If after restart it starts back at for example 10 hours letters run and see if it will finish with in the 12 hours. If it doesn't & keeps running past 13 hours feel free to abort
Have a crunching good day!!
ID: 99949 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Brian Nixon

Send message
Joined: 12 Apr 20
Posts: 293
Credit: 8,432,366
RAC: 0
Message 99950 - Posted: 7 Dec 2020, 10:56:28 UTC - in response to Message 99948.  

I had some stub_cyc_target tasks complete in under 2 hours; some after more than 19. I aborted one that had been running for 1½ days, as even though it still seemed to be running, its progress percentage was increasing so slowly that it didn’t seem likely it would reach 100% in any reasonable amount of time.

Maybe leave yours for a few more hours, and kill it if it gets to a full day without completing?

Once a task has overrun, its remaining time estimate becomes meaningless, as BOINC has no way of knowing when it will finish. And sometimes tasks can get in a state where they are running but not reporting progress, so BOINC estimates progress using elapsed time towards a target perpetually 10 minutes in the future, meaning the value only asymptotically approaches 100%.
ID: 99950 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 74 · 75 · 76 · 77 · 78 · 79 · 80 . . . 274 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org