wget is much faster than scp or rsync
I needed to copy 3TB of data from my old homeserver to my new one. I decided to spend as much time "sharpening my axe" as possible. I spent ages dicking around with ZFS configs, tweaking BIOS settings, flashing firmware, and all the other yak-shaving necessary for convincing yourself you're doing useful work.
Then I started testing large file transfers. Both scp
and rsync
started well - transferring files at around 112MBps. That pretty much saturated my Gigabit link. Nice! This was going to take no time at all...
And then, after a few GB of a single large file had transferred, the speed slowed to a crawl. Eventually dropping to about 16MBps where it stayed for the majority of the transfer.
I spent ages futzing around with the various options. Disabling encryption, disabling compression, flicking obscure switches. I tried using an SSD as a ZIL. I rebuilt my ZFS pool as a MDADM RAID. I mounted disks individually. Nothing seemed to work. It seemed that something was filling a buffer somewhere when I used scp
or rsync
.
So I tried a speed test. Using curl
I could easily hit my ISP's limit of 70MBps (about 560Mbps).
It was getting late and I wanted to start the backup before going to bed. Time for radical action!
On the sending server, I opened a new tmux
and ran:
cd /my/data/dir/
python3 -m http.server 1234
That starts a webserver which lists all the files and folders in that directory.
On the receiving server, I opened a new tmux
and ran:
wget --mirror http://server.ip.address:1234/
That downloads all the files that it sees, follows all the directories and subdirectories, and recreates them on the server.
After running overnight, the total transfer speed reported by wget
was about 68MBps. Not exactly saturating my link - but better than the puny throughput I experienced earlier.
Downsides
There are a few (minor) problems with this approach.
- No encryption. As this was a LAN transfer, I didn't really care.
- No preservation of Linux attributes. I didn't mind losing metadata.
- No directory timestamps. Although file timestamps are preserved.
- An
index.html
file is stored every directory. I couldn't find an option to turn that off.- They can be deleted with
find . -newermt '2023-04-01' \! -newermt '2023-04-03' -name 'index.html' -delete
- They can be deleted with
- It all just feels a bit icky. But, hey, if it's stupid and it works; it isn't stupid.
I'm sure someone in the comments will tell me exactly which obscure setting I needed to turn on to make scp
work at the same speed as wget
. But this was a quick way to transfer a bunch of large files with the minimum of fuss.
Karey Higuera says:
I ran into this as well and ended up using
croc
to transfer the files. It maintained a relatively high speed, but there was a significant loading time when starting (parsing the files, perhaps?).Barney Livingston said on mastodon.me.uk:
@Edent For this kind of bulk LAN copy I tend to use tar piped to netcat. Something likenc -l 9999 | tar -x -f-on the receiver, andtar -c -f- <dir> | nc <host> 9999on the sender.Can chuck a gzip in the pipeline if you're sending something compressible. It can keep all the file attributes, links, etc with the right tar options.
Denilson says:
Yes, I've done something similar recently also using netcat.
Bonus points: You only need netcat at one of the computers. On the other computer, you can just use
bash
special filename/dev/tcp/host/port
in a file redirection. So:rythie said on mastodon.social:
@Edent I think this is due to lots of small files. Probably a “tar copy” would be faster: https://qameta.com/posts/copy-files-compressed-with-tar-via-ssh-to-a-linux-server/ Copy Files Compressed with Tar via ssh from and to a Linux Server
Alex B says:
I always find some way of getting old and new storage hosted in the same system when doing this kind of migration. Obviously, a HP microserver isn't optimal for that, though! Maybe use USB-to-SATA adaptors, or migrate from a single device to RAID after the copy, allowing the use of SATA ports for the old storage?
Ivan says:
In theory, uftp could help you reach the absolute maximum speed your LAN could handle (no TCP ACKs!), but that would probably need a bet careful hand-tuning of the speed of transfer. (uftp can afford to avoid the slow start part of the TCP algorithm.)
In practice, the could be due to rsync and scp being careful and calling fsync() every now and then, while plain wget/curl has no need to do that and relies on the OS to flush its buffers when it's comfortable. But I'm not sure at all that's the case here.
Fazal Majid says:
Combining rsync with GNU parallel is usually much faster, but is there any reason why you are not using “zfs send”?
@edent says:
Firstly, because I didn't know about it. Secondly, because the sending system wasn't using ZFS. And, thirdly, it still requires a transfer over
ssh
- so that doesn't solve the problem.More comments on Mastodon.