HPR4341: Transferring Large Data Sets
This show has been flagged as Clean by the host.
Transferring Large Data Sets
Very large data sets present their own problems. Not everyone has
directories with hundreds of gigabytes of project files, but I do,
and I assume I'm not the only one.
For instance, I have a directory with over 700 radio shows, many
of these directories also have a podcast, and they also have
pictures and text files.
Doing a properties check on the directory I see 450 gigabytes of
data.
When I started envisioning Libre Indie Archive I wanted to move
the directories into archival storage using optical drives. My
first attempt at this didn't work because I lost metadata when I
wrote the optical drives since optical drives are read only.
After further work and study I learned that tar files can preserve
meta data if they are created and uncompressed as root. In fact,
if you are running tar as root preserving file ownership and
permissions is the default.
So this means that optical drives are an option if you write tar
archives onto the optical drives.
I have better success rates with 25 GB Blue Ray Discs than with
the 50 GB discs. So, if your directory breaks up into projects
that fit on 25 GB discs, that's great.
My data did not do this easily but tar does have an option to
write a data set to multiple tar files each with a maximum size,
labelling them -0 -1, etc.
When using this multi volume feature you cannot use compression.
So you will get tar files, not tar.gz files.
It's better to break the file sets up in more reasonable sizes so
I decided to divide the shows up alphabetically by title, so all
the shows starting with the letter a would be one data set and
then down the alphabet, one letter at a time.
Most of the letters would result in a single tar file labeled -0
that would fit on the 25 GB disc. Many letters, however, took two
or even three tar files that would have to be written on different
disks and then concatenated on the primary system before they are
extracted to the correct location in primaryfiles.
There is a companion program to tar, called tarcat, that I used to
combine 2 or 3 tar files split by length into a single tar file
that could be extracted.
I ran engrampa as root to extract the files.
So, I used a tar command on the working system where my Something
Blue radio shows are stored. Then I used K3b to burn these files
onto a 25 GB Blu Ray Disc carefully labeling the discs and writing
a text file that I used to keep up with which files I had already
copied to Disc.
Then on the Libre Indie Archive primary system I copied from the
Blu Ray to the boot drive the file or files for that data set.
Then I would use tarcat to combine the files if there was more
than one file for that data set. And finally I would extract the
files to primaryfiles by running engrampa as root.
Now I'm going to go into details on each of these steps.
First make sure that the Libre Indie Archive program, prep.sh, is
in your home directory on your workstation. Then from the data
directory to be archived, in my case the something_blue directory
run prep.sh like this.
~/prep.sh
This will create a file named IA_Origin.txt that lists the date,
the computer and directory being archived, and the users and
userids on that system. All very helpful information to have if at
some time in the future you need to do a restore.
Next create a tar data set for each letter of the alphabet. (You
may want to divide your data set in a different way.)
Open a terminal in the same directory as the data directory, my
something_blue directory, so that ls displays something_blue (your
data directory). I keep the Something Blue shows and podcasts in
subdirectories in the something_blue directory.
Here's the tar command.
Example a:
sudo tar -cv --tape-length=20000000 --file=somethingblue-a-{0..50}.tar /home/larry/delta/something_blue/a*
This is for the letter a so the --file parameter includes the
letter a. The numbers 0..50 in the squirelly brackets are the
sequence numbers for the files. I only had one file for the letter
a, somethingblue-a-0.tar.
The last parameter is the source for the tar files, in this case
/home/larry/delta/something_blue/a*
All of the files and directories in the something_blue directory
that start with the letter a.
You may want to change the --tape-length parameter. As listed it
stores up to 19.1 GB. The maximum capacity of a 25 GB Blu-ray is
23.3GB for data storage.
Example b:
For the letter b, I ended up with three tar files.
somethingblue-b-0.tarsomethingblue-b-1.tarsomethingblue-b-2.tar
I will use these files in the example below using tarcat to
combine the files.
I use K3b to burn Blu-Ray data discs. Besides installing K3b you
have to install some other programs and then there is a particular
setup that needs to be done including selecting cdrecord and no
multisession. Here's an excellent article that will go step by
step through the installation and setup.
How to burn Blu-ray discs on Ubuntu and derivatives using K3b?
https://en.ubunlog.com/how-to-burn-blu-ray-discs-on-ubuntu-and-derivatives-using-k3b/
I also always check Verify data and I use the Linux/Unix file
system, not windows which will rename your files if the filenames
are too long.
I installed a Blu-Ray reader into the primary system and I used
thunar to copy the files from the Blu-Ray Disc to the boot drive.
In the primaryfiles directory I make a subdirectory,
something_blue, to hold the archived shows.
If there is only one file, like in example a above, you can skip
the concatenation step.
If there is more than one file, like Example b above, you use
tarcat to concatenate these files into one tar file.
You have to do this. If you try to extract from just one of the
numbered files when there is more than one you will get an error.
So if I try to extract from somethingblue-b-0.tar and I get an
error it doesn't mean that there's anything wrong with that file.
It just has to be concatenated with the other b files before it
can be extracted.
There is a companion program to tar called tarcat that should be
used to concatenate the tar files.
Here's the command I used for example b, above.
tarcat somethingblue-b-0.tar somethingblue-b-1.tar somethingblue-b-2.tar > sb-b.tar
This will concatenate the three smaller tar files into one bigger
tar file named sb-b.tar
In order to preserve the meta data you have to extract the files
as root. In order to make it easier to select the files to be
extracted and where to store them I use the GUI archive manager,
engrampa. To run engrampa as root open a terminal with CTRL-ALT t
and use this command
sudo -H engrampa
Click Open and select the tar file to extract. Then follow the
path until you are in the something_blue directory and you are
seeing the folders and files you want to extract. Type Ctrl a to
select them all. (instead of the something_blue directory you will
go to your_data directory)
Then click Extract at the top of the window. Open the directory
where you want the files to go. In my case,
primaryfiles/something_blue
Then click Extract again in the lower right.
After the files are extracted go to your data directory in
primaryfiles and check that the directories and files are where
you expect them to be.
You can also open a terminal in that directory and type
ls -l
to review the meta data.
When dealing with data chunks sized 20 GB or more each one of these
steps takes time. The reason I like using an optical disk backup to
transfer the files from the working system to Libre Indie Archive is
because it gives me an easy to store backup that is not on a
spinning drive and that cannot be overwritten. Still optical disk
storage is not perfect either. It's just another belt to go with
your suspenders.
Another way to transfer directories into the primaryfiles
directory is with ssh over the network. This is not as safe as
using optical disks and it also does not provide the extra
snapshot backup. It also takes a long time but it is not as labor
intensive.
After I spend some more time thinking about this and testing I
will do a podcast about transferring large data sets with ssh.
Although I am transferring large data sets to move them into
archival storage using Libre Indie Archive there are many other
situations where you might want to move a large data set while
preserving the meta data. So what I have written about tar files,
optical discs, and running thunar and engrampa as root is
generally applicable.
As always comments are appreciated. You can comment on Hacker
Public Radio or on Mastodon. Visit my blog at home.gamerplus.org
where I will post the show notes and embed the Mastodon thread for
comments about thie podcast.
Thanks
Provide feedback on this episode.