A new approach: git for backup/snapshot?

I am searching for a backup/snapshot solution and I came to the conclusion that git is maybe the best solution for me (although it was not intended for that : ).

I will use it for hourly backups on a notebook, which often is on battery power. It shall protect against a failure of the hard drive, but also if something gets deleted/changed accidentally by the user or by an application (backup operations will be done by root + chown/-grp root + chmod go-w -R). Overhead is not a relevant issue (at least to a large extent). RAID is not appropriate as it permanently increases the power consumption and does not protect against deletion/changes, too. LVM snapshots are too unreliable (e.g., when their size exceeds allocated storage) and don’t protect for drive failure.

Most solutions I used before, e.g., rsnapshot, are based upon rsync: this always (in my case, hourly) creates chunks and hashes of both origin and target (all stored data!) to identify changes for backup. This consumes too much power if I am on battery - and it does it hourly of all data, even if there are no (or just minor) changes.

This is what led me to git: add+commit creates a backup, including snapshots that can be restored, and it does not always create chunks and hashing of both origin and target: it does hashing+compressing+creating-its-structures once when creating a “backup” (=commit) but only of things that have changed (re-hash+…+… of unchanged data not necessary; that data remains stored).

Nevertheless, the process of git seems to take more time than I expected (I tested with a 700 mb file) because in the end, it still does sha1 hashing and additionally compressing. But I think a precise comparison is not needed because I expect that the files I will store will indeed need some storage (GB+, not MB, maybe more over time) but the changes per hour will be usually in KB or MB size ranges (seldomly more). Therefore, git will not hash/compress much each hour while it will not touch the remaining GB of unchanged data (rsync would…). Thus, with the assumption that there will be a lot of data stored (which all would be always chunked & hashed by rsync) while there will be just minor changes per hour (or not even that), I thought git is a good solution that is on average less power consuming than rsync-solutions. Also, git is a well proven and reliable technology.

My test: I make git init in the folder I want to backup, then I move the content of .git to the second drive and mount it on the .git folder. In my test, I simulated a loss of all data in the drive, except the .git folder (which is then on another drive). However, I could still do a git clone of the “empty” folder (because still containing .git; root-owned). The clone contained then again all data. The backup process will be automated with a python script.

However, git is not intended for that. So, I would like to know what you think of it. Have I missed something? Does the compromise against rsync make sense (given my assumption: much stored data, not much hourly changes)? Let me know if something is not clear. I worry about thinking errors :smiley:

I am aware that I will need to delete from time to time old commits. But this is fine.

1 Like

I think you may be mis-interpreting the behavior of rsync. It does not copy data that already is the same at both source and destination, but instead only copies over the changed data (thus the ‘sync’ in the name.) Rsync is capable of syncing in one direction only (preventing unintended deletes at the destination) or bidirectional including deletes at either end (a true sync at that point in time)

That said though, neither git nor rsync are ideal for the type of backup you are wanting to do. There are dedicated backup tools out there that are designed to do a full backup, followed by incremental backups of only changes on whatever schedule you define. I use backula for my system and rsync for my home data, but there are several choices available that can be found with a quick search then looking at the features to see what fits your needs best.

Most backup tools are not capable of identifying whether changes in file content are intended or unintended though, so if you are concerned about that you may need to dig deeper to find out what will actually meet your needs. For that purpose a version control system may be needed.

3 Likes

I know it does not copy :wink: But it always chunks + hashes all data at the origin and at the target to identify the changes (or more precisely, the differences), and then it transfers only the difference between origin and target. But this has already a large impact because it hashes a lot of data each hour. This is what I want to avoid (such an issue is shared among most related / incremental solutions).

Absolutely. This will be the “ongoing” backup “within” my notebook when I have to be mobile without much external equipment, but finally it is a compromise that shall have as less power consumption as possible but still fulfill two purposes :frowning: . I will do additional backups on external hard disks, but only when I’m at home. This will remain with rsync.

Agreed :slight_smile: I’m mostly worried if I’ve overlooked some specific behavior of git, since it’s ultimately not designed for that. It’s always good to have someone else to verify such approaches for thinking errors :wink:

Bacula is indeed something I have forgotten, but I think it is less eligible than git for my purpose. But I will have another look on it. Thanks!

This part is just about snapshots. The capability to revert (but more reliable+flexible than with, e.g., lvm).

1 Like

Two possibilities to consider (or both?):

2 Likes

I love the idea of the resulting real-time backup without rigid hour schedules. This would be indeed great. However, I am a bit skeptic of the inefficient kernel/user space jumps of FUSE in terms of battery impact. But when I can spare some time in the coming days, I will test it! If the impact is limited, this would indeed add a strong advantage.

I was already considering of using an external online git for uploading critical files regularly (additional protection against loss of notebook). git lfs could be used to automate and implement that for determined/specific file types (although it is on itself more a shift than backup, but it protects against notebook loss). Nevertheless, I don’t have always WiFi when I’m not at home, so this cannot replace my offline solution (my LTE traffic is limited), but just become an additional “secondary” backup when WiFi is available to protect some specific critical files against loss of notebook. Also, I would have to implement something to retain a local copy of the files.

However, I was not aware of git lfs at all (shame on me, but I have not yet heard of it : ). I am currently thinking on how far this can solve some issues with my external hard drive backups at home. Thanks!

2 Likes

This is really brilliant. I do think this update mechanism could be a better form of ostree.

In Kinoite what annoys me the most is, that the whole OS is copied over as the second ISO image, instead of just linking the differences or something. Its such a waste of space and resources. I update daily if I can, and its immense load every day.

Having the whole system git-traced, which seems to not be a problem as its kinda done automatically once initializing (?) then why not use this as backup. It gets updates and is a crucial product, it will be as secure as the Linux kernel, stable and if you use the networking connection in a LAN also really fast.

Could you share the script? Didnt quite get what chmods you did, and why. Also I think this would be a systemd hourly timer, a systemd service being activated running a script doing the work, and a logfile logging the process?

How would the LAN git initiating be done?

I agree having the complete system as FUSE may be a nice model integration, but this should not be done to the system, its a proof of concept but not efficient. An OS should not be written in python I guess.

Git-lfs seems to be completely useless? Or is the “text pointer to cloud-outsourced storages” concept somehow useful in a different way?