Twitter Call us on +44 (0)1256 486557

Dec 2014: Understanding Git History as GitHub rejects >100Mb files

We recently had to understand a large git history for a project and found these commands so useful so thought we would share. The main issue was that GitHub rejects files larger than 100MB so we had to get rid of them even though they were not present on the master branch!

  • This command lists all the SHA codes for all blobs (files) throughout the entire GIT history. Load allfileshas.txt after running to see:
  • git rev-list --objects --all | sort -k 2 > allfileshas.txt
    (This other command lists the unique files on the console): git rev-list --objects --all | sort -k 2 | cut -f 2 -d\ | uniq

  • Example:
  • 00f1a4b14f828a1c5fe514c3bf141097817c239d
    015cf543686c2811cc882e6d9b7fce2570855bbd
    01a8d89ae8447525b042cac3bd57c4c7ec2c1346
    02199e29286e9582d5905877ce64d79690a6ba6c

  • This command finds the biggest SHA’s (blobs or files) in your repo and lists them biggest to smallest in the bigobjects file:
  • git gc && git verify-pack -v .git/objects/pack/pack-*.idx | egrep "^\w+ blob\W+[0-9]+ [0-9]+ [0-9]+$" | sort -k 3 -n -r > bigobjects.txt

  • Example:
  • 7371a5106259c75ff8ef0ea122cc4939e92ab9c0 blob 1630546 1109088 98891
    06251121d1843fa59a70d23b069daaa421e8d976 blob 398240 397892 6982294
    dcc17e328a1b5724e298759caf7d1bae6d3e3f8f blob 397658 90680 5900453

  • After having run above, we can use the results to list the actual filenames in order of size. Look at bigtosmall.txt and you will see the largest files in your repo history in order:
  • for SHA in `cut -f 1 -d\ < bigobjects.txt`; do
    echo $(grep $SHA bigobjects.txt) $(grep $SHA allfileshas.txt) | awk '{print $1,$3,$7}' >> bigtosmall.txt
    done;

  • Example:
  • 7371a5106259c75ff8ef0ea122cc4939e92ab9c0 1630546 MyFile.pdf
    06251121d1843fa59a70d23b069daaa421e8d976 398240 images/2014/carousel/contact.jpg
    dcc17e328a1b5724e298759caf7d1bae6d3e3f8f 397658 js/jquery.js

  • To purge a file PERMANENTLY from history us this command (obviously enter the correct path according to bigtosmall.txt!):
  • git filter-branch --prune-empty --index-filter 'git rm -rf --cached --ignore-unmatch /PATH/TO/MY/FILE/TEXT.TXT' --tag-name-filter cat -- --all

    And thats it! Your repo is small again!