The Great Gist Heist
I have crawled, downloaded, and archived all of gist.github.com. Please hear my story before jumping to conclusions.
I’m currently building software that requires a large corpus of source code.
I began to search for a collection of source code documents but my pursuit appeared fruitless. Feeling displeased I attempted to gather all of my own source code. My collection lacked fidelity perhaps because of my revere for the python language.
Regardless of the reasoning, I needed a higher quantity of samples. I needed unbiased samples from all programming languages. I needed, most importantly, samples in a variation of quality that only the most popular paste sites have… sites like gist.
Why are you sharing it?
I feel a little bad about using Github’s bandwidth.
Sharing this collection should reduce the chances that others will crawl for the same data. If you need a large collection of source code, download this torrent.
How did you do it?
I wrote a short, 30 line, python script. The script is part of the torrent.
At the peak of the scrape I had 14 threads running of the script, using approximately 580Kbps (I used iftop).