The Great Gist Heist

The Great Gist Heist

I have crawled, downloaded, and archived all of gist.github.com. Please hear my story before jumping to conclusions.

Why?

I’m currently building software that requires a large corpus of source code.

I began to search for a collection of source code documents but my pursuit appeared fruitless. Feeling displeased I attempted to gather all of my own source code. My collection lacked fidelity perhaps because of my revere for the python language.

Regardless of the reasoning, I needed a higher quantity of samples. I needed unbiased samples from all programming languages. I needed, most importantly, samples in a variation of quality that only the most popular paste sites have… sites like gist.

Why are you sharing it?

I feel a little bad about using Github’s bandwidth.

Sharing this collection should reduce the chances that others will crawl for the same data. If you need a large collection of source code, download this torrent.

How did you do it?

I wrote a short, 30 line, python script. The script is part of the torrent.

At the peak of the scrape I had 14 threads running of the script, using approximately 580Kbps (I used iftop).

One thought on “The Great Gist Heist

  1. Hi Russel,

    I found this link through criticue (after they updated to let me see the rest of the feedback). Already downloading this so cheers for providing the feedback. If you want to contact me directly use the email (in this post).

    Always looking for someone to help out.

    Ben

Leave a Reply

Your email address will not be published. Required fields are marked *