An alternative approach to the problem of backlinks is to start by finding existing links between pages with something like common crawl. I imagine you could parse each page in the corpus and extract out the links, then build a dataset containing all the links to and from pages. This seems like an enormous amount of computational effort though. It's also only based on historic snapshots of the pages, so the backlinks wouldn't update until a new crawl comes out.