Working with the British Library Dataset

20 April 2016

Earlier this year Tim Weyrich directed me onto a dataset published by the British Library and since then my research has focused heavily around this. Within Computer Vision it is unusual to get a large dataset not skewed to achieve a specific research goal. Sometimes datasets can be repurposed, but this is requires extensive effort to get the data in its rawest form.

The British Library dataset is a quite literally a "dump" of all the unknown to OCR elements from the book scanning performed by Microsoft. Therefore is not just a collection of line art imagery, but also of elaborate charachters or section embroidery.

I intend to post more on working with this dataset as time goes on, but for now there is a github which contains the directory of images: Directory of Images (Github)

And what is the main repository on Flickr:Flickr Repository