So where we're at is that it's hosted, but we're still organising the data. Scraping the document IDs automatically is difficult as they move around inside the documents, they are not terribly consistent in format, and many documents don't have visible part numbers at all. Many uploaders were nice enough to name their files after the Volvo part number of the document, but even then I expect 99% of visitors will want to filter by model and/or year, which is again not easy to automatically pluck from the documents with certainty.
We played around with indexing, we chucked the whole lot into ES and used Ambar as a frontend, but it was resource-intensive and also not actually terribly useful, as in a bunch of my scenarios I was looking for something like "rivet" but Volvo call it something really generic like "retaining fastener". It was good at finding wiring diagrams though.
Anyway, so we ditched Ambar, now we're looking at ES Service with
Searchkit as a frontend out of S3.
If we don't get that working to a satisfactory level in the next few days, we'll switch to the alternative approach, which is recruit a bunch of people to go through the documents and mark them up in a spreadsheet, than have a script break it all up into useful metadata and organise the files from that.
If we go the mechanical turk route, I'll post here looking for some volunteers to help out.
As an aside, "final" count is 14.6GB and 2190 documents/images after auto deduplication, but we've found there are some documents that sneak through (i.e. they're the same actual document but scanned by different people).