>>2186
Ah, interesting service - will have to look into it. I need something less limited than a t2.micro, and c4.8xlarges are $1.67 an HOUR (!).
I remember the last time I did this having an insane amount of difficulty with the e621 API - part of it was that I was a much worse programmer then, but part of it was that individual API pages didn't seem to include things like tag categories and ratings. I also don't, in theory, trust their hashes - in practice they work fine, but I'm developing here with an eye towards sites like Derpibooru that optimize the image and don't update the hash in their API, so if I want anything to match on that (since as you mentioned, nobody downloads originals) I'm going to have to manually get the hashes of both the orig_ and the optimized images.
So the way I'm doing it this time is using BeautifuSoup to actually scrape every single page on the site (/post/show/{id}, iterated from 000001 to 900000 ignoring 404s), grab the tags including namespaces, grab the rating, stream the image and hash it myself, put everything in a JSON file, save it into the locally running instance of MongoDB that I keep around, repeat.
I broke up the iteration into work-blocks, which are 1000 IDs each and are served from Mongo as well. So I can fire up any number of workers, which point themselves at Mongo, grab the top block, and send the data back.
Now that I have it all in Mongo, all I'll have to do is write a quick script to iterate through it and dump it through the Hydrus Tag Archive generator.
This is definitely overkill, but it's extensible overkill - I don't need to worry about API differences, I don't need to worry about a process choking and hosing my output file, I don't need to worry about parallelism conflicts, etc.
The ultimate goal is to re-do Furaffinity, properly this time. Have to do hashing on the fly for that, too, and this method will work for sure now that their servers are less shit (thanks, IMVU! :P)