[ / / / / / / / / / / / ]

# /hydrus/ - Hydrus Network

Bug reports, feature requests, and other discussion for the hydrus network.
Name Email Select/drop/paste files here * = required field [▶ Show post options & limits]Confused? See the FAQ.
Embed (replaces files and can be used instead) Do not bump(you can also write sage in the email field)Spoiler images(this replaces the thumbnails of your images with question marks) (For file and post deletion.) Allowed file types:jpg, jpeg, gif, png, webm, mp4, swf, pdfMax filesize is 12 MB.Max image dimensions are 10000 x 10000. You may upload 5 per post.

New user? Start here ---> http://hydrusnetwork.github.io/hydrus/

Current to-do list has: 1035 items

File: 1424393184272.jpg (1.57 MB, 1500x1978, 750:989, 1b42a554ea243f40c1ec80b391….jpg)

438960 No.290

Here is a 7zip of the client database, version 147, freshly initialised and synced up to my public tag repository as of today.

If you want to start a new client that connects to my public tag repository, you can swap in this database right after you install, and you won't have to spend twenty hours sitting around waiting for 7.8 million mappings to process.

If you have no idea what this is, I suggest you ignore it and install the client normally, learning about how hydrus works using my help files first.
Post last edited at

438960 No.811

File: 1433964257285.jpg (3.07 MB, 2042x1915, 2042:1915, 0592004c4a4926ad1300c2da28….jpg)

Here is an update, up to version 159 and update 1107, which is today, June 10th 2015. It now has 14.2 million mappings.

Also, here is a tag archive of the same:

438960 No.1204

File: 1444430383249.jpg (1.03 MB, 3000x1977, 1000:659, 0de26cbb2ba79d0cefa6114018….jpg)

And another, up to version 176 and update 1212, which is today, October 9th 2015. It now has 20.5 million mappings.

And the tag archive:

438960 No.1263

Do you need the database and the tag archive? I'm new but I'm not going to be able to do it the normal way, it hangs an insane amount when processing

438960 No.1266

File: 1445447637573.jpg (247.01 KB, 1280x866, 640:433, faa387277148bad454632afb35….jpg)

>>1263

Just the db. Swap out the db directory with the 'bare_database' 7z, and you should be good to go.

This is a bare database, though. It is essentially a fresh install that has synced to my public tag repo and done nothing else. All the options and files already in your client will disappear, so you will have to reset and reimport them, which might be more than a new user wants to deal with.

If you want to stick with what you already have, you can change when and how the client does its high-CPU processing in file->options->maintenance and processing. If the client is processing and freezing up when you don't want it to, let me know.

438960 No.1436

If I'm already fully synched with the PTR, then I don't need the PTR tag archive, right? Since they're the same thing?

438960 No.1437

File: 1447525498974.jpg (1.44 MB, 4433x1858, 4433:1858, 19fce413c5a9c4e7aa85b7e036….jpg)

>>1436

Yeah, they are the same, just in different containers. The archive is really just for programmers who want to screw around with the data alone.

438960 No.1585

File: 1450227585621.png (32.5 KB, 830x648, 415:324, screen.1450219925.png)

Finally got around to playing with tag archives thanks to an anon's scripts. I have some questions though.

1. I made a tag archive of xbooru. It has 556,667 hashes. Does this mean Hydrus should be able to recognize 556k images from xbooru once imported? As in, I shouldn't need to run the downloader in order for it to grab tags for the image.

2. Would having too many tag archives slow or clutter my client.db? I currently have gelbooru, danbooru, e621, derpibooru tag archives imported.(In progress of importing rule34xx & xbooru..here's hoping the import is successful) In the future I plan on doing more, such as rule34phael & rule34hentai, so I'm curious if everything will still hold up. I don't know if I'm the only one taking full advantage of these, so i'm asking to make sure.

3. This is kind of unrelated, but is it possible to sync my client.db with the PTR? The main reason I'm making the tag archives is because I'm not sure what booru siterips PTR already contains. Hopefully I can get majority of them in the PTR.

438960 No.1588

File: 1450287229719.jpg (1.38 MB, 2036x1523, 2036:1523, f1665da89398eeda0549eb9dc7….jpg)

>>1585

1

Yes. Any files you already have that match any of those hashes will gain the namespaces you chose in manage services. Any files you import in future will be checked against the tag archive in case they match as well.

2

The only time the client queries a tag archive is on the initial sync and the check every subsequent file import. It typically only takes a couple of milliseconds per import per archive, so don't worry about it–add as many as you want.

3

It depends on the hash type of your archive. The hydrus client and server refer to files using sha256, but I assume xbooru has provided you with md5. The hydrus client only has md5s for local files, so it can only cross-reference to make the conversion of (md5->tag)->(sha256->tag) for files you actually have. As you have probably experienced with your other tag archives, this still usually represents a hell of a lot of tags!

If you happen to have sha256, though, then you can potentially sync the entire gigantic tag archive into the PTR! This is a slightly terrifying prospect, but if you can, give it a go! I expect trying to upload 9 million mappings or whatever in one go will cause an error in one way or another, so let me know if you end up in this situation–I might have to rewrite the upload system a little to spread it out a bit or add a cutoff or something.

If you do have md5 or sha1, you can get better coverage by sharing the archive around many users, who will have many of the files that you don't, enabling a greater number of md5->sha256 conversions. Would you be willing to share your archive? I would 7zip it and mirror it up on my mediafire like the others.

438960 No.1594

File: 1450365776257.png (75.61 KB, 970x635, 194:127, screen.1450326881.png)

I apologize. I had to delete my xbooru.db download link. Although it does work, it seems my xbooru namespace.csv was missing a bunch of namespaces. (could have sworn I changed the max Pid) As you can see in the picture, it did match the md5 with the proper tags, but "isabelle (animal crossing)" should have had the 'character:' namespace. I'll put the link back up when it's finished.

438960 No.1599

File: 1450542830467.png (55.51 KB, 914x509, 914:509, screen.1450541833.png)

>>1594

Rule34xxx tag archive is done

xbooru tag archive is done

Rule34Pheal tag archive is done

sof.booru.org tag archive is done(*1)

ohdd.booru.org tag archive is done(*1)

I will post all of them once they're done syncing.(or on my off day) Everything seems to be going fine though.

(*1) Might be interesting to some, but I discovered that some of the sites from booru.org(sof/ohdd,etc) use incorrect hashing. Initially I thought they were using SHA1 since they were 40 characters long, but then noticed that none of my files were being recognized by hydrus, even after using HASH_TYPE_SHA1 to generate the db. Turns out, the hashes were completely wrong, and were a mixture of SHA1 + MD5. This meant converting all hashes in the csv to md5. Here is the script I made to convert the hashes.

#USAGE: md5-convert.py "directory_with_files" ".csv file"import hashlibimport codecsimport sysimport osimport reimport timeDIR = sys.argv[1]CSV_FILE= sys.argv[2]count = 0;#gets md5 hash of filedef md5(fname):    hash = hashlib.md5()    with open(fname, "rb") as f:        for chunk in iter(lambda: f.read(4096), b""):            hash.update(chunk)    return hash.hexdigest()f = codecs.open(CSV_FILE, encoding='utf-8', mode='r')data = f.read()f.close()with codecs.open(CSV_FILE, encoding='utf-8', mode='r') as f:  for line in f:  	line = line[:-1]	line = line.split('\t')	ext = re.search("\.jpg|\.jpeg|\.png|\.gif|\.webm", line[2]).group(0)	hash = line[1]		#construct absolute path for the md5 function	path = sys.argv[1] + '\\'+ hash + ext		#check to make sure the file exists	if os.path.isfile(path) and os.access(path,os.R_OK):		filehash = md5(path)		newdata = data.replace(hash,filehash)				#write data to temp file		f2 = codecs.open('temp.csv',encoding='utf-8', mode='w')		f2.write(newdata)		f2.close()				#set old data to the newdata		data = newdata		count+=1				print "line #{count}  converted to md5.".format(count = count)		#time.sleep(.250)		print "All hashes have been converted."

http://pastebin.com/TdakdrpN

After converting and recreating the db,they now work fine.

438960 No.1605

File: 1450553671877.jpg (690.45 KB, 2000x997, 2000:997, 2d321ed7865d08a6d9b0a6b774….jpg)

>>1599

That mixed hash thing is odd. Where are you getting the incorrect 40-character hash from? I tried gelbooru's api call:

http://sof.booru.org/index.php?page=dapi&s=post&q=index

http://ohdd.booru.org/index.php?page=dapi&s=post&q=index

but it just sends me to blah.booru.org//. Yet it works for SizeBooru, here:

http://size.booru.org/index.php?page=dapi&s=post&q=index

Am I doing something wrong there? Maybe those boorus just have the API turned off.

By the way, if you want to generate sha256 for those files, it works exactly the same, just use hashlib.sha256(). If you are interested, I use it because modern CPUs can generate it quick and it isn't easy to generate a collision. md5 is easy to spoof:

http://www.mscs.dal.ca/~selinger/md5collision/

And as I understand it, sha1 isn't far behind.

438960 No.1608

>>1605

>That mixed hash thing is odd

>The Gelbooru author who made the hashing function applies SHA-1 to MD5 hashes.

I got that info from their forums. I have no idea if it's accurate or not, but it seems it only applies to boorus running Gelbooru Beta 0.1.11 and such.

>Where are you getting the incorrect 40-character hash from?

If you look at their thumbnail urls/image urls, you'll see the hash is something like 09efcc1d867698c11734ec52120f133bdc9faf9f, which is 40 chars.

>I tried gelbooru's api call

I'm guessing 0.1.11 didnt have an api yet. However it was still pretty straight forward to do html parsing with beautiful soup. Hell, I think i'm starting to like python.(I mainly use C#)

>Yet it works for SizeBooru

Yeah, it seems it's because they're running Gelbooru 0.2.

>just use hashlib.sha256()

I probably should have done this instead. Would using sha256 speed up the syncing of tag archives too? I will use that for the next batch of sites I do.

>md5 is easy to spoof

438960 No.1611

File: 1450640925906.jpg (165.65 KB, 1280x960, 4:3, 38685f183aa7c1aad5c19502fe….jpg)

>>1608

>sha1 of the literal md5

lol wat

I can sympathise, though–I bet the dev wanted to move to sha1 but didn't want to recalculate everything or something, or he thought sha1( md5( file ) ) was the same as sha1( file ). I see that 0.2 goes back to something more reasonable.

I have come to like python in a not dissimilar way. It is a bit janky sometimes (GIL is almost a deal-breaker, imo, and should have been fixed years ago), and it runs a bit slow, but I find I can prototype things so much faster than if I do them 'properly'. Since I am always short on time, fast development is a high priority for me. I think they should teach it in school, as any smart kid, given the python console, can be up and running in about fifteen seconds. Instead it seems they often go for some opaque bullshit like Java.

Hydrus indexes all the hashes it stores, so it is about the same speed for any of them. The smaller ones are probably quicker just because there are fewer bytes to compare, and the db index is smaller because it only stores local files' hashes, but I expect the difference is measurable in microseconds.

sha256 is a good bet for anything hydrus-related just because it is the native hash, so you can transfer it over the network. I knew when I chose it that it would be a small problem interfacing with the legacy formats most other sites use, but I am spergish enough that I would rather do what makes engineering sense to me than work with consensus. Then again, someone was telling me IPFS uses some metahash or something where it basically adds a hash_type descriptor-header to its hashes, like "sha1:[sha_hash]", which mite b cool to support in the future, especially if it turns out we all need to move to SHA3. I would rather that the booru/imageboard devs just jumped to sha256 for now, though.

438960 No.1619

File: 1450847115418.png (641.78 KB, 1595x833, 1595:833, screen.1450846694.png)

>>1611

Well here are the tag archives.

Hopefully you can mirror them to the hydrus mediafire folder too. They should be complete tag rip, as of Dec 10~20ish. Some of the sites do not use namespaces, so you'll get some tags creators/characters/series as generic tags. I couldn't think of a way to fix that without manually editing the tags(yea..not going to map out 1mil tags lol) or using another sites namespace, which could technically work, but it'd probably be messy and miss a bunch of tags too, especially if it's gelbooru's namespace since they use japanese names.(For example, rule34 uses the name "Dawn", yet Gelbooru uses "Hikari", meaning all pokemon characters wouldn't be mapped if that tag was found)

db: ohdd.db

name: Onahole Doll - オナホドール

db: rule34hentai.db

name: Rule34Hentai

(includes all the loli tags too)

db: rule34pheal.db

name: Rule34paheal

db: rule34xxx.db

name: Rule34xxx

db: sof.db

name: Semen on Figures

db: xbooru.db

name: Xbooru

Pic-related is an import of some rule34hentai webms I saved, being tagged accordingly.

438960 No.1620

>>1619

db: drunkenpumken.db

name: Drunken Pumken

438960 No.1621

>>1619

Wow! Thank you so much.

438960 No.1622

>>1621

Sure thing. The more images that can be recognized and tagged, the better.

Planning on doing yande,shuushuu and konachan next. If anyone has any other suggestions i'd be happy to try them.

438960 No.1623

File: 1450903343032.jpg (10.47 KB, 228x250, 114:125, e33b9cfba9d16b30b5561a9aa7….jpg)

>>1619

>>1620

This is fantastic, thank you for the hard work!

https://www.mediafire.com/folder/yoy1dx6or0tnr/tag_archives

438960 No.1630

>>1619

Oh shit! Thanks for xbooru.

>>1622

sankaku chan, sankaku idol and gelbooru comes to mind~

438960 No.1668

Update

https://mega.nz/#F!HUIX1ZIB!buAIE_bgKCKZ0G0Mjk4CGQ

db: konachan.db

name: Konachan

db: yande.db

name: Yande.re

db: lolibooru-moe.db

name: Lolibooru Moe

db: tbib.db

name: The Big ImageBoard (TBIB)

(Note: Take caution when adding this one, as it is very large.(4.6mil hashes, 550k tags) The site is apparently a sync between many boorus, though it doesn't mention which ones. I haven't actually synced this one yet, but I imagine it'd take a long time, as in like a week to sync)

>>1623

No problem. I'd love to do as many as I can.

>>1630

>sankaku

There is a problem with this site. I can't figure out how to get past page 1000(hard limit). In other boorus like gelbooru, you can bypass it by searching ID's,but it's disabled in sankaku for some reason. However it shouldn't be too bad, since sankaku is basically a mirror of Gelbooru I thought? I can do sankaku idol up to page 1k too, but something about an incomplete archive feels wrong.

>gelbooru

Sure, though there's already one done in the hydrus mediafire folder.(From July 2015). I can make an updated one though

438960 No.1764

File: 1452655151954.png (42.06 KB, 1577x743, 1577:743, screen.1452654740.png)

>>1668

Update.

Here's the next batch that I will release.

Gelbooru [2016] - Done/Synced

Safebooru - Done/Synced

Dollbooru - Done/Synced

Uberbooru - Done/Not synced.

Nihonomaru - Done/Synced

My Figure Collection - In progress. Should be done in a week or so. The problem is the site does not store any of the image's hashes, and they also have a page limit like Sankaku. This meant using the dirty method..(iterating through every id one by one(don't want to spam the server)), and it also meant I have to download every image off the site to get their hashes and use them in a .db, which is roughly around 1.5mil. Not a huge deal, just takes awhile. Once done, it should also support the proper namespaces the image belongs to.(Figures/Items/Collections/etc). Pic-related.(The '00000' column will be replaced with sha256 of the files once they're downloaded)

Sankaku Idol/Sankaku - In progress. I found a way around their page limit, so I should be able to make a .db of it.(Same method as MyFigureCollection)

438960 No.1814

>>1764

I wanted to release them all this Wednesday, but MyFigureCollection & Sankaku will take longer than I thought, so I'll just post the complete ones now.

db: gelbooru2016.db

name: Gelbooru

site: http://gelbooru.com/

db: dollbooru.db

name: Dollbooru

db: nihonomaru.db

name: Nihonomaru

db: safebooru.db

name: Safebooru

db: uberbooru.db

name: Überbooru

438960 No.1820

>>1814

question, how do i add these extra tag databases ?

also could you by any chance grab the drawfriends and deviants depository boorus?

http://drawfriends.booru.org/

http://deviants-despository.booru.org/

438960 No.1821

File: 1453149182012.png (1.78 MB, 858x725, 858:725, screen.1453149002.png)

>>1820

1. Shut down Hydrus if it's running first

2. Drop the .db in '\db\client_archives', and start hydrus up again.

3. Goto services -> manage services

4. In the local tab under 'local tags', click the add button under archive sync.(Pic-related)

5. Then check all the boxes for tags you want to sync.(no namespace/character/artist/etc)

Depending on the size it can take quite some time. Once synced, hydrus will be able to tag any file that came from that site, up to the date the archive was created.

> grab the drawfriends and deviants depository boorus

Sure thing

438960 No.1822

>>1821

oooooooooh I see, does this enable me to download from those sites afterwards too?

Also will it retroactively check images I already have to these databases and tag them?

I am seeing a ton of images I have but don't have tagged properly, so this might help speed up my tagging a lot if i can do that

thanks

438960 No.1823

>>1822

>Also will it retroactively check images I already have to these databases and tag them

Yep, if you already added an image, then it should get tagged if it matches anything from the database.

>I am seeing a ton of images I have but don't have tagged properly

As long as you did not modify the file(as this would change the hash), it should be recognized accordingly. Also note, none of the db's check for resampled images.(Sometimes you might accidentally download the sample version instead of the original from gelbooru/etc)

>so this might help speed up my tagging a lot if i can do that

definitely. i don't think i've manually tagged a single of my images yet ;p

438960 No.1833

File: 1453228249286.png (777.42 KB, 1600x839, 1600:839, screen.1453228186.png)

>>1820

Alright, here are the ones you wanted, tested and working. Since they were small, I could finish them quickly.

drawfriends

https://mega.nz/#!KJhE0YQS!ez8HoC-LS4mZB-xW5-phcTZahYFX6dGGzBehkd_tZco

deviants-despository

https://mega.nz/#!vB52CRxZ!lenOJaEAWsp2ovMSU3UoDG3iuZB0s0PXQhNekwdghpI

438960 No.1843

File: 1453274431305.jpg (103.94 KB, 1280x720, 16:9, 123456789.jpg)

That's a lot of archives, thanks.

But what should i do if i already synched my db with old gelbooru archive? Should i delete it before adding new one?

438960 No.1845

>>1843

>Should i delete it before adding new one?

I have both synced just in case, and I haven't noticed any performance degradation/issues with it, so I assume it's fine to keep both.

438960 No.1865

File: 1453412529336.png (55.39 KB, 780x609, 260:203, problems.png)

https://www.mediafire.com/folder/yoy1dx6or0tnr/tag_archives

Tested & Working Archive Tag Packs:

rule34phael

rule34.xxx

drunkenpumken.booru.org

tbib

xbooru

lolibooru-moe

Broken or malformed Archive Tag Packs:

Danbooru

Gelbooru

(See image)

Gelbooru's has a bunch of extra tags that may or maynot cause an issue (IE: Only select Character, Artst, Series, Creator instead of the misspelled ones).

If anyone can confirm Gelbooru's status that would be great.

438960 No.1866

438960 No.1872

>>1865

I'm guessing the person who made the first gelbooru archive didn't get rid of the bad namespaces. It should still work though if you pick the correct ones.

I didn't make the danbooru archive either, but I was able to successfully sync with the creator namespace.

I guess it's time for a danbooru 2016 too

438960 No.1890

https://mega.nz/#!HRATjLhQ!oebEwFVg2QaQn0XToCUX5bdBReZXfsYlxr-xBn2zU2g

db:e621[2016].db

name:e621.net

(About 262k more hashes compared to the old one)

438960 No.1891

https://mega.nz/#!SYBjkRgb!YSAtNzUxaXEOkwOGuTLJP3Pnn7_Chi5bb-zwsN2RNx8

db:atf.db

name:All The Fallen

(loli-themed booru. It probably has the same content as gelbooru, but I ripped it just incase it has some older images which may have been removed from other sites, or missed. 136k images, 43k tags)

438960 No.1892

https://mega.nz/#!7NY1VDbB!fwX4Pgxvf8HiaVUDAvO9c1OPnIEHgvN4R0UMwpbayDQ

db:zombooru.db

name: Zombooru

(another old site. ripped it because it may have old images)

438960 No.1909

File: 1453977985065.png (578.18 KB, 1280x720, 16:9, princess_kenny_1.png)

>>1833

awesome, thanks friend, these two have a lot of tags for some more obscure images

438960 No.1931

File: 1454183107382.jpg (7.8 MB, 3832x5320, 479:665, 3df15aa7ef0627c94c67eaea58….jpg)

>>1204

And now up to version 191 and update 1310, which is today, January 30th 2016. It now has 33.7 million mappings.

Post last edited at

438960 No.1934

>>1931

Is there a way to import just the tags to an already existing db without overwriting all personal settings?

438960 No.1935

>>1934

For anyone wanting to try, the answer is no, and especially not when your backup software has been dead every since you installed without your knowledge. Time to drink.

438960 No.1969

File: 1454512352323-0.png (586.31 KB, 1415x829, 1415:829, screen1.png)

File: 1454512352324-1.png (391.57 KB, 1371x575, 1371:575, screen2.png)

File: 1454512352324-2.png (777.7 KB, 1595x833, 1595:833, s3.png)

https://mega.nz/#!SZR2GaSY!0OzfLhBPO0EHsqF4py8jk3m39mVCe6E7UYfYHw9Itq8

(edit, some series weren't being recognized because the tags had colons in them. I realized I could use this to my advantage and add "series:" namespace to them, so if you downloaded the .db 3-5 hours ago, then please use this instead)

db:myfigurecollectionV2.db

name:My Figure Collection

Finally done. This is a tag archive of the pictures section from MFC, which includes over 1.3 million photos of figurines.

In Hydrus, You can search by Category(category:figures, category:kits&customs,etc), image ID and the uploader.

There's two portions to the site. One where users upload images of photos they've taken(the current tag archive), and another portion where the creators sell their figures, which has more detailed information such as the company, character, materials,etc. That part will be done in a few days.

Hopefully anyone else who is interested in figurines will benefit.

>>1909

438960 No.2000

File: 1454772742005-0.jpeg (70.35 KB, 600x900, 2:3, nguyedt1452743803.jpeg)

File: 1454772742005-1.png (492.81 KB, 1357x782, 59:34, 1.png)

File: 1454772742005-2.png (803.38 KB, 1595x785, 319:157, 2.png)

File: 1454772742006-3.png (20.84 KB, 774x753, 258:251, 3.png)

https://mega.nz/#!DE5EUaYA!lQkqVE6_35m26XFQJs8TH_O2P6lS_U-vzh2JM4xnaZQ

db:myfigurecollection-item.db

name:My Figure Collection

Here is the second portion of MFC, the item database. You can use it with the mfc picture db to find a bit more information from a figurine. You can search most of the same things you would from the myfigurecollection.net/item/ page. Basically, same as using the site, except in Hydrus. Don't forget to add colors to all the namespaces added.

list of item namespaces;

origin

company

creator(both sculptor and illustrator)

character

material

id

category(prepainted, action/dolls, garage kits)

scale

As for Sankaku Idol/Chan..Those will be awhile. I don't know what it is with their site, but they go to high lengths to prevent their site from being crawled. Even after using 10 second delays between each page request, I still get request timeouts every so often, which makes the whole process extremely slow. I've never used Hentai Foundry much, but they seem to have a great deal of western art, which I really want, so I'll probably make a tag archive of that next.

438960 No.2001

>>1872

I made the old gelbooru, danbooru and e621 archives, and indeed I didn't clear the namespaces properly. I realized it far too late and was too lazy to fix it, there should be some post where I realize, maybe it's on github still.

438960 No.2002

>>2001

Maybe hydrus could add support for namespace siblings / parents, then we could correct those namespaces and wouldn't have to bother fixing it from the sources themselves (although it is a good idea to fix those in the tag archives because some namespaces and tags are utter shit, and should be cleaned in a centralized tag system/provider)

438960 No.2006

File: 1454839954706.jpg (577.79 KB, 1247x1556, 1247:1556, mfc.jpg)

>>2001

No problem, it's an easy fix. Major thanks to that csv converter script though. Would have taken much longer for me to get started without it

>>2002

>support for namespace siblings / parents

That's a good idea. Some namespaces overlap, like 'origin' from MFC item db is the same as 'series' for the most part. I only used origin to keep inline with the site.

>cleaned in a centralized tag system/provider

Isn't that basically what the PTR is? Users submit tags, which can then be petitioned by other users. I've yet to see how that all works though since my client is usually in a constant state of importing files and testing tag archives.

438960 No.2008

>>2006

> Isn't that basically what the PTR is?

Yeah, it was just my roundabout way of saying, it would be probably smarter to fix it at the source (the tag archives)

438960 No.2037

>>1764

It's been a while and it seems you found a way around it, but Sankaku, while the id metatag does not work, you can actually use date instead.

https://idol.sankakucomplex.com/?tags=date%3A2010-05-01..2010-06-01

This might be a faster way to do it.

438960 No.2038

>>2037

Great find! It means I can actually finish in a sane amount of time now since I can use their API.

438960 No.2056

File: 1455801456813.png (1.1 MB, 1594x834, 797:417, screen.1455801245.png)

db:hentaifoundry.db

name: Hentai Foundry

Namespaces include creator and title of work.Keywords are considered no namespace. Contains around 200k hashes. To be honest I thought it'd be bigger, but I believe I did grab all content from every user from the site.

Sankaku chan is nearly done too. I just need to get namespace tags.

438960 No.2072

File: 1456048705682.png (3.38 KB, 397x109, 397:109, Безымянный.png)

>>2056

>Hentai Foundry

Hmm, when i try to synch this archive with my db, it seems like Hydrus thinks that i have all these images and trying to synch them all. But i only have around 70k images. Other archives like gelbooru2016 or safebooru matches only few hundreds or thousands of my images.

Is there something wrong with this archive or it's problem in my db, or there's no problem at all?

438960 No.2079

>>2072

As far as I understand it it's the amount of images found in the archive, which it then checks against the present files.

I have the same number shown in that dialogue but I have 369.892 images in the db.

438960 No.2083

>>2079

Maybe. It's just a little confuses me, because the other archives gives much smaller numbers.

438960 No.2084

File: 1456111527009-0.png (2.45 KB, 206x81, 206:81, 01.png)

File: 1456111527009-1.png (3.2 KB, 392x92, 98:23, 02.png)

Yep. There's definetly some difference between these archives. I made a new db and tried to synch it with gelbooru2016 archive. Its processing the whole archive, around 2 million files, but matched 0 because its new db without any files. The whole synchronization process takes around 1-2 minutes.

Then i made a new db again and tried to synch it with hentaifoundry archive. It matched all 207376 files and starts sync them, which much slower than gelbooru archive.

438960 No.2090

Is there a "best way" to actually organize full manga on hydrus, as in, having chapters labeled and pages ordered? I have a few full series downloaded as numbered images in individual chapter folders.

438960 No.2092

File: 1456250755849.jpg (835.68 KB, 1600x799, 1600:799, b4be967a65a0ef79cc3601ed77….jpg)

>>2072

>>2079

>>2083

>>2084

That db seems to have sha256 hashes, which means hydrus can import all its hash-tag pairs (about two million, I think!), not just the ones it can match to your local files. Syncing with that db (as opposed to a one-time import) isn't actually important, as newly imported files that are checked against it are not going to produce any new mappings, since everything will have been added already. Also, once one person syncs all that stuff to a repo, like my ptr, then no one else needs to, as there will be nothing left to add.

There isn't a way, yet, to limit the tag archive sync to a particular file domain (like local files). Would you like one?

>>2090

I don't think so. You can use page: and chapter: namespace to sort and collect inside hydrus, but I'm not really happy about the workflow. I expect I will support a multi-page, single-file .cbr-type format in future.

Have a play with a couple of chapters and the 'add tags based on filename' dialog when you import files, if you like, and let me know what you think.

438960 No.2101

>>2092

>There isn't a way, yet, to limit the tag archive sync to a particular file domain (like local files). Would you like one?

Well, i dont know. I mean, i can wait a little longer, this archive is not that big. I was just curious, why these archives have different sync processes. Thanks for explanation.

But then again, here's this >>747 big archive, for example. And if i understand it right, synching only local files will be much faster than import all tags from this archive. If creating this function does not require much time then having it just in case would be a good idea, i guess.

438960 No.2147

I've been reading through the docs for a while, and it's quite possible I'm just an autistic idiot, but I cannot figure out a way to tag things that were already imported. I accidentally imported some images from the boorus without selecting "tag" and now they are untagged, is it possible to retag them?

438960 No.2148

File: 1456697816967.jpg (1.61 MB, 1254x1024, 627:512, 37fbae148ddcba63cd89c0813e….jpg)

>>2147

Hydrus isn't yet clever enough to do this sort of reverse lookup, so you'll have to repeat the same query with the tags checked.

Thankfully, it is clever enough to remember when it has downloaded a file from a specific url, so it won't waste bandwidth trying to download the files themselves again–it should only fetch the html page to parse the tags and then apply them to the file.

438960 No.2175

Hi guys, OP of the original (2014) e621 db here. I'm reripping e621 now, under my original sanitization regime - read, normalized creator: series: species: character: namespaces instead of the site's artist: copyright: species: character: native set.

Should be up in a week or two, for anyone interested.

To the other guy making site archives: what hardware are you using, out of curiosity? I've been playing with short-duration EC2 c4.8xlarge instances, and they're obscenely quick for this, but pricy.

438960 No.2176

>>2175

(by "obscenely quick" I mean "I ripped sha1 and md5 hashes of all of e621 in about 90 minutes at 1gbit". It's like renting a Bugatti to spin cookies on the front lawn of the Augusta National.)

438960 No.2186

File: 1457097320773.png (3.76 KB, 716x247, 716:247, screen.1457096014.png)

>>2175

>>2175

>To the other guy making site archives: what hardware are you using, out of curiosity?

I'm just using my cheap seedbox from https://seedboxes.cc/. Only $14 a month. I've been with them probably since 2013 now, and they're definitely reliable. Although advertised as a seedbox, they let you basically do anything on it. Of course, it's not just used for downloading images/making tag archives.. I also run h@h from sad panda on it, and also download a ton of torrents from private trackers. 20Gbps Up/Down. However, 3TB /month upload cap monthly. As for making the tag archive, it's just a simple 50 line python script that parses their index.xml page. "https://e621.net/post/index.xml?limit=1000&page={page_id}" >short-duration EC2 c4.8xlarge instances I actually had no idea Amazon offered this type of service. I'll have to look into it. >(by "obscenely quick" I mean "I ripped sha1 and md5 hashes of all of e621 in about 90 minutes at 1gbit" Yeah e621 and gelbooru can be ripped quickly. Most tag archives from big sites can be done in a few hours, with the only exception being sankaku complex. Their incredibly restrictive server makes it a huge bottleneck. Speaking of sankaku, I had to start over. I didn't realize they used a limit of 100 instead of the usual 1000 when doing post api searches, which meant my date ranges had gotten messed up. It doesn't help that it has 10 minute time outs after 150 requests either. Makes it a huge pain 438960 No.2192 >>2186 Ah, interesting service - will have to look into it. I need something less limited than a t2.micro, and c4.8xlarges are$1.67 an HOUR (!).

I remember the last time I did this having an insane amount of difficulty with the e621 API - part of it was that I was a much worse programmer then, but part of it was that individual API pages didn't seem to include things like tag categories and ratings. I also don't, in theory, trust their hashes - in practice they work fine, but I'm developing here with an eye towards sites like Derpibooru that optimize the image and don't update the hash in their API, so if I want anything to match on that (since as you mentioned, nobody downloads originals) I'm going to have to manually get the hashes of both the orig_ and the optimized images.

So the way I'm doing it this time is using BeautifuSoup to actually scrape every single page on the site (/post/show/{id}, iterated from 000001 to 900000 ignoring 404s), grab the tags including namespaces, grab the rating, stream the image and hash it myself, put everything in a JSON file, save it into the locally running instance of MongoDB that I keep around, repeat.

I broke up the iteration into work-blocks, which are 1000 IDs each and are served from Mongo as well. So I can fire up any number of workers, which point themselves at Mongo, grab the top block, and send the data back.

Now that I have it all in Mongo, all I'll have to do is write a quick script to iterate through it and dump it through the Hydrus Tag Archive generator.

This is definitely overkill, but it's extensible overkill - I don't need to worry about API differences, I don't need to worry about a process choking and hosing my output file, I don't need to worry about parallelism conflicts, etc.

The ultimate goal is to re-do Furaffinity, properly this time. Have to do hashing on the fly for that, too, and this method will work for sure now that their servers are less shit (thanks, IMVU! :P)

438960 No.2278

If I synced with a downloaded db, gelbooru for example, after the sync do I still need to keep that db file or all the mappings now inside my client db?

438960 No.2441

File: 1460661336311.jpg (90.4 KB, 600x417, 200:139, Cf_YwHgXEAAtXO6.jpg)

is there a way to reverse engineer these tag databases and add them to a booru?

I know someone starting up a rozen maiden booru, and want all the tags ive done in hydrus to be able to cross over, as well as the tags from other boorus.

Anyway to do this?

I've made all my tags to the public tag database

438960 No.2459

File: 1460834106347.png (477.56 KB, 640x548, 160:137, b5fa8aa827813449c96e8de503….png)

>>2278

For most, you should keep it synced.

A mapping is essentially a pair of (file_hash, tag). Hydrus uses sha256 for its hash, but most boorus use md5, so the information cannot be imported without a cross-reference, which can only be generated if you have the original file. Syncing your client to an HTA imports all the mappings it can match against your local files, and then when you import another, it rechecks the tag archive and applies any new mappings it finds.

>>2441

If you can do a little programming, sure! An HTA is a sqlite database, so if you look at it with SQLiteStudio or another sqlite program, you can see how it is structured. If your language has a sqlite library, you can write a script to convert the data in the db into whatever you need for your booru POST form or however you intend to import tags.

If you can program in python, you can use the interface I wrote to make it even easier. It is under any install's install_dir/include/HydrusTagArchive.py. I wrote a little intro at the top of the file, but let me know if you would like any extra help.

To create your HTA, you want to do something like services->review services->public tag repo->perform a service-wide operation->export to HTA, then pick the hash_type you care about. I expect the booru uses md5, but you will have to check with your friend.

438960 No.2479

I tried to add the danbooru .db to my Hydrus but I get this error while it's trying to sync http://pastebin.com/JT1zKkfH don't know if this is the right place to ask for help.

438960 No.2480

I tried to add the danbooru .db to my Hydrus but I get this error while it's trying to sync http://pastebin.com/JT1zKkfH don't know if this is the right place to ask for help.

438960 No.2481

I tried to add the danbooru .db to my Hydrus but I get this error while it's trying to sync http://pastebin.com/JT1zKkfH don't know if this is the right place to ask for help.

438960 No.2482

I was not suppose to post that 3 times.

438960 No.2496

hydrus_dev, should syncing a 1.2gb tag archive (via the client_archives -> services method) take 16-24 hours plus?

I threw what I think is my final namespaced e621 HTA into my live Hydrus instance to check, and it matched 33,835 files instantly and has been spending the last day syncing them - 16ish hours and it's only up to 10,000.

Stats:

-Mappings in HTA: 25 million

-Hashes in HTA: 750,000

-Matched files in library: 33,835

-Computer: i5-4670, 32gb

-Hydrus resource usage: 25% CPU, 1gb

-Library: 204,000 files

-Mappings in client.mappings.db prior to operation: 1.3 million

-Hydrus version: 201 (updating to 202 after this is done)

Is there anything I can do to optimize the HTA for this before I release it? Force an index on Mappings or something?

438960 No.2497

>>2496

Worth noting, I am using sha1 hashes for this. If there is a significant user-facing benefit to using another hash type, I can evaluate its impact on crawl time and see if it makes sense to switch.

438960 No.2505

File: 1461535573291.jpg (2.44 MB, 1932x3173, 1932:3173, a1f0d149a9440b313235039ee6….jpg)

>>2479

I'm sorry about this. Thank you for the report. It looks like the danbooru HTA has a single NULL (i.e. invalid) hash for hash_id = 12. I am synced to this HTA myself and never had any problems, so I presume some previous check for this got removed somehow, or sqlite only recently started complaining about it.

I have added a check-and-skip for NULL hashes for v203. Please try the sync again in that and let me know if you have any more problems.

>>2496

>>2497

It is actually the client db which is being slow adding those mappings. The recent db split-up and reshaping have knocked some of my processing efficiency about. v201 does some big jobs more slowly due to some naive analyze statistics and bad multiple-db caching settings. If you can, I suggest you cancel and update to v202, which is much better, especially with the folded-in ac_cache, which reduces upwards of 100ms hdd lag per tag transaction.

Before you retry the sync in v202, go help->debug->force idle and let the db do some improved v202 maintenance, or since you know what you are doing, you can rush it by opening client.mappings.db in sqlite3.exe or SQLiteStudio and running ANALYZE;, which will take a while but will improve processing speed a bit by populating some statistical tables sqlite uses to plan queries.

438960 No.2512

Probably some silly question, please excuse my ignorance. I hope I didn't miss it in the docs.

1 - How do you guys keep your databases (or .db files) updated with new images/tags that appear everyday?

2 - In context with the previous question, with gelbooru: I was going to make a script that updates >>1866's .db file with new tags until the current day when executed. To do this I thought about using their public API (http://gelbooru.com/index.php?page=help&topic=dapi), however they don't seem to offer namespace parsing. How are you able to discern which namespace for each tag? Is there a more convenient way to update without using their API?

Thanks

438960 No.2513

>>2512

The answer to question 1 is, we don't - we re-rip the entire tag databases from the sites every time we release a new DB, which is why they're months apart usually. I'm working on a plan to update e621 in more streamlined fashion (with diff DBs - the actual update will still be a full rip on my side, but I'll calculate and only release added mappings) but I have to finish validating my initial rerip and release that first (see my conversation with hydrus_dev above).

For question 2, I can't answer you about Gelbooru because I'm not the guy who does that rip, but I believe e621 uses a similar API, in that it doesn't show namespaces in API queries. To resolve that, I simply don't use the e621 API - I use Python and BeautifulSoup4 to load each page, download/hash/discard each image myself, and scrape the tags out of the HTML, including their namespaces. I designed the process to use individual workers pulling off a common work database that I keep in MongoDB - I have a VPS with online.net that I can roll with about 80 workers on, which completes an e621 rip in 8 hours. Once I release my initial rerip, I'll probably start posting the code to do all of this somewhere, after I document it a bit.

I think, if the anon who does Gelbooru now is the same one as a couple years ago, that he found an API query that shows a list of tags with the associated namespaces, and then he worked backwards from there to associate them in his final DB. There are tradeoffs to each way of doing it.

438960 No.2525

>>2513

Alright, my initial e621 rip cleaned/sanitized Hydrus Tag Archive is complete, and can be downloaded here:

https://mega.nz/#!04VlRLzQ!vc0VWQ0E5pdFCFu6-mo-NpVcorObldTcY93XvCVR0dY

This rip contains all e621 tag mappings as they existed 4/23/2016; the most recent image captured is /post/show/875576. There are about 25 million mappings to about 750,000 hashes.

It will take some time to import; I recommend the client_archives method for permanence.

Namespaces are:

-No namespace

-Character

-Species

-Series (standardized version of e621's "copyright")

-Creator (standardized version of e621's "artist")

-Rating (direct rip of e621's safe/questionable/explicit)

-Gender (all gender-related tags I could find/think of are converted to this namespace)

-Tag Source (all hashes have "tag source:e621")

e621 uses underscores instead of spaces in their tags. This HTA converts them to spaces as that is the usual way of doing things in Hydrus.

Bad spellings of namespaces are automatically fixed as much as is possible; I have also corrected them manually on e621 itself, for the most part.

A small number of tags are dropped/blacklisted; this includes "creator:unknown artist", "invalid color", a bunch of namespace fragments with no tag parts, and anything at all related to aspect ratio (since we have system:aspect ratio in Hydrus).

Updates will be forthcoming once a month or two has passed and I've had time to figure out the best way to diff them. My goal is to make generating them very low-effort for myself, so that this relatively high-quality source of mappings for furry images can be relied upon for Hydrus users.

438960 No.2527

Is the bare PTR database snapshot going to be updated to the new split db format? Because now importing the initial DB takes way longer due to the mandatory splitting of the database

438960 No.2528

File: 1461790025529.jpg (478.51 KB, 1280x1135, 256:227, ccf17816325aeb1aa6538303a5….jpg)

>>2527

Yes, once the new split db schema is settled, I'll create a new pinned thread and start regularly releasing bare PTR dbs again. I have two more big changes to make, both of which will significantly reduce the size of the db. I expect I'll be happy in a month from now.

438960 No.2641

>>2528

>once the new split db schema is settled

Is this the "client compacting" you mentioned being done in >>2598 ? Sorry If I come off as impatient, but can you give a estimate on when you'll release a new bare PTR dump?

438960 No.2652

File: 1462745113599.jpg (84.62 KB, 716x917, 716:917, f7269549b2b662802cced057d1….jpg)

>>2641

Thank you for the prompt. I've since decided to put off the second compacting, so I'm happy to make bare dbs again. I'll put a new one together every fie weeks or so. I've just made a new sticky with the new one:

>>2651

Let me know if it gives you any trouble.

438960 No.3166

Warning: The HentaiFoundry tag database is broken.

More specifically the TITLE namespace. It only contains the first word of the title, the other words are put as separate tags without a namespace.

I noticed this too late and now I have a bunch of trash tags with no idea how to easily clean them up.

438960 No.3170

>>3166

Good catch.

Looking at my db, this happened because I mistakenly assumed all the tags from the site had underscores to link single words. Instead, it seems it's completely mixed.

There's tags like "character name here", and then some tags like "character_name_here"(how it's supposed to be). Since I used space as a delimiter, it was grabbing 'character','name','here' as separate tags. I'd have to redo the rip to fix it.

I'd also like to know how to get rid of the trash tags now. I thought removing the tag archive would do the trick, but they're still there.

438960 No.3179

>>3170

Looks like these trash tags also leaked onto the PTR on some images, so it doesn't matter if you clean them up locally if you're using that. That's unfortunate.

438960 No.3200

>>3179

There may be a way to fix it. I could make a script that checks to see which hashes are part of HentaiFoundry inside the PTR, then purge all tags from them. However this would be pointless if Hydrus dev doesn't use the fixed PTR. I would need his input on this

438960 No.3206

File: 1468947561716.jpg (1.03 MB, 1522x1757, 1522:1757, 6cebd4b7ff9dbaceb39f401655….jpg)

>>3200

I've noticed this problem through some individual petitions on my end, and I'd like to fix it for good, but I'm not sure the best way to go about it. I think it might be a script to create an HTA with the bad tags (by doing something like 'include all unnamespaced single words that also occur in the title tag') and then mass petition that HTA's contents through the advanced service-wide operation dialog. I'm not sure if the current in-client workflows support that completely (for matching non-local files and any other surprises), and my petition approve/deny control might implode if a 10,000+ strong petition gets thrown at it, but I'm confident we can figure it out.

Did the HF HTA use sha256? And does it have the correct title tag as well as the individual words?

438960 No.3363

>>3206

>And does it have the correct title tag as well as the individual words?

No. An image with the title "This is my image" will get these tags from hentaifoundry.db:

title:This

is

my

image

Unless someone set the full title tag in the PTR as well, that is.

Bumping