Commons:Bots/Requests/ImagehashBot

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

ImagehashBot (talk · contribs)[edit]

Operator:

Bot's tasks for which permission is being sought: adding pHash checksum (P9310) and Imagehash difference hash (P12563) values to the photos.

Documentation for the hashes
Example images with P9310 and P12563 values

First targets are photos from Europeana, Estonian, Finland, Sweden and Flickr, but long term target is to add imagehashes to all commons photos. Currently we have used FinnaUploadBot for Finna images. Reason for the new account is to make dedicated account and service for the non Finna related edits.

Automatic or manually assisted: automatic

Edit type (e.g. Continuous, daily, one time run): first a batch jobs, later continuous

Maximum edit rate (e.g. edits per minute):

Bot flag requested: (Y/N): Y

Programming language(s):

Zache (talk) 15:08, 12 April 2024 (UTC)[reply]

Discussion
What is use for such hashes? --EugeneZelenko (talk) 14:47, 13 April 2024 (UTC)[reply]
One can use them to compare the similarity of pictures by checking how much the identifiers differ to detect duplicates and match photos in different repositories. We have used image hashes to prevent duplicates when uploading files and to prevent the wrong photos from being updated when reuploading photos from Finna with better quality and/or updating metadata. --Zache (talk) 16:31, 13 April 2024 (UTC)[reply]
Such hashes make much more sense as part of Commons database. --EugeneZelenko (talk) 14:26, 14 April 2024 (UTC)[reply]
In SDC they are filemetadata and in particular using SPARQL it would be easy way for querying and sharing the hashes for external usage. Ie. it is part of metadata for the files. Zache (talk) 14:52, 14 April 2024 (UTC)[reply]
Also, even if the information would be added to the Wikimedia Commons database (there are good technical reasons why one would like to use an external service instead of adding this to the MediaWiki core), I would like to note that we are populating SDC values from the Commons internal database using bots. Most notable in this context are the SHA-1 checksum, mime type, image width, and image height. (Commons:Structured data/Modeling/Meta) And yes, there would be probaply better ways to do this, but currently using bots is the preferred method. --Zache (talk) 06:42, 18 April 2024 (UTC)[reply]
Is there any community discussion that such data shall be generated at large scale? Krd 06:53, 18 April 2024 (UTC)[reply]
I am not aware that there would have been a wider discussion. Current discussions, to my knowledge, are related to the Fæ's User:Fæ/Imagehash and village pump discussions 1 and 2. In my structured data property proposal in 2021, there were no follow-up comments in Wikimedia Commons. Phabricator has some tickets (for example, phab:T121797) related to image hashing.
Also, just for background, I am running ImageHash-Toolforge, which has approximately 25% of Wikimedia Commons bitmap images (jpg, tiff, png) indexed with phash and dhash. I also made a Wikimania lightning talk proposal for it. (Proposals are currently under review.) My current idea was to proceed gradually when adding values to SDC, and my current personal need was to add hashes to European and Estonian photos before the Wikimedia Hackathon, Tallinn, in May so they would be available there. (see my question in Commons_talk:Bots/Requests#Extending_FinnaUploadBot).
However, if you think I should do the village pump discussion or the discussion on the Structured Data talk pages, I am happy to start these. --Zache (talk) 07:49, 18 April 2024 (UTC)[reply]
Please do. Krd 05:48, 21 April 2024 (UTC)[reply]
Now I made a village pump proposal --Zache (talk) 16:44, 17 May 2024 (UTC)[reply]