Hashing videos

Discussion:

Hashing videos

Berthold Stoeger

2018-05-22 20:45:44 UTC

Dear all,

I'm currently implementing addition of videos to the dive photos. This happens
to be dog-slow, because we calculate hashes of the file contents. As you can
imagine, addition of multiple videos with a few GB each is a major CPU hog.
Granted, the UI stays responsive, since this is done in background threads.
Nevertheless, it gives a bad impression if the CPUs run at 100% for a few
minutes.

What are we supposed to do? Hash only the first MB? That would unfortunately
not be backwards-compatible. Do different things for images and videos? Sounds
hard to get right.

Or perhaps even remove the hashes? I found three users:
1) In git storage. This is unsupported afaik.
2) The "Find moved images" functionality. Perhaps searching for (case-
insensitive?) filenames is enough? Or perhaps match by metadata?
3) In current head it is also used for the thumbnail files, but this could be
changed before doing the next release.

Ideas?

Berthold

Lubomir I. Ivanov

2018-05-22 21:06:03 UTC

Permalink

hello,

i believe Robert should comment on this as he originally wrote the
first implementation of the profile photos.
(CC).

Post by Berthold Stoeger
Dear all,
I'm currently implementing addition of videos to the dive photos. This happens
to be dog-slow, because we calculate hashes of the file contents. As you can
imagine, addition of multiple videos with a few GB each is a major CPU hog.
Granted, the UI stays responsive, since this is done in background threads.
Nevertheless, it gives a bad impression if the CPUs run at 100% for a few
minutes.
What are we supposed to do? Hash only the first MB? That would unfortunately
not be backwards-compatible. Do different things for images and videos? Sounds
hard to get right.

i was thinking about running hashes on the thumbnails but that has a
couple of problems:
1) if Qt changes the backend of the code we use for thumbnail
generation the hashes would stop matching
2) thumbnail generation for videos would need to happen not for the
first frame but rather for an arbitrary point of the video timeline -
e.g. thumbnail at 30% length.
(that's actually a good generic way of doing it, instead of using
always the first frame)
but if two thumbnails for two videos happen to have exactly the same
frame at those 30% of the length (e.g. consider a black screen
transition), we risk generating the same hash for two different videos
for the same frame.

that on the other hand might not be ever possible for compressed
video, as the compression adds noise which would essentially generate
thumbnails with slightly different bytes, unless it's uncompressed RAW
video in which case the thumbnails would match perfectly and therefore
the hashes too.

i would still consider this as an option if we really need hashes and
we want them to be fast.

i guess the biggest question here is what are the hashes used for?
if they are used to skip the generation of thumbnails for already
existing media, then the above proposal is completely invalid.

Post by Berthold Stoeger
1) In git storage. This is unsupported afaik.
2) The "Find moved images" functionality. Perhaps searching for (case-
insensitive?) filenames is enough? Or perhaps match by metadata?
3) In current head it is also used for the thumbnail files, but this could be
changed before doing the next release.

something like hashing the date/time + metadata is a good option too i guess.
depends on what we need a hash for.

lubomir
--

Berthold Stoeger

2018-05-23 05:22:19 UTC

Permalink

Hi Lubomir and Willem,

Post by Lubomir I. Ivanov
i was thinking about running hashes on the thumbnails but that has a

I think generating hashes of thumbnails is out of the question. Not only, as
you note, may Qt's scaling algorithm change; extracting thumbnails from videos
is at the moment not even supported. You can have different streams, embedded
thumbnail(s), and other complexities. This is all very unstable.

Post by Lubomir I. Ivanov
i guess the biggest question here is what are the hashes used for?
if they are used to skip the generation of thumbnails for already
existing media, then the above proposal is completely invalid.

Indeed, let's wait for Robert's assessment.

Post by Lubomir I. Ivanov

something like hashing the date/time + metadata is a good option too i
guess. depends on what we need a hash for.

We wouldn't even have to hash that, as we just store it unhashed. One scheme
that came to mind (supposing the only point of the hashes is to find moved
pictures): We consider two pictures as equivalent if
1) They have the same filename (modulo path and case)
2) They have the same length
3) They have the same meta-data in the case of JPEG
Finding two different pictures fulfilling 1-3 must be very bad luck. We
currently don't store file-length, but that can be trivially rectified when
opening an old log.

It would not find renamed pictures, but that also sounds like a case of "tough
luck".

Berthold

Robert Helling

2018-05-23 08:01:20 UTC

Permalink

Hi,

Post by Berthold Stoeger
1) They have the same filename (modulo path and case)
2) They have the same length
3) They have the same meta-data in the case of JPEG
Finding two different pictures fulfilling 1-3 must be very bad luck. We
currently don't store file-length, but that can be trivially rectified when
opening an old log.

we originally introduced the hashing to make the âfind imagesâ thing possible so you donât have to preserve paths (and filename conventions) between different computers. On the other hand, we want to notice when the user changed the image (for example by photoshopping, so I guess we have to take the content into account).

So my choice would be: Completely ignore filename and path, but maybe take into account length and creation date. I donât have a lot of experience but why not hash 1MB of data after seeking to 30% of file size? I would guess that is a pretty good test. Or maybe there is an easy way to take internal meta date into account as well?

Best
Robert

Berthold Stoeger

2018-05-23 13:23:36 UTC

Permalink

Hi Robert,

Hi,

we originally introduced the hashing to make the „find images“ thing
possible so you don’t have to preserve paths (and filename conventions)
between different computers. On the other hand, we want to notice when the
user changed the image (for example by photoshopping, so I guess we have to
take the content into account).

Under which circumstances do we note that the file changed? The only way I
currently know of is when the thumbnail is recalculated.

So my choice would be: Completely ignore filename and path, but maybe take
into account length and creation date. I don’t have a lot of experience but
why not hash 1MB of data after seeking to 30% of file size? I would guess
that is a pretty good test. Or maybe there is an easy way to take internal
meta date into account as well?

I fear that any such change would not be backwards-compatible with the current
hashes. What we could do is for <10 MB files hash all and for >10 MB hash
filesize + metadata or some such scheme. I hope the <10 MB rule would catch
nearly all current pictures (we're currently not supporting RAW images, are
we?).

I think a combination of file-length + meta-data would in principle be good
enough for most cases. For PNGs we already get the created time-stamp as a
replacement for the missing metadata. But unfortunately, I was wrong in a
previous mail: We're currently not saving the metadata timestamps - we only
save an "offset", which may be changed by drag&dropping to the profile. :(

One fundamental problem with the metadata is of course that we might change
the metadata extractor in the future to e.g. support XMP, which would
invalidate all old stored metadata.

Dirk, any opinion?

Berthold

Robert Helling

2018-05-23 14:30:36 UTC

Permalink

Hi,

Post by Berthold Stoeger
Under which circumstances do we note that the file changed? The only way I
currently know of is when the thumbnail is recalculated.

that was at least intended. As I said, I introduced the hashes originally as an abstraction of filename/path in order to be able to show the log on different machines (including different OSs with different path conventions) as long as the files are in some form locally available. So when loading via the hash (meaning: from the hash we would infer the actual filename) we would still compute the hash of the file that was loaded and update the hash accordingly in the log.

It was only later that the hash was also used as a key to coached thumbnails.

Best
Robert

Robert Helling

2018-05-23 14:31:54 UTC

Permalink

Hi,

Post by Berthold Stoeger
I fear that any such change would not be backwards-compatible with the current
hashes. What we could do is for <10 MB files hash all and for >10 MB hash
filesize + metadata or some such scheme. I hope the <10 MB rule would catch
nearly all current pictures (we're currently not supporting RAW images, are
we?).

I donât think backwards compatibility is any problem at all: At worst, with a new version of hashes we get some cache misses and have to once load the actual file. So startup is slower when first running a new version.

Best
Robert

Berthold Stoeger

2018-05-23 14:53:49 UTC

Permalink

I don’t think backwards compatibility is any problem at all: At worst, with
a new version of hashes we get some cache misses and have to once load the
actual file. So startup is slower when first running a new version.

Wouldn't the whole "Find moved pictures" functionality fail if we changed the
hash-algorithm?

Berthold

Linus Torvalds

2018-05-23 14:58:56 UTC

Permalink

Post by Robert Helling
So my choice would be: Completely ignore filename and path, but maybe

take into account length and creation date.

I would disagree. Moving machines, or having the same pictures just on
multiple machines, means that you definitely do not want to take creation
date into account.

Also, I suspect that if you *do* end up photoshopping the picture (fixing
color etc), you would still want it referenced.

So I think it should just do it by filename (not path, although maybe you
want to match up directories "eagerly" - save the path, and use as much of
it as possible, but search for pictures just by filename if you don't find
it in the primary location).

The hashing was always questionable. I think it came from the original "put
the whole picture in the repository" which was a complete disaster.

Linus