Comments about SHA256 are irrelevant - you can misuse the prefix of a SHA256 hash just as easily. The issue is that people got used to human-readable hash prefixes of 10-12 characters as "unique" for all intents and purposes, despite the fact that there were never any uniqueness guarantees for prefixes and git has always handled collisions with short object IDs as ambiguous - it's just that it's so rare to happen in the real world that lots of script writers treated that "mostly unique" as a guarantee.
IMO support for short object IDs is a mistake, as is any behavior that "works this way 99.999% of the time, but hey developer don't forget you need to also code for that .001% edge case". I'm always just copying and pasting things around anyway, so it really doesn't make much difference to me if I'm copying 12 chars or 64.
poincaredisk 17 days ago [-]
>so it really doesn't make much difference to me if I'm copying 12 chars or 64.
It doesn't have to be 12 characters. I often type short git hashes - maybe 5 characters - when jumping between commits.
throw0101d 19 days ago [-]
First twelve and last twelve characters are the same:
There were some plans to migrate to SHA256, but somehow it still hasn't happened.
The practical upshot is a git commit hash is not enough l to know you are distributing/executing the legitimate code, as opposed to a malicious doppelganger. This is particularly important for tools that rely on it for dependency management, local caches, etc.
usr1106 18 days ago [-]
TFA has nothing to do with SHA-1 or SHA-1 collisions. It's about abbreviated hash values introduced for readability by humans. Now these values are used by auxiliary scripts. Which again has little to do with git proper. It's just what the kernel community writes into commit messages and what scripts they use to parse those messages.
TheDong 18 days ago [-]
> The practical upshot is a git commit hash is not enough l to know you are distributing/executing the legitimate code, as opposed to a malicious doppelganger
Really now? Mind if I challenge you?
I have on my machine a git repo with commit '75eb4e3b1369706a4dcd61cc80e49660ac341ea4'.
If you can give me a second git repo with such a commit containing different contents, I'll happily send you $10k USD, or donate it to a charity of your choice.
sgjohnson 18 days ago [-]
> If you can give me a second git repo with such a commit containing different contents, I'll happily send you $10k USD, or donate it to a charity of your choice.
Calculating that SHA1 collision is going to be a bit more expensive than $10k, by a couple of orders of magnitude.
Finding it in the wild is improbable, but calculating it is definitely possible, and has been done before. http://shattered.io/
chippiewill 18 days ago [-]
Shattered didn't produce a collision for an arbitrary hash, it produced two documents with the same hash (which is a slightly easier problem, about 100,000x faster).
SHA1 is certainly insecure at this point, but not even close to trivially so.
bawolff 18 days ago [-]
Indeed. We can't even do this for md5, let alone sha1.
Preimage attacks are very different from collision attacks.
codeflo 18 days ago [-]
That is enough to distribute malicious code though, at least in certain scenarios. Someone might create a setup where reviewers check/sign one version of the source code, and what gets distributed is another version with the same hash.
usr1106 18 days ago [-]
Code review in the Linux kernel still happens by email to a large degree.
Further up in contribution tree there is additional signing. Would that further complicate the insertion of a false commit? I am not convinced that signing is used all the way down to every contribution.
codeflo 18 days ago [-]
Linux probably has enough eyeballs on its source to make attacks like that unlikely anyway, but Git isn't just used by Linux.
ExoticPearTree 18 days ago [-]
Can you create a proof of concept and show it here?
codeflo 18 days ago [-]
What's your point?
ExoticPearTree 18 days ago [-]
My point is that you need to put the money where your mouth is.
sgjohnson 18 days ago [-]
You're talking about hundreds of millions of dollars to calculate that. "Put your money where your mouth is".
We know it's theoretically possible. And we also know that this theoretical possibility is within the reach of a couple of countries.
Terr_ 18 days ago [-]
If your goal was to prove that SHA1 collisions are unimportant, far too hard for any group to exploit within the next X years of processing improvements...
That means math.
In contrast, this "challenge" stuff is just chasing outage endorphins and internet points.
Think it through, and it's pointless. Any refusal or negative result is utterly compromised and confounded by things like: How trustworthy you appear; whether the amount is reasonable; whether the random commenter has the skillset, free time, and financial assets to try; whether they're part of a larger group they can recruit; etc.
ExoticPearTree 17 days ago [-]
My goal is to see the actual proof of concept that whatever the person I replied to is feasible. Not the daily BS from security wannabes that start with "In certain scenarios it is possible to X and Y" and then never show proof.
"In certain scenarios I could be a ninja": it means absolutely nothing without proving that I actually have the skills and I could actually use them.
It is not pointless, but if you claim something show the proof.
Dylan16807 17 days ago [-]
The math is the proof of concept when an attack costs that much money to pull off. Or the various papers that show successful attacks on reduced-round versions of the hash.
Do you not accept those? What would you accept as a proof of concept?
ExoticPearTree 16 days ago [-]
I expected a proof of concept for this statement:
_That is enough to distribute malicious code though, at least in certain scenarios. Someone might create a setup where reviewers check/sign one version of the source code, and what gets distributed is another version with the same hash._
Dylan16807 16 days ago [-]
Well the proof of concept without actually having two colliding files is really simple, so I thought it was generally understood.
Here's the easiest to explain way: Upload the malicious version of the file to github. Send an innocuous patch to the kernel devs that creates a file with the same hash. It gets accepted, and anyone that downloads the kernel from github gets the malicious version. Done. That's a small fraction of linux downloaders, but this is just the proof of concept.
rurban 15 days ago [-]
A proof of concept became much easier with C11 unicode identifiers, and email patch review. You can trivially hide Cyrillic chars eg. between whitespace changes or other trivial "optimizations". Even without collisions.
And with the current surge of GPU's even collisions are realistic now. The H100's are not doing much when not in training.
poincaredisk 17 days ago [-]
>which is a slightly easier problem, about 100,000x faster
Where did you get this number from? I was under impression that this is completely infeasible (just like we can generate a collision good md5 in seconds, but we still can't do a preimage attack).
onedognight 19 days ago [-]
The is not a full git commit hash collision. It has to do with a git note which only needs to matche a 12 character prefix of the git commit.
usr1106 18 days ago [-]
While you corrected one mistake you added a new one:)
Those are git trailers, see git-interpret-trailers(1).
git-notes(1) is something completely different and not used by the kernel.
theamk 18 days ago [-]
people don't really care, because current collision methods are mitigated.
Git does not actually use "sha1", despite what all the docs say, it uses "sha1dc", which is just like sha1 except for inputs which can cause collisions, in which case it either fails with clear error message or returns completely different value.
so don't worry, git hashes _are_ enough to know you are distributing/executing the legitimate code.
(not to mention you need a preimage attack to replace known commit, and this is not yet possible with sha1)
akoboldfrying 18 days ago [-]
From your link:
>In this case "hash" will be the same as SHA1(input) in all cases, except those where the input is detected to be malicious (as in the SHAttered attack)
I don't see how this can be more than a fundamentally forward-incompatible sticking plaster over the problem. The problem isn't merely that "detecting maliciousness" seems fraught in itself (how does one infer intent reliably?) -- it's that today's SHA1DC() implementation can only detect and optionally correct today's known attacks, so each new attack necessitates a new, incompatible version of SHA1DC().
theamk 18 days ago [-]
Each new _unrelated_ SHA1 attack will need an update in SHA1DC. But it has to be truly unrelated, as the collision detection method is fairly robust. I recommend reading the original "Counter-cryptanalysis" paper [0] for details on how attacks work and how they can be mitigated (there is certain internal state in SHA1 that is used in all known attacks). BTW, this paper has an interesting anecdote: apparently Flame malware had exploited MD5 collisions using novel unpublished attack method... and yet it was detected by collision detector (section 3.2). Another example is that SHA-mbles attack, published 3 years after "Counter-cryptanalysis", was detected as well, with no required code changes.
No, there is nothing "fundamentally incompatible" in the new SHA1DC method. After all, git came out in 2005, 11 years before SHA1 attacks were known, so it used regular SHA1. The collision detector was added in 2017 and nothing broke, because false positive chance is 2^-90 [1].
I have not heard of any new SHA1 collision results, but if they are based on no-difference differential paths, git has nothing to worry about. And if they are not, it may be possible to extend DC detector to seamlessly detect and prevent those attacks, and then only upgrade git clients, keeping backward and forward compatibility for data.
Of course there is always a chance that someone will come out with all-new SHA1 preimage attack that cannot be detected without high rate of false positive, so it's prudent to switch git to sha256. There is a lot of work being done: git's sha256 mode went out of beta in 2.42 (2023), but neither github nor gitlab support it.
But since the current state of git's sha is that there is nothing broken, and git git commit hash _is_ "enough to know you are distributing/executing the legitimate code, as opposed to a malicious doppelganger", there is no real pressure.
> The problem isn't merely that "detecting maliciousness" seems fraught in itself (how does one infer intent reliably?)
Its not detecting "intent" it is detecting that the hash is one vulnerable to the attack, which is extremely unlikely to happen by accident, so if you see it you can assume malice.
It might be a band-aid, and sha256 is certainly a much better solution, but its more robust than it sounds at first glance (since it sounds crazy at first glance)
Dylan16807 18 days ago [-]
Sure it's plaster, but plaster can last a good while. Sufficiently new attacks don't come around all that often, and every hash has a risk of new attacks showing up.
But agreed that it's a rather tougher patch than I'd originally thought, reading elsewhere in this thread.
oefrha 18 days ago [-]
Switching to SHA-256 and switching to longer substrings of hashes for identification are basically orthogonal problems. The former is hardly going to help with the latter, except in the we already broke everything so why not take the chance to break some more sense.
shakna 19 days ago [-]
Compatibility between remotes using one or the other hasn't arrived yet, and git doesn't want to break compatibility. But you can create SHA256 one's today. [0]
The hash space is atoms-in-the-universe range; this is a collision in a much, much smaller subset of that space
dec0dedab0de 18 days ago [-]
I think multiple hashes is the way to go to avoid collisions. it can even be something simple like md5. the chances of finding a collision that matches two or more algorithms is near impossible. Obviously that doesn’t work for passwords, but for verifying that data hasn’t been tampered with, it works.
Presumably the problem is that these tools only take the abbreviated hash into account. Not also the subject:
<abbrev. hash> ("<subject>")
You also have another data point. You only need to search in the history from the commit that you are reading. Assuming that the "Fixes" commit is an ancestor of the commit whose commit footer you are reading.
I always just assumed that tools would take all the data into account. Which means that you both need to collide with the abbreviated hash as well as the subject. Now I don't do that since I just copy-paste the hash, but I would quickly notice in case the subject is different (and likely the commit message and the diff just look irrelevant).
I don't understand why the Linux Kernel has this hard-coded rule[2] -- again, you were going to get collisions eventually, so the tools should have just taken all the data into account (at least the subject) from the start. The recommendation in the Git project is to use `git show -s --pretty=reference`, without any fiddling with the abbreviation:
<abbrev. hash> (subject, ISO date)
Although the Git maintainer uses `--abbrev=8` since git-show will just use a longer abbreviation in case the output would be ambiguous[1].
They could have used this instead if they wanted simpler, future-proof tooling:
Fixes: <full hash>
Just like tools like git-revert and git-cherry-pick do.
git itself does not use a fixed size abbreviation but determines the length necessary with the birthday paradox formula. It just happens to be seven characters most of the time, because most repos are small.
This is just for the abbreviated hash which git uses only for specific cases like display in the git log and similar, where it is a user experience improvement and relatively safe.
chrishill89 18 days ago [-]
I know. I don’t understand why the Linux Kernel uses a hard-coded abbreviation instead of just letting Git figure out how long it should be (edited my comment now).
Too 17 days ago [-]
The length of the required abbreviation grows over time, while content you put in commit messages sticks around. Meaning a commit done in year 2005 may have gotten away with "Fixes: abcd" and was unambiguous at that time, whereas after 20 years of growth, there could now be multiple other commits with that prefix.
By that logic, any length of abbreviation will eventually fail if stored for a long time, those abbreviations should be seen as temporary for user interaction only. For the "Fixes:" footer they should have gone with full hash, it's hidden away in the footer so it doesn't disrupt anyone. Aanyone interacting with it will double click to copy it regardless if it is 8 or 40 characters long. Abbreviating it simply adds no value.
Dylan16807 17 days ago [-]
Well a fix has to be for a parent commit so future commits and future growth aren't nearly as important.
dotancohen 18 days ago [-]
> with the birthday paradox formula.
At what probability? Can that probability be configured as a global Git option?
Dylan16807 17 days ago [-]
So as noted you can configure the length, but you can't actually adjust the probability used by the auto-length formula.
The formula calculates how many bits are needed to store the approximate number of objects, then uses twice that many bits, rounding up.
For example, at 16000 objects git is still using the minimum length of 7 characters. But at 17000 objects it now takes 15 bits to store the object count, so it wants 30 bits of hash, which means 8 characters.
I love when the word "hopefully" is used in technical documentation ))
MadnessASAP 18 days ago [-]
All technical documentation is wishful thinking, they're just being explicit about it.
loeg 18 days ago [-]
Cute. I've seen CPU commit prefix brute-forcers but not GPU ones. On the CPU, you can generate chosen 7-8 digit prefixes pretty easily (within a few minutes, I forget exactly). 12 digits is, of course, a massively bigger (factor of 2^16) space.
18 days ago [-]
timewizard 18 days ago [-]
Would tooling break if the radix was changed from 16 to something higher? You could double your resistance by just changing to base 32 representation in the same length
Dylan16807 18 days ago [-]
So much would break if you changed the character set. And there would be pretty much no benefit. Making it a hundred times harder* is not nearly enough to prevent short collisions.
* The exact amount would depend on repo and method. Using base 32 and changing nothing else would slow collisions by 2^length.
timewizard 18 days ago [-]
> would slow collisions by 2^length.
It would add a bit to each position, going from the current 48 bits, to 60 bits. Or 4,096 more resistant to _accidental_ collision. You can have as many bits as you want you just make it more time consuming to create one, but if you're entire infrastructure blithely relies on one never _intentionally_ being submitted, you've solved nothing really.
Moreover, if you find a commit with a collision, it's very easy to slightly alter it's contents to alleviate the problem, presuming we're only having to contend with unintentional conflicts.
Dylan16807 18 days ago [-]
The length of the short hash automatically adjusts to keep the chance of accidental collisions low. There's no need to lower the risk of accidental collisions.
So changing the character set doesn't notably improve the accidental situation, and it doesn't notably improve the deliberate situation. And it would be a huge pain to do.
timewizard 18 days ago [-]
In the kernel case the length of the short hash is fixed at 12 characters. This is part of the article.
Dylan16807 18 days ago [-]
Okay, sorry, I was talking about the normal short hash rather than this specific tool. And I first looked before the link was changed.
Still, they already figured out a solution. They don't need multiple mechanisms for the same thing.
quotemstr 18 days ago [-]
We could just encode hashes on base64 instead of hex and get several times more entropy with the same size text
Retr0id 18 days ago [-]
base64 only gets you 1.5x more entropy per symbol (measuring in bits)
quotemstr 18 days ago [-]
Oh, right. It's logarithmic. That'll teach me to post before coffee.
userbinator 18 days ago [-]
In other words, 6 out of 20 bytes match. Good luck colliding the other 112 bits.
bandrami 18 days ago [-]
The problem is some (broken and wrong but in the wild) tooling relies on substrings like that
Dilettante_ 18 days ago [-]
Somebody ought to tell them not to do that! [Spoken in the tone of 'You can't park there sir!']
Comments about SHA256 are irrelevant - you can misuse the prefix of a SHA256 hash just as easily. The issue is that people got used to human-readable hash prefixes of 10-12 characters as "unique" for all intents and purposes, despite the fact that there were never any uniqueness guarantees for prefixes and git has always handled collisions with short object IDs as ambiguous - it's just that it's so rare to happen in the real world that lots of script writers treated that "mostly unique" as a guarantee.
IMO support for short object IDs is a mistake, as is any behavior that "works this way 99.999% of the time, but hey developer don't forget you need to also code for that .001% edge case". I'm always just copying and pasting things around anyway, so it really doesn't make much difference to me if I'm copying 12 chars or 64.
It doesn't have to be 12 characters. I often type short git hashes - maybe 5 characters - when jumping between commits.
The practical upshot is a git commit hash is not enough l to know you are distributing/executing the legitimate code, as opposed to a malicious doppelganger. This is particularly important for tools that rely on it for dependency management, local caches, etc.
Really now? Mind if I challenge you?
I have on my machine a git repo with commit '75eb4e3b1369706a4dcd61cc80e49660ac341ea4'.
If you can give me a second git repo with such a commit containing different contents, I'll happily send you $10k USD, or donate it to a charity of your choice.
Calculating that SHA1 collision is going to be a bit more expensive than $10k, by a couple of orders of magnitude.
Finding it in the wild is improbable, but calculating it is definitely possible, and has been done before. http://shattered.io/
SHA1 is certainly insecure at this point, but not even close to trivially so.
Preimage attacks are very different from collision attacks.
Further up in contribution tree there is additional signing. Would that further complicate the insertion of a false commit? I am not convinced that signing is used all the way down to every contribution.
We know it's theoretically possible. And we also know that this theoretical possibility is within the reach of a couple of countries.
In contrast, this "challenge" stuff is just chasing outage endorphins and internet points.
Think it through, and it's pointless. Any refusal or negative result is utterly compromised and confounded by things like: How trustworthy you appear; whether the amount is reasonable; whether the random commenter has the skillset, free time, and financial assets to try; whether they're part of a larger group they can recruit; etc.
"In certain scenarios I could be a ninja": it means absolutely nothing without proving that I actually have the skills and I could actually use them.
It is not pointless, but if you claim something show the proof.
Do you not accept those? What would you accept as a proof of concept?
_That is enough to distribute malicious code though, at least in certain scenarios. Someone might create a setup where reviewers check/sign one version of the source code, and what gets distributed is another version with the same hash._
Here's the easiest to explain way: Upload the malicious version of the file to github. Send an innocuous patch to the kernel devs that creates a file with the same hash. It gets accepted, and anyone that downloads the kernel from github gets the malicious version. Done. That's a small fraction of linux downloaders, but this is just the proof of concept.
And with the current surge of GPU's even collisions are realistic now. The H100's are not doing much when not in training.
Where did you get this number from? I was under impression that this is completely infeasible (just like we can generate a collision good md5 in seconds, but we still can't do a preimage attack).
Those are git trailers, see git-interpret-trailers(1).
git-notes(1) is something completely different and not used by the kernel.
Git does not actually use "sha1", despite what all the docs say, it uses "sha1dc", which is just like sha1 except for inputs which can cause collisions, in which case it either fails with clear error message or returns completely different value.
https://news.ycombinator.com/item?id=17825441
so don't worry, git hashes _are_ enough to know you are distributing/executing the legitimate code.
(not to mention you need a preimage attack to replace known commit, and this is not yet possible with sha1)
>In this case "hash" will be the same as SHA1(input) in all cases, except those where the input is detected to be malicious (as in the SHAttered attack)
I don't see how this can be more than a fundamentally forward-incompatible sticking plaster over the problem. The problem isn't merely that "detecting maliciousness" seems fraught in itself (how does one infer intent reliably?) -- it's that today's SHA1DC() implementation can only detect and optionally correct today's known attacks, so each new attack necessitates a new, incompatible version of SHA1DC().
No, there is nothing "fundamentally incompatible" in the new SHA1DC method. After all, git came out in 2005, 11 years before SHA1 attacks were known, so it used regular SHA1. The collision detector was added in 2017 and nothing broke, because false positive chance is 2^-90 [1].
I have not heard of any new SHA1 collision results, but if they are based on no-difference differential paths, git has nothing to worry about. And if they are not, it may be possible to extend DC detector to seamlessly detect and prevent those attacks, and then only upgrade git clients, keeping backward and forward compatibility for data.
Of course there is always a chance that someone will come out with all-new SHA1 preimage attack that cannot be detected without high rate of false positive, so it's prudent to switch git to sha256. There is a lot of work being done: git's sha256 mode went out of beta in 2.42 (2023), but neither github nor gitlab support it.
But since the current state of git's sha is that there is nothing broken, and git git commit hash _is_ "enough to know you are distributing/executing the legitimate code, as opposed to a malicious doppelganger", there is no real pressure.
[0] https://marc-stevens.nl/research/papers/C13-S.pdf
[1] https://github.com/cr-marcstevens/sha1collisiondetection
Its not detecting "intent" it is detecting that the hash is one vulnerable to the attack, which is extremely unlikely to happen by accident, so if you see it you can assume malice.
It might be a band-aid, and sha256 is certainly a much better solution, but its more robust than it sounds at first glance (since it sounds crazy at first glance)
But agreed that it's a rather tougher patch than I'd originally thought, reading elsewhere in this thread.
[0] https://lwn.net/Articles/898522/
I always just assumed that tools would take all the data into account. Which means that you both need to collide with the abbreviated hash as well as the subject. Now I don't do that since I just copy-paste the hash, but I would quickly notice in case the subject is different (and likely the commit message and the diff just look irrelevant).
I don't understand why the Linux Kernel has this hard-coded rule[2] -- again, you were going to get collisions eventually, so the tools should have just taken all the data into account (at least the subject) from the start. The recommendation in the Git project is to use `git show -s --pretty=reference`, without any fiddling with the abbreviation:
Although the Git maintainer uses `--abbrev=8` since git-show will just use a longer abbreviation in case the output would be ambiguous[1].They could have used this instead if they wanted simpler, future-proof tooling:
Just like tools like git-revert and git-cherry-pick do.[1]: https://lore.kernel.org/git/xmqq34j5h7v9.fsf@gitster.g/
[2]: Edit: hard-coded as opposed to Git just figuring out how long the abbreviation should be based on how many objects there are.
Well the first mentioned script:
> > Tools like linux-next's “Fixes tag checker”,
has `get_full_hash`[1] which uses the subject to search through the abbreviated matches.
Edit: And that check was added two weeks ago by Kees [2].
[1]: https://github.com/kees/kernel-tools/blob/trunk/helpers/chec...
[2]: https://github.com/kees/kernel-tools/commit/5bf6a1e71df59a23...
This is just for the abbreviated hash which git uses only for specific cases like display in the git log and similar, where it is a user experience improvement and relatively safe.
By that logic, any length of abbreviation will eventually fail if stored for a long time, those abbreviations should be seen as temporary for user interaction only. For the "Fixes:" footer they should have gone with full hash, it's hidden away in the footer so it doesn't disrupt anyone. Aanyone interacting with it will double click to copy it regardless if it is 8 or 40 characters long. Abbreviating it simply adds no value.
At what probability? Can that probability be configured as a global Git option?
The formula calculates how many bits are needed to store the approximate number of objects, then uses twice that many bits, rounding up.
For example, at 16000 objects git is still using the minimum length of 7 characters. But at 17000 objects it now takes 15 bits to store the object count, so it wants 30 bits of hash, which means 8 characters.
https://git-scm.com/docs/git-config#Documentation/git-config...
* The exact amount would depend on repo and method. Using base 32 and changing nothing else would slow collisions by 2^length.
It would add a bit to each position, going from the current 48 bits, to 60 bits. Or 4,096 more resistant to _accidental_ collision. You can have as many bits as you want you just make it more time consuming to create one, but if you're entire infrastructure blithely relies on one never _intentionally_ being submitted, you've solved nothing really.
Moreover, if you find a commit with a collision, it's very easy to slightly alter it's contents to alleviate the problem, presuming we're only having to contend with unintentional conflicts.
So changing the character set doesn't notably improve the accidental situation, and it doesn't notably improve the deliberate situation. And it would be a huge pain to do.
Still, they already figured out a solution. They don't need multiple mechanisms for the same thing.
(Submitted URL was https://lwn.net/Articles/1003797/)