> a bug in “Archive Utility” on macOS prevents it from decompressing the resulting file
I looked into this in the past, it's because they check for a "PK" header at the start of the file - which is of course not actually required. I assumed it was deliberate because it does exclude most "weird" ZIPs.
> it's because they check for a "PK" header at the start of the file
Lots of FOSS tooling will have a similar limitation due to the lack of support in the shared-mime-info spec for reading identifying features from the ends of files. Please vote/comment on this issue to voice your support: https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues...
garaetjjte 22 days ago [-]
But EOCD is not required to be at the end of file either (well it is, but has stupid comment field).
22 days ago [-]
gildas 22 days ago [-]
Note that you can also take advantage of the fact that a ZIP can be password-protected and make your web page secret! For example https://gildas-lormeau.github.io/private/ (password: "thisisapage").
jclarkcom 22 days ago [-]
If you are loading external libraries like in this example your encrypted data is at risk. It would be better to include the decryption code directly in the Js or embed Js zlib.
gildas 22 days ago [-]
It's possible to define the Content Security Policy with a <META> tag in the "bootstrap page" and prevent this kind of security issue, e.g. <META http-equiv="content-security-policy" content="connect-src 'self' data: blob:;">
Thorrez 22 days ago [-]
I don't think that will prevent data exfiltration. Malicious javascript could create e.g. an img element with the data to exfiltrate stored in a query parameter of the image URL.
gildas 22 days ago [-]
The request will be blocked by the CSP.
Thorrez 21 days ago [-]
Why? The CSP policy isn't setting default-src or img-src. So image loads are allowed from everywhere.
gildas 21 days ago [-]
That was just an example of syntax, nothing prevents you from blocking more resources and sandbox the page.
Thorrez 20 days ago [-]
If we make it strict enough to block exfiltration, it'll block the external libraries from loading. So that means we have to load our scripts from the same origin instead of external origins (as jclarkcom suggested).
But the whole reason for CSP was to allow us to use external libraries without exfiltration risk. If we stop using external libraries, then our motivation for using CSP is gone. So CSP is useless for the purpose of this conversation.
gildas 19 days ago [-]
I think there's been a misunderstanding, there was an error in the article suggesting that zip.min.js is not inlined in the page. This error has been corrected meanwhile. I'm sorry for that. The goal is obviously to create pages that work offline, as shown in the demo.
infotogivenm 22 days ago [-]
source integrity is probably the more applicable feature for gp’s concerns
nhinck3 22 days ago [-]
You can also use the SubtleCrypto API
zzo38computer 22 days ago [-]
I would probably prefer to use text other than "Please wait..." since it won't work if JavaScripts are disabled. This can be fixed by changing the text to something such as "This is a HTML/ZIP/PNG polyglot file". And then, omit the <title> to save space.
A <noscript> script would be even more suitable, but I agree with the principle. I added a link to view the demo without downloading the file, see https://gildas-lormeau.github.io/Polyglot-HTML-ZIP-PNG/demo.... (it was not working previously because GitHub serves pages in UTF-8).
OkGoDoIt 22 days ago [-]
I was hoping for an example PNG on the webpage to showcase that it actually works. I’m on my phone so I can’t do much with a downloaded zip file. But it would be cool to see that the PNG renders like a normal image on Safari mobile.
gildas 22 days ago [-]
Note that if you're on iOS, it's possible that the HTML page doesn't work at all because when it's opened from the filesystem, it's displayed by a viewer which doesn't support JS instead of Safari.
Dwedit 22 days ago [-]
It's the "Rennes JS User Group" image that you see in the middle of the HTML page.
a1o 22 days ago [-]
I am also on my phone and found it weird that wasn't a single online demo
This opens a download dialog for me rather that render the html (in firefox on android)
gildas 22 days ago [-]
This is done on purpose, so you can rename the file to make sure it's polyglot.
Aardwolf 22 days ago [-]
Thanks, on an actual computer it's easy to check :)
gildas 21 days ago [-]
For the record, I've just added a link to view the file without downloading it.
gavindean90 22 days ago [-]
A screenshot would help
edflsafoiewq 22 days ago [-]
A screenshot of what? It just looks like a normal web page.
22 days ago [-]
Dwedit 22 days ago [-]
I think there's probably a much more efficient way to pack the correction data than JSON. For example, if you wanted to embed a 10MB video file in there, the correction data would be huge.
In the project there, correction data is used to recover bytes that have been changed into LF when they are actually CR or CRLF.
One idea is to store the correction data as binary, then read two bits every time you see a LF byte. It's either an actual LF, a CR, or a CRLF. The downside is that binary data itself could need correction as well, and encoding nearly 1-bit data in 2 bits is still wasteful (but simple). Packing five 3-state values into a byte is less wasteful and would eliminate forbidden symbols, but is still not optimal.
gildas 22 days ago [-]
You're right, SingleFile (which is capable of saving pages in this format) does a little better than the demo, but it can also be optimized. In fact, I chose the JSON format to keep things as simple and didactic as possible for the presentation. I think I need to use your suggestions to optimize this structure in SingleFile ;)
ElectricalUnion 22 days ago [-]
I believe at that point (huge blobs compared to small amounts of plaintext strings), it's easier to embed a universal binary web server and have it serve the contents of the zip, like https://redbean.dev/
porridgeraisin 22 days ago [-]
> However, there’s a problem: due to the same-origin policy, retrieving ZIP data directly with fetch(””) fails when the page is opened from the filesystem (except in Firefox).
chromium --allow-access-from-files
lifthrasiir 22 days ago [-]
> The bootstrap page is now encoded in windows-1252, which allows data to be read from the DOM with minimum degradation.
This is not always the case if the encoded content happens to have `-->`, for example. A better approach would be the `<plaintext>` element which can never be closed.
gildas 22 days ago [-]
Indeed, for example the HTML of the files used for the presentation slides [1] use <noframe> tags to keep the HTML well-formed. This point is addressed in the conclusion of the presentation.
I looked into this in the past, it's because they check for a "PK" header at the start of the file - which is of course not actually required. I assumed it was deliberate because it does exclude most "weird" ZIPs.
By the way, if you're interested in this sort of file format wrangling, check out Ange Albertini's talk tomorrow at 38c3: https://fahrplan.events.ccc.de/congress/2024/fahrplan/talk/Q...
Lots of FOSS tooling will have a similar limitation due to the lack of support in the shared-mime-info spec for reading identifying features from the ends of files. Please vote/comment on this issue to voice your support: https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues...
But the whole reason for CSP was to allow us to use external libraries without exfiltration risk. If we stop using external libraries, then our motivation for using CSP is gone. So CSP is useless for the purpose of this conversation.
The URL jar:https://raw.githubusercontent.com/gildas-lormeau/Polyglot-HT... can be used to display the HTML file in some web browsers, although it cannot display the PNG file in this way since it uses # as the URL of the picture.
In the project there, correction data is used to recover bytes that have been changed into LF when they are actually CR or CRLF.
One idea is to store the correction data as binary, then read two bits every time you see a LF byte. It's either an actual LF, a CR, or a CRLF. The downside is that binary data itself could need correction as well, and encoding nearly 1-bit data in 2 bits is still wasteful (but simple). Packing five 3-state values into a byte is less wasteful and would eliminate forbidden symbols, but is still not optimal.
This is not always the case if the encoded content happens to have `-->`, for example. A better approach would be the `<plaintext>` element which can never be closed.
[1] https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG/raw/...