Estimate compressibility of file
Is there a quick and dirty way of estimating gzip
-compressibility of a file without having to fully compress it with gzip
?
I could, in bash
, do
bc <<<"scale=2;$(gzip -c file | wc -c)/$(wc -c <file)"
This gives me the compression factor without having to write the gz
file to disk; this way I can avoid replacing a file on disk with its gz
version if the resultant disk space savings do not justify the trouble. But with this approach the file is indeed fully put through gzip
; it's just that the output is piped to wc
rather than written to disk.
Is there a way to get a rough compressibility estimate for a file without having gzip
work on all its contents?
compression gzip
add a comment |
Is there a quick and dirty way of estimating gzip
-compressibility of a file without having to fully compress it with gzip
?
I could, in bash
, do
bc <<<"scale=2;$(gzip -c file | wc -c)/$(wc -c <file)"
This gives me the compression factor without having to write the gz
file to disk; this way I can avoid replacing a file on disk with its gz
version if the resultant disk space savings do not justify the trouble. But with this approach the file is indeed fully put through gzip
; it's just that the output is piped to wc
rather than written to disk.
Is there a way to get a rough compressibility estimate for a file without having gzip
work on all its contents?
compression gzip
add a comment |
Is there a quick and dirty way of estimating gzip
-compressibility of a file without having to fully compress it with gzip
?
I could, in bash
, do
bc <<<"scale=2;$(gzip -c file | wc -c)/$(wc -c <file)"
This gives me the compression factor without having to write the gz
file to disk; this way I can avoid replacing a file on disk with its gz
version if the resultant disk space savings do not justify the trouble. But with this approach the file is indeed fully put through gzip
; it's just that the output is piped to wc
rather than written to disk.
Is there a way to get a rough compressibility estimate for a file without having gzip
work on all its contents?
compression gzip
Is there a quick and dirty way of estimating gzip
-compressibility of a file without having to fully compress it with gzip
?
I could, in bash
, do
bc <<<"scale=2;$(gzip -c file | wc -c)/$(wc -c <file)"
This gives me the compression factor without having to write the gz
file to disk; this way I can avoid replacing a file on disk with its gz
version if the resultant disk space savings do not justify the trouble. But with this approach the file is indeed fully put through gzip
; it's just that the output is piped to wc
rather than written to disk.
Is there a way to get a rough compressibility estimate for a file without having gzip
work on all its contents?
compression gzip
compression gzip
asked Sep 16 '14 at 16:48
iruvariruvar
11.7k62960
11.7k62960
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
You could try compressing one every 10 blocks for instance to get an idea:
perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}
if ($. % 10 == 1) {print O $_; $l+=length}
END{close O; $c = <I>; say $c/$l}'
(here with 4K blocks).
add a comment |
Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution
python -c "
import zlib
from itertools import islice
from functools import partial
import sys
with open(sys.argv[1]) as f:
compressor = zlib.compressobj()
t, z = 0, 0.0
for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):
t += len(chunk)
z += len(compressor.compress(chunk))
z += len(compressor.flush())
print z/t
" file
I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer togzip -3
) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
– Stéphane Chazelas
Sep 17 '14 at 11:55
@StéphaneChazelas, judging by the documentationcompressobj
should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
– iruvar
Sep 17 '14 at 12:48
add a comment |
I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:
head -c 10000000 large_file.bin | gzip | wc -c
It's not prefect but it worked well for me.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f155901%2festimate-compressibility-of-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You could try compressing one every 10 blocks for instance to get an idea:
perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}
if ($. % 10 == 1) {print O $_; $l+=length}
END{close O; $c = <I>; say $c/$l}'
(here with 4K blocks).
add a comment |
You could try compressing one every 10 blocks for instance to get an idea:
perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}
if ($. % 10 == 1) {print O $_; $l+=length}
END{close O; $c = <I>; say $c/$l}'
(here with 4K blocks).
add a comment |
You could try compressing one every 10 blocks for instance to get an idea:
perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}
if ($. % 10 == 1) {print O $_; $l+=length}
END{close O; $c = <I>; say $c/$l}'
(here with 4K blocks).
You could try compressing one every 10 blocks for instance to get an idea:
perl -MIPC::Open2 -nE 'BEGIN{$/=4096;open2(*I,*O,"gzip|wc -c")}
if ($. % 10 == 1) {print O $_; $l+=length}
END{close O; $c = <I>; say $c/$l}'
(here with 4K blocks).
answered Sep 16 '14 at 18:23
Stéphane ChazelasStéphane Chazelas
300k54564916
300k54564916
add a comment |
add a comment |
Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution
python -c "
import zlib
from itertools import islice
from functools import partial
import sys
with open(sys.argv[1]) as f:
compressor = zlib.compressobj()
t, z = 0, 0.0
for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):
t += len(chunk)
z += len(compressor.compress(chunk))
z += len(compressor.flush())
print z/t
" file
I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer togzip -3
) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
– Stéphane Chazelas
Sep 17 '14 at 11:55
@StéphaneChazelas, judging by the documentationcompressobj
should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
– iruvar
Sep 17 '14 at 12:48
add a comment |
Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution
python -c "
import zlib
from itertools import islice
from functools import partial
import sys
with open(sys.argv[1]) as f:
compressor = zlib.compressobj()
t, z = 0, 0.0
for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):
t += len(chunk)
z += len(compressor.compress(chunk))
z += len(compressor.flush())
print z/t
" file
I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer togzip -3
) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
– Stéphane Chazelas
Sep 17 '14 at 11:55
@StéphaneChazelas, judging by the documentationcompressobj
should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
– iruvar
Sep 17 '14 at 12:48
add a comment |
Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution
python -c "
import zlib
from itertools import islice
from functools import partial
import sys
with open(sys.argv[1]) as f:
compressor = zlib.compressobj()
t, z = 0, 0.0
for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):
t += len(chunk)
z += len(compressor.compress(chunk))
z += len(compressor.flush())
print z/t
" file
Here's a (hopefully equivalent) Python version of Stephane Chazelas's solution
python -c "
import zlib
from itertools import islice
from functools import partial
import sys
with open(sys.argv[1]) as f:
compressor = zlib.compressobj()
t, z = 0, 0.0
for chunk in islice(iter(partial(f.read, 4096), ''), 0, None, 10):
t += len(chunk)
z += len(compressor.compress(chunk))
z += len(compressor.flush())
print z/t
" file
edited Apr 13 '17 at 12:36
Community♦
1
1
answered Sep 16 '14 at 19:14
iruvariruvar
11.7k62960
11.7k62960
I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer togzip -3
) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
– Stéphane Chazelas
Sep 17 '14 at 11:55
@StéphaneChazelas, judging by the documentationcompressobj
should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
– iruvar
Sep 17 '14 at 12:48
add a comment |
I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer togzip -3
) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.
– Stéphane Chazelas
Sep 17 '14 at 11:55
@StéphaneChazelas, judging by the documentationcompressobj
should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea
– iruvar
Sep 17 '14 at 12:48
I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to
gzip -3
) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.– Stéphane Chazelas
Sep 17 '14 at 11:55
I don't know if that's equivalent (as python is not by cup of coffee ;-b)), but that gives slightly different results (even for very large files where the gzip header size overhead can be ignored), possibly because zlib.compressobj uses different settings than gzip (I find that it's closer to
gzip -3
) or maybe because zlib.compressobj compresses each chunk in isolation (as opposed to a stream as a whole). In any case both approaches should be good enough. I find that perl is slightly faster.– Stéphane Chazelas
Sep 17 '14 at 11:55
@StéphaneChazelas, judging by the documentation
compressobj
should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea– iruvar
Sep 17 '14 at 12:48
@StéphaneChazelas, judging by the documentation
compressobj
should be acting on the stream as a whole. I find that both Python and Perl solutions produce approximately the same result on my data files (they differ in the second decimal place, that's good enough for me). Thanks for the great idea– iruvar
Sep 17 '14 at 12:48
add a comment |
I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:
head -c 10000000 large_file.bin | gzip | wc -c
It's not prefect but it worked well for me.
add a comment |
I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:
head -c 10000000 large_file.bin | gzip | wc -c
It's not prefect but it worked well for me.
add a comment |
I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:
head -c 10000000 large_file.bin | gzip | wc -c
It's not prefect but it worked well for me.
I had a multi-gigabyte file and I wasn't sure if it was compressed, so I test-compressed the first 10M bytes:
head -c 10000000 large_file.bin | gzip | wc -c
It's not prefect but it worked well for me.
answered 8 mins ago
aidanaidan
1063
1063
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f155901%2festimate-compressibility-of-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown