How can I grep in PDF files?
Is there a way to search pdf files using the power of grep, without converting to text first in Ubuntu?
grep search pdf
add a comment |
Is there a way to search pdf files using the power of grep, without converting to text first in Ubuntu?
grep search pdf
1
I think you need to parse it thou pdf2text in order to get some usable results back...
– Johan
Jan 31 '11 at 14:29
1
See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.
– Gilles
Jan 31 '11 at 20:01
1
For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?
– Martin Thoma
Jan 2 '16 at 22:09
add a comment |
Is there a way to search pdf files using the power of grep, without converting to text first in Ubuntu?
grep search pdf
Is there a way to search pdf files using the power of grep, without converting to text first in Ubuntu?
grep search pdf
grep search pdf
edited Jun 22 '13 at 13:19
Flow
472516
472516
asked Jan 31 '11 at 13:31
Dervin ThunkDervin Thunk
96431019
96431019
1
I think you need to parse it thou pdf2text in order to get some usable results back...
– Johan
Jan 31 '11 at 14:29
1
See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.
– Gilles
Jan 31 '11 at 20:01
1
For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?
– Martin Thoma
Jan 2 '16 at 22:09
add a comment |
1
I think you need to parse it thou pdf2text in order to get some usable results back...
– Johan
Jan 31 '11 at 14:29
1
See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.
– Gilles
Jan 31 '11 at 20:01
1
For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?
– Martin Thoma
Jan 2 '16 at 22:09
1
1
I think you need to parse it thou pdf2text in order to get some usable results back...
– Johan
Jan 31 '11 at 14:29
I think you need to parse it thou pdf2text in order to get some usable results back...
– Johan
Jan 31 '11 at 14:29
1
1
See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.
– Gilles
Jan 31 '11 at 20:01
See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.
– Gilles
Jan 31 '11 at 20:01
1
1
For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?
– Martin Thoma
Jan 2 '16 at 22:09
For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?
– Martin Thoma
Jan 2 '16 at 22:09
add a comment |
14 Answers
14
active
oldest
votes
Install the package pdfgrep
, then use the command:
find /path -iname '*.pdf' -exec pdfgrep pattern {} +
4
This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.
– mikiemorales
Jan 23 '14 at 1:28
6
Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.
– Andrew Martin
Sep 16 '14 at 11:11
3
pdfgrep
also has a recursive flag. So this answer could perhaps be reduced to:pdfgrep -R pattern /path/
. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.
– Rovanion
Jan 14 '16 at 12:11
Actually, the-n
option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).
– JepZ
Nov 10 '17 at 20:18
3
This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What'spattern
? What's{}
? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.
– Mark Amery
Apr 20 '18 at 14:44
|
show 1 more comment
If you have poppler-utils
installed (default on Ubuntu Desktop), you could "convert" it on the fly and pipe it to grep
:
pdftotext my.pdf - | grep 'pattern'
This won't create a .txt file.
1
so .. you extract the text before you grep it which means the answer is "no".
– akira
Jan 31 '11 at 15:18
18
@akira The OP probably meant "without opening the PDF in a viewer and exporting to text"
– Michael Mrozek♦
Jan 31 '11 at 17:36
5
@akira Where do you see "grep only"?
– Michael Mrozek♦
Jan 31 '11 at 18:55
6
@akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to
– Michael Mrozek♦
Feb 1 '11 at 5:52
1
@sherrellbc The second argument ofpdftotext
is the filename it should write to. However, by convention, tools typically allow you to write tostdout
instead of to a file by specifying a-
instead. Similarly, some tools would write tostdout
by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).
– Joost
Sep 23 '16 at 14:06
|
show 4 more comments
No.
A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to 'grep' a .pdf you have to reverse the compression aka extract the text.
You can do that either per file with tools such as pdf2text
and grep the result, or you run an 'indexer' (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.
But no, you can not grep
pdf files and hope for reliable answers without extracting the text first.
1
Consideringpdfgrep
exists (see above), a flat "no" is incorrect.
– Jonathan Cross
Aug 28 '18 at 10:18
add a comment |
pdfgrep was written for exactly this purpose and is available in Ubuntu.
It tries to be mostly compatible to grep
and thus provides "the power of grep", only specialized for PDFs. That includes common grep options, such as --recursive
, --ignore-case
or --color
.
In contrast to pdftotext | grep
, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn't have to search the whole document (e.g. --max-count
or --quiet
).
The basic usage is:
pdfgrep PATTERN FILE..
where PATTERN
is your search string and FILE
a list of filenames (or wildcards in a shell).
See the manpage for more infos.
add a comment |
Recoll can search PDFs. It doesn't support regular expressions, but it has lots of other search options, so it might fit your needs.
add a comment |
You could pipe it through strings
first:-
cat file.pdf | strings | grep <...etc...>
8
Just usestrings file.pdf | grep <...>
, you don't needcat
– phunehehe
Jan 31 '11 at 14:31
Yeah - my mind seems to work better with streams... :-)
– Andy Smith
Jan 31 '11 at 14:57
12
wont work if text is compressed, which it is most of the times.
– akira
Jan 31 '11 at 15:18
6
Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly forstrings
orgrep
.
– Jander
Jan 31 '11 at 16:08
Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.
– hourback
Nov 24 '15 at 19:58
add a comment |
Take a look at the common resource grep tool crgrep which supports searching within PDF files.
It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.
add a comment |
try this
find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i;
pdftotext "$i" - | grep pattern; done
for printing the lines the pattern occurs inside the pdf
add a comment |
cd to your folder containing your pdf-file and then..
pdfgrep 'pattern' your.pdf
or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)
pdfgrep 'pattern' `ls *.pdf`
or
pdfgrep 'pattern' $(ls *.pdf)
why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to usels
output as the input to other commands. Justpdfgrep 'pattern' *.pdf
is enough
– phuclv
7 mins ago
add a comment |
There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:
find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' ;
The advantage over the similar answer here is the --with-filename
flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.
https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files
I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.
– Bernhard
May 9 '14 at 12:07
add a comment |
gpdf might be what you need if you're using Gnome! Check this in case you're not using Gnome. It's got a list of CLI pdf viewers. Then you can use grep
to find some pattern.
Hope that helps.
add a comment |
Here is a quick script for search pdf in the current directory :
#!/bin/bash
if [ $# -ne 1 ]; then
echo "usage $0 VALUE" 1>&2
exit 1
fi
echo 'SEARCH IS CASE SENSITIVE' 1>&2
find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "$0"' $1 ;
add a comment |
I assume you mean tp not convert it on the disk, you can convert them to stdout
and then grep it with pdftotext
. Grepping the pdf without any sort of conversion is not a practical approach since PDF
is mostly a binary format.
In the directory:
ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {} - | grep "keyword"
or in the directory and its subdirectories:
tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {} - | grep "keyword"
Also because some pdf
are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be grep
ed and OCR them.
I noticed if a pdf
file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts
.
First 2 lines of the pdffonts
are the table header, so when a file is searchable has more than two line output, knowing this we can create:
gedit check_pdf_searchable.sh
then paste this
#!/bin/bash
#set -vx
if ((`pdffonts "$1" | wc -l` < 3 )); then
echo $1
pypdfocr "$1"
fi
then make it executable
chmod +x check_pdf_searchable.sh
then list all non-searchable pdfs in the directory:
ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}
or in the directory and its subdirectories:
tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}
add a comment |
If you just want to search for pdf names/properties... or simple strings that are not compressed or encoded then instead of strings
you can use the below
grep -a STRING file.pdf
cat -v file.pdf | grep STRING
From grep --help
:
--binary-files=TYPE assume that binary files are TYPE;
TYPE is 'binary', 'text', or 'without-match'
-a, --text equivalent to --binary-files=text
and cat --help
:
-v, --show-nonprinting use ^ and M- notation, except for LFD and TAB
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f6704%2fhow-can-i-grep-in-pdf-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
14 Answers
14
active
oldest
votes
14 Answers
14
active
oldest
votes
active
oldest
votes
active
oldest
votes
Install the package pdfgrep
, then use the command:
find /path -iname '*.pdf' -exec pdfgrep pattern {} +
4
This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.
– mikiemorales
Jan 23 '14 at 1:28
6
Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.
– Andrew Martin
Sep 16 '14 at 11:11
3
pdfgrep
also has a recursive flag. So this answer could perhaps be reduced to:pdfgrep -R pattern /path/
. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.
– Rovanion
Jan 14 '16 at 12:11
Actually, the-n
option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).
– JepZ
Nov 10 '17 at 20:18
3
This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What'spattern
? What's{}
? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.
– Mark Amery
Apr 20 '18 at 14:44
|
show 1 more comment
Install the package pdfgrep
, then use the command:
find /path -iname '*.pdf' -exec pdfgrep pattern {} +
4
This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.
– mikiemorales
Jan 23 '14 at 1:28
6
Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.
– Andrew Martin
Sep 16 '14 at 11:11
3
pdfgrep
also has a recursive flag. So this answer could perhaps be reduced to:pdfgrep -R pattern /path/
. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.
– Rovanion
Jan 14 '16 at 12:11
Actually, the-n
option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).
– JepZ
Nov 10 '17 at 20:18
3
This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What'spattern
? What's{}
? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.
– Mark Amery
Apr 20 '18 at 14:44
|
show 1 more comment
Install the package pdfgrep
, then use the command:
find /path -iname '*.pdf' -exec pdfgrep pattern {} +
Install the package pdfgrep
, then use the command:
find /path -iname '*.pdf' -exec pdfgrep pattern {} +
answered Dec 23 '11 at 18:40
enzotibenzotib
33.8k710395
33.8k710395
4
This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.
– mikiemorales
Jan 23 '14 at 1:28
6
Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.
– Andrew Martin
Sep 16 '14 at 11:11
3
pdfgrep
also has a recursive flag. So this answer could perhaps be reduced to:pdfgrep -R pattern /path/
. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.
– Rovanion
Jan 14 '16 at 12:11
Actually, the-n
option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).
– JepZ
Nov 10 '17 at 20:18
3
This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What'spattern
? What's{}
? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.
– Mark Amery
Apr 20 '18 at 14:44
|
show 1 more comment
4
This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.
– mikiemorales
Jan 23 '14 at 1:28
6
Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.
– Andrew Martin
Sep 16 '14 at 11:11
3
pdfgrep
also has a recursive flag. So this answer could perhaps be reduced to:pdfgrep -R pattern /path/
. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.
– Rovanion
Jan 14 '16 at 12:11
Actually, the-n
option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).
– JepZ
Nov 10 '17 at 20:18
3
This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What'spattern
? What's{}
? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.
– Mark Amery
Apr 20 '18 at 14:44
4
4
This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.
– mikiemorales
Jan 23 '14 at 1:28
This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.
– mikiemorales
Jan 23 '14 at 1:28
6
6
Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.
– Andrew Martin
Sep 16 '14 at 11:11
Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.
– Andrew Martin
Sep 16 '14 at 11:11
3
3
pdfgrep
also has a recursive flag. So this answer could perhaps be reduced to: pdfgrep -R pattern /path/
. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.– Rovanion
Jan 14 '16 at 12:11
pdfgrep
also has a recursive flag. So this answer could perhaps be reduced to: pdfgrep -R pattern /path/
. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.– Rovanion
Jan 14 '16 at 12:11
Actually, the
-n
option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).– JepZ
Nov 10 '17 at 20:18
Actually, the
-n
option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).– JepZ
Nov 10 '17 at 20:18
3
3
This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What's
pattern
? What's {}
? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.– Mark Amery
Apr 20 '18 at 14:44
This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What's
pattern
? What's {}
? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.– Mark Amery
Apr 20 '18 at 14:44
|
show 1 more comment
If you have poppler-utils
installed (default on Ubuntu Desktop), you could "convert" it on the fly and pipe it to grep
:
pdftotext my.pdf - | grep 'pattern'
This won't create a .txt file.
1
so .. you extract the text before you grep it which means the answer is "no".
– akira
Jan 31 '11 at 15:18
18
@akira The OP probably meant "without opening the PDF in a viewer and exporting to text"
– Michael Mrozek♦
Jan 31 '11 at 17:36
5
@akira Where do you see "grep only"?
– Michael Mrozek♦
Jan 31 '11 at 18:55
6
@akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to
– Michael Mrozek♦
Feb 1 '11 at 5:52
1
@sherrellbc The second argument ofpdftotext
is the filename it should write to. However, by convention, tools typically allow you to write tostdout
instead of to a file by specifying a-
instead. Similarly, some tools would write tostdout
by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).
– Joost
Sep 23 '16 at 14:06
|
show 4 more comments
If you have poppler-utils
installed (default on Ubuntu Desktop), you could "convert" it on the fly and pipe it to grep
:
pdftotext my.pdf - | grep 'pattern'
This won't create a .txt file.
1
so .. you extract the text before you grep it which means the answer is "no".
– akira
Jan 31 '11 at 15:18
18
@akira The OP probably meant "without opening the PDF in a viewer and exporting to text"
– Michael Mrozek♦
Jan 31 '11 at 17:36
5
@akira Where do you see "grep only"?
– Michael Mrozek♦
Jan 31 '11 at 18:55
6
@akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to
– Michael Mrozek♦
Feb 1 '11 at 5:52
1
@sherrellbc The second argument ofpdftotext
is the filename it should write to. However, by convention, tools typically allow you to write tostdout
instead of to a file by specifying a-
instead. Similarly, some tools would write tostdout
by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).
– Joost
Sep 23 '16 at 14:06
|
show 4 more comments
If you have poppler-utils
installed (default on Ubuntu Desktop), you could "convert" it on the fly and pipe it to grep
:
pdftotext my.pdf - | grep 'pattern'
This won't create a .txt file.
If you have poppler-utils
installed (default on Ubuntu Desktop), you could "convert" it on the fly and pipe it to grep
:
pdftotext my.pdf - | grep 'pattern'
This won't create a .txt file.
answered Jan 31 '11 at 13:45
wagwag
25.1k65548
25.1k65548
1
so .. you extract the text before you grep it which means the answer is "no".
– akira
Jan 31 '11 at 15:18
18
@akira The OP probably meant "without opening the PDF in a viewer and exporting to text"
– Michael Mrozek♦
Jan 31 '11 at 17:36
5
@akira Where do you see "grep only"?
– Michael Mrozek♦
Jan 31 '11 at 18:55
6
@akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to
– Michael Mrozek♦
Feb 1 '11 at 5:52
1
@sherrellbc The second argument ofpdftotext
is the filename it should write to. However, by convention, tools typically allow you to write tostdout
instead of to a file by specifying a-
instead. Similarly, some tools would write tostdout
by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).
– Joost
Sep 23 '16 at 14:06
|
show 4 more comments
1
so .. you extract the text before you grep it which means the answer is "no".
– akira
Jan 31 '11 at 15:18
18
@akira The OP probably meant "without opening the PDF in a viewer and exporting to text"
– Michael Mrozek♦
Jan 31 '11 at 17:36
5
@akira Where do you see "grep only"?
– Michael Mrozek♦
Jan 31 '11 at 18:55
6
@akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to
– Michael Mrozek♦
Feb 1 '11 at 5:52
1
@sherrellbc The second argument ofpdftotext
is the filename it should write to. However, by convention, tools typically allow you to write tostdout
instead of to a file by specifying a-
instead. Similarly, some tools would write tostdout
by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).
– Joost
Sep 23 '16 at 14:06
1
1
so .. you extract the text before you grep it which means the answer is "no".
– akira
Jan 31 '11 at 15:18
so .. you extract the text before you grep it which means the answer is "no".
– akira
Jan 31 '11 at 15:18
18
18
@akira The OP probably meant "without opening the PDF in a viewer and exporting to text"
– Michael Mrozek♦
Jan 31 '11 at 17:36
@akira The OP probably meant "without opening the PDF in a viewer and exporting to text"
– Michael Mrozek♦
Jan 31 '11 at 17:36
5
5
@akira Where do you see "grep only"?
– Michael Mrozek♦
Jan 31 '11 at 18:55
@akira Where do you see "grep only"?
– Michael Mrozek♦
Jan 31 '11 at 18:55
6
6
@akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to
– Michael Mrozek♦
Feb 1 '11 at 5:52
@akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to
– Michael Mrozek♦
Feb 1 '11 at 5:52
1
1
@sherrellbc The second argument of
pdftotext
is the filename it should write to. However, by convention, tools typically allow you to write to stdout
instead of to a file by specifying a -
instead. Similarly, some tools would write to stdout
by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).– Joost
Sep 23 '16 at 14:06
@sherrellbc The second argument of
pdftotext
is the filename it should write to. However, by convention, tools typically allow you to write to stdout
instead of to a file by specifying a -
instead. Similarly, some tools would write to stdout
by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).– Joost
Sep 23 '16 at 14:06
|
show 4 more comments
No.
A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to 'grep' a .pdf you have to reverse the compression aka extract the text.
You can do that either per file with tools such as pdf2text
and grep the result, or you run an 'indexer' (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.
But no, you can not grep
pdf files and hope for reliable answers without extracting the text first.
1
Consideringpdfgrep
exists (see above), a flat "no" is incorrect.
– Jonathan Cross
Aug 28 '18 at 10:18
add a comment |
No.
A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to 'grep' a .pdf you have to reverse the compression aka extract the text.
You can do that either per file with tools such as pdf2text
and grep the result, or you run an 'indexer' (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.
But no, you can not grep
pdf files and hope for reliable answers without extracting the text first.
1
Consideringpdfgrep
exists (see above), a flat "no" is incorrect.
– Jonathan Cross
Aug 28 '18 at 10:18
add a comment |
No.
A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to 'grep' a .pdf you have to reverse the compression aka extract the text.
You can do that either per file with tools such as pdf2text
and grep the result, or you run an 'indexer' (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.
But no, you can not grep
pdf files and hope for reliable answers without extracting the text first.
No.
A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to 'grep' a .pdf you have to reverse the compression aka extract the text.
You can do that either per file with tools such as pdf2text
and grep the result, or you run an 'indexer' (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.
But no, you can not grep
pdf files and hope for reliable answers without extracting the text first.
answered Jan 31 '11 at 15:17
akiraakira
96459
96459
1
Consideringpdfgrep
exists (see above), a flat "no" is incorrect.
– Jonathan Cross
Aug 28 '18 at 10:18
add a comment |
1
Consideringpdfgrep
exists (see above), a flat "no" is incorrect.
– Jonathan Cross
Aug 28 '18 at 10:18
1
1
Considering
pdfgrep
exists (see above), a flat "no" is incorrect.– Jonathan Cross
Aug 28 '18 at 10:18
Considering
pdfgrep
exists (see above), a flat "no" is incorrect.– Jonathan Cross
Aug 28 '18 at 10:18
add a comment |
pdfgrep was written for exactly this purpose and is available in Ubuntu.
It tries to be mostly compatible to grep
and thus provides "the power of grep", only specialized for PDFs. That includes common grep options, such as --recursive
, --ignore-case
or --color
.
In contrast to pdftotext | grep
, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn't have to search the whole document (e.g. --max-count
or --quiet
).
The basic usage is:
pdfgrep PATTERN FILE..
where PATTERN
is your search string and FILE
a list of filenames (or wildcards in a shell).
See the manpage for more infos.
add a comment |
pdfgrep was written for exactly this purpose and is available in Ubuntu.
It tries to be mostly compatible to grep
and thus provides "the power of grep", only specialized for PDFs. That includes common grep options, such as --recursive
, --ignore-case
or --color
.
In contrast to pdftotext | grep
, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn't have to search the whole document (e.g. --max-count
or --quiet
).
The basic usage is:
pdfgrep PATTERN FILE..
where PATTERN
is your search string and FILE
a list of filenames (or wildcards in a shell).
See the manpage for more infos.
add a comment |
pdfgrep was written for exactly this purpose and is available in Ubuntu.
It tries to be mostly compatible to grep
and thus provides "the power of grep", only specialized for PDFs. That includes common grep options, such as --recursive
, --ignore-case
or --color
.
In contrast to pdftotext | grep
, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn't have to search the whole document (e.g. --max-count
or --quiet
).
The basic usage is:
pdfgrep PATTERN FILE..
where PATTERN
is your search string and FILE
a list of filenames (or wildcards in a shell).
See the manpage for more infos.
pdfgrep was written for exactly this purpose and is available in Ubuntu.
It tries to be mostly compatible to grep
and thus provides "the power of grep", only specialized for PDFs. That includes common grep options, such as --recursive
, --ignore-case
or --color
.
In contrast to pdftotext | grep
, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn't have to search the whole document (e.g. --max-count
or --quiet
).
The basic usage is:
pdfgrep PATTERN FILE..
where PATTERN
is your search string and FILE
a list of filenames (or wildcards in a shell).
See the manpage for more infos.
answered Jun 19 '15 at 1:06
hpdeifelhpdeifel
8112
8112
add a comment |
add a comment |
Recoll can search PDFs. It doesn't support regular expressions, but it has lots of other search options, so it might fit your needs.
add a comment |
Recoll can search PDFs. It doesn't support regular expressions, but it has lots of other search options, so it might fit your needs.
add a comment |
Recoll can search PDFs. It doesn't support regular expressions, but it has lots of other search options, so it might fit your needs.
Recoll can search PDFs. It doesn't support regular expressions, but it has lots of other search options, so it might fit your needs.
edited May 16 '13 at 22:18
Michael Mrozek♦
61.1k29190211
61.1k29190211
answered May 16 '13 at 20:52
user39336user39336
6111
6111
add a comment |
add a comment |
You could pipe it through strings
first:-
cat file.pdf | strings | grep <...etc...>
8
Just usestrings file.pdf | grep <...>
, you don't needcat
– phunehehe
Jan 31 '11 at 14:31
Yeah - my mind seems to work better with streams... :-)
– Andy Smith
Jan 31 '11 at 14:57
12
wont work if text is compressed, which it is most of the times.
– akira
Jan 31 '11 at 15:18
6
Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly forstrings
orgrep
.
– Jander
Jan 31 '11 at 16:08
Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.
– hourback
Nov 24 '15 at 19:58
add a comment |
You could pipe it through strings
first:-
cat file.pdf | strings | grep <...etc...>
8
Just usestrings file.pdf | grep <...>
, you don't needcat
– phunehehe
Jan 31 '11 at 14:31
Yeah - my mind seems to work better with streams... :-)
– Andy Smith
Jan 31 '11 at 14:57
12
wont work if text is compressed, which it is most of the times.
– akira
Jan 31 '11 at 15:18
6
Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly forstrings
orgrep
.
– Jander
Jan 31 '11 at 16:08
Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.
– hourback
Nov 24 '15 at 19:58
add a comment |
You could pipe it through strings
first:-
cat file.pdf | strings | grep <...etc...>
You could pipe it through strings
first:-
cat file.pdf | strings | grep <...etc...>
answered Jan 31 '11 at 13:45
Andy SmithAndy Smith
1943
1943
8
Just usestrings file.pdf | grep <...>
, you don't needcat
– phunehehe
Jan 31 '11 at 14:31
Yeah - my mind seems to work better with streams... :-)
– Andy Smith
Jan 31 '11 at 14:57
12
wont work if text is compressed, which it is most of the times.
– akira
Jan 31 '11 at 15:18
6
Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly forstrings
orgrep
.
– Jander
Jan 31 '11 at 16:08
Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.
– hourback
Nov 24 '15 at 19:58
add a comment |
8
Just usestrings file.pdf | grep <...>
, you don't needcat
– phunehehe
Jan 31 '11 at 14:31
Yeah - my mind seems to work better with streams... :-)
– Andy Smith
Jan 31 '11 at 14:57
12
wont work if text is compressed, which it is most of the times.
– akira
Jan 31 '11 at 15:18
6
Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly forstrings
orgrep
.
– Jander
Jan 31 '11 at 16:08
Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.
– hourback
Nov 24 '15 at 19:58
8
8
Just use
strings file.pdf | grep <...>
, you don't need cat
– phunehehe
Jan 31 '11 at 14:31
Just use
strings file.pdf | grep <...>
, you don't need cat
– phunehehe
Jan 31 '11 at 14:31
Yeah - my mind seems to work better with streams... :-)
– Andy Smith
Jan 31 '11 at 14:57
Yeah - my mind seems to work better with streams... :-)
– Andy Smith
Jan 31 '11 at 14:57
12
12
wont work if text is compressed, which it is most of the times.
– akira
Jan 31 '11 at 15:18
wont work if text is compressed, which it is most of the times.
– akira
Jan 31 '11 at 15:18
6
6
Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly for
strings
or grep
.– Jander
Jan 31 '11 at 16:08
Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly for
strings
or grep
.– Jander
Jan 31 '11 at 16:08
Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.
– hourback
Nov 24 '15 at 19:58
Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.
– hourback
Nov 24 '15 at 19:58
add a comment |
Take a look at the common resource grep tool crgrep which supports searching within PDF files.
It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.
add a comment |
Take a look at the common resource grep tool crgrep which supports searching within PDF files.
It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.
add a comment |
Take a look at the common resource grep tool crgrep which supports searching within PDF files.
It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.
Take a look at the common resource grep tool crgrep which supports searching within PDF files.
It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.
answered Oct 23 '13 at 12:30
CraigCraig
1311
1311
add a comment |
add a comment |
try this
find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i;
pdftotext "$i" - | grep pattern; done
for printing the lines the pattern occurs inside the pdf
add a comment |
try this
find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i;
pdftotext "$i" - | grep pattern; done
for printing the lines the pattern occurs inside the pdf
add a comment |
try this
find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i;
pdftotext "$i" - | grep pattern; done
for printing the lines the pattern occurs inside the pdf
try this
find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i;
pdftotext "$i" - | grep pattern; done
for printing the lines the pattern occurs inside the pdf
edited Dec 23 '11 at 20:17
enzotib
33.8k710395
33.8k710395
answered Dec 23 '11 at 19:35
harish.venkatharish.venkat
4,5231924
4,5231924
add a comment |
add a comment |
cd to your folder containing your pdf-file and then..
pdfgrep 'pattern' your.pdf
or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)
pdfgrep 'pattern' `ls *.pdf`
or
pdfgrep 'pattern' $(ls *.pdf)
why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to usels
output as the input to other commands. Justpdfgrep 'pattern' *.pdf
is enough
– phuclv
7 mins ago
add a comment |
cd to your folder containing your pdf-file and then..
pdfgrep 'pattern' your.pdf
or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)
pdfgrep 'pattern' `ls *.pdf`
or
pdfgrep 'pattern' $(ls *.pdf)
why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to usels
output as the input to other commands. Justpdfgrep 'pattern' *.pdf
is enough
– phuclv
7 mins ago
add a comment |
cd to your folder containing your pdf-file and then..
pdfgrep 'pattern' your.pdf
or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)
pdfgrep 'pattern' `ls *.pdf`
or
pdfgrep 'pattern' $(ls *.pdf)
cd to your folder containing your pdf-file and then..
pdfgrep 'pattern' your.pdf
or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)
pdfgrep 'pattern' `ls *.pdf`
or
pdfgrep 'pattern' $(ls *.pdf)
answered Apr 19 '15 at 19:26
Rasmuss RallRasmuss Rall
211
211
why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to usels
output as the input to other commands. Justpdfgrep 'pattern' *.pdf
is enough
– phuclv
7 mins ago
add a comment |
why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to usels
output as the input to other commands. Justpdfgrep 'pattern' *.pdf
is enough
– phuclv
7 mins ago
why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to use
ls
output as the input to other commands. Just pdfgrep 'pattern' *.pdf
is enough– phuclv
7 mins ago
why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to use
ls
output as the input to other commands. Just pdfgrep 'pattern' *.pdf
is enough– phuclv
7 mins ago
add a comment |
There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:
find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' ;
The advantage over the similar answer here is the --with-filename
flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.
https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files
I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.
– Bernhard
May 9 '14 at 12:07
add a comment |
There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:
find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' ;
The advantage over the similar answer here is the --with-filename
flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.
https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files
I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.
– Bernhard
May 9 '14 at 12:07
add a comment |
There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:
find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' ;
The advantage over the similar answer here is the --with-filename
flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.
https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files
There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:
find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' ;
The advantage over the similar answer here is the --with-filename
flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.
https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files
answered May 9 '14 at 10:00
user7610user7610
4681718
4681718
I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.
– Bernhard
May 9 '14 at 12:07
add a comment |
I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.
– Bernhard
May 9 '14 at 12:07
I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.
– Bernhard
May 9 '14 at 12:07
I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.
– Bernhard
May 9 '14 at 12:07
add a comment |
gpdf might be what you need if you're using Gnome! Check this in case you're not using Gnome. It's got a list of CLI pdf viewers. Then you can use grep
to find some pattern.
Hope that helps.
add a comment |
gpdf might be what you need if you're using Gnome! Check this in case you're not using Gnome. It's got a list of CLI pdf viewers. Then you can use grep
to find some pattern.
Hope that helps.
add a comment |
gpdf might be what you need if you're using Gnome! Check this in case you're not using Gnome. It's got a list of CLI pdf viewers. Then you can use grep
to find some pattern.
Hope that helps.
gpdf might be what you need if you're using Gnome! Check this in case you're not using Gnome. It's got a list of CLI pdf viewers. Then you can use grep
to find some pattern.
Hope that helps.
answered Jan 31 '11 at 14:03
DharmitDharmit
1,88762032
1,88762032
add a comment |
add a comment |
Here is a quick script for search pdf in the current directory :
#!/bin/bash
if [ $# -ne 1 ]; then
echo "usage $0 VALUE" 1>&2
exit 1
fi
echo 'SEARCH IS CASE SENSITIVE' 1>&2
find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "$0"' $1 ;
add a comment |
Here is a quick script for search pdf in the current directory :
#!/bin/bash
if [ $# -ne 1 ]; then
echo "usage $0 VALUE" 1>&2
exit 1
fi
echo 'SEARCH IS CASE SENSITIVE' 1>&2
find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "$0"' $1 ;
add a comment |
Here is a quick script for search pdf in the current directory :
#!/bin/bash
if [ $# -ne 1 ]; then
echo "usage $0 VALUE" 1>&2
exit 1
fi
echo 'SEARCH IS CASE SENSITIVE' 1>&2
find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "$0"' $1 ;
Here is a quick script for search pdf in the current directory :
#!/bin/bash
if [ $# -ne 1 ]; then
echo "usage $0 VALUE" 1>&2
exit 1
fi
echo 'SEARCH IS CASE SENSITIVE' 1>&2
find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "$0"' $1 ;
answered Jun 1 '16 at 19:01
NicoNico
1012
1012
add a comment |
add a comment |
I assume you mean tp not convert it on the disk, you can convert them to stdout
and then grep it with pdftotext
. Grepping the pdf without any sort of conversion is not a practical approach since PDF
is mostly a binary format.
In the directory:
ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {} - | grep "keyword"
or in the directory and its subdirectories:
tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {} - | grep "keyword"
Also because some pdf
are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be grep
ed and OCR them.
I noticed if a pdf
file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts
.
First 2 lines of the pdffonts
are the table header, so when a file is searchable has more than two line output, knowing this we can create:
gedit check_pdf_searchable.sh
then paste this
#!/bin/bash
#set -vx
if ((`pdffonts "$1" | wc -l` < 3 )); then
echo $1
pypdfocr "$1"
fi
then make it executable
chmod +x check_pdf_searchable.sh
then list all non-searchable pdfs in the directory:
ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}
or in the directory and its subdirectories:
tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}
add a comment |
I assume you mean tp not convert it on the disk, you can convert them to stdout
and then grep it with pdftotext
. Grepping the pdf without any sort of conversion is not a practical approach since PDF
is mostly a binary format.
In the directory:
ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {} - | grep "keyword"
or in the directory and its subdirectories:
tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {} - | grep "keyword"
Also because some pdf
are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be grep
ed and OCR them.
I noticed if a pdf
file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts
.
First 2 lines of the pdffonts
are the table header, so when a file is searchable has more than two line output, knowing this we can create:
gedit check_pdf_searchable.sh
then paste this
#!/bin/bash
#set -vx
if ((`pdffonts "$1" | wc -l` < 3 )); then
echo $1
pypdfocr "$1"
fi
then make it executable
chmod +x check_pdf_searchable.sh
then list all non-searchable pdfs in the directory:
ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}
or in the directory and its subdirectories:
tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}
add a comment |
I assume you mean tp not convert it on the disk, you can convert them to stdout
and then grep it with pdftotext
. Grepping the pdf without any sort of conversion is not a practical approach since PDF
is mostly a binary format.
In the directory:
ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {} - | grep "keyword"
or in the directory and its subdirectories:
tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {} - | grep "keyword"
Also because some pdf
are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be grep
ed and OCR them.
I noticed if a pdf
file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts
.
First 2 lines of the pdffonts
are the table header, so when a file is searchable has more than two line output, knowing this we can create:
gedit check_pdf_searchable.sh
then paste this
#!/bin/bash
#set -vx
if ((`pdffonts "$1" | wc -l` < 3 )); then
echo $1
pypdfocr "$1"
fi
then make it executable
chmod +x check_pdf_searchable.sh
then list all non-searchable pdfs in the directory:
ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}
or in the directory and its subdirectories:
tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}
I assume you mean tp not convert it on the disk, you can convert them to stdout
and then grep it with pdftotext
. Grepping the pdf without any sort of conversion is not a practical approach since PDF
is mostly a binary format.
In the directory:
ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {} - | grep "keyword"
or in the directory and its subdirectories:
tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {} - | grep "keyword"
Also because some pdf
are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be grep
ed and OCR them.
I noticed if a pdf
file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts
.
First 2 lines of the pdffonts
are the table header, so when a file is searchable has more than two line output, knowing this we can create:
gedit check_pdf_searchable.sh
then paste this
#!/bin/bash
#set -vx
if ((`pdffonts "$1" | wc -l` < 3 )); then
echo $1
pypdfocr "$1"
fi
then make it executable
chmod +x check_pdf_searchable.sh
then list all non-searchable pdfs in the directory:
ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}
or in the directory and its subdirectories:
tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}
edited Feb 8 '18 at 9:33
answered Feb 8 '18 at 8:38
Eduard FlorinescuEduard Florinescu
3,309103853
3,309103853
add a comment |
add a comment |
If you just want to search for pdf names/properties... or simple strings that are not compressed or encoded then instead of strings
you can use the below
grep -a STRING file.pdf
cat -v file.pdf | grep STRING
From grep --help
:
--binary-files=TYPE assume that binary files are TYPE;
TYPE is 'binary', 'text', or 'without-match'
-a, --text equivalent to --binary-files=text
and cat --help
:
-v, --show-nonprinting use ^ and M- notation, except for LFD and TAB
add a comment |
If you just want to search for pdf names/properties... or simple strings that are not compressed or encoded then instead of strings
you can use the below
grep -a STRING file.pdf
cat -v file.pdf | grep STRING
From grep --help
:
--binary-files=TYPE assume that binary files are TYPE;
TYPE is 'binary', 'text', or 'without-match'
-a, --text equivalent to --binary-files=text
and cat --help
:
-v, --show-nonprinting use ^ and M- notation, except for LFD and TAB
add a comment |
If you just want to search for pdf names/properties... or simple strings that are not compressed or encoded then instead of strings
you can use the below
grep -a STRING file.pdf
cat -v file.pdf | grep STRING
From grep --help
:
--binary-files=TYPE assume that binary files are TYPE;
TYPE is 'binary', 'text', or 'without-match'
-a, --text equivalent to --binary-files=text
and cat --help
:
-v, --show-nonprinting use ^ and M- notation, except for LFD and TAB
If you just want to search for pdf names/properties... or simple strings that are not compressed or encoded then instead of strings
you can use the below
grep -a STRING file.pdf
cat -v file.pdf | grep STRING
From grep --help
:
--binary-files=TYPE assume that binary files are TYPE;
TYPE is 'binary', 'text', or 'without-match'
-a, --text equivalent to --binary-files=text
and cat --help
:
-v, --show-nonprinting use ^ and M- notation, except for LFD and TAB
answered 9 mins ago
phuclvphuclv
318121
318121
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f6704%2fhow-can-i-grep-in-pdf-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
I think you need to parse it thou pdf2text in order to get some usable results back...
– Johan
Jan 31 '11 at 14:29
1
See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.
– Gilles
Jan 31 '11 at 20:01
1
For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?
– Martin Thoma
Jan 2 '16 at 22:09