How can I grep in PDF files?












108















Is there a way to search pdf files using the power of grep, without converting to text first in Ubuntu?










share|improve this question




















  • 1





    I think you need to parse it thou pdf2text in order to get some usable results back...

    – Johan
    Jan 31 '11 at 14:29






  • 1





    See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.

    – Gilles
    Jan 31 '11 at 20:01






  • 1





    For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?

    – Martin Thoma
    Jan 2 '16 at 22:09
















108















Is there a way to search pdf files using the power of grep, without converting to text first in Ubuntu?










share|improve this question




















  • 1





    I think you need to parse it thou pdf2text in order to get some usable results back...

    – Johan
    Jan 31 '11 at 14:29






  • 1





    See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.

    – Gilles
    Jan 31 '11 at 20:01






  • 1





    For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?

    – Martin Thoma
    Jan 2 '16 at 22:09














108












108








108


36






Is there a way to search pdf files using the power of grep, without converting to text first in Ubuntu?










share|improve this question
















Is there a way to search pdf files using the power of grep, without converting to text first in Ubuntu?







grep search pdf






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jun 22 '13 at 13:19









Flow

472516




472516










asked Jan 31 '11 at 13:31









Dervin ThunkDervin Thunk

96431019




96431019








  • 1





    I think you need to parse it thou pdf2text in order to get some usable results back...

    – Johan
    Jan 31 '11 at 14:29






  • 1





    See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.

    – Gilles
    Jan 31 '11 at 20:01






  • 1





    For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?

    – Martin Thoma
    Jan 2 '16 at 22:09














  • 1





    I think you need to parse it thou pdf2text in order to get some usable results back...

    – Johan
    Jan 31 '11 at 14:29






  • 1





    See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.

    – Gilles
    Jan 31 '11 at 20:01






  • 1





    For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?

    – Martin Thoma
    Jan 2 '16 at 22:09








1




1





I think you need to parse it thou pdf2text in order to get some usable results back...

– Johan
Jan 31 '11 at 14:29





I think you need to parse it thou pdf2text in order to get some usable results back...

– Johan
Jan 31 '11 at 14:29




1




1





See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.

– Gilles
Jan 31 '11 at 20:01





See also Is there some sort of PDF to text -converter? and Command line tool to search phrases in large number of pdf files.

– Gilles
Jan 31 '11 at 20:01




1




1





For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?

– Martin Thoma
Jan 2 '16 at 22:09





For people comming here via search: If you are willing to convert it first to text files, have a look at How to search contents of multiple pdf files?

– Martin Thoma
Jan 2 '16 at 22:09










14 Answers
14






active

oldest

votes


















113














Install the package pdfgrep, then use the command:



find /path -iname '*.pdf' -exec pdfgrep pattern {} +





share|improve this answer



















  • 4





    This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.

    – mikiemorales
    Jan 23 '14 at 1:28








  • 6





    Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.

    – Andrew Martin
    Sep 16 '14 at 11:11






  • 3





    pdfgrep also has a recursive flag. So this answer could perhaps be reduced to: pdfgrep -R pattern /path/. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.

    – Rovanion
    Jan 14 '16 at 12:11











  • Actually, the -n option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).

    – JepZ
    Nov 10 '17 at 20:18






  • 3





    This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What's pattern? What's {}? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.

    – Mark Amery
    Apr 20 '18 at 14:44



















54














If you have poppler-utils installed (default on Ubuntu Desktop), you could "convert" it on the fly and pipe it to grep:



pdftotext my.pdf - | grep 'pattern'


This won't create a .txt file.






share|improve this answer



















  • 1





    so .. you extract the text before you grep it which means the answer is "no".

    – akira
    Jan 31 '11 at 15:18






  • 18





    @akira The OP probably meant "without opening the PDF in a viewer and exporting to text"

    – Michael Mrozek
    Jan 31 '11 at 17:36






  • 5





    @akira Where do you see "grep only"?

    – Michael Mrozek
    Jan 31 '11 at 18:55






  • 6





    @akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to

    – Michael Mrozek
    Feb 1 '11 at 5:52






  • 1





    @sherrellbc The second argument of pdftotext is the filename it should write to. However, by convention, tools typically allow you to write to stdout instead of to a file by specifying a - instead. Similarly, some tools would write to stdout by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).

    – Joost
    Sep 23 '16 at 14:06





















8














No.



A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to 'grep' a .pdf you have to reverse the compression aka extract the text.



You can do that either per file with tools such as pdf2text and grep the result, or you run an 'indexer' (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.



But no, you can not grep pdf files and hope for reliable answers without extracting the text first.






share|improve this answer



















  • 1





    Considering pdfgrep exists (see above), a flat "no" is incorrect.

    – Jonathan Cross
    Aug 28 '18 at 10:18



















8














pdfgrep was written for exactly this purpose and is available in Ubuntu.



It tries to be mostly compatible to grep and thus provides "the power of grep", only specialized for PDFs. That includes common grep options, such as --recursive, --ignore-case or --color.



In contrast to pdftotext | grep, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn't have to search the whole document (e.g. --max-count or --quiet).



The basic usage is:



pdfgrep PATTERN FILE..


where PATTERN is your search string and FILE a list of filenames (or wildcards in a shell).



See the manpage for more infos.






share|improve this answer































    6














    Recoll can search PDFs. It doesn't support regular expressions, but it has lots of other search options, so it might fit your needs.






    share|improve this answer

































      4














      You could pipe it through strings first:-



      cat file.pdf | strings | grep <...etc...>





      share|improve this answer



















      • 8





        Just use strings file.pdf | grep <...>, you don't need cat

        – phunehehe
        Jan 31 '11 at 14:31











      • Yeah - my mind seems to work better with streams... :-)

        – Andy Smith
        Jan 31 '11 at 14:57






      • 12





        wont work if text is compressed, which it is most of the times.

        – akira
        Jan 31 '11 at 15:18








      • 6





        Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly for strings or grep.

        – Jander
        Jan 31 '11 at 16:08











      • Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.

        – hourback
        Nov 24 '15 at 19:58



















      3














      Take a look at the common resource grep tool crgrep which supports searching within PDF files.



      It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.






      share|improve this answer































        2














        try this



        find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i; 
        pdftotext "$i" - | grep pattern; done


        for printing the lines the pattern occurs inside the pdf






        share|improve this answer

































          2














          cd to your folder containing your pdf-file and then..



          pdfgrep 'pattern' your.pdf


          or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)



          pdfgrep 'pattern'  `ls *.pdf`


          or



          pdfgrep 'pattern' $(ls *.pdf)





          share|improve this answer
























          • why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to use ls output as the input to other commands. Just pdfgrep 'pattern' *.pdf is enough

            – phuclv
            7 mins ago



















          1














          There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:



          find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' ;


          The advantage over the similar answer here is the --with-filename flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.



          https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files






          share|improve this answer
























          • I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.

            – Bernhard
            May 9 '14 at 12:07



















          0














          gpdf might be what you need if you're using Gnome! Check this in case you're not using Gnome. It's got a list of CLI pdf viewers. Then you can use grep to find some pattern.



          Hope that helps.






          share|improve this answer































            0














            Here is a quick script for search pdf in the current directory :



            #!/bin/bash

            if [ $# -ne 1 ]; then
            echo "usage $0 VALUE" 1>&2
            exit 1
            fi

            echo 'SEARCH IS CASE SENSITIVE' 1>&2

            find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "$0"' $1 ;





            share|improve this answer































              0














              I assume you mean tp not convert it on the disk, you can convert them to stdout and then grep it with pdftotext. Grepping the pdf without any sort of conversion is not a practical approach since PDF is mostly a binary format.



              In the directory:



              ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {}  - | grep "keyword"


              or in the directory and its subdirectories:



              tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {}  - | grep "keyword"


              Also because some pdf are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be greped and OCR them.



              I noticed if a pdf file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts.



              First 2 lines of the pdffonts are the table header, so when a file is searchable has more than two line output, knowing this we can create:



              gedit check_pdf_searchable.sh


              then paste this



              #!/bin/bash 
              #set -vx
              if ((`pdffonts "$1" | wc -l` < 3 )); then
              echo $1
              pypdfocr "$1"
              fi


              then make it executable



              chmod +x check_pdf_searchable.sh


              then list all non-searchable pdfs in the directory:



              ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}


              or in the directory and its subdirectories:



              tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}





              share|improve this answer

































                0














                If you just want to search for pdf names/properties... or simple strings that are not compressed or encoded then instead of strings you can use the below



                grep -a STRING file.pdf
                cat -v file.pdf | grep STRING


                From grep --help:



                      --binary-files=TYPE   assume that binary files are TYPE;
                TYPE is 'binary', 'text', or 'without-match'
                -a, --text equivalent to --binary-files=text


                and cat --help:



                  -v, --show-nonprinting   use ^ and M- notation, except for LFD and TAB




                share























                  Your Answer








                  StackExchange.ready(function() {
                  var channelOptions = {
                  tags: "".split(" "),
                  id: "106"
                  };
                  initTagRenderer("".split(" "), "".split(" "), channelOptions);

                  StackExchange.using("externalEditor", function() {
                  // Have to fire editor after snippets, if snippets enabled
                  if (StackExchange.settings.snippets.snippetsEnabled) {
                  StackExchange.using("snippets", function() {
                  createEditor();
                  });
                  }
                  else {
                  createEditor();
                  }
                  });

                  function createEditor() {
                  StackExchange.prepareEditor({
                  heartbeatType: 'answer',
                  autoActivateHeartbeat: false,
                  convertImagesToLinks: false,
                  noModals: true,
                  showLowRepImageUploadWarning: true,
                  reputationToPostImages: null,
                  bindNavPrevention: true,
                  postfix: "",
                  imageUploader: {
                  brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                  contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                  allowUrls: true
                  },
                  onDemand: true,
                  discardSelector: ".discard-answer"
                  ,immediatelyShowMarkdownHelp:true
                  });


                  }
                  });














                  draft saved

                  draft discarded


















                  StackExchange.ready(
                  function () {
                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f6704%2fhow-can-i-grep-in-pdf-files%23new-answer', 'question_page');
                  }
                  );

                  Post as a guest















                  Required, but never shown

























                  14 Answers
                  14






                  active

                  oldest

                  votes








                  14 Answers
                  14






                  active

                  oldest

                  votes









                  active

                  oldest

                  votes






                  active

                  oldest

                  votes









                  113














                  Install the package pdfgrep, then use the command:



                  find /path -iname '*.pdf' -exec pdfgrep pattern {} +





                  share|improve this answer



















                  • 4





                    This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.

                    – mikiemorales
                    Jan 23 '14 at 1:28








                  • 6





                    Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.

                    – Andrew Martin
                    Sep 16 '14 at 11:11






                  • 3





                    pdfgrep also has a recursive flag. So this answer could perhaps be reduced to: pdfgrep -R pattern /path/. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.

                    – Rovanion
                    Jan 14 '16 at 12:11











                  • Actually, the -n option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).

                    – JepZ
                    Nov 10 '17 at 20:18






                  • 3





                    This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What's pattern? What's {}? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.

                    – Mark Amery
                    Apr 20 '18 at 14:44
















                  113














                  Install the package pdfgrep, then use the command:



                  find /path -iname '*.pdf' -exec pdfgrep pattern {} +





                  share|improve this answer



















                  • 4





                    This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.

                    – mikiemorales
                    Jan 23 '14 at 1:28








                  • 6





                    Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.

                    – Andrew Martin
                    Sep 16 '14 at 11:11






                  • 3





                    pdfgrep also has a recursive flag. So this answer could perhaps be reduced to: pdfgrep -R pattern /path/. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.

                    – Rovanion
                    Jan 14 '16 at 12:11











                  • Actually, the -n option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).

                    – JepZ
                    Nov 10 '17 at 20:18






                  • 3





                    This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What's pattern? What's {}? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.

                    – Mark Amery
                    Apr 20 '18 at 14:44














                  113












                  113








                  113







                  Install the package pdfgrep, then use the command:



                  find /path -iname '*.pdf' -exec pdfgrep pattern {} +





                  share|improve this answer













                  Install the package pdfgrep, then use the command:



                  find /path -iname '*.pdf' -exec pdfgrep pattern {} +






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Dec 23 '11 at 18:40









                  enzotibenzotib

                  33.8k710395




                  33.8k710395








                  • 4





                    This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.

                    – mikiemorales
                    Jan 23 '14 at 1:28








                  • 6





                    Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.

                    – Andrew Martin
                    Sep 16 '14 at 11:11






                  • 3





                    pdfgrep also has a recursive flag. So this answer could perhaps be reduced to: pdfgrep -R pattern /path/. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.

                    – Rovanion
                    Jan 14 '16 at 12:11











                  • Actually, the -n option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).

                    – JepZ
                    Nov 10 '17 at 20:18






                  • 3





                    This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What's pattern? What's {}? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.

                    – Mark Amery
                    Apr 20 '18 at 14:44














                  • 4





                    This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.

                    – mikiemorales
                    Jan 23 '14 at 1:28








                  • 6





                    Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.

                    – Andrew Martin
                    Sep 16 '14 at 11:11






                  • 3





                    pdfgrep also has a recursive flag. So this answer could perhaps be reduced to: pdfgrep -R pattern /path/. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.

                    – Rovanion
                    Jan 14 '16 at 12:11











                  • Actually, the -n option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).

                    – JepZ
                    Nov 10 '17 at 20:18






                  • 3





                    This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What's pattern? What's {}? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.

                    – Mark Amery
                    Apr 20 '18 at 14:44








                  4




                  4





                  This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.

                  – mikiemorales
                  Jan 23 '14 at 1:28







                  This works in mac osx (Mavericks) as well. Install it using brew. Simple. Thanks.

                  – mikiemorales
                  Jan 23 '14 at 1:28






                  6




                  6





                  Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.

                  – Andrew Martin
                  Sep 16 '14 at 11:11





                  Out of curiosity I checked the source of pdfgrep and it uses poppler to extract strings from the pdf. Almost exactly as @wag's answer only pagewise rather than, presumably, the entire document.

                  – Andrew Martin
                  Sep 16 '14 at 11:11




                  3




                  3





                  pdfgrep also has a recursive flag. So this answer could perhaps be reduced to: pdfgrep -R pattern /path/. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.

                  – Rovanion
                  Jan 14 '16 at 12:11





                  pdfgrep also has a recursive flag. So this answer could perhaps be reduced to: pdfgrep -R pattern /path/. Though it might be less effective if it goes through every file even if it isn't a PDF. And I notice that it has issues with international characters such as å, ä and ö.

                  – Rovanion
                  Jan 14 '16 at 12:11













                  Actually, the -n option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).

                  – JepZ
                  Nov 10 '17 at 20:18





                  Actually, the -n option is a pro for pdfgrep as it allows to include the page number in the output (might be helpful for further processing).

                  – JepZ
                  Nov 10 '17 at 20:18




                  3




                  3





                  This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What's pattern? What's {}? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.

                  – Mark Amery
                  Apr 20 '18 at 14:44





                  This answer would be easier to use if it explained which bits of the command are meant to copied literally and which are placeholders. What's pattern? What's {}? What's up with the ` +`? I have no idea upon first reading... so off to the manpage I go, I suppose.

                  – Mark Amery
                  Apr 20 '18 at 14:44













                  54














                  If you have poppler-utils installed (default on Ubuntu Desktop), you could "convert" it on the fly and pipe it to grep:



                  pdftotext my.pdf - | grep 'pattern'


                  This won't create a .txt file.






                  share|improve this answer



















                  • 1





                    so .. you extract the text before you grep it which means the answer is "no".

                    – akira
                    Jan 31 '11 at 15:18






                  • 18





                    @akira The OP probably meant "without opening the PDF in a viewer and exporting to text"

                    – Michael Mrozek
                    Jan 31 '11 at 17:36






                  • 5





                    @akira Where do you see "grep only"?

                    – Michael Mrozek
                    Jan 31 '11 at 18:55






                  • 6





                    @akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to

                    – Michael Mrozek
                    Feb 1 '11 at 5:52






                  • 1





                    @sherrellbc The second argument of pdftotext is the filename it should write to. However, by convention, tools typically allow you to write to stdout instead of to a file by specifying a - instead. Similarly, some tools would write to stdout by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).

                    – Joost
                    Sep 23 '16 at 14:06


















                  54














                  If you have poppler-utils installed (default on Ubuntu Desktop), you could "convert" it on the fly and pipe it to grep:



                  pdftotext my.pdf - | grep 'pattern'


                  This won't create a .txt file.






                  share|improve this answer



















                  • 1





                    so .. you extract the text before you grep it which means the answer is "no".

                    – akira
                    Jan 31 '11 at 15:18






                  • 18





                    @akira The OP probably meant "without opening the PDF in a viewer and exporting to text"

                    – Michael Mrozek
                    Jan 31 '11 at 17:36






                  • 5





                    @akira Where do you see "grep only"?

                    – Michael Mrozek
                    Jan 31 '11 at 18:55






                  • 6





                    @akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to

                    – Michael Mrozek
                    Feb 1 '11 at 5:52






                  • 1





                    @sherrellbc The second argument of pdftotext is the filename it should write to. However, by convention, tools typically allow you to write to stdout instead of to a file by specifying a - instead. Similarly, some tools would write to stdout by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).

                    – Joost
                    Sep 23 '16 at 14:06
















                  54












                  54








                  54







                  If you have poppler-utils installed (default on Ubuntu Desktop), you could "convert" it on the fly and pipe it to grep:



                  pdftotext my.pdf - | grep 'pattern'


                  This won't create a .txt file.






                  share|improve this answer













                  If you have poppler-utils installed (default on Ubuntu Desktop), you could "convert" it on the fly and pipe it to grep:



                  pdftotext my.pdf - | grep 'pattern'


                  This won't create a .txt file.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 31 '11 at 13:45









                  wagwag

                  25.1k65548




                  25.1k65548








                  • 1





                    so .. you extract the text before you grep it which means the answer is "no".

                    – akira
                    Jan 31 '11 at 15:18






                  • 18





                    @akira The OP probably meant "without opening the PDF in a viewer and exporting to text"

                    – Michael Mrozek
                    Jan 31 '11 at 17:36






                  • 5





                    @akira Where do you see "grep only"?

                    – Michael Mrozek
                    Jan 31 '11 at 18:55






                  • 6





                    @akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to

                    – Michael Mrozek
                    Feb 1 '11 at 5:52






                  • 1





                    @sherrellbc The second argument of pdftotext is the filename it should write to. However, by convention, tools typically allow you to write to stdout instead of to a file by specifying a - instead. Similarly, some tools would write to stdout by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).

                    – Joost
                    Sep 23 '16 at 14:06
















                  • 1





                    so .. you extract the text before you grep it which means the answer is "no".

                    – akira
                    Jan 31 '11 at 15:18






                  • 18





                    @akira The OP probably meant "without opening the PDF in a viewer and exporting to text"

                    – Michael Mrozek
                    Jan 31 '11 at 17:36






                  • 5





                    @akira Where do you see "grep only"?

                    – Michael Mrozek
                    Jan 31 '11 at 18:55






                  • 6





                    @akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to

                    – Michael Mrozek
                    Feb 1 '11 at 5:52






                  • 1





                    @sherrellbc The second argument of pdftotext is the filename it should write to. However, by convention, tools typically allow you to write to stdout instead of to a file by specifying a - instead. Similarly, some tools would write to stdout by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).

                    – Joost
                    Sep 23 '16 at 14:06










                  1




                  1





                  so .. you extract the text before you grep it which means the answer is "no".

                  – akira
                  Jan 31 '11 at 15:18





                  so .. you extract the text before you grep it which means the answer is "no".

                  – akira
                  Jan 31 '11 at 15:18




                  18




                  18





                  @akira The OP probably meant "without opening the PDF in a viewer and exporting to text"

                  – Michael Mrozek
                  Jan 31 '11 at 17:36





                  @akira The OP probably meant "without opening the PDF in a viewer and exporting to text"

                  – Michael Mrozek
                  Jan 31 '11 at 17:36




                  5




                  5





                  @akira Where do you see "grep only"?

                  – Michael Mrozek
                  Jan 31 '11 at 18:55





                  @akira Where do you see "grep only"?

                  – Michael Mrozek
                  Jan 31 '11 at 18:55




                  6




                  6





                  @akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to

                  – Michael Mrozek
                  Feb 1 '11 at 5:52





                  @akira Well, I already said what I think he probably meant; he doesn't want to export to text before processing it. I very much doubt he has a problem with any command that converts to text in any way; there's no reason not to

                  – Michael Mrozek
                  Feb 1 '11 at 5:52




                  1




                  1





                  @sherrellbc The second argument of pdftotext is the filename it should write to. However, by convention, tools typically allow you to write to stdout instead of to a file by specifying a - instead. Similarly, some tools would write to stdout by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).

                  – Joost
                  Sep 23 '16 at 14:06







                  @sherrellbc The second argument of pdftotext is the filename it should write to. However, by convention, tools typically allow you to write to stdout instead of to a file by specifying a - instead. Similarly, some tools would write to stdout by default if you omit such an argument entirely (but this is not always possible without creating ambiguity).

                  – Joost
                  Sep 23 '16 at 14:06













                  8














                  No.



                  A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to 'grep' a .pdf you have to reverse the compression aka extract the text.



                  You can do that either per file with tools such as pdf2text and grep the result, or you run an 'indexer' (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.



                  But no, you can not grep pdf files and hope for reliable answers without extracting the text first.






                  share|improve this answer



















                  • 1





                    Considering pdfgrep exists (see above), a flat "no" is incorrect.

                    – Jonathan Cross
                    Aug 28 '18 at 10:18
















                  8














                  No.



                  A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to 'grep' a .pdf you have to reverse the compression aka extract the text.



                  You can do that either per file with tools such as pdf2text and grep the result, or you run an 'indexer' (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.



                  But no, you can not grep pdf files and hope for reliable answers without extracting the text first.






                  share|improve this answer



















                  • 1





                    Considering pdfgrep exists (see above), a flat "no" is incorrect.

                    – Jonathan Cross
                    Aug 28 '18 at 10:18














                  8












                  8








                  8







                  No.



                  A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to 'grep' a .pdf you have to reverse the compression aka extract the text.



                  You can do that either per file with tools such as pdf2text and grep the result, or you run an 'indexer' (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.



                  But no, you can not grep pdf files and hope for reliable answers without extracting the text first.






                  share|improve this answer













                  No.



                  A pdf consists of chunks of data, some of them text, some of them pictures and some of them really magical fancy XYZ (eg. .u3d files). Those chunks are most of the times compressed (eg. flat, check http://www.verypdf.com/pdfinfoeditor/compression.htm). In order to 'grep' a .pdf you have to reverse the compression aka extract the text.



                  You can do that either per file with tools such as pdf2text and grep the result, or you run an 'indexer' (look at xapian.org or lucene) which builds an searchable index out of your .pdf files and then you can use the search engine tools of that indexer to get the content of the pdf.



                  But no, you can not grep pdf files and hope for reliable answers without extracting the text first.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jan 31 '11 at 15:17









                  akiraakira

                  96459




                  96459








                  • 1





                    Considering pdfgrep exists (see above), a flat "no" is incorrect.

                    – Jonathan Cross
                    Aug 28 '18 at 10:18














                  • 1





                    Considering pdfgrep exists (see above), a flat "no" is incorrect.

                    – Jonathan Cross
                    Aug 28 '18 at 10:18








                  1




                  1





                  Considering pdfgrep exists (see above), a flat "no" is incorrect.

                  – Jonathan Cross
                  Aug 28 '18 at 10:18





                  Considering pdfgrep exists (see above), a flat "no" is incorrect.

                  – Jonathan Cross
                  Aug 28 '18 at 10:18











                  8














                  pdfgrep was written for exactly this purpose and is available in Ubuntu.



                  It tries to be mostly compatible to grep and thus provides "the power of grep", only specialized for PDFs. That includes common grep options, such as --recursive, --ignore-case or --color.



                  In contrast to pdftotext | grep, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn't have to search the whole document (e.g. --max-count or --quiet).



                  The basic usage is:



                  pdfgrep PATTERN FILE..


                  where PATTERN is your search string and FILE a list of filenames (or wildcards in a shell).



                  See the manpage for more infos.






                  share|improve this answer




























                    8














                    pdfgrep was written for exactly this purpose and is available in Ubuntu.



                    It tries to be mostly compatible to grep and thus provides "the power of grep", only specialized for PDFs. That includes common grep options, such as --recursive, --ignore-case or --color.



                    In contrast to pdftotext | grep, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn't have to search the whole document (e.g. --max-count or --quiet).



                    The basic usage is:



                    pdfgrep PATTERN FILE..


                    where PATTERN is your search string and FILE a list of filenames (or wildcards in a shell).



                    See the manpage for more infos.






                    share|improve this answer


























                      8












                      8








                      8







                      pdfgrep was written for exactly this purpose and is available in Ubuntu.



                      It tries to be mostly compatible to grep and thus provides "the power of grep", only specialized for PDFs. That includes common grep options, such as --recursive, --ignore-case or --color.



                      In contrast to pdftotext | grep, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn't have to search the whole document (e.g. --max-count or --quiet).



                      The basic usage is:



                      pdfgrep PATTERN FILE..


                      where PATTERN is your search string and FILE a list of filenames (or wildcards in a shell).



                      See the manpage for more infos.






                      share|improve this answer













                      pdfgrep was written for exactly this purpose and is available in Ubuntu.



                      It tries to be mostly compatible to grep and thus provides "the power of grep", only specialized for PDFs. That includes common grep options, such as --recursive, --ignore-case or --color.



                      In contrast to pdftotext | grep, pdfgrep can output the page number of a match in a performant way and is generally faster when it doesn't have to search the whole document (e.g. --max-count or --quiet).



                      The basic usage is:



                      pdfgrep PATTERN FILE..


                      where PATTERN is your search string and FILE a list of filenames (or wildcards in a shell).



                      See the manpage for more infos.







                      share|improve this answer












                      share|improve this answer



                      share|improve this answer










                      answered Jun 19 '15 at 1:06









                      hpdeifelhpdeifel

                      8112




                      8112























                          6














                          Recoll can search PDFs. It doesn't support regular expressions, but it has lots of other search options, so it might fit your needs.






                          share|improve this answer






























                            6














                            Recoll can search PDFs. It doesn't support regular expressions, but it has lots of other search options, so it might fit your needs.






                            share|improve this answer




























                              6












                              6








                              6







                              Recoll can search PDFs. It doesn't support regular expressions, but it has lots of other search options, so it might fit your needs.






                              share|improve this answer















                              Recoll can search PDFs. It doesn't support regular expressions, but it has lots of other search options, so it might fit your needs.







                              share|improve this answer














                              share|improve this answer



                              share|improve this answer








                              edited May 16 '13 at 22:18









                              Michael Mrozek

                              61.1k29190211




                              61.1k29190211










                              answered May 16 '13 at 20:52









                              user39336user39336

                              6111




                              6111























                                  4














                                  You could pipe it through strings first:-



                                  cat file.pdf | strings | grep <...etc...>





                                  share|improve this answer



















                                  • 8





                                    Just use strings file.pdf | grep <...>, you don't need cat

                                    – phunehehe
                                    Jan 31 '11 at 14:31











                                  • Yeah - my mind seems to work better with streams... :-)

                                    – Andy Smith
                                    Jan 31 '11 at 14:57






                                  • 12





                                    wont work if text is compressed, which it is most of the times.

                                    – akira
                                    Jan 31 '11 at 15:18








                                  • 6





                                    Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly for strings or grep.

                                    – Jander
                                    Jan 31 '11 at 16:08











                                  • Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.

                                    – hourback
                                    Nov 24 '15 at 19:58
















                                  4














                                  You could pipe it through strings first:-



                                  cat file.pdf | strings | grep <...etc...>





                                  share|improve this answer



















                                  • 8





                                    Just use strings file.pdf | grep <...>, you don't need cat

                                    – phunehehe
                                    Jan 31 '11 at 14:31











                                  • Yeah - my mind seems to work better with streams... :-)

                                    – Andy Smith
                                    Jan 31 '11 at 14:57






                                  • 12





                                    wont work if text is compressed, which it is most of the times.

                                    – akira
                                    Jan 31 '11 at 15:18








                                  • 6





                                    Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly for strings or grep.

                                    – Jander
                                    Jan 31 '11 at 16:08











                                  • Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.

                                    – hourback
                                    Nov 24 '15 at 19:58














                                  4












                                  4








                                  4







                                  You could pipe it through strings first:-



                                  cat file.pdf | strings | grep <...etc...>





                                  share|improve this answer













                                  You could pipe it through strings first:-



                                  cat file.pdf | strings | grep <...etc...>






                                  share|improve this answer












                                  share|improve this answer



                                  share|improve this answer










                                  answered Jan 31 '11 at 13:45









                                  Andy SmithAndy Smith

                                  1943




                                  1943








                                  • 8





                                    Just use strings file.pdf | grep <...>, you don't need cat

                                    – phunehehe
                                    Jan 31 '11 at 14:31











                                  • Yeah - my mind seems to work better with streams... :-)

                                    – Andy Smith
                                    Jan 31 '11 at 14:57






                                  • 12





                                    wont work if text is compressed, which it is most of the times.

                                    – akira
                                    Jan 31 '11 at 15:18








                                  • 6





                                    Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly for strings or grep.

                                    – Jander
                                    Jan 31 '11 at 16:08











                                  • Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.

                                    – hourback
                                    Nov 24 '15 at 19:58














                                  • 8





                                    Just use strings file.pdf | grep <...>, you don't need cat

                                    – phunehehe
                                    Jan 31 '11 at 14:31











                                  • Yeah - my mind seems to work better with streams... :-)

                                    – Andy Smith
                                    Jan 31 '11 at 14:57






                                  • 12





                                    wont work if text is compressed, which it is most of the times.

                                    – akira
                                    Jan 31 '11 at 15:18








                                  • 6





                                    Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly for strings or grep.

                                    – Jander
                                    Jan 31 '11 at 16:08











                                  • Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.

                                    – hourback
                                    Nov 24 '15 at 19:58








                                  8




                                  8





                                  Just use strings file.pdf | grep <...>, you don't need cat

                                  – phunehehe
                                  Jan 31 '11 at 14:31





                                  Just use strings file.pdf | grep <...>, you don't need cat

                                  – phunehehe
                                  Jan 31 '11 at 14:31













                                  Yeah - my mind seems to work better with streams... :-)

                                  – Andy Smith
                                  Jan 31 '11 at 14:57





                                  Yeah - my mind seems to work better with streams... :-)

                                  – Andy Smith
                                  Jan 31 '11 at 14:57




                                  12




                                  12





                                  wont work if text is compressed, which it is most of the times.

                                  – akira
                                  Jan 31 '11 at 15:18







                                  wont work if text is compressed, which it is most of the times.

                                  – akira
                                  Jan 31 '11 at 15:18






                                  6




                                  6





                                  Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly for strings or grep.

                                  – Jander
                                  Jan 31 '11 at 16:08





                                  Even if the text is uncompressed, it's generally small pieces of sentences (not even necessarily whole words!) finely intermixed with formatting information. Not very friendly for strings or grep.

                                  – Jander
                                  Jan 31 '11 at 16:08













                                  Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.

                                  – hourback
                                  Nov 24 '15 at 19:58





                                  Can you think of another reason why using strings for this wouldn't work? I found that using strings works on some PDFs but not others.

                                  – hourback
                                  Nov 24 '15 at 19:58











                                  3














                                  Take a look at the common resource grep tool crgrep which supports searching within PDF files.



                                  It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.






                                  share|improve this answer




























                                    3














                                    Take a look at the common resource grep tool crgrep which supports searching within PDF files.



                                    It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.






                                    share|improve this answer


























                                      3












                                      3








                                      3







                                      Take a look at the common resource grep tool crgrep which supports searching within PDF files.



                                      It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.






                                      share|improve this answer













                                      Take a look at the common resource grep tool crgrep which supports searching within PDF files.



                                      It also allows searching other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.







                                      share|improve this answer












                                      share|improve this answer



                                      share|improve this answer










                                      answered Oct 23 '13 at 12:30









                                      CraigCraig

                                      1311




                                      1311























                                          2














                                          try this



                                          find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i; 
                                          pdftotext "$i" - | grep pattern; done


                                          for printing the lines the pattern occurs inside the pdf






                                          share|improve this answer






























                                            2














                                            try this



                                            find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i; 
                                            pdftotext "$i" - | grep pattern; done


                                            for printing the lines the pattern occurs inside the pdf






                                            share|improve this answer




























                                              2












                                              2








                                              2







                                              try this



                                              find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i; 
                                              pdftotext "$i" - | grep pattern; done


                                              for printing the lines the pattern occurs inside the pdf






                                              share|improve this answer















                                              try this



                                              find /path -iname *.pdf -print0 | for i in `xargs 0`; do echo $i; 
                                              pdftotext "$i" - | grep pattern; done


                                              for printing the lines the pattern occurs inside the pdf







                                              share|improve this answer














                                              share|improve this answer



                                              share|improve this answer








                                              edited Dec 23 '11 at 20:17









                                              enzotib

                                              33.8k710395




                                              33.8k710395










                                              answered Dec 23 '11 at 19:35









                                              harish.venkatharish.venkat

                                              4,5231924




                                              4,5231924























                                                  2














                                                  cd to your folder containing your pdf-file and then..



                                                  pdfgrep 'pattern' your.pdf


                                                  or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)



                                                  pdfgrep 'pattern'  `ls *.pdf`


                                                  or



                                                  pdfgrep 'pattern' $(ls *.pdf)





                                                  share|improve this answer
























                                                  • why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to use ls output as the input to other commands. Just pdfgrep 'pattern' *.pdf is enough

                                                    – phuclv
                                                    7 mins ago
















                                                  2














                                                  cd to your folder containing your pdf-file and then..



                                                  pdfgrep 'pattern' your.pdf


                                                  or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)



                                                  pdfgrep 'pattern'  `ls *.pdf`


                                                  or



                                                  pdfgrep 'pattern' $(ls *.pdf)





                                                  share|improve this answer
























                                                  • why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to use ls output as the input to other commands. Just pdfgrep 'pattern' *.pdf is enough

                                                    – phuclv
                                                    7 mins ago














                                                  2












                                                  2








                                                  2







                                                  cd to your folder containing your pdf-file and then..



                                                  pdfgrep 'pattern' your.pdf


                                                  or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)



                                                  pdfgrep 'pattern'  `ls *.pdf`


                                                  or



                                                  pdfgrep 'pattern' $(ls *.pdf)





                                                  share|improve this answer













                                                  cd to your folder containing your pdf-file and then..



                                                  pdfgrep 'pattern' your.pdf


                                                  or if you want to search in more than just one pdf-file (e.g. in all pdf-files in your folder)



                                                  pdfgrep 'pattern'  `ls *.pdf`


                                                  or



                                                  pdfgrep 'pattern' $(ls *.pdf)






                                                  share|improve this answer












                                                  share|improve this answer



                                                  share|improve this answer










                                                  answered Apr 19 '15 at 19:26









                                                  Rasmuss RallRasmuss Rall

                                                  211




                                                  211













                                                  • why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to use ls output as the input to other commands. Just pdfgrep 'pattern' *.pdf is enough

                                                    – phuclv
                                                    7 mins ago



















                                                  • why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to use ls output as the input to other commands. Just pdfgrep 'pattern' *.pdf is enough

                                                    – phuclv
                                                    7 mins ago

















                                                  why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to use ls output as the input to other commands. Just pdfgrep 'pattern' *.pdf is enough

                                                  – phuclv
                                                  7 mins ago





                                                  why on earth do you use ls to put filenames in parameters? It's not only slower but also a bad idea to use ls output as the input to other commands. Just pdfgrep 'pattern' *.pdf is enough

                                                  – phuclv
                                                  7 mins ago











                                                  1














                                                  There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:



                                                  find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' ;


                                                  The advantage over the similar answer here is the --with-filename flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.



                                                  https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files






                                                  share|improve this answer
























                                                  • I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.

                                                    – Bernhard
                                                    May 9 '14 at 12:07
















                                                  1














                                                  There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:



                                                  find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' ;


                                                  The advantage over the similar answer here is the --with-filename flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.



                                                  https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files






                                                  share|improve this answer
























                                                  • I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.

                                                    – Bernhard
                                                    May 9 '14 at 12:07














                                                  1












                                                  1








                                                  1







                                                  There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:



                                                  find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' ;


                                                  The advantage over the similar answer here is the --with-filename flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.



                                                  https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files






                                                  share|improve this answer













                                                  There is a duplicate question on StackOverflow. The people there suggest a variation of harish.venkarts answer:



                                                  find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' ;


                                                  The advantage over the similar answer here is the --with-filename flag for grep. This is somewhat superior to pdfgrep as well, because the standard grep has more features.



                                                  https://stackoverflow.com/questions/4643438/how-to-search-contents-of-multiple-pdf-files







                                                  share|improve this answer












                                                  share|improve this answer



                                                  share|improve this answer










                                                  answered May 9 '14 at 10:00









                                                  user7610user7610

                                                  4681718




                                                  4681718













                                                  • I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.

                                                    – Bernhard
                                                    May 9 '14 at 12:07



















                                                  • I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.

                                                    – Bernhard
                                                    May 9 '14 at 12:07

















                                                  I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.

                                                  – Bernhard
                                                  May 9 '14 at 12:07





                                                  I think it would have been better to leave this as a comment (or edit) in the similar answer you are referring to.

                                                  – Bernhard
                                                  May 9 '14 at 12:07











                                                  0














                                                  gpdf might be what you need if you're using Gnome! Check this in case you're not using Gnome. It's got a list of CLI pdf viewers. Then you can use grep to find some pattern.



                                                  Hope that helps.






                                                  share|improve this answer




























                                                    0














                                                    gpdf might be what you need if you're using Gnome! Check this in case you're not using Gnome. It's got a list of CLI pdf viewers. Then you can use grep to find some pattern.



                                                    Hope that helps.






                                                    share|improve this answer


























                                                      0












                                                      0








                                                      0







                                                      gpdf might be what you need if you're using Gnome! Check this in case you're not using Gnome. It's got a list of CLI pdf viewers. Then you can use grep to find some pattern.



                                                      Hope that helps.






                                                      share|improve this answer













                                                      gpdf might be what you need if you're using Gnome! Check this in case you're not using Gnome. It's got a list of CLI pdf viewers. Then you can use grep to find some pattern.



                                                      Hope that helps.







                                                      share|improve this answer












                                                      share|improve this answer



                                                      share|improve this answer










                                                      answered Jan 31 '11 at 14:03









                                                      DharmitDharmit

                                                      1,88762032




                                                      1,88762032























                                                          0














                                                          Here is a quick script for search pdf in the current directory :



                                                          #!/bin/bash

                                                          if [ $# -ne 1 ]; then
                                                          echo "usage $0 VALUE" 1>&2
                                                          exit 1
                                                          fi

                                                          echo 'SEARCH IS CASE SENSITIVE' 1>&2

                                                          find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "$0"' $1 ;





                                                          share|improve this answer




























                                                            0














                                                            Here is a quick script for search pdf in the current directory :



                                                            #!/bin/bash

                                                            if [ $# -ne 1 ]; then
                                                            echo "usage $0 VALUE" 1>&2
                                                            exit 1
                                                            fi

                                                            echo 'SEARCH IS CASE SENSITIVE' 1>&2

                                                            find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "$0"' $1 ;





                                                            share|improve this answer


























                                                              0












                                                              0








                                                              0







                                                              Here is a quick script for search pdf in the current directory :



                                                              #!/bin/bash

                                                              if [ $# -ne 1 ]; then
                                                              echo "usage $0 VALUE" 1>&2
                                                              exit 1
                                                              fi

                                                              echo 'SEARCH IS CASE SENSITIVE' 1>&2

                                                              find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "$0"' $1 ;





                                                              share|improve this answer













                                                              Here is a quick script for search pdf in the current directory :



                                                              #!/bin/bash

                                                              if [ $# -ne 1 ]; then
                                                              echo "usage $0 VALUE" 1>&2
                                                              exit 1
                                                              fi

                                                              echo 'SEARCH IS CASE SENSITIVE' 1>&2

                                                              find . -name '*.pdf' -exec /bin/bash -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "$0"' $1 ;






                                                              share|improve this answer












                                                              share|improve this answer



                                                              share|improve this answer










                                                              answered Jun 1 '16 at 19:01









                                                              NicoNico

                                                              1012




                                                              1012























                                                                  0














                                                                  I assume you mean tp not convert it on the disk, you can convert them to stdout and then grep it with pdftotext. Grepping the pdf without any sort of conversion is not a practical approach since PDF is mostly a binary format.



                                                                  In the directory:



                                                                  ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {}  - | grep "keyword"


                                                                  or in the directory and its subdirectories:



                                                                  tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {}  - | grep "keyword"


                                                                  Also because some pdf are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be greped and OCR them.



                                                                  I noticed if a pdf file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts.



                                                                  First 2 lines of the pdffonts are the table header, so when a file is searchable has more than two line output, knowing this we can create:



                                                                  gedit check_pdf_searchable.sh


                                                                  then paste this



                                                                  #!/bin/bash 
                                                                  #set -vx
                                                                  if ((`pdffonts "$1" | wc -l` < 3 )); then
                                                                  echo $1
                                                                  pypdfocr "$1"
                                                                  fi


                                                                  then make it executable



                                                                  chmod +x check_pdf_searchable.sh


                                                                  then list all non-searchable pdfs in the directory:



                                                                  ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}


                                                                  or in the directory and its subdirectories:



                                                                  tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}





                                                                  share|improve this answer






























                                                                    0














                                                                    I assume you mean tp not convert it on the disk, you can convert them to stdout and then grep it with pdftotext. Grepping the pdf without any sort of conversion is not a practical approach since PDF is mostly a binary format.



                                                                    In the directory:



                                                                    ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {}  - | grep "keyword"


                                                                    or in the directory and its subdirectories:



                                                                    tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {}  - | grep "keyword"


                                                                    Also because some pdf are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be greped and OCR them.



                                                                    I noticed if a pdf file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts.



                                                                    First 2 lines of the pdffonts are the table header, so when a file is searchable has more than two line output, knowing this we can create:



                                                                    gedit check_pdf_searchable.sh


                                                                    then paste this



                                                                    #!/bin/bash 
                                                                    #set -vx
                                                                    if ((`pdffonts "$1" | wc -l` < 3 )); then
                                                                    echo $1
                                                                    pypdfocr "$1"
                                                                    fi


                                                                    then make it executable



                                                                    chmod +x check_pdf_searchable.sh


                                                                    then list all non-searchable pdfs in the directory:



                                                                    ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}


                                                                    or in the directory and its subdirectories:



                                                                    tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}





                                                                    share|improve this answer




























                                                                      0












                                                                      0








                                                                      0







                                                                      I assume you mean tp not convert it on the disk, you can convert them to stdout and then grep it with pdftotext. Grepping the pdf without any sort of conversion is not a practical approach since PDF is mostly a binary format.



                                                                      In the directory:



                                                                      ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {}  - | grep "keyword"


                                                                      or in the directory and its subdirectories:



                                                                      tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {}  - | grep "keyword"


                                                                      Also because some pdf are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be greped and OCR them.



                                                                      I noticed if a pdf file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts.



                                                                      First 2 lines of the pdffonts are the table header, so when a file is searchable has more than two line output, knowing this we can create:



                                                                      gedit check_pdf_searchable.sh


                                                                      then paste this



                                                                      #!/bin/bash 
                                                                      #set -vx
                                                                      if ((`pdffonts "$1" | wc -l` < 3 )); then
                                                                      echo $1
                                                                      pypdfocr "$1"
                                                                      fi


                                                                      then make it executable



                                                                      chmod +x check_pdf_searchable.sh


                                                                      then list all non-searchable pdfs in the directory:



                                                                      ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}


                                                                      or in the directory and its subdirectories:



                                                                      tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}





                                                                      share|improve this answer















                                                                      I assume you mean tp not convert it on the disk, you can convert them to stdout and then grep it with pdftotext. Grepping the pdf without any sort of conversion is not a practical approach since PDF is mostly a binary format.



                                                                      In the directory:



                                                                      ls -1 ./*.pdf | xargs -L1 -I {} pdftotext {}  - | grep "keyword"


                                                                      or in the directory and its subdirectories:



                                                                      tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} pdftotext {}  - | grep "keyword"


                                                                      Also because some pdf are scans they need to be OCRed first. I wrote a pretty simple way to search all pdfs that cannot be greped and OCR them.



                                                                      I noticed if a pdf file doesn't have any font it is usually not searchable. So knowing this we can use pdffonts.



                                                                      First 2 lines of the pdffonts are the table header, so when a file is searchable has more than two line output, knowing this we can create:



                                                                      gedit check_pdf_searchable.sh


                                                                      then paste this



                                                                      #!/bin/bash 
                                                                      #set -vx
                                                                      if ((`pdffonts "$1" | wc -l` < 3 )); then
                                                                      echo $1
                                                                      pypdfocr "$1"
                                                                      fi


                                                                      then make it executable



                                                                      chmod +x check_pdf_searchable.sh


                                                                      then list all non-searchable pdfs in the directory:



                                                                      ls -1 ./*.pdf | xargs -L1 -I {} ./check_pdf_searchable.sh {}


                                                                      or in the directory and its subdirectories:



                                                                      tree -fai . | grep -P ".pdf$" | xargs -L1 -I {} ./check_pdf_searchable.sh {}






                                                                      share|improve this answer














                                                                      share|improve this answer



                                                                      share|improve this answer








                                                                      edited Feb 8 '18 at 9:33

























                                                                      answered Feb 8 '18 at 8:38









                                                                      Eduard FlorinescuEduard Florinescu

                                                                      3,309103853




                                                                      3,309103853























                                                                          0














                                                                          If you just want to search for pdf names/properties... or simple strings that are not compressed or encoded then instead of strings you can use the below



                                                                          grep -a STRING file.pdf
                                                                          cat -v file.pdf | grep STRING


                                                                          From grep --help:



                                                                                --binary-files=TYPE   assume that binary files are TYPE;
                                                                          TYPE is 'binary', 'text', or 'without-match'
                                                                          -a, --text equivalent to --binary-files=text


                                                                          and cat --help:



                                                                            -v, --show-nonprinting   use ^ and M- notation, except for LFD and TAB




                                                                          share




























                                                                            0














                                                                            If you just want to search for pdf names/properties... or simple strings that are not compressed or encoded then instead of strings you can use the below



                                                                            grep -a STRING file.pdf
                                                                            cat -v file.pdf | grep STRING


                                                                            From grep --help:



                                                                                  --binary-files=TYPE   assume that binary files are TYPE;
                                                                            TYPE is 'binary', 'text', or 'without-match'
                                                                            -a, --text equivalent to --binary-files=text


                                                                            and cat --help:



                                                                              -v, --show-nonprinting   use ^ and M- notation, except for LFD and TAB




                                                                            share


























                                                                              0












                                                                              0








                                                                              0







                                                                              If you just want to search for pdf names/properties... or simple strings that are not compressed or encoded then instead of strings you can use the below



                                                                              grep -a STRING file.pdf
                                                                              cat -v file.pdf | grep STRING


                                                                              From grep --help:



                                                                                    --binary-files=TYPE   assume that binary files are TYPE;
                                                                              TYPE is 'binary', 'text', or 'without-match'
                                                                              -a, --text equivalent to --binary-files=text


                                                                              and cat --help:



                                                                                -v, --show-nonprinting   use ^ and M- notation, except for LFD and TAB




                                                                              share













                                                                              If you just want to search for pdf names/properties... or simple strings that are not compressed or encoded then instead of strings you can use the below



                                                                              grep -a STRING file.pdf
                                                                              cat -v file.pdf | grep STRING


                                                                              From grep --help:



                                                                                    --binary-files=TYPE   assume that binary files are TYPE;
                                                                              TYPE is 'binary', 'text', or 'without-match'
                                                                              -a, --text equivalent to --binary-files=text


                                                                              and cat --help:



                                                                                -v, --show-nonprinting   use ^ and M- notation, except for LFD and TAB





                                                                              share











                                                                              share


                                                                              share










                                                                              answered 9 mins ago









                                                                              phuclvphuclv

                                                                              318121




                                                                              318121






























                                                                                  draft saved

                                                                                  draft discarded




















































                                                                                  Thanks for contributing an answer to Unix & Linux Stack Exchange!


                                                                                  • Please be sure to answer the question. Provide details and share your research!

                                                                                  But avoid



                                                                                  • Asking for help, clarification, or responding to other answers.

                                                                                  • Making statements based on opinion; back them up with references or personal experience.


                                                                                  To learn more, see our tips on writing great answers.




                                                                                  draft saved


                                                                                  draft discarded














                                                                                  StackExchange.ready(
                                                                                  function () {
                                                                                  StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f6704%2fhow-can-i-grep-in-pdf-files%23new-answer', 'question_page');
                                                                                  }
                                                                                  );

                                                                                  Post as a guest















                                                                                  Required, but never shown





















































                                                                                  Required, but never shown














                                                                                  Required, but never shown












                                                                                  Required, but never shown







                                                                                  Required, but never shown

































                                                                                  Required, but never shown














                                                                                  Required, but never shown












                                                                                  Required, but never shown







                                                                                  Required, but never shown







                                                                                  Popular posts from this blog

                                                                                  濃尾地震

                                                                                  How to rewrite equation of hyperbola in standard form

                                                                                  No ethernet ip address in my vocore2