Downloading nested pdf files with wget





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}







1















I am trying to download dozens of PDF files located on pages linked from here:



http://machineknittingetc.com/passap.html?limit=all



Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/.



I have tried these:



wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"


It doesn't get the PDFs.



Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?










share|improve this question































    1















    I am trying to download dozens of PDF files located on pages linked from here:



    http://machineknittingetc.com/passap.html?limit=all



    Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/.



    I have tried these:



    wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all
    wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"
    wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"


    It doesn't get the PDFs.



    Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?










    share|improve this question



























      1












      1








      1








      I am trying to download dozens of PDF files located on pages linked from here:



      http://machineknittingetc.com/passap.html?limit=all



      Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/.



      I have tried these:



      wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all
      wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"
      wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"


      It doesn't get the PDFs.



      Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?










      share|improve this question
















      I am trying to download dozens of PDF files located on pages linked from here:



      http://machineknittingetc.com/passap.html?limit=all



      Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/.



      I have tried these:



      wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all
      wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"
      wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"


      It doesn't get the PDFs.



      Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?







      linux wget






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jan 2 '17 at 8:05









      Kusalananda

      140k17261434




      140k17261434










      asked Jan 2 '17 at 7:47









      KallasteKallaste

      1084




      1084






















          2 Answers
          2






          active

          oldest

          votes


















          1














          @rajaganesh87
          you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
          and the (.pdf) files correlating to it.



          The problem is your being blocked by the




          robots.txt file




          and your using the dot (.) in



              -A .pdf


          Try the code below that I tested and it works.



           wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all 





          share|improve this answer

































            2














            Does this work for you ?



            #!/bin/bash
            for i in {000..175}
            do
            wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
            done





            share|improve this answer
























            • Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

              – Kallaste
              Jan 2 '17 at 9:21











            • I really should have thought of that.

              – Kallaste
              Jan 2 '17 at 9:23











            • No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

              – Kallaste
              Jan 2 '17 at 9:37











            • @Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

              – rajaganesh87
              Jan 4 '17 at 11:26












            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "106"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f334243%2fdownloading-nested-pdf-files-with-wget%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            @rajaganesh87
            you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
            and the (.pdf) files correlating to it.



            The problem is your being blocked by the




            robots.txt file




            and your using the dot (.) in



                -A .pdf


            Try the code below that I tested and it works.



             wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all 





            share|improve this answer






























              1














              @rajaganesh87
              you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
              and the (.pdf) files correlating to it.



              The problem is your being blocked by the




              robots.txt file




              and your using the dot (.) in



                  -A .pdf


              Try the code below that I tested and it works.



               wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all 





              share|improve this answer




























                1












                1








                1







                @rajaganesh87
                you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
                and the (.pdf) files correlating to it.



                The problem is your being blocked by the




                robots.txt file




                and your using the dot (.) in



                    -A .pdf


                Try the code below that I tested and it works.



                 wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all 





                share|improve this answer















                @rajaganesh87
                you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
                and the (.pdf) files correlating to it.



                The problem is your being blocked by the




                robots.txt file




                and your using the dot (.) in



                    -A .pdf


                Try the code below that I tested and it works.



                 wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all 






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited 2 hours ago









                Rui F Ribeiro

                41.9k1483142




                41.9k1483142










                answered May 26 '17 at 8:19









                Jason SwartzJason Swartz

                862




                862

























                    2














                    Does this work for you ?



                    #!/bin/bash
                    for i in {000..175}
                    do
                    wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
                    done





                    share|improve this answer
























                    • Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

                      – Kallaste
                      Jan 2 '17 at 9:21











                    • I really should have thought of that.

                      – Kallaste
                      Jan 2 '17 at 9:23











                    • No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

                      – Kallaste
                      Jan 2 '17 at 9:37











                    • @Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

                      – rajaganesh87
                      Jan 4 '17 at 11:26
















                    2














                    Does this work for you ?



                    #!/bin/bash
                    for i in {000..175}
                    do
                    wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
                    done





                    share|improve this answer
























                    • Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

                      – Kallaste
                      Jan 2 '17 at 9:21











                    • I really should have thought of that.

                      – Kallaste
                      Jan 2 '17 at 9:23











                    • No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

                      – Kallaste
                      Jan 2 '17 at 9:37











                    • @Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

                      – rajaganesh87
                      Jan 4 '17 at 11:26














                    2












                    2








                    2







                    Does this work for you ?



                    #!/bin/bash
                    for i in {000..175}
                    do
                    wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
                    done





                    share|improve this answer













                    Does this work for you ?



                    #!/bin/bash
                    for i in {000..175}
                    do
                    wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
                    done






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Jan 2 '17 at 8:51









                    rajaganesh87rajaganesh87

                    7362926




                    7362926













                    • Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

                      – Kallaste
                      Jan 2 '17 at 9:21











                    • I really should have thought of that.

                      – Kallaste
                      Jan 2 '17 at 9:23











                    • No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

                      – Kallaste
                      Jan 2 '17 at 9:37











                    • @Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

                      – rajaganesh87
                      Jan 4 '17 at 11:26



















                    • Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

                      – Kallaste
                      Jan 2 '17 at 9:21











                    • I really should have thought of that.

                      – Kallaste
                      Jan 2 '17 at 9:23











                    • No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

                      – Kallaste
                      Jan 2 '17 at 9:37











                    • @Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

                      – rajaganesh87
                      Jan 4 '17 at 11:26

















                    Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

                    – Kallaste
                    Jan 2 '17 at 9:21





                    Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

                    – Kallaste
                    Jan 2 '17 at 9:21













                    I really should have thought of that.

                    – Kallaste
                    Jan 2 '17 at 9:23





                    I really should have thought of that.

                    – Kallaste
                    Jan 2 '17 at 9:23













                    No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

                    – Kallaste
                    Jan 2 '17 at 9:37





                    No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

                    – Kallaste
                    Jan 2 '17 at 9:37













                    @Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

                    – rajaganesh87
                    Jan 4 '17 at 11:26





                    @Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

                    – rajaganesh87
                    Jan 4 '17 at 11:26


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Unix & Linux Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f334243%2fdownloading-nested-pdf-files-with-wget%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    CARDNET

                    Boot-repair Failure: Unable to locate package grub-common:i386

                    濃尾地震