Downloading nested pdf files with wget
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}
I am trying to download dozens of PDF files located on pages linked from here:
http://machineknittingetc.com/passap.html?limit=all
Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/
.
I have tried these:
wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"
It doesn't get the PDFs.
Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?
linux wget
add a comment |
I am trying to download dozens of PDF files located on pages linked from here:
http://machineknittingetc.com/passap.html?limit=all
Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/
.
I have tried these:
wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"
It doesn't get the PDFs.
Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?
linux wget
add a comment |
I am trying to download dozens of PDF files located on pages linked from here:
http://machineknittingetc.com/passap.html?limit=all
Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/
.
I have tried these:
wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"
It doesn't get the PDFs.
Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?
linux wget
I am trying to download dozens of PDF files located on pages linked from here:
http://machineknittingetc.com/passap.html?limit=all
Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/
.
I have tried these:
wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"
wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"
It doesn't get the PDFs.
Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?
linux wget
linux wget
edited Jan 2 '17 at 8:05
Kusalananda♦
140k17261434
140k17261434
asked Jan 2 '17 at 7:47
KallasteKallaste
1084
1084
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
@rajaganesh87
you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
and the (.pdf) files correlating to it.
The problem is your being blocked by the
robots.txt file
and your using the dot (.) in
-A .pdf
Try the code below that I tested and it works.
wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all
add a comment |
Does this work for you ?
#!/bin/bash
for i in {000..175}
do
wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
done
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f334243%2fdownloading-nested-pdf-files-with-wget%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
@rajaganesh87
you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
and the (.pdf) files correlating to it.
The problem is your being blocked by the
robots.txt file
and your using the dot (.) in
-A .pdf
Try the code below that I tested and it works.
wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all
add a comment |
@rajaganesh87
you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
and the (.pdf) files correlating to it.
The problem is your being blocked by the
robots.txt file
and your using the dot (.) in
-A .pdf
Try the code below that I tested and it works.
wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all
add a comment |
@rajaganesh87
you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
and the (.pdf) files correlating to it.
The problem is your being blocked by the
robots.txt file
and your using the dot (.) in
-A .pdf
Try the code below that I tested and it works.
wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all
@rajaganesh87
you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
and the (.pdf) files correlating to it.
The problem is your being blocked by the
robots.txt file
and your using the dot (.) in
-A .pdf
Try the code below that I tested and it works.
wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all
edited 2 hours ago
Rui F Ribeiro
41.9k1483142
41.9k1483142
answered May 26 '17 at 8:19
Jason SwartzJason Swartz
862
862
add a comment |
add a comment |
Does this work for you ?
#!/bin/bash
for i in {000..175}
do
wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
done
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
add a comment |
Does this work for you ?
#!/bin/bash
for i in {000..175}
do
wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
done
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
add a comment |
Does this work for you ?
#!/bin/bash
for i in {000..175}
do
wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
done
Does this work for you ?
#!/bin/bash
for i in {000..175}
do
wget http://machineknittingetc.com/downloadable/download/sample/sample_id/$i
done
answered Jan 2 '17 at 8:51
rajaganesh87rajaganesh87
7362926
7362926
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
add a comment |
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.
– Kallaste
Jan 2 '17 at 9:21
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
I really should have thought of that.
– Kallaste
Jan 2 '17 at 9:23
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?
– Kallaste
Jan 2 '17 at 9:37
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list
– rajaganesh87
Jan 4 '17 at 11:26
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f334243%2fdownloading-nested-pdf-files-with-wget%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown