Downloading nested pdf files with wget

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ margin-bottom:0;
}

I am trying to download dozens of PDF files located on pages linked from here:

http://machineknittingetc.com/passap.html?limit=all

Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/.

I have tried these:

wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all

wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"

wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"

It doesn't get the PDFs.

Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?

edited Jan 2 '17 at 8:05

Kusalananda♦

140k17261434

asked Jan 2 '17 at 7:47

Kallaste

1084

add a comment |

I am trying to download dozens of PDF files located on pages linked from here:

http://machineknittingetc.com/passap.html?limit=all

Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/.

I have tried these:

wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all

wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"

wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"

It doesn't get the PDFs.

Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?

edited Jan 2 '17 at 8:05

Kusalananda♦

140k17261434

asked Jan 2 '17 at 7:47

Kallaste

1084

add a comment |

I am trying to download dozens of PDF files located on pages linked from here:

http://machineknittingetc.com/passap.html?limit=all

Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/.

I have tried these:

wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all

wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"

wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"

It doesn't get the PDFs.

Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?

edited Jan 2 '17 at 8:05

Kusalananda♦

140k17261434

asked Jan 2 '17 at 7:47

Kallaste

1084

I am trying to download dozens of PDF files located on pages linked from here:

http://machineknittingetc.com/passap.html?limit=all

Each PDF is referred to by a URL ending with /downloadable/download/sample/sample_id/[some three digit number]/.

I have tried these:

wget -r -l2 -A.pdf http://machineknittingetc.com/passap.html?limit=all

wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.pdf"

wget -r -l2 -np http://machineknittingetc.com/passap.html?limit=all -A "*.###"

It doesn't get the PDFs.

Does it have something to do with the server not being indexed to allow me to access the URLs like a file hierarchy? Is there a way to make it work?

linux wget

edited Jan 2 '17 at 8:05

Kusalananda♦

140k17261434

asked Jan 2 '17 at 7:47

Kallaste

1084

edited Jan 2 '17 at 8:05

Kusalananda♦

140k17261434

asked Jan 2 '17 at 7:47

Kallaste

1084

edited Jan 2 '17 at 8:05

Kusalananda♦

140k17261434

edited Jan 2 '17 at 8:05

Kusalananda♦

140k17261434

edited Jan 2 '17 at 8:05

Kusalananda♦

140k17261434

asked Jan 2 '17 at 7:47

Kallaste

1084

asked Jan 2 '17 at 7:47

Kallaste

1084

asked Jan 2 '17 at 7:47

Kallaste

1084

add a comment |

2 Answers
2

active

oldest

votes

@rajaganesh87
you are guessing at the directory link numbers and are your code does not work for the actual links needed per the base link http://machineknittingetc.com/passap.html?limit=all
and the (.pdf) files correlating to it.

The problem is your being blocked by the

robots.txt file

and your using the dot (.) in

    -A .pdf

Try the code below that I tested and it works.

 wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all

edited 2 hours ago

Rui F Ribeiro

41.9k1483142

answered May 26 '17 at 8:19

Jason Swartz

862

add a comment |

Does this work for you ?

#!/bin/bash

for i in {000..175}

do

     wget  http://machineknittingetc.com/downloadable/download/sample/sample_id/$i

done

answered Jan 2 '17 at 8:51

rajaganesh87

7362926

Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

– Kallaste
Jan 2 '17 at 9:21

I really should have thought of that.

– Kallaste
Jan 2 '17 at 9:23

No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

– Kallaste
Jan 2 '17 at 9:37

@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

– rajaganesh87
Jan 4 '17 at 11:26

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f334243%2fdownloading-nested-pdf-files-with-wget%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

The problem is your being blocked by the

robots.txt file

and your using the dot (.) in

    -A .pdf

Try the code below that I tested and it works.

 wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all

edited 2 hours ago

Rui F Ribeiro

41.9k1483142

answered May 26 '17 at 8:19

Jason Swartz

862

add a comment |

The problem is your being blocked by the

robots.txt file

and your using the dot (.) in

    -A .pdf

Try the code below that I tested and it works.

 wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all

edited 2 hours ago

Rui F Ribeiro

41.9k1483142

answered May 26 '17 at 8:19

Jason Swartz

862

add a comment |

The problem is your being blocked by the

robots.txt file

and your using the dot (.) in

    -A .pdf

Try the code below that I tested and it works.

 wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all

edited 2 hours ago

Rui F Ribeiro

41.9k1483142

answered May 26 '17 at 8:19

Jason Swartz

862

The problem is your being blocked by the

robots.txt file

and your using the dot (.) in

    -A .pdf

Try the code below that I tested and it works.

 wget -np -nd -r -l2 -A pdf -e robots=off http://machineknittingetc.com/passap.html?limit=all

edited 2 hours ago

Rui F Ribeiro

41.9k1483142

answered May 26 '17 at 8:19

Jason Swartz

862

edited 2 hours ago

Rui F Ribeiro

41.9k1483142

edited 2 hours ago

Rui F Ribeiro

41.9k1483142

edited 2 hours ago

Rui F Ribeiro

41.9k1483142

answered May 26 '17 at 8:19

Jason Swartz

862

answered May 26 '17 at 8:19

Jason Swartz

862

answered May 26 '17 at 8:19

Jason Swartz

862

add a comment |

Does this work for you ?

#!/bin/bash

for i in {000..175}

do

     wget  http://machineknittingetc.com/downloadable/download/sample/sample_id/$i

done

answered Jan 2 '17 at 8:51

rajaganesh87

7362926

Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

– Kallaste
Jan 2 '17 at 9:21

I really should have thought of that.

– Kallaste
Jan 2 '17 at 9:23

No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

– Kallaste
Jan 2 '17 at 9:37

@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

– rajaganesh87
Jan 4 '17 at 11:26

add a comment |

Does this work for you ?

#!/bin/bash

for i in {000..175}

do

     wget  http://machineknittingetc.com/downloadable/download/sample/sample_id/$i

done

answered Jan 2 '17 at 8:51

rajaganesh87

7362926

Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

– Kallaste
Jan 2 '17 at 9:21

I really should have thought of that.

– Kallaste
Jan 2 '17 at 9:23

No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

– Kallaste
Jan 2 '17 at 9:37

@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

– rajaganesh87
Jan 4 '17 at 11:26

add a comment |

Does this work for you ?

#!/bin/bash

for i in {000..175}

do

     wget  http://machineknittingetc.com/downloadable/download/sample/sample_id/$i

done

answered Jan 2 '17 at 8:51

rajaganesh87

7362926

Does this work for you ?

#!/bin/bash

for i in {000..175}

do

     wget  http://machineknittingetc.com/downloadable/download/sample/sample_id/$i

done

answered Jan 2 '17 at 8:51

rajaganesh87

7362926

answered Jan 2 '17 at 8:51

rajaganesh87

7362926

answered Jan 2 '17 at 8:51

rajaganesh87

7362926

answered Jan 2 '17 at 8:51

rajaganesh87

7362926

Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

– Kallaste
Jan 2 '17 at 9:21

I really should have thought of that.

– Kallaste
Jan 2 '17 at 9:23

No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

– Kallaste
Jan 2 '17 at 9:37

@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

– rajaganesh87
Jan 4 '17 at 11:26

add a comment |

Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

– Kallaste
Jan 2 '17 at 9:21

I really should have thought of that.

– Kallaste
Jan 2 '17 at 9:23

No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

– Kallaste
Jan 2 '17 at 9:37

@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

– rajaganesh87
Jan 4 '17 at 11:26

Yes, thanks! But it gets a lot more than the links on that page. Apparently the downloadable subpath has a lot of files. I will look for a range for the files I want (hopefully they are not randomly numbered) and see if I can alter it.

– Kallaste
Jan 2 '17 at 9:21

I really should have thought of that.

– Kallaste
Jan 2 '17 at 9:23

No, it seems the files I want are not numbered predictably. Some are consecutive but then they start to jump around. Short of passing it a list of each file path, this will not work. Is there no way to do it with the wget filters?

– Kallaste
Jan 2 '17 at 9:37

@Kallaste In that case get the html using wget and grep for the document numbers, download again from that list

– rajaganesh87
Jan 4 '17 at 11:26

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

8HfWKYEaJ,4zdWNSyFUaUyGLK95of MbTlLN8rzgmx4ttZm0nh9avrKk216G iUg,264CARATxQf4r

搜尋此網誌

Yrurtj