File leacher

pooky2483 · 3 December 2012 04:37

I am looking for a file leacher to save time downloading PDF files. Does anyone know of a good program I can use to do this?
I want to be able to get ALL PDF’s from a site without having to ener each URL for each PDF, just the main sites address and for it to go inside searching for PDF’s to download.

Mark_Greaves_PCNetSp · 3 December 2012 11:11

Have you tried using wget which is an immensely powerful web grabber.

and
http://linuxreviews.org/quicktips/wget/

and

man wget

Or you could take a look at HTTrack:

in the repos as webhttrack

.

pooky2483 · 3 December 2012 12:32

I’ve just been looking at wget but for what I want to do I would need a HUGE script as I want to download ALL .PDF’s from http://www.legislation.gov.uk/
Or is there an easier way to do it?

pooky2483 · 3 December 2012 12:42

This site mentions Larbin but I cant find it in the repo

It looks like the type of program I would need to get the URL’s for the script but as above, unable to find. Is there any other program out there that could grab all the PDF URL’s?

SeZo · 3 December 2012 14:10

I’ve just been looking at wget but for what I want to do I would need a HUGE script as I want to download ALL .PDF's from http://www.legislation.gov.uk/

What is wrong with:

wget --random-wait --limit-rate=20k -r -A=.pdf http://website.com

With this command, you download all pdf files from website.com
Use it at your own risk.

pooky2483 · 3 December 2012 17:48

Nothing, I just cant understand the last two commands, even after reading below.I can understand some of the commands but not all.

How does the -r work in knowing the address to go to next from the ‘home page address’.

How does the -A work, what list?

And where does it save the collected file as it’s not specified?

Anyway, thanks loads for the answer ;D

–random-wait
Some web sites may perform log analysis to identify retrieval
programs such as Wget by looking for statistically significant
similarities in the time between requests. This option causes the
time between requests to vary between 0.5 and 1.5 * wait seconds,
where wait was specified using the --wait option, in order to mask
Wget’s presence from such analysis.

       A 2001 article in a publication devoted to development on a popular
       consumer platform provided code to perform this analysis on the
       fly.  Its author suggested blocking at the class C address level to
       ensure automated retrieval programs were blocked despite changing
       DHCP-supplied addresses.

–limit-rate=amount
Limit the download speed to amount bytes per second. Amount may be
expressed in bytes, kilobytes with the k suffix, or megabytes with
the m suffix. For example, --limit-rate=20k will limit the
retrieval rate to 20KB/s. This is useful when, for whatever
reason, you don’t want Wget to consume the entire available
bandwidth.

       This option allows the use of decimal numbers, usually in
       conjunction with power suffixes; for example, --limit-rate=2.5k is
       a legal value.

       Note that Wget implements the limiting by sleeping the appropriate
       amount of time after a network read that took less time than
       specified by the rate.  Eventually this strategy causes the TCP
       transfer to slow down to approximately the specified rate.
       However, it may take some time for this balance to be achieved, so
       don't be surprised if limiting the rate doesn't work well with very
       small files.

Recursive Retrieval Options
-r
–recursive
Turn on recursive retrieving. The default maximum depth is 5.

Recursive Accept/Reject Options
-A acclist --accept acclist
-R rejlist --reject rejlist
Specify comma-separated lists of file name suffixes or patterns to
accept or reject. Note that if any of the wildcard characters, *,
?, [ or ], appear in an element of acclist or rejlist, it will be
treated as a pattern, rather than a suffix.

pooky2483 · 3 December 2012 18:04

Tried it but got this, dont know if I did anything wrong?

pooky2483@pooky2483-ubuntu12:~$ wget --random-wait --limit-rate=20k -r -A=.pdf http://www.legislation.gov.uk/
–2012-12-03 17:57:36-- http://www.legislation.gov.uk/
Resolving www.legislation.gov.uk (www.legislation.gov.uk)… 23.65.22.26, 23.65.22.35
Connecting to www.legislation.gov.uk (www.legislation.gov.uk)|23.65.22.26|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 10953 (11K) [text/html]
Saving to: ‘www.legislation.gov.uk/index.html’

100%[==============================================>] 10,953 20.0KB/s in 0.5s

2012-12-03 17:57:37 (20.0 KB/s) - ‘www.legislation.gov.uk/index.html’ saved [10953/10953]

Removing www.legislation.gov.uk/index.html since it should be rejected.

FINISHED --2012-12-03 17:57:37–
Total wall clock time: 1.3s
Downloaded: 1 files, 11K in 0.5s (20.0 KB/s)

pooky2483 · 3 December 2012 18:23

Tried different address;
Not getting the result I wanted…

pooky2483@pooky2483-ubuntu12:~$ wget --random-wait --limit-rate=20k -r -A=.pdf [b]http://www.legislation.gov.uk/browse[/b]
–2012-12-03 18:12:51-- Browse Legislation
Resolving www.legislation.gov.uk (www.legislation.gov.uk)… 23.65.22.26, 23.65.22.16
Connecting to www.legislation.gov.uk (www.legislation.gov.uk)|23.65.22.26|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 12807 (13K) [text/html]
Saving to: ‘www.legislation.gov.uk/browse’

100%[==============================================>] 12,807 20.0KB/s in 0.6s

2012-12-03 18:12:52 (20.0 KB/s) - ‘www.legislation.gov.uk/browse’ saved [12807/12807]

Loading robots.txt; please ignore errors.
–2012-12-03 18:12:52-- http://www.legislation.gov.uk/robots.txt
Reusing existing connection to www.legislation.gov.uk:80.
HTTP request sent, awaiting response… 200 OK
Length: 1226 (1.2K) [text/plain]
Saving to: ‘www.legislation.gov.uk/robots.txt’

100%[==============================================>] 1,226 --.-K/s in 0s

2012-12-03 18:12:53 (44.2 MB/s) - ‘www.legislation.gov.uk/robots.txt’ saved [1226/1226]

Removing www.legislation.gov.uk/browse since it should be rejected.

–2012-12-03 18:12:53-- http://www.legislation.gov.uk/
Reusing existing connection to www.legislation.gov.uk:80.
HTTP request sent, awaiting response… 200 OK
Length: 10953 (11K) [text/html]
Saving to: ‘www.legislation.gov.uk/index.html’

100%[==============================================>] 10,953 20.0KB/s in 0.5s

2012-12-03 18:12:53 (20.0 KB/s) - ‘www.legislation.gov.uk/index.html’ saved [10953/10953]

Removing www.legislation.gov.uk/index.html since it should be rejected.

FINISHED --2012-12-03 18:12:53–
Total wall clock time: 1.8s
Downloaded: 3 files, 24K in 1.2s (21.0 KB/s)
pooky2483@pooky2483-ubuntu12:~$

AND

pooky2483@pooky2483-ubuntu12:~$ wget --random-wait --limit-rate=20k -r -A=.pdf [b]http://www.legislation.gov.uk/ukpga[/b]
–2012-12-03 18:17:28-- Legislation.gov.uk
Resolving www.legislation.gov.uk (www.legislation.gov.uk)… 23.65.22.16, 23.65.22.26
Connecting to www.legislation.gov.uk (www.legislation.gov.uk)|23.65.22.16|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: unspecified [text/html]
Saving to: ‘www.legislation.gov.uk/ukpga’

[           <=>                                 ] 81,079      20.0KB/s   in 4.0s

2012-12-03 18:17:33 (20.0 KB/s) - ‘www.legislation.gov.uk/ukpga’ saved [81079]

Loading robots.txt; please ignore errors.
–2012-12-03 18:17:33-- http://www.legislation.gov.uk/robots.txt
Reusing existing connection to www.legislation.gov.uk:80.
HTTP request sent, awaiting response… 200 OK
Length: 1226 (1.2K) [text/plain]
Saving to: ‘www.legislation.gov.uk/robots.txt’

100%[==============================================>] 1,226 --.-K/s in 0s

2012-12-03 18:17:33 (58.2 MB/s) - ‘www.legislation.gov.uk/robots.txt’ saved [1226/1226]

Removing www.legislation.gov.uk/ukpga since it should be rejected.

–2012-12-03 18:17:33-- http://www.legislation.gov.uk/
Reusing existing connection to www.legislation.gov.uk:80.
HTTP request sent, awaiting response… 200 OK
Length: 10953 (11K) [text/html]
Saving to: ‘www.legislation.gov.uk/index.html’

100%[==============================================>] 10,953 20.0KB/s in 0.5s

2012-12-03 18:17:34 (20.0 KB/s) - ‘www.legislation.gov.uk/index.html’ saved [10953/10953]

Removing www.legislation.gov.uk/index.html since it should be rejected.

FINISHED --2012-12-03 18:17:34–
Total wall clock time: 5.8s
Downloaded: 3 files, 91K in 4.5s (20.3 KB/s)
pooky2483@pooky2483-ubuntu12:~$

pooky2483 · 3 December 2012 18:25

Whats all this???

Mark_Greaves_PCNetSp · 3 December 2012 18:40

Though I’m no website dev, my guess is they are maps to directories on the server that you don’t (and will never) have permission to enter, and as the .pdf’s are contained in them, “you can’t av em … stop tryin to nick all our content in one go, and sod off”

pooky2483 · 3 December 2012 18:46

Tried it and saw this (at the end) says its in the repo but comes up blank.
A GUI wget…
I saw your last post and wonder if theres a workaround with a longer pause?

pooky2483 · 3 December 2012 18:47

But they’re all free to download documents

Mark_Greaves_PCNetSp · 3 December 2012 18:54

Of course they are, but they don’t want you to download them all in one go and -

a) use all their bandwidth
and
b) set up a competing site with all their content.

So not only are they not going to make it easy … they’ll go out of their way to make it difficult.

I never said what you are attempting to do was wrong, or illegal … I just said they don’t want you to.

.

pooky2483 · 3 December 2012 19:04

Tried to get the GUI version of wget

pooky2483@pooky2483-ubuntu12:~$ sudo apt-get install gwget
[sudo] password for pooky2483:
Reading package lists… Done
Building dependency tree
Reading state information… Done
Package gwget is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package ‘gwget’ has no installation candidate
pooky2483@pooky2483-ubuntu12:~$

Mark_Greaves_PCNetSp · 3 December 2012 19:30

It’s no longer in the repo’s … there’s a probable reason in the title here:
http://projects.gnome.org/gwget/

wget::gui

seems to work … you’ll need to make sure perl and perl-tk are installed.

then just download the .zip for Linux … unpack it … and run it from the CLI

But it’s unlikely to help in this case.

pooky2483 · 3 December 2012 19:43

Looks like I’m at a dead end then. Back to manually downloading them…

SeZo · 3 December 2012 20:57

Looks like I'm at a dead end then. Back to manually downloading them...

Wget follows the breadcrumbs (links) in the pages it comes accross.

I guess if the pages are loaded by script, then you are outta luck.
However if you would start with a known subsection which contains the links to the pdfs then you might get somewhere.

wget --random-wait --limit-rate=20k -r --no-parent -l10 -A.pdf http://website.com/subfolder

That should start from the specified url (ignoring the directories above) and will go to 10 level deep.

Like Mark said, they might just tell you to sod off and do the downloading like everyone else does.

Mark_Greaves_PCNetSp · 3 December 2012 21:15

I had a “quick” look at the site in question, and the pdf always seems to get oaded from the same link, which “becomes” a link to the selected pdf … so I’m guessing it’s ll scripted.

Nor wll it let you “browse” the directory it says the pdf comes from.

That said, playing with wget::gui and trying a few random options seemed to download quite a few directories … but no pdf’s, just a bunch of xml files, even though I’d told it to get pdf’s
(maybe I just didn’t give it enough time to follow all the links).

I’d also guess there will be gigabytes of pdf’s on that site … which would explain them not wanting 1000’s of people all downloading the lot in one go … they’d lock up the site for hours.

pooky2483 · 11 December 2012 18:07

Just thought it’d be worth a try to get the PDF’s in one go but if it can’t be done, nevermind.
Thanks anyways for trying.

pooky2483 · 28 December 2013 23:10

I’m trying ‘wget’ again and I’m having problems getting it to download picture files. I am getting conflicting info regarding what to enter as some say type ‘-A jpg,jpeg,gif’ or ‘-A .jpg,.jpeg,.gif’ or ‘-A.jpg,.jpeg,.gif’

It is downloading them but then deleting them!

bash-4.2$ wget -r -e robots=off --limit-rate=50k --wait=8 --no-parent -l50 -Ajpg,jpeg,gif http://www.crayola.co.uk/free-coloring-pages/
–2013-12-28 22:43:02-- http://www.crayola.co.uk/free-coloring-pages/
Resolving www.crayola.co.uk (www.crayola.co.uk)… 198.78.218.126
Connecting to www.crayola.co.uk (www.crayola.co.uk)|198.78.218.126|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 106758 (104K) [text/html]
Saving to: `www.crayola.co.uk/free-coloring-pages/index.html’

100%[===================================================================================================================================================>] 106,758 50.0K/s in 2.1s

2013-12-28 22:43:06 (50.0 KB/s) - `www.crayola.co.uk/free-coloring-pages/index.html’ saved [106758/106758]

Removing www.crayola.co.uk/free-coloring-pages/index.html since it should be rejected.

–2013-12-28 22:43:14-- http://www.crayola.co.uk/free-coloring-pages/~/media/Crayola/Doodads/doodadbee.jpg?mh=1600&mw=150
Connecting to www.crayola.co.uk (www.crayola.co.uk)|198.78.218.126|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 16928 (17K) [image/jpeg]
Saving to: `www.crayola.co.uk/free-coloring-pages/~/media/Crayola/Doodads/doodadbee.jpg?mh=1600&mw=150’

100%[===================================================================================================================================================>] 16,928 75.1K/s in 0.2s

2013-12-28 22:43:15 (75.1 KB/s) - `www.crayola.co.uk/free-coloring-pages/~/media/Crayola/Doodads/doodadbee.jpg?mh=1600&mw=150’ saved [16928/16928]

Removing www.crayola.co.uk/free-coloring-pages/~/media/Crayola/Doodads/doodadbee.jpg?mh=1600&mw=150 since it should be rejected.

^Z
[8]+ Stopped wget -r -e robots=off --limit-rate=50k --wait=8 --no-parent -l50 -Ajpg,jpeg,gif http://www.crayola.co.uk/free-coloring-pages/
bash-4.2$