Author Topic: File leacher  (Read 1928 times)

0 Members and 1 Guest are viewing this topic.

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1616
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
File leacher
« on: December 03, 2012, 04:37:54 am »
I am looking for a file leacher to save time downloading PDF files. Does anyone know of a good program I can use to do this?
I want to be able to get ALL PDF's from a site without having to ener each URL for each PDF, just the main sites address and for it to go inside searching for PDF's to download.
« Last Edit: December 03, 2012, 04:39:35 am by pooky2483 »

Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-61-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Online Mark Greaves (PCNetSpec)

  • Administrator
  • Hero Member
  • *****
  • Posts: 13191
  • Karma: 321
  • Gender: Male
  • "-rw-rw-rw-" .. The Number Of The Beast
    • View Profile
    • PCNetSpec
    • Awards
Re: File leacher
« Reply #1 on: December 03, 2012, 11:11:44 am »
Have you tried using wget which is an immensely powerful web grabber.
http://www.webupd8.org/2009/08/how-to-download-files-from-web-using.html
and
http://linuxreviews.org/quicktips/wget/

and
Code: [Select]
man wget



Or you could take a look at HTTrack:
http://www.httrack.com/

in the repos as webhttrack

.
« Last Edit: December 03, 2012, 12:34:53 pm by Mark Greaves (PCNetSpec) »
WARNING: You are logged into reality as 'root'

logging in as 'insane' is the only safe option.

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1616
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
Re: File leacher
« Reply #2 on: December 03, 2012, 12:32:48 pm »
I’ve just been looking at wget but for what I want to do I would need a HUGE script as I want to download ALL .PDF's from http://www.legislation.gov.uk/
Or is there an easier way to do it?


Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-61-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1616
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
Re: File leacher
« Reply #3 on: December 03, 2012, 12:42:40 pm »
This site mentions Larbin but I cant find it in the repo
http://www.linuxquestions.org/questions/programming-9/i-need-a-web-crawler-and-indexer-for-linux-248214/

It looks like the type of program I would need to get the URL's for the script but as above, unable to find. Is there any other program out there that could grab all the PDF URL's?


Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-61-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Offline SeZo

  • Hero Member
  • *****
  • Posts: 1421
  • Karma: 118
  • Gender: Male
    • View Profile
    • Awards
Re: File leacher
« Reply #4 on: December 03, 2012, 02:10:17 pm »
Quote
I’ve just been looking at wget but for what I want to do I would need a HUGE script as I want to download ALL .PDF's from http://www.legislation.gov.uk/

What is wrong with:
Code: [Select]
wget --random-wait --limit-rate=20k -r -A=.pdf http://website.com
With this command, you download all pdf files from website.com
Use it at your own risk.

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1616
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
Re: File leacher
« Reply #5 on: December 03, 2012, 05:48:49 pm »
What is wrong with:
Code: [Select]
wget --random-wait --limit-rate=20k -r -A=.pdf http://website.com
With this command, you download all pdf files from website.com
Use it at your own risk.


Nothing, I just cant understand the last two commands, even after reading below.I can understand some of the commands but not all.

How does the -r work in knowing the address to go to next from the 'home page address'.

How does the -A work, what list?

And where does it save the collected file as it's not specified?

Anyway, thanks loads for the answer  ;D

--random-wait

           Some web sites may perform log analysis to identify retrieval
           programs such as Wget by looking for statistically significant
           similarities in the time between requests. This option causes the
           time between requests to vary between 0.5 and 1.5 * wait seconds,
           where wait was specified using the --wait option, in order to mask
           Wget's presence from such analysis.

           A 2001 article in a publication devoted to development on a popular
           consumer platform provided code to perform this analysis on the
           fly.  Its author suggested blocking at the class C address level to
           ensure automated retrieval programs were blocked despite changing
           DHCP-supplied addresses.

--limit-rate=amount
           Limit the download speed to amount bytes per second.  Amount may be
           expressed in bytes, kilobytes with the k suffix, or megabytes with
           the m suffix.  For example, --limit-rate=20k will limit the
           retrieval rate to 20KB/s.  This is useful when, for whatever
           reason, you don't want Wget to consume the entire available
           bandwidth.

           This option allows the use of decimal numbers, usually in
           conjunction with power suffixes; for example, --limit-rate=2.5k is
           a legal value.

           Note that Wget implements the limiting by sleeping the appropriate
           amount of time after a network read that took less time than
           specified by the rate.  Eventually this strategy causes the TCP
           transfer to slow down to approximately the specified rate.
           However, it may take some time for this balance to be achieved, so
           don't be surprised if limiting the rate doesn't work well with very
           small files.

Recursive Retrieval Options
       -r
       --recursive

           Turn on recursive retrieving.    The default maximum depth is 5.

Recursive Accept/Reject Options
      -A acclist --accept acclist
       -R rejlist --reject rejlist

           Specify comma-separated lists of file name suffixes or patterns to
           accept or reject. Note that if any of the wildcard characters, *,
           ?, [ or ], appear in an element of acclist or rejlist, it will be
           treated as a pattern, rather than a suffix.

Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-61-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1616
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
Re: File leacher
« Reply #6 on: December 03, 2012, 06:04:52 pm »
Tried it but got this, dont know if I did anything wrong?

pooky2483@pooky2483-ubuntu12:~$ wget --random-wait --limit-rate=20k -r -A=.pdf http://www.legislation.gov.uk/
--2012-12-03 17:57:36--  http://www.legislation.gov.uk/
Resolving www.legislation.gov.uk (www.legislation.gov.uk)... 23.65.22.26, 23.65.22.35
Connecting to www.legislation.gov.uk (www.legislation.gov.uk)|23.65.22.26|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10953 (11K) [text/html]
Saving to: ‘www.legislation.gov.uk/index.html’

100%[==============================================>] 10,953      20.0KB/s   in 0.5s   

2012-12-03 17:57:37 (20.0 KB/s) - ‘www.legislation.gov.uk/index.html’ saved [10953/10953]

Removing www.legislation.gov.uk/index.html since it should be rejected.

FINISHED --2012-12-03 17:57:37--
Total wall clock time: 1.3s
Downloaded: 1 files, 11K in 0.5s (20.0 KB/s)

Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-61-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1616
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
Re: File leacher
« Reply #7 on: December 03, 2012, 06:23:55 pm »
Tried different address;
Not getting the result I wanted...

pooky2483@pooky2483-ubuntu12:~$ wget --random-wait --limit-rate=20k -r -A=.pdf http://www.legislation.gov.uk/browse
--2012-12-03 18:12:51--  http://www.legislation.gov.uk/browse
Resolving www.legislation.gov.uk (www.legislation.gov.uk)... 23.65.22.26, 23.65.22.16
Connecting to www.legislation.gov.uk (www.legislation.gov.uk)|23.65.22.26|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12807 (13K) [text/html]
Saving to: ‘www.legislation.gov.uk/browse’

100%[==============================================>] 12,807      20.0KB/s   in 0.6s   

2012-12-03 18:12:52 (20.0 KB/s) - ‘www.legislation.gov.uk/browse’ saved [12807/12807]

Loading robots.txt; please ignore errors.
--2012-12-03 18:12:52--  http://www.legislation.gov.uk/robots.txt
Reusing existing connection to www.legislation.gov.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 1226 (1.2K) [text/plain]
Saving to: ‘www.legislation.gov.uk/robots.txt’

100%[==============================================>] 1,226       --.-K/s   in 0s     

2012-12-03 18:12:53 (44.2 MB/s) - ‘www.legislation.gov.uk/robots.txt’ saved [1226/1226]

Removing www.legislation.gov.uk/browse since it should be rejected.

--2012-12-03 18:12:53--  http://www.legislation.gov.uk/
Reusing existing connection to www.legislation.gov.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 10953 (11K) [text/html]
Saving to: ‘www.legislation.gov.uk/index.html’

100%[==============================================>] 10,953      20.0KB/s   in 0.5s   

2012-12-03 18:12:53 (20.0 KB/s) - ‘www.legislation.gov.uk/index.html’ saved [10953/10953]

Removing www.legislation.gov.uk/index.html since it should be rejected.

FINISHED --2012-12-03 18:12:53--
Total wall clock time: 1.8s
Downloaded: 3 files, 24K in 1.2s (21.0 KB/s)
pooky2483@pooky2483-ubuntu12:~$

AND

pooky2483@pooky2483-ubuntu12:~$ wget --random-wait --limit-rate=20k -r -A=.pdf http://www.legislation.gov.uk/ukpga
--2012-12-03 18:17:28--  http://www.legislation.gov.uk/ukpga
Resolving www.legislation.gov.uk (www.legislation.gov.uk)... 23.65.22.16, 23.65.22.26
Connecting to www.legislation.gov.uk (www.legislation.gov.uk)|23.65.22.16|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.legislation.gov.uk/ukpga’

    [           <=>                                 ] 81,079      20.0KB/s   in 4.0s   

2012-12-03 18:17:33 (20.0 KB/s) - ‘www.legislation.gov.uk/ukpga’ saved [81079]

Loading robots.txt; please ignore errors.
--2012-12-03 18:17:33--  http://www.legislation.gov.uk/robots.txt
Reusing existing connection to www.legislation.gov.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 1226 (1.2K) [text/plain]
Saving to: ‘www.legislation.gov.uk/robots.txt’

100%[==============================================>] 1,226       --.-K/s   in 0s     

2012-12-03 18:17:33 (58.2 MB/s) - ‘www.legislation.gov.uk/robots.txt’ saved [1226/1226]

Removing www.legislation.gov.uk/ukpga since it should be rejected.

--2012-12-03 18:17:33--  http://www.legislation.gov.uk/
Reusing existing connection to www.legislation.gov.uk:80.
HTTP request sent, awaiting response... 200 OK
Length: 10953 (11K) [text/html]
Saving to: ‘www.legislation.gov.uk/index.html’

100%[==============================================>] 10,953      20.0KB/s   in 0.5s   

2012-12-03 18:17:34 (20.0 KB/s) - ‘www.legislation.gov.uk/index.html’ saved [10953/10953]

Removing www.legislation.gov.uk/index.html since it should be rejected.

FINISHED --2012-12-03 18:17:34--
Total wall clock time: 5.8s
Downloaded: 3 files, 91K in 4.5s (20.3 KB/s)
pooky2483@pooky2483-ubuntu12:~$

Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-61-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1616
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards

Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-61-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Online Mark Greaves (PCNetSpec)

  • Administrator
  • Hero Member
  • *****
  • Posts: 13191
  • Karma: 321
  • Gender: Male
  • "-rw-rw-rw-" .. The Number Of The Beast
    • View Profile
    • PCNetSpec
    • Awards
Re: File leacher
« Reply #9 on: December 03, 2012, 06:40:25 pm »
Though I'm no website dev, my guess is they are maps to directories on the server that you don't (and will never) have permission to enter, and as the .pdf's are contained in them, "you can't av em .. stop tryin to nick all our content in one go, and sod off" ;)
« Last Edit: December 03, 2012, 06:48:49 pm by Mark Greaves (PCNetSpec) »
WARNING: You are logged into reality as 'root'

logging in as 'insane' is the only safe option.

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1616
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
Re: File leacher
« Reply #10 on: December 03, 2012, 06:46:08 pm »
Have you tried using wget which is an immensely powerful web grabber.
http://www.webupd8.org/2009/08/how-to-download-files-from-web-using.html


Tried it and saw this (at the end) says its in the repo but comes up blank.
A GUI wget...
I saw your last post and wonder if theres a workaround with a longer pause?

Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-61-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1616
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
Re: File leacher
« Reply #11 on: December 03, 2012, 06:47:20 pm »
Though I'm no website dev, my guess is they are maps to directories on the server that you don't (and will never) have permission to enter, and as the .pdf's are contained in them, "you can't av em .. stop tryin to nick all our content in one go, and sod off" ;)
But they’re all free to download documents

Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-61-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Online Mark Greaves (PCNetSpec)

  • Administrator
  • Hero Member
  • *****
  • Posts: 13191
  • Karma: 321
  • Gender: Male
  • "-rw-rw-rw-" .. The Number Of The Beast
    • View Profile
    • PCNetSpec
    • Awards
Re: File leacher
« Reply #12 on: December 03, 2012, 06:54:05 pm »
Of course they are, but they don't want you to download them all in one go and -

a) use all their bandwidth
and
b) set up a competing site with all their content.

So not only are they not going to make it easy .. they'll go out of their way to make it difficult.

I never said what you are attempting to do was wrong, or illegal .. I just said they don't want you to.

.
« Last Edit: December 03, 2012, 06:56:53 pm by Mark Greaves (PCNetSpec) »
WARNING: You are logged into reality as 'root'

logging in as 'insane' is the only safe option.

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1616
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
Re: File leacher
« Reply #13 on: December 03, 2012, 07:04:32 pm »
Tried to get the GUI version of wget

pooky2483@pooky2483-ubuntu12:~$ sudo apt-get install gwget
[sudo] password for pooky2483:
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package gwget is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'gwget' has no installation candidate
pooky2483@pooky2483-ubuntu12:~$

Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-61-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Online Mark Greaves (PCNetSpec)

  • Administrator
  • Hero Member
  • *****
  • Posts: 13191
  • Karma: 321
  • Gender: Male
  • "-rw-rw-rw-" .. The Number Of The Beast
    • View Profile
    • PCNetSpec
    • Awards
Re: File leacher
« Reply #14 on: December 03, 2012, 07:30:11 pm »
It's no longer in the repo's .. there's a probable reason in the title here:
http://projects.gnome.org/gwget/

wget::gui
http://www.martin-achern.de/wgetgui/#download
seems to work .. you'll need to make sure perl and perl-tk are installed.

then just download the .zip for Linux .. unpack it .. and run it from the CLI

But it's unlikely to help in this case.
WARNING: You are logged into reality as 'root'

logging in as 'insane' is the only safe option.

 


SimplePortal 2.3.3 © 2008-2010, SimplePortal