Author Topic: File leacher  (Read 2268 times)

0 Members and 1 Guest are viewing this topic.

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1620
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
Re: File leacher
« Reply #15 on: December 03, 2012, 07:43:57 pm »
Looks like I'm at a dead end then. Back to manually downloading them...

Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-68-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Offline SeZo

  • Hero Member
  • *****
  • Posts: 1529
  • Karma: 121
  • Gender: Male
    • View Profile
    • Awards
Re: File leacher
« Reply #16 on: December 03, 2012, 08:57:52 pm »
Quote
Looks like I'm at a dead end then. Back to manually downloading them...

Wget follows the breadcrumbs (links) in the pages it comes accross.

I guess if the pages are loaded by script, then you are outta luck.
However if you would start with a known subsection which contains the links to the pdfs then you might get somewhere.
Code: [Select]
wget --random-wait --limit-rate=20k -r --no-parent -l10 -A.pdf http://website.com/subfolder
That should start from the specified url (ignoring the directories above) and will go to 10 level deep.

Like Mark said, they might just tell you to sod off and do the downloading like everyone else does.

Offline Mark Greaves (PCNetSpec)

  • Administrator
  • Hero Member
  • *****
  • Posts: 14048
  • Karma: 348
  • Gender: Male
  • "-rw-rw-rw-" .. The Number Of The Beast
    • View Profile
    • PCNetSpec
    • Awards
Re: File leacher
« Reply #17 on: December 03, 2012, 09:15:56 pm »
I had a "quick" look at the site in question, and the pdf always seems to get oaded from the same link, which "becomes" a link to the selected pdf .. so I'm guessing it's ll scripted.

Nor wll it let you "browse" the directory it says the pdf comes from.

That said, playing with wget::gui and trying a few random options seemed to download quite a few directories .. but no pdf's, just a bunch of xml files, even though I'd told it to get pdf's
(maybe I just didn't give it enough time to follow all the links).

I'd also guess there will be gigabytes of pdf's on that site .. which would explain them not wanting 1000's of people all downloading the lot in one go .. they'd lock up the site for hours.
WARNING: You are logged into reality as 'root'

logging in as 'insane' is the only safe option.

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1620
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
Re: File leacher
« Reply #18 on: December 11, 2012, 06:07:44 pm »
Just thought it'd be worth a try to get the PDF's in one go but if it can't be done, nevermind.
Thanks anyways for trying.

Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-68-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1620
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
Re: File leacher
« Reply #19 on: December 28, 2013, 11:10:37 pm »
I'm trying 'wget' again and I'm having problems getting it to download picture files. I am getting conflicting info regarding what to enter as some say  type '-A jpg,jpeg,gif' or '-A .jpg,.jpeg,.gif' or '-A.jpg,.jpeg,.gif'

It is downloading them but then deleting them!

bash-4.2$ wget -r -e robots=off --limit-rate=50k --wait=8 --no-parent -l50 -Ajpg,jpeg,gif http://www.crayola.co.uk/free-coloring-pages/
--2013-12-28 22:43:02--  http://www.crayola.co.uk/free-coloring-pages/
Resolving www.crayola.co.uk (www.crayola.co.uk)... 198.78.218.126
Connecting to www.crayola.co.uk (www.crayola.co.uk)|198.78.218.126|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 106758 (104K) [text/html]
Saving to: `www.crayola.co.uk/free-coloring-pages/index.html'

100%[===================================================================================================================================================>] 106,758     50.0K/s   in 2.1s   

2013-12-28 22:43:06 (50.0 KB/s) - `www.crayola.co.uk/free-coloring-pages/index.html' saved [106758/106758]

Removing www.crayola.co.uk/free-coloring-pages/index.html since it should be rejected.

--2013-12-28 22:43:14--  http://www.crayola.co.uk/free-coloring-pages/~/media/Crayola/Doodads/doodadbee.jpg?mh=1600&mw=150
Connecting to www.crayola.co.uk (www.crayola.co.uk)|198.78.218.126|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16928 (17K) [image/jpeg]
Saving to: `www.crayola.co.uk/free-coloring-pages/~/media/Crayola/Doodads/doodadbee.jpg?mh=1600&mw=150'

100%[===================================================================================================================================================>] 16,928      75.1K/s   in 0.2s   

2013-12-28 22:43:15 (75.1 KB/s) - `www.crayola.co.uk/free-coloring-pages/~/media/Crayola/Doodads/doodadbee.jpg?mh=1600&mw=150' saved [16928/16928]

Removing www.crayola.co.uk/free-coloring-pages/~/media/Crayola/Doodads/doodadbee.jpg?mh=1600&mw=150 since it should be rejected.

^Z
[8]+  Stopped                 wget -r -e robots=off --limit-rate=50k --wait=8 --no-parent -l50 -Ajpg,jpeg,gif http://www.crayola.co.uk/free-coloring-pages/
bash-4.2$

Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-68-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Offline SeZo

  • Hero Member
  • *****
  • Posts: 1529
  • Karma: 121
  • Gender: Male
    • View Profile
    • Awards
Re: File leacher
« Reply #20 on: December 29, 2013, 12:02:32 am »
Quote
I am getting conflicting info regarding what to enter as some say  type '-A jpg,jpeg,gif' or '-A .jpg,.jpeg,.gif' or '-A.jpg,.jpeg,.gif'

Try adding this:
Code: [Select]
-A gif,jpg
(see here)

Offline pooky2483

  • Hero Member
  • *****
  • Posts: 1620
  • Karma: 0
  • Gender: Male
  • Slowly getting the hang of it.
    • View Profile
    • Get your FREE Ubuntu stickers here. I'm the UK address
    • Awards
Re: File leacher
« Reply #21 on: December 29, 2013, 03:39:03 am »
lol... that happens to be one of the sites I looked at.

Kubuntu 12.04LTS 64bit|KDE 4.13.2|QT 4.8.6|Linux 3.2.0-68-generic|M3A76-CM|BIOS 2101|AMD PhenomII X4 965 3400+|Realtek RTL8168C(P)|8111C(P) PCI-E Gigabit Ethernet NIC|NVIDIA 128MB GeForce6200 Turbocache|8.0GB Single-Channel DDR2|

Offline SeZo

  • Hero Member
  • *****
  • Posts: 1529
  • Karma: 121
  • Gender: Male
    • View Profile
    • Awards
Re: File leacher
« Reply #22 on: December 29, 2013, 04:58:42 pm »
OK, I think the problem lays in the filenames (wget does not recognise doodadbee.jpg?mh=1600&mw=150 as jpg) as it seems to be a redirect.
Try adding as an option:
Code: [Select]
wget --trust-server-names

Quote
--trust-server-names
    If this is set to on, on a redirect the last component of the redirection URL will be used as the local file name. By default it is used the last component in the original URL.

 


SimplePortal 2.3.3 © 2008-2010, SimplePortal