Wednesday, 6 May 2015

Download recursively all files from a certain directory listing using wget

This is going to be a quick blog post about wget which I believe it is very interesting for you to know how to do this. From your Linux box you can use wget to recursively download all the files listed in a directory listing. 

If you have seen something similar to Figure 1, then this is what directory listing looks like. If someone wants you to have access to their files on the web server through HTTP then it is a quick and easy way of doing it, but most of the time is a misconfiguration allowing the hosted files to be publicly available to unauthorised users. 

Figure 1 - Directory Listing



Lets assume that you want to download all the files from: www.example.com/admin/backup

There might be other parent directories there along with admin, like admin2, user, old, etc. but you only want to download all the files and sub-directories inside the backup directory, recursively. 


This can be done with the following command: 

wget -r -np -nH --cut-dirs=1 -R index.html http://www.example.com/admin/backup/

More specifically, the above command will download all files and sub-directories within backup/

due to the different switches provided:
              -r -> recursively 
            -np -> ignore directories before the one specified (backup/)
            -nH -> not saving files to hostname folder
--cut-dirs=1 -> omitting the first 1 folder(s) (/admin/)
              -R -> exclude index.html files

Note: If the download fails for any reason mid way, you can use -nc -c to skip the already downloaded files, resume any unfinished files and continue downloading the what is left:
wget -r -np -nH --cut-dirs=1 -nc -c -R index.html http://www.example.com/admin/pub/

Now, just in case you want to download everything without any restrictions, you can give the command:

wget -m http://www.example.com/admin/backup/

In this case it doesn't matter that you pointed the URL to the /admin/backup/. Using the switch -m [URL] wget will start downloading everything from www.example.com

---------------------------------------------------------------------------------------------------------------

[TIP] The -e robots=off flag tells wget to ignore restrictions in the robots.txt file

[NOTE] Using -X allows the user to download one specific file. 
wget -X [path/to/save] http://www.example.com/admin/file


---------------------------------------------------------------------------------------------------------------

The above focuses on Directory Listings. On actual websites consider:
wget -mpEk <URL>

Note: Try with: --wait="duration" enabled that adds a duration between requests to avoid trigger any DoS flags

Other:
sudo apt-get install httrack
httrack --ext-depth=1 <URL>



No comments:

Post a Comment