Wednesday, May 26, 2010

Ftp/Http File Retrieval Tips

There comes a time when you want to download a directory or set of files from a site. Be it a ftp or web server. You may want to retrieve files with a specific name or string of characters in the name. Perhaps you need to download the entire directory tree. Here are some tips and tricks that might help you automate this processIn this exercise we are going to look at three(3) tools you can use to download files and directory trees. NcFTP, Wget and Curl. Each program has its own strengths and weaknesses so take a look at the examples and find the one that best suites your needs.




The Ncftp suite (FTP retrieval)

NcFTP was the first alternative FTP client program, which debuted in 1990. It was created as an alternative to the standard UNIX ftp program, and offers a number of additional features and greater ease of use. NcFTP is a command-line user interface program, and runs on a large number of platforms. The program provides features such as bookmarks, automatic use of binary transfers, dynamic progress meters, automatic anonymous logins, recursive downloading, resumption of failed downloads, "redialing" unresponsive hosts, and the use of passive transfers. Ncftp is a group of binaries including:
  • NcFTP - FTP client browser with tab completion
  • NcFTPGet - command-line utility to retrieve ftp files
  • NcFTPPut - command-line utility to store files to a server
  • NcFTPLs - command-line utility to get a remote directory listing


Example 1: Sometime you already know the name of the file, set of files or the directory name you want to download. Perhaps you want to setup a script to collect those files every week for backups. If you already know the names of the files or directory structure you want to retrieve, the following line is especially useful. Ncftpget will not open an interactive ftp shell, but will only retrieve files. The following line will...
  • connect to ftp://server_name.com/
  • Use passive (PASV) data connections which are more firewall friendly (-F)
  • Recursive mode; copy whole directory trees if needed (-R)
  • retry three(3) times when trying to log into the server (-r3)
  • timeout the log in if it takes more than ten(10) seconds (-t10)
  • download "file_or_directory"
ncftpget -FR -r3 -t10 ftp://server_name.com/file_or_directory




Wget (FTP and HTTP retrieval)

GNU Wget is a free software program that implements simple and powerful content retrieval from web servers and is part of the GNU Project. Its name is derived from World Wide Web and get, connotative of its primary function. It currently supports downloading via HTTP, HTTPS, and FTP protocols, the most popular TCP/IP-based protocols used for web browsing. Its features include recursive download, conversion of links for offline viewing of local HTML, support for proxies, and much more. It appeared in 1996, coinciding with the boom of popularity of the web, causing its wide use among Unix users and distribution with most major Linux distributions. Written in portable C, Wget can be easily installed on any Unix-like system and has been ported to many environments.
Wget is especially useful due to the fact it can retrieve files from both ftp and http web servers.


Example 1: General Use : What if you wanted to download a file or directory from and anonymous ftp site. Lets make sure that we will resume connections that break and resume files that were incomplete. Also, we do _not_ want to put a heavy load on the server. By limiting our download rate and waiting a second or two before getting the next file our client will not get banned for server abuse. Additionally, make sure we do _not_ download any file that ends with ".iso".
  • connect to ftp://server_name.com/
  • continue failed downloads (-c)
  • recurse only one directory, i.e. the current directory only (-l1)
  • Recursive mode; copy whole directory trees if needed (-r)
  • Do not make the directory structure, just download the files (-nd)
  • get the files symlinks are pointing to (--retr-symlinks)
  • retry three(3) times when trying to log into the server (-t3)
  • timeout an inactive connection in 30 seconds (-t30)
  • limit the download rate to 50 kilobytes per second (--limit-rate=50k)
  • wait between 1 and 3 seconds before downloading the next file (--random-wait)
  • do not download any file that ends with ".iso" (--reject "*.iso") or you use --accept "*.iso" to only get .iso files
  • download "file_or_directory" (this can be ftp or http)
wget -c -l1 -r -nd --retr-symlinks -t3 -T30 --limit-rate=50k --random-wait --reject "*.iso" ftp://server_name.com/file_or_directory/*  


Example 2: Get all gif pictures : You want to download all the GIFs from a directory on an HTTP server. You tried wget http://www.server.com/dir/*.gif, but that did not work because HTTP retrieval does not support globbing. In that case, use:
wget -r -l1 --no-parent -A.gif http://www.server_name.com/directory/
More verbose, but the effect is the same. -r -l1 means to retrieve recursively, with maximum depth of 1. --no-parent means that references to the parent directory are ignored, and -A.gif means to download only the GIF files. -A "*.gif" would have worked too.
Example 4: Download all files listed on a page : Wget can download all of the files listed on a web page. To do this we need to have wget retrieve the html file and pass that output to another wget processes. Also note that most web servers do not use absolute links, they use relative links. For example, if you look at the html source of the page you are getting the files from the link might look like /scripts/index.html and not http://www.example.org/scripts/index.html. For this reason you will have to use the --base= directive and specify the Host: header. It sounds complicated, but take a look at this line and it will make sense. This line will download all of the links listed in http://www.example.org/scripts/ and put them in the current local directory.
wget http://www.example.org/scripts/ -O - | wget --base=http://www.example.org -nd -Fri -


Example 5: Backup mirror : If you wish Wget to keep a mirror of a page (or FTP sub directories), use --mirror (-m), which is the shorthand for -r -l inf -N. We also suggest logging the output of the transfer to a file (mirror_log) for later review. You can put Wget in the crontab file asking it to recheck a site each Sunday:
#minute (0-59)
#|   hour (0-23)
#|   |    day of the month (1-31)
#|   |    |   month of the year (1-12 or Jan-Dec)
#|   |    |   |   day of the week (0-6 with 0=Sun or Sun-Sat)
#|   |    |   |   |   commands
#|   |    |   |   |   |
#### mirror web site (Sunday at midnight)
 0   0    *   *   0   wget --mirror http://www.mywebsite.com/ -o /home/user/mirror_log


Example 6: Off-line browsing : Create a five levels deep mirror image of the web site www.site_name.com with the same directory structure the original has, with only one try per document, saving the log of the activities to web_log. Also, convert the links in the HTML files to point to local files, so you can view the documents off-line and get all files necessary to make this site function:
wget --convert-links -p -r http://www.site_name.com/ -o web_log




Curl

Curl is a command line tool for transferring files with URL syntax, supporting FTP, FTPS, HTTP, HTTPS, TFTP, SCP, SFTP, Telnet, DICT, FILE and LDAP. cURL supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading, Kerberos, HTTP form based upload, proxies, cookies, user+password authentication (Basic, Digest, NTLM and Negotiate for HTTP and kerberos4 for FTP), file transfer resume, http proxy tunneling and many other features. cURL is open source/free software distributed under the MIT License. The main purpose and use for cURL is to automate unattended file transfers or sequences of operations. For example, it is a good tool for simulating a user's actions at a web browser. Libcurl is the corresponding library/API that users may incorporate into their programs; cURL acts as a stand-alone wrapper to the libcurl library. libcurl is used to provide URL transfer capabilities to numerous applications (open-source as well as proprietary).


Example 1: List files: The following command will list all of the files in the root ftp tree while logging in with my username and password.
curl ftp://site_name.com --user myname:mypassword 


Example 2: Download a file: To download a file curl will first list out the file and then pipe it to the name specified by the argument "-o".
curl ftp://site_name.com/examplefile.zip -o examplefile.zip

No comments: