The HTTP Protocol

This is the protocol used to retrieve Web pages from a server (normally on port 80) and also to send information obtained from a form back to a server. It is perhaps the most complex of protocols mentioned in these notes. Like other protocols, this one can be explored using Telnet to act as a primitive web browser, sending and receiving information according to the protocol.

The HTTP protocol works like this:

Connection between client and server. The client makes a TCP connection with the server, normally on port 80, unless some other port is specified.
Client request The client sends a request to the server requesting a page at a specified location. A typical request might be:
```
GET /index.html HTTP/1.0
```
which gives only the path name, since the machine name is already implicit.
Server response The server sends the data back to the client as lines of ASCII text. The first line is typically something like this:
```
HTTP/1.0 200 OK
```
Closing the connection The connection is closed by either the client, or the server, or both.

A separate connection is used for each request.

The response code 200 OK is the most common response, signaling that the request was successful. There are many response codes, grouped as shown below:

Response Code Grouping	General Meaning
200 - 299	success
300 - 399	web browser needs to go to another page
400 - 499	client error
500 - 599	server error

Some common particular response codes are:

Response Code	Meaning
200 OK	The request was successful.
301 Moved Permanently	The page has moved to a new URL.
304 Not Modified	The client made a request for a page, but used an option to specify that it only requires the page if it has been changed.
400 Bad request	The request has faulty syntax
401 Unauthorized	Authorization is needed to access this page, Either the authorization is wring or has not been supplied.
404 Not Found	The server cannot find the page. This is a common error.
503 Service Unavailable	The server is temporarily unable to handle the request, perhaps due to maintenance or overloading

A typical request might look like this:

GET /index.html HTTP/1.0
Accept: text/html
Accept: image/gif
User-Agent: Lynx/2.4

This is a sequence of lines, in ASCII, terminated by an empty line. As we have seen, the second item on the first line is the path name. This is followed by the version of the HTTP protocol that the client understands. This line is all that is required. However, other information can be provided by the client. Each piece of information is on a separate line and takes the form:

keyword: value

For example, the line

Accept: text/html

says that the client can accept html documents, while the line

Accept: image/gif

says that the client can accept images in the Graphics Interchange Format (one of the very common image file formats used on the web). This kind of information allows the server to tailor its responses to what the client is able to process. The client can also say which web browser and version it is, as in

User-Agent: Lynx/2.4

There are also other request types in addition to GET. For example, HEAD retrieves only the file header, so that the browser can see whether it has been updated since it last retrieved a copy, and POST is used in conjunction with forms and CGI (the Common Gateway Interface protocol).

A typical response consists of a number of header lines, followed by an empty line, followed by the contents of the file - usually in the form of HTML. For example, we might get this:

HTTP/1.1 200 OK
Date: Mon, 12 Jul 1999 12:42:22 GMT
Server: Apache/1.3.6 (Unix)
Last-Modified: Wed, 07 Jul 1999 17:14:42 GMT
ETag: "fcdd-17e-37838b02"
Accept-Ranges: bytes
Content-Length: 382
Connection: close
Content-Type: text/html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
etc.

The first line gives the HTTP version number and a response code (see above). The third line is the name of the server program and version number. The last line of the header specifies the type of content being returned.

Exercise: Begin by trying to replicate the following, and then trying the same thing with another server or two. Note that you only have to enter the telnet command, the line beginning with GET, and the blank line immediately following the GET line. Use upper case as shown. GET is only one of several HTTP commands. You might choose to type some file name other than the one shown (the index file). Study the reply carefully. The meanings of any error messages might be available from the tables given above.

If the web page you are retrieving is a long one, it may be difficult to display if your Telnet program does not allow its window to be scrolled. In that case you might want to switch on logging before connecting. Afterwards you can display the dialogue using any editor. You should see several lines of heading, followed by the "raw" HTML of the Web page.

$ telnet cs.smu.ca 80
Trying 140.184.76.9...
Connected to cs.stmarys.ca.
Escape character is '^]'.
GET /index.html HTTP/1.0

HTTP/1.1 200 OK
Date: Fri, 21 Mar 2003 15:34:52 GMT
Server: Apache/1.3.26 (Unix) Debian GNU/Linux PHP/4.1.2
Connection: close
Content-Type: text/html; charset=iso-8859-1

<html>
<head>
   <meta http-equiv="Refresh"
   content="1;URL=http://www.stmarys.ca/academic/science/compsci">
</head>
<body>
If you are not automatically forwarded click
<a href="http://www.stmarys.ca/academic/science/compsci">here.</a>
</body>
</html>
Connection closed by foreign host.