Basic Internet Principles

This page reviews the central concepts of software on the Internet. It briefly explains:


TCP/IP

The Internet is the network that connects computers all over the world. It works according to a set of agreed-upon protocols. TCP (Transmission Control Protocol) and IP(Internet Protocol) are the most commonly-used protocols for using the Internet. (But there are others at lower levels.) The combination is simply known as TCP/IP.

The Internet is a packet switching system. Any message is broken into packets that are transmitted independently across the interment (sometime by different routes). These packets are called datagrams. The route chosen for each datagram depends on the traffic at any point in time. Each datagram has a header of between 20 and 60 bytes, followed by the payload of up to 65,515 bytes of data. The header consists of, amongst other data:

  1. The version number of the protocol in use
  2. The IP address of the sender (or source, or origin)
  3. The IP address of recipient (or destination)

TCP breaks down a message into packets. At the destination, it re-assembles packets into messages. It attaches a checksum to each packet. If the checksum doesn't match the computed checksum at the destination, the packet is re-transmitted. Thus TCP ensures reliable transmission of information. In summary, TCP:

  1. Provides re-transmission of lost data
  2. Ensures delivery of data in the correct order

Exercise: Suggest some typical causes for lost or damaged data transmission

IP is concerned with routing. IP attaches the address of the destination of each packet. IP ensures that packets get to the right place.

TCP is the higher-level protocol that uses the lower-level IP.

When an application is written, the general principle is to use the highest level protocol that you can, provided that it provides the functionality and performance that is required. Many applications can be written using TCP/IP. For example, a Web browser can be written in Java using only URLs, without any explicit mention of sockets.

On each machine an application program makes calls on procedures in the transport layer (normally TCP). In turn the transport layer makes calls on the Internet layer (normally IP). In turn the Internet layer makes calls on the physical layer, which is different depending on the technology of the communication link.

At the destination machine, information is passed up through the layers to the application program. Each application program acts as if it is communicating directly with the application on another machine. The lower levels of the communication software and hardware are invisible.

This four-layer model is sufficient for understanding Internet software. But there are other models that use a different number of layers, like the ISO seven-layer model.

The application layer produces some data, adds a header to it and passes the complete package to the transport layer. The transport layer adds another header and passes the package to the internet layer. The internet layer adds another header and passes it to the physical layer. The application data is enclosed by 4 headers used by the different layers. This process can be thought of as repeatedly putting a letter into an envelope and then addressing the envelope.


UDP

Most applications use TCP. However, an example of a situation in which it is desirable to use a lower-level protocol is the case of audio streaming. If you want to download a sound file, it can take some time, even though it may be compressed. You have to wait (perhaps some considerable time, relatively speaking) for the complete file to download, before it can be played. An alternative is to listen to the sound as it is being downloaded - which is called streaming. One of the most popular technologies is called RealAudio.

RealAudio does not use TCP because of its overhead. The sound file is sent in IP packets using the UDP (User Datagram Protocol) instead of TCP. UDP is an unreliable protocol, since:

UDP doesn't re-send a packet if it is missing or there is some other error, and it doesn't assemble packets into the correct order. But it is faster than TCP. In this application, losing a few bits of data is better than waiting for the re-transmission of some missing data. The application's major mission is to keep playing the sound without interruption. (In contrast, the main goal of a file transfer program is to transmit the data accurately.)

The same mechanism is used with video streaming.

UDP is a protocol at the same level as TCP, above the level of IP.


IP Addresses

An IP address is a unique address for every host computer in the world. Consists of 4 bytes or 32 bits. This is represented in quad notation (or dot notation) as four 8-bit numbers, each in the range 0 to 255, e.g. 131.123.2.220.

IP addresses are registered so that they stay unique.

You can find the IP address of the local machine under Windows NT by typing the following command at the DOS prompt in a console window:

ipconfig

Under Unix or Linux, this command is:

ifconfig

Exercise: Type these commands at the Windows DOS command prompt and/or the Unix/Linux prompt.

The IP address 127.0.0.1 is a special address, called the local loopback address, that denotes the local machine. A message sent to this address will simply return to the sender, without leaving the sender. It is useful for testing purposes.


Domain Names

A domain name is the user-friendly equivalent of an IP address. It is used because the numbers in an IP address are hard to remember and use. It is also known as a host name.

Example:

cs.stmarys.ca

Such a name starts with the most local part of the name and is followed by the most general. The whole name space is a tree, whose root has no name. the first level in the tree is something like com, org, edu, ca, etc.

The parts of a domain name don't correspond to the parts of an IP address. Indeed, domain names don't always have 4 parts - they can have 2, 5 or whatever.

All applications that use an address should work whether an IP address or a domain name is used. In fact, a domain name is converted to an IP address before it is used.

Exercise: Compare and contrast IP addresses with domain names.


The Domain Name System

A program, say a Web browser, that wants to use a domain address usually needs to convert it into an IP address before making contact with the server. The domain name system (DNS) provides a mapping between IP addresses and domain names. All this information cannot be located in one place, so it is held in a distributed database.


Clients, Servers and Peers

A network application usually involves a client and a server. Each is a process (an independently running program) running on a (different) computer.

A server runs on a host and provides some particular service, e.g. e-mail, or access to local Web pages. Thus a Web server is a server. A commonly-used web server program is called Apache.

A client runs on a host, but generally needs to connect with a sever on another host to accomplish its task. Usually, different clients are used for different tasks, e.g. Web browsing and e-mail. Thus a Web browser is a client.

Some programs are not structured as clients and servers. For example a game, played across the internet by two or more players is a peer-to-peer relationship. Other examples of peer-to-peer relationships: chat, internet phone, shared whiteboard.


Port Numbers

To identify a host machine, an IP address or a domain name is needed. To identify a particular server on a host, a port number is used. A port is like a logical connection to a machine. Port numbers can take values from 1 to 65,535. A port number does not correspond to any physical connection on the machine, of which there might be just one. Each type of service has, by convention, a standard port number. Thus 80 usually means Web Serving and 21 means File Transfer. If the default port number is used, it can be omitted in the URL (see below). For each port supplying a service there is a server program waiting for any requests. Thus a web server program "listens on port 80" for any incoming requests. All these server programs run together in parallel on the host machine.

When a packet of information is received by a host, the port number is examined and the packet sent to the program responsible for that port. Thus the different types of request are distinguished and dispatched to the relevant program.

The following table lists the common services, together with their normal port numbers. These conventional port numbers are sometimes not used, for a variety of reasons. One example is when a host provides (say) multiple web servers, so only one can be on port 80. Another reason might be that the server program has not been assigned the necessary privilege to use port 80.

Protocol Name

Port Number

Nature of Service

echo

7

The server simply echoes the data sent to it. This is useful for testing purposes.

daytime

13

Provides the ASCII representation of the current date and time on the server.

ftp-data

20

Transferring files. (ftp uses two ports)

ftp

21

Sending ftp commands like RETR and STOR.

telnet

23

Remote login and command line interaction.

smtp

25

E-mail (Simple Mail Transfer Protocol)

http

80

Web

nntp

119

Usenet (Network News Transfer Protocol)

Some of these protocols are described later in these notes.


Exercise: Although the main use of telnet is remote login, it can be used as a general-purpose client that simply sends text (character by character) to a port on a server and then displays any reply. Thus it can be used to simulate any text-based protocol, such as date, echo, HTTP, SMTP, or FTP. Use a telnet program to enhance your understanding of ports, investigate the services provided by various servers, and understand something of high-level protocols.

For example, try the date protocol on two different servers by entering each of the following commands (and just to keep your life really exciting, do it from both Linux and Windows):

telnet ug.cs.dal.ca 13
telnet cs.smu.ca 13

If you know a server in some other part of the world, try it as well, and see whether the time difference makes sense.

Next, continue with the echo protocol, which simply echoes whatever you type in. This is accessible by port number 7, if provided by the server. Once again, to see the differences that may occur, it is useful to try this on different systems. For example, try

telnet ug.cs.dal.ca 7
telnet cs.smu.ca 7
and again try it from Windows and Linux. On Windows you will probably have to set your telnet program to LOCAL_ECHO to get any response.

If you know a server on the other side of the world, try it as well to see if the echo takes any longer.


Sockets

A socket is the software mechanism for one program to connect to another. A pair of programs open a socket connection between themselves. This then acts like a telephone connection - they can converse in both directions for as long as the connection is open. (In fact, data can flow in both directions at the same time.) More than one socket can use any particular port. The network software ensures that data is routed to or from the correct socket.

When a server (on a particular port number) gets an initial request, it often spawns a separate thread to deal with the client. This is because different clients may well run at different speeds. Having one thread per client means that the different speeds can be accommodated. The new thread creates a (software) socket to use as the connection to the client. Thus one port may be associated with many sockets.


Streams

Accessing information across the Internet is accomplished using streams. A stream is a serial collection of data. An output stream can be sent to a printer, a display, a serial file, or an Internet connection, for example. Likewise, an input stream can come from a keyboard, a serial file, or from an Internet connection. Thus reading or writing to another program across a network or the Internet is just like reading or writing to a serial file.


URL

A URL (Uniform Resource Locator):

A URL has this structure:

protocol://hostname[:port]/[pathname]/filename#section

Things in square brackets indicate that the item can be omitted.

The first part of a URL is the particular protocol. Some commonly-used protocols are:

http

The service is the Web. The file is accessed using the HTTP protocol.

ftp

The service is file transfer protocol. The URL locates a file, a directory or an FTP server.

telnet

The service is remote login to a host. No file name is needed.

mailto

The service is e-mail.

news

The URL specifies a usenet newsgroup.

file

This locates a file on the local system. The server part of the URL is omitted.

The host name is the name of the server that provides the service. This can either be a domain name or an IP address.

The port number is only needed when the server does not use the default port number. For example, 80 is the default port number for HTTP.

A pathname (optional) specifies a directory (folder). The pathname is not the complete directory name, but is relative to some directory (folder) designated by the administrator as the directory in which publicly-accessible files are held. It would be unusual for a server to make available its entire file system to clients.

The file name can either be a data file name or can specify an executable file that produces a valid HTML document as its output. A file name is often omitted. In this case, the server decides which file to use. Many servers send a default file from the directory specified in the path name - for example a file called default.html, index.html or welcome.html.

The section part of a URL (optional) specifies a named anchor in an HTML document. Such a place in a document is specified by an HTML entry like:

<a name="thisplace"></a>

which would be referred to by thisplace as the section in the URL.


Exercise: Distinguish between a domain name, a host name, a URL, a path name, an e-mail address.