UC Berkeley Home IP Web Traces

Description
This dataset consists of 18 days' worth of HTTP traces gathered from the Home IP service offered by UC Berkeley to its students, faculty, and staff Home IP provides dial-up PPP/SLIP IP connectivity using 2.4 kb/s, 9.6 kb/s, 14.4 kb/s, or 28.8 kb/s wireline modems, or Metricom Ricochet (approximately 20-30 kb/s) wireless modems. These client traces were unobtrusively gathered through the use of a packet sniffing machine placed at the head-end of the Home IP modem bank; the tracing program used was a custom module written on top of the Internet Protocol Scanning Engine (IPSE) created by Ian Goldberg. Only traffic destined for port 80 was traced; all non-HTTP protocols and HTTP connections for other ports were excluded from these traces.

The traces contain the following information:

Format
For the sake of storage efficiency, the (gzipped) traces are stored in a binary representation. This archive of tools includes the following code to parse and manipulate the archives:

The showtrace tool will display lines in the following format:

848278028:829593 848278028:893670 848278028:895350 23.240.8.98:1462
207.36.205.194:80 2 8 4294967295 4294967295 835418853 170 844
37 GET 9168504434183313441..gif HTTP/1.0

The interpretation of the client and server header bitfields are as defined in the logparse.h header in the tools code.

The tools code has been tested on both Linux and Solaris. The provided Makefile assumes Solaris - you may have to play with the LIBS definition for other platforms. HPUX is a mess; I didn't even try, but it should be possible to get these tools to work with little effort. If you do, please let me know what you did so that I can make your changes available to the world.

Measurement
The Home IP population gains IP connectivity using PPP or SLIP across their 2.4 kb/s, 9.6 kb/s, 14.4kb/s or 28.8kb/s wireline modem, or their (approximately) 20-30kb/s wireless Metricom Ricochet modem. There are a total of roughly 600 modems available via the Home IP bank. All traffic from these modems ends up feeding over a single 10Mb/s shared Ethernet segment, on which we placed a network monitoring computer (a Pentium Pro 200Mhz running Linux 2.0.27). The monitor was running the IPSE user-level packet scanning engine and a custom-written HTTP module that reconstructed HTTP connections from the gathered IP packets on-the-fly and emitted an unanonymized trace file. Each trace file was then anonymized and transmitted to our research workstations for further postprocessing and analysis.

The trace gathering engine was brought down and restarted approximately every 4 hours (for administrative and address-space-growth reasons). This implies that there are two weaknesses in these traces that you should be aware of:

  1. any connection active when the engine was brought down will have a possibly incorrect timestamp for the last byte seen from the server, and a possibly incorrect reported size. We estimate that no more than 150 such entries (out of roughly 90000-100000) are misreported for each 4 hour period.

  2. any connection that was forged in the very small time window (about 300 milliseconds) between when the engine was shut down and restarted will not appear in the logs. We estimate that no more than 30 such drops occur for each 4 hour period.

The packet capture tool reported no packet drops. Considering that a Pentium Pro 200MHz was used to capture the traces on a 10 Mb/s Ethernet segment, it is virtually certain that no trace drops besides those mentioned above occurred. There may be periods of uncharacteristically low activity in the traces - these correspond to network outages from Berkeley's ISP, rather than trace failures.

The traces do contain entries for requests issued by the client but that weren't completed (because, for instance, the user pressed the STOP button and the TCP connection was shut down before the request completed). Unknown timestamps in the traces contain the value 0xFFFFFFFF (reported by showtrace as 4294967295), and incomplete requests contain header and data length values that report as much header/data was seen.

The trace data is sorted by completion time (i.e. the time at which the last bye of the server response was seen, or the time at which the connection was dropped). However, because of inaccuracies and apparent time travel in the Linux system clock, some trace entries appear slightly out of order.

All timestamps within the traces are as reported by the gettimeofday() system call, so these timestamps ostensibly have microsecond resolution.

Privacy
To maintain the privacy of each individual Home IP user, we have stripped identity information out of the traces through a post-processing phase. Because it is very trivial to identify a user based solely on the pages that the user has visited, we were forced to anonymize the URL and destination IP address of each web request as well as the source IP address. All anonymization was done using a keyed MD5 hash of the data (32 bits for client and server IP addresses, 64 bits for URLs). We ourselves do not know the key used to salt the MD5 hash, so don't bother asking us for it. Similarly, don't bother asking us for unanonymized traces.

In order to preserve some information about the URLs, the post-processed URLs have the following format:

COMMAND URLHASH.[flags][.suffix] [HTTPVERS]

where:

Here are some examples of URLs contained in the traces:

Privacy was the firstmost concern during this trace gathering experiment - UC Berkeley and the CS department consider the privacy of the student body to be paramount, and whenever we had the choice of putting more information in these published logs at the cost sacrificing the privacy of the traced users, we have invariably chosen to maintain the users' privacy at the cost of losing this information. It is our hope that someday the web protocols and servers will become secure enough to make a tracing effort of the kind we have done impossible.

Acknowledgements

Steven D. Gribble contributed the traces to the ITA. He also maintains the official UC Berkeley page dedicated to this tracing effort.

For inquiries, contact Steve Gribble at gribble@cs.berkeley.edu.

These traces, documentation, and associated trace tools were created by Steve Gribble with the assistance of Armando Fox, Ian Goldberg, Eric Brewer, and Cliff Frost.

Restrictions

Copyright (C) 1996-1997 by the Regents of the University of California.

IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY DERIVATIVES THEREOF, EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

You have permission to use and redistribute these traces freely, as long as this Copyright and Disclaimer is distributed unmodified. If you publish any results based on these traces, please send us a copy of this publication (in electronic or print form) and give the following reference or attribution in your publication:

Steven D. Gribble, "UC Berkeley Home IP HTTP Traces", July 1997. Available at http://www.acm.org/sigcomm/ITA/.
Distribution
The web traces have been split into the following 4 files: We have also made the following small 4 hour snippet of trace data available in case you want to evaluate the traces without downloading such a large data set:


Up to Traces In The Internet Traffic Archive.