The traces contain the following information:
no-cache
,
keep-alive
, cache-control
,
if-modified-since
, and unless
client
headers.
no-cache
,
cache-control
, expires
, and
last-modified
server headers.
if-modified-since
, the
server expires
, and the server
last-modified
headers, if present.
gzcat <tracefile> | showtrace
showtrace.c
to see
how you can use logparse.[ch]
to write code that
parses and manipulates the traces. All times displayed are as
reported by the gettimeofday()
system call.
The showtrace tool will display lines in the following format:
848278028:829593 848278028:893670 848278028:895350 23.240.8.98:1462 207.36.205.194:80 2 8 4294967295 4294967295 835418853 170 844 37 GET 9168504434183313441..gif HTTP/1.0
The interpretation of the client and server header bitfields are as defined in the logparse.h header in the tools code.
The tools code has been tested on both Linux and Solaris. The provided Makefile assumes Solaris - you may have to play with the LIBS definition for other platforms. HPUX is a mess; I didn't even try, but it should be possible to get these tools to work with little effort. If you do, please let me know what you did so that I can make your changes available to the world.
The trace gathering engine was brought down and restarted approximately every 4 hours (for administrative and address-space-growth reasons). This implies that there are two weaknesses in these traces that you should be aware of:
The packet capture tool reported no packet drops. Considering that a Pentium Pro 200MHz was used to capture the traces on a 10 Mb/s Ethernet segment, it is virtually certain that no trace drops besides those mentioned above occurred. There may be periods of uncharacteristically low activity in the traces - these correspond to network outages from Berkeley's ISP, rather than trace failures.
The traces do contain entries for requests issued by the client but that weren't completed (because, for instance, the user pressed the STOP button and the TCP connection was shut down before the request completed). Unknown timestamps in the traces contain the value 0xFFFFFFFF (reported by showtrace as 4294967295), and incomplete requests contain header and data length values that report as much header/data was seen.
The trace data is sorted by completion time (i.e. the time at which the last bye of the server response was seen, or the time at which the connection was dropped). However, because of inaccuracies and apparent time travel in the Linux system clock, some trace entries appear slightly out of order.
All timestamps within the traces are as reported by the gettimeofday()
system call, so these timestamps ostensibly have microsecond resolution.
In order to preserve some information about the URLs, the post-processed URLs have the following format:
COMMAND URLHASH.[flags][.suffix] [HTTPVERS]
where:
COMMAND
is one of GET
,
HEAD
, POST
, or PUT
,
URLHASH
is the string representation of the
64-bit MD5 hash of the URL,
flags
contains the character q to
indicate that a question mark was seen in the URL, and the
character c to indicate that the string CGI or
cgi was seen in the URL,
suffix
is the filename suffix, if present, and
HTTPVERS
is the HTTP version field of the
HTTP command issued by the client, and is one of
Here are some examples of URLs contained in the traces:
GET 8252631242092696791.q.map HTTP/1.0
- the
client issued a GET request, the URL contained a question mark,
the URL ended in the suffix .map, and HTTP/1.0 was used by the
client. An example of a request that may generate this
anonymized URL is GET /foo.map?BAR=BAZ HTTP/1.0
.
POST 36782605103285618862.c HTTP/1.0
- the
client issued a POST, the URL contained the substring CGI or cgi,
the URL did not end with a dotted suffix, and HTTP/1.0 was used
by the client. An example of a request that may generate this
anonymized URL is POST /cgi-bin/foo HTTP/1.0
.
GET 103551731373256697..gif HTTP/1.0
- the
client issued a GET request, the URL contained neither the
substring [CGI|cgi] nor a question mark, the filename ended
with the .gif suffix, and HTTP/1.0 was used. An example of a
request that may generate this anonymized URL is
GET /image.gif HTTP/1.0
.
GET 41438582632480924518. HTTP/1.0
- the
client issued a GET request, the URL contained neither the
substring [CGI|cgi] nor a question mark, the filename didn't end
with a dotted suffix, and HTTP/1.0 was used. An example of a
request that may generate this anonymized URL is
GET /foo HTTP/1.0
.
Privacy was the firstmost concern during this trace gathering experiment - UC Berkeley and the CS department consider the privacy of the student body to be paramount, and whenever we had the choice of putting more information in these published logs at the cost sacrificing the privacy of the traced users, we have invariably chosen to maintain the users' privacy at the cost of losing this information. It is our hope that someday the web protocols and servers will become secure enough to make a tracing effort of the kind we have done impossible.
Acknowledgements
For inquiries, contact Steve Gribble at gribble@cs.berkeley.edu.
These traces, documentation, and associated trace tools were created by Steve Gribble with the assistance of Armando Fox, Ian Goldberg, Eric Brewer, and Cliff Frost.
Restrictions
IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY DERIVATIVES THEREOF, EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. THIS SOFTWARE IS PROVIDED ON AN "AS IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.
You have permission to use and redistribute these traces freely, as long as this Copyright and Disclaimer is distributed unmodified. If you publish any results based on these traces, please send us a copy of this publication (in electronic or print form) and give the following reference or attribution in your publication:
Steven D. Gribble, "UC Berkeley Home IP HTTP Traces", July 1997. Available at http://www.acm.org/sigcomm/ITA/.