Several hundred thousand files, each containing a summary of a single web page request to one of many websites. The summary consists of a timestamp and a direction for each packet.

(Direction is: 1 = outbound/ -1 = inbound for each packet transmitted.)

There are three groups of files:

  • Alexa_Monitored: Requests to the top 55 websites (as determined by www.alexa.com). There are perhaps 100 requests for each website.

  • HS_Monitored: requests to the same websites as in Alexa_Monitored (perhaps only the top 30) but this time accessed through the Tor anonymising service.

  • Unmonitored: The top 100,000 websites excluding the top 55, one request per site.

The filename for the “monitored” requests have the form nn_mm.txt, where nn identifies the website being accessed; and mm is the particular request.

Challenge

Given a novel file for a previously unseen request – ie, the packet timestamp and direction timeseries – identify which website has been accessed.