De-anonymisation of web browsing history
Several hundred thousand files, each containing a summary of a single web page request to one of many websites. The summary consists of a timestamp and a direction for each packet.
(Direction is: 1 = outbound/ -1 = inbound for each packet transmitted.)
There are three groups of files:
-
Alexa_Monitored: Requests to the top 55 websites (as determined by www.alexa.com). There are perhaps 100 requests for each website.
-
HS_Monitored: requests to the same websites as in Alexa_Monitored (perhaps only the top 30) but this time accessed through the Tor anonymising service.
-
Unmonitored: The top 100,000 websites excluding the top 55, one request per site.
The filename for the “monitored” requests have the form nn_mm.txt
, where nn
identifies the website being accessed; and mm
is the particular request.
Challenge
Given a novel file for a previously unseen request – ie, the packet timestamp and direction timeseries – identify which website has been accessed.