Dataset used for detecting DNS over HTTPS by Machine Learning.

The dataset consists of three different data sources:

  1. DoH enabled Firefox 
  2. DoH enabled Google Chrome 
  3. Cloudflared DoH proxy

The generation of web browser data was made using the Selenium framework, which simulated classical user browsing. The browsers received command for visiting domains taken from Alexa's top 10K most visited websites. The capturing was performed on the host by listening on the network interface of the virtual machine. Overall the dataset contains almost 5,000 web-page visits by Mozzila and 1,000 pages visited by firefox.

The Cloudflared DoH proxy was installed in raspberry and the IP address of the Raspberry was set as the default DNS resolver in two separate offices in our university.  It was continuously capturing the DNS/DoH traffic created by 6 working computers for around three months. 

 

The dataset contains 1,128,904 flows from which is around 33,000 labeled as DoH. We provide it in the form CSV file with the following datafields:

  • Label
  • Data source
  • Duration              
  • Minimal Inter-Packet Delay
  • Maximal Inter-Packet Delay
  • Average Inter-Packet Delay
  • A variance of Incoming Packet Sizes
  • A variance of Outgoing Packet Sizes
  • A ratio of the number of Incoming and outgoing bytes
  • A ration of the number of Incoming and outgoing packets       
  • Average of Incoming Packet sizes
  • Average of Outgoing Packet sizes        
  • The median value of Incoming Packet sizes 
  • The median value of outgoing Packet sizes
  • The ratio of bursts and pauses
  • Number of bursts      
  • Number of pauses 
  • Autocorrelation                   
  • Transmission symmetry in the 1st third of connection
  • Transmission symmetry in the 2nd third of connection 
  • Transmission symmetry in the last third of connection  

The observed network traffic does not contain privacy-sensitive information.