Abstract:
Since ICANN initiated the delegation of new generic top-level domains (new gTLDs) in 2013, more than a thousand of new gTLDs have been added to the domain name system (DNS). Previous work has shown that while new gTLD domains bring flexibility to registrants, they are also commo
nly used for malicious behavior because of their low registration costs, and it is im
portant to identify malicious new gTLD domains. However, because of the unique characteristics (e.g., domain length) of new gTLD domains, the accuracy is low when applying existing malicious domain identification methods to malicious new gTLD domain identification. To address this issue, we first characterize the resolution behavior of new gTLD domains ba
sed on massive domain name resolution data from five aspects including the number of associated SLDs per new gTLD, query volume, query failure rate, co
ntent replication and hosting infrastructure sharing. Then we analyze the resolution behavior of malicious new gTLD domains and find their unique behavioral characteristics in terms of co
ntent hosting infrastructure concentration, the number of FQDNs per SLD, the number of queries, the distribution of end users’ network footprints, and the distribution of the length of SLDs. Finally, according to these features, we design a malicious new gTLD domain identification method ba
sed on random forest. The results of the experiment show that the proposed method achieves 94% accuracy, which is better than the existing malicious domain identification methods.