Problem: A few months ago, Giancarlo Gonzales, a former CIO for the island of Puerto Rico, indicated the lack of updates towards open-data in data.pr.gov. As part of an open-data initiative, Puerto Rico created its own version of data.gov, called data.pr.gov, which provides free and open access to government information datasets. Giancarlo alluded to the lack of updated data.pr.gov datasets between government administrations since 2016.
A few related posts by Giancarlo regarding the expired TLS/SSL certs on government portals, as well as recent posts between the confusion of .pr domains led our President, Jose Fernandez, to a dataset within the data.gov which states it has a semi-complete list of all .gov domains. The dataset correctly excludes some sensitive domains, such as military domains, but we were surprised to only see six pr.gov domains in this list. To help gather a more accurate list, we engaged in non-commercial academic research to help fill this information void.
Solution: To solve this problem, we utilized censys.io to extract all public information contained within TLS/SSL certificates on port 443 which included a subdomain of pr.gov.
Explanation: censys.io provides search criteria for TLS/SSL certificates. Although it seems limited to just 443.http, it did provide a better answer than shodan.io did. For example, censys.io provided 1539 results to our query of pr.gov while shodan.io had less than 160 at the time of writing, down to 82 over one week before publication.
Script: We found https://crazybulletctfwriteups.wordpress.com/2018/01/18/subdomain-scanner-using-censys-python/ from P3t3rp4rk3r – https://github.com/P3t3rp4rk3r. Santhosh Baswa’s script from a CTF essentially provided the means to help solve this question. However, the censys.io API is limited to 1000 results or less. To overcome this limitation, we slightly modified Santhosh Baswa’s script to permit the use of an additional filter to reduce the dataset queried.
Primer: The simplest way to reduce the number of hits at the time was to also query the parsed.validity.start date in conjunction with pr.gov. Since we have a total number of hits, 1539, then querying between the start and end dates reduces the number of records returned, then we can query for the remaining date period and get the missing records. Although the methodology sounds correct, we ended up with more certificates than our simple math expected, but at least we have a better starting point for anyone looking to use censys.io to solve research problems such as this. We have posted a cited, and modified version of Santhosh Baswa’s script on our github.com page.
On Shodan Problems: Shodan.io stated approximately less than 160 pr.gov domains within its datasets. We considered this number to be much smaller than we had anticipated simply due to familiarity with the Puerto Rico government. For example, a certain government website we expected to exist within shodan.io was not present. This led us to believe in two possible scenarios:
- Shodan.io scanning is not 100% when parsing TLS/SSL certificates.
- The government of Puerto Rico has begun blocking shodan.io scans.
Both scenarios are plausible. Considering our previous academic and well-intentioned notifications towards government officials and private sector warnings about unnecessary service exposure on shodan.io in the past, we are inclined to believe that perhaps instead of removing unnecessary services from the respective networks, the administrators of these networks may simply drop the scan traffic. Over time, if you have older datasets to make comparisons you can validate these assumptions. For now, we will believe the shodan.io scanning/indexing is incomplete.
Conclusion: As cybersecurity professionals, we often struggle with the effects of compiling open-source data sets for public dissemination. We must ask ourselves, does this do more harm than good?. To date, we are unsure of what the effect of distributing such a list will have, but we remain positive that compilation of this dataset will help the government of Puerto Rico, it’s residents, and data scientists abroad to help understand how the government websites of Puerto Rico are utilized in our digital era. We facilitated the dataset to an admin on data.pr.gov. If the list becomes used, we will try and match the format used on https://raw.githubusercontent.com/GSA/data/master/dotgov-domains/current-full.csv to improve the list.