The mysteries of the web root
There are many files in the web root folder that is used for many different things that are often automatically grabbed by either bots or your browser. Here is a list of various things I have used and read about.
A command like this ran in your
/var/log/apache2 can be used to find all kinds of different things that has been accessed in your server in the logs.
cat *.gz | gunzip | grep -o -P '(GET|POST) [^ ]+' | sort | uniq
- .htaccess and .htpasswd
- Probably accidental files
This is a common file to exist here. There might be all kinds of index. files such as
index.php and even index.json.
The typical behavior for a web server is for every directory you visit on a website it will look for an index file and if that file exists then display or run it otherwise it will run directory listing if that is on. You can disable directory listing by creating an empty
index.html in fact which is good if you want to hide things or provide your own directory lister.
index.html exists on a website where php is considered a valid index file then it might prioritize one of them and you can try accessing either http://example.com
This is really important and browsers tend to automatically try to fetch this to use to display as an icon on the browser tab but it can be overridden like the following.
<link rel="icon" type="image/x-icon" href="/favicon.ico">
You should have some icon to uniquely identify your website here and you can even use the HTML element to change it dynamically after the page has been loaded to show a different icon if the user has notifications.
The Robots Exclusion Standard for friendly bots. Bots tend to follow these rules and it both is intended to make bots to go to places and not go to places and also provide extra info for them like sitemaps. Below is an example of one that allows absolutely everything
User-agent: * Allow: /
User-agent can be set to specific bots and many Disallow and Allow paths may follow for every bot. Remember that malicious bots being contrarian in nature makes them specifically go to places where you tell them not to go so do not put some Disallow rule on your secret admin panel or such.
A HUGE amount of robots will visit your website for all kinds of reasons. Here is a whole bunch of reasons a robot might visit your website
- Search indexing
- Someone pasted in a link to your website on social media or a chat client
- Some malicious bot appears and tries to hit things in hope to find a vulnerability
- Someone used a downloader or user operated spider on your website like Wget (This can bypass robots.txt at times if specified by the user)
- An API or webhook on your website received a call
- A malfunctioning robot accidentally hitting your site
- RSS clients that are looking for new articles on your website. Maybe you have a tag like
<link rel="alternate" type="application/rss+xml" title="RSS" href="https://ellietheyeen.github.io/feed.xml">
- Tools like google webmasters or the bing equivalent trying to find the verification file and to see if it is still there (Yes you can block this)
- Someone told a bot maybe like ChatGPT to fetch something from your website or it arrived automatically
- Site accelerators that fetch pages that it predicts you might fetch or converting data for slow clients
- You clicked on some link on a website that then read the
refererheader on your request and sent a bot back to analyze it. The creative commons license website has been known to do this.
- Pingbacks for your blog like wordpress
xmlrpc.phpwhich is used to find who referenced your article in their own
- Vulnerability scanners that are meant to warn users
- Bots simply just fetching the title such as ones on wikis
- Down detectors that verify that your website is online
- Mastodon verifying that you have a rel=”me” for verification looking like this:
<a href="https://toot.cat/@DPSsys" target="_blank" rel="me">Mastodon</a>
Example of disallowing
Below is an example of blocking 2 different ChatGPT related robots. The first one is to block the crawling bot and the second is to block the user operated bot and this was found at https://www.furaffinity.net/robots.txt
User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: /
Mostly a thing to allow people to see the humans behind a website. Not as relevant for me as I am a creature but you can read about it here.
This is defined in the following document
It is a general purpose folder made for various bots and you can find a list of what bots uses it at
It is an extremely interesting article and list a lot of other things that use the standard such as
.htaccess and .htpasswd
These are used by Apache to add things such as passwords to directories where
.htaccess defines various rules and
.htpasswd contains an encrypted password but these tend to not be accessible by HTTP but rather just be there
While technically not at the location itself of the webroot at least most of the time as it tends to be aliased there it is still a very common location you can access on servers by writing
/cgi-bin/ but it tends to have directory listing off. The location of it tends to be something like
/usr/lib/cgi-bin/ in Linux. If you know a script there you can access it tho.
Used for verification in Google search console that used to be called Google WebMasters in order to let you display performance for your website and to add sitemaps.
Similar for Bing webmaster tools
This was sort of an early attempt at federating the internet with blogs from wordpress getting referred by other blogs got links added to them but as you can guess this was heavily exploited and newer solutions for this.
Used by Flash and Unity web player in order to define what is allowed to open sockets to. Ypu can find information about it in the following link but it tends to be very legacy at this point with both of those falling into disuse.
Probably accidental files
It is easy to accidentally upload some files that you do not want to upload that some bots might try to find in order to exploit your website.
Dotenv file which very likely contains credentials
Git version control system and might have credentials at
Someone accidentally uploaded their VSCode project to the web. Files containing secrets like usernames and passwords like
.vscode/sftp.json might exist inside from a sftp extension.