The mysteries of the web root
There are many files in the web root folder that is used for many different things that are often automatically grabbed by either bots or your browser. Here is a list of various things I have used and read about.
A command like this ran in your /var/log/apache2
can be used to find all kinds of different things that has been accessed in your server in the logs.
cat *.gz | gunzip | grep -o -P '(GET|POST) [^ ]+' | sort | uniq
- index.html
- favicon.ico
- robots.txt
- humans.txt
- .well-known
- .htaccess and .htpasswd
- cgi-bin
- google*.html
- xmlrpc.php
- crossdomain.xml
- Probably accidental files
index.html
This is a common file to exist here. There might be all kinds of index. files such as index.php
and even index.json.
The typical behavior for a web server is for every directory you visit on a website it will look for an index file and if that file exists then display or run it otherwise it will run directory listing if that is on. You can disable directory listing by creating an empty index.html
in fact which is good if you want to hide things or provide your own directory lister.
If both index.php
and index.html
exists on a website where php is considered a valid index file then it might prioritize one of them and you can try accessing either http://example.com
favicon.ico
This is really important and browsers tend to automatically try to fetch this to use to display as an icon on the browser tab but it can be overridden like the following.
<link rel="icon" type="image/x-icon" href="/favicon.ico">
You should have some icon to uniquely identify your website here and you can even use the HTML element to change it dynamically after the page has been loaded to show a different icon if the user has notifications.
Here is the picture this website has as a favicon and it should be a red and blue creature unless it has been changed when you read this later.
robots.txt
The Robots Exclusion Standard for friendly bots. Bots tend to follow these rules and it both is intended to make bots to go to places and not go to places and also provide extra info for them like sitemaps. Below is an example of one that allows absolutely everything
User-agent: *
Allow: /
User-agent can be set to specific bots and many Disallow and Allow paths may follow for every bot. Remember that malicious bots being contrarian in nature makes them specifically go to places where you tell them not to go so do not put some Disallow rule on your secret admin panel or such.
A HUGE amount of robots will visit your website for all kinds of reasons. Here is a whole bunch of reasons a robot might visit your website
- Search indexing
- Someone pasted in a link to your website on social media or a chat client
- Some malicious bot appears and tries to hit things in hope to find a vulnerability
- Someone used a downloader or user operated spider on your website like Wget (This can bypass robots.txt at times if specified by the user)
- An API or webhook on your website received a call
- A malfunctioning robot accidentally hitting your site
- RSS clients that are looking for new articles on your website. Maybe you have a tag like
<link rel="alternate" type="application/rss+xml" title="RSS" href="https://ellietheyeen.github.io/feed.xml">
- Tools like google webmasters or the bing equivalent trying to find the verification file and to see if it is still there (Yes you can block this)
- Someone told a bot maybe like ChatGPT to fetch something from your website or it arrived automatically
- Site accelerators that fetch pages that it predicts you might fetch or converting data for slow clients
- You clicked on some link on a website that then read the
referer
header on your request and sent a bot back to analyze it. The creative commons license website has been known to do this. - Pingbacks for your blog like wordpress
xmlrpc.php
which is used to find who referenced your article in their own - Vulnerability scanners that are meant to warn users
- Bots simply just fetching the title such as ones on wikis
- Down detectors that verify that your website is online
- Mastodon verifying that you have a rel=”me” for verification looking like this:
<a href="https://toot.cat/@DPSsys" target="_blank" rel="me">Mastodon</a>
Example of disallowing
Below is an example of blocking 2 different ChatGPT related robots. The first one is to block the crawling bot and the second is to block the user operated bot and this was found at https://www.furaffinity.net/robots.txt
Part of robots.txt
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
humans.txt
Mostly a thing to allow people to see the humans behind a website. Not as relevant for me as I am a creature but you can read about it here.
https://humanstxt.org/
.well-known
This is defined in the following document
https://www.rfc-editor.org/rfc/rfc8615
It is a general purpose folder made for various bots and you can find a list of what bots uses it at
https://en.wikipedia.org/wiki/Well-known_URI
It is an extremely interesting article and list a lot of other things that use the standard such as
https://en.wikipedia.org/wiki/Security.txt
https://en.wikipedia.org/wiki/WebFinger
There are many things that use this like both Mastodon and Certbot which you use to get free SSL certificates from https://letsencrypt.org/
.htaccess and .htpasswd
These are used by Apache to add things such as passwords to directories where .htaccess
defines various rules and .htpasswd
contains an encrypted password but these tend to not be accessible by HTTP but rather just be there
https://httpd.apache.org/docs/current/howto/htaccess.html
cgi-bin
While technically not at the location itself of the webroot at least most of the time as it tends to be aliased there it is still a very common location you can access on servers by writing /cgi-bin/
but it tends to have directory listing off. The location of it tends to be something like /usr/lib/cgi-bin/
in Linux. If you know a script there you can access it tho.
google*.html
Used for verification in Google search console that used to be called Google WebMasters in order to let you display performance for your website and to add sitemaps.
BingSiteAuth.xml
Similar for Bing webmaster tools
xmlrpc.php
This was sort of an early attempt at federating the internet with blogs from wordpress getting referred by other blogs got links added to them but as you can guess this was heavily exploited and newer solutions for this.
crossdomain.xml
Used by Flash and Unity web player in order to define what is allowed to open sockets to. Ypu can find information about it in the following link but it tends to be very legacy at this point with both of those falling into disuse.
https://www.adobe.com/devnet-docs/acrobatetk/tools/AppSec/xdomain.html
Probably accidental files
It is easy to accidentally upload some files that you do not want to upload that some bots might try to find in order to exploit your website.
.env
Dotenv file which very likely contains credentials
.git
Git version control system and might have credentials at .git/credentials
.vscode
Someone accidentally uploaded their VSCode project to the web. Files containing secrets like usernames and passwords like .vscode/sftp.json
might exist inside from a sftp extension.
*Mweeoops*
By ellietheyeen 5 November 2023 Permalink Tags: html robots.txt seo