One of the very first steps when conducting an SEO audit or preparing for a website redesign is “crawling” (aka “scraping”) the website. Crawling a website can provide crucial information about potential issues and possible areas of improvement.
My personal tool of choice for crawling a website is Screaming Frog. The software is inexpensive, easy to use, and extremely versatile. Through the years I’ve evaluated many other options but I’ve yet to find one that truly competes when all considerations are factored in.
What are the reasons for crawling a website?
There are many different things you can accomplish using Screaming Frog. Below are a few of the uses I find particularly helpful.
- Create a list of all URLs/pages on a website
- Find 302 redirects
- Perform QA for 301 redirect implementations
- Verify Google Analytics is located on every page
- Find broken links (internal and external)
- Find missing meta content and alt attributes
- Find duplicate content
- Find or verify Schema content
- Find malware or spam
- Find slow loading pages
- Create an XML sitemap
- Validate a list of PPC URLs
Again, this is just a small list of the potential uses for crawling a website. There are many more things you can do once you get a full understanding of Screaming Frog’s capabilities.
How do you crawl very large website?
Screaming Frog’s default setup is great for tackling websites with less than 10,000 pages but what do you do when you run into a website like the one below?
The website pictured has over 3 million indexed pages which makes crawling it a challenge to say the least. Taking on a website this size requires a handful of important tweaks to Screaming Frog’s configurations. The following is a tutorial for how to setup Screaming Frog to crawl very large websites.
1. Increase Computer RAM
This step is optional but extremely recommended. The Screaming Frog software relies on RAM for holding data during a website crawl. The more RAM your computer has the more pages you can crawl. The good news is RAM has become extremely inexpensive. For most computers an 8 GB memory card costs around $50.
You’ll need to do a little research on how much RAM your computer can hold since it ranges anywhere from 4 GB to over 64 GB. Once you know how much RAM you can add make sure you purchase RAM cards that are compatible with your computer.
Installing RAM usually takes about 30 seconds. All you do is pop off the back of the computer and slide the RAM cards into the available slots. You may want to perform a quick search on YouTube for your computer model + “RAM Upgrade” to find a tutorial for how to perform the process in case there are specific nuances with your computer.
The best part about increasing RAM is it enhances more than just your ability to crawl websites. If you’re like me and you always have an assortment of programs running simultaneously while also bouncing back-and-forth between 1,000 different Chrome tabs you’ll be pleasantly surprised by the improvement in day-to-day computer speed.
2. Increase Screaming Frog’s RAM Availability
The default setting for Screaming Frog allows it to access 512 MB of RAM. Adding more RAM to your computer won’t change how many pages you can crawl until you alert the software it can use more of the available storage.
To do this, you’ll first need to open the folder Screaming Frog was downloaded to. Then use a text editor like Notepad to open the file named “ScreamingFrogSEOSpider.l4j”.
The file is very simple and only includes two lines of information. The number located on the second line is the one you’ll need to update.
The default number when you first open this file will be “-Xmx512M”. This means Screaming Frog is allowed to use 512 MB (0.5 GB) of your computer’s RAM. Add 1024 for every additional 1 GB of RAM you want Screaming Frog to have access to. For example, to allocate 15 GB (pictured in the above screenshot) simply replace the number with “15360″ (1024 x 15 = 15360). Make sure to leave the “-Xmx” and “M” text that appear around the number. Then save the file and you should be ready to use your extra RAM.
TIP 1: I recommend allocating at least 3 GB less than your total available website RAM. If you allocate the full available website RAM it’s possible for large website crawls to freeze your computer once it gets close to the maximum amount of RAM. For example, if your computer has 16 GB of RAM then only allow Screaming Frog to access 13 GB of it. You can always reduce the amount of RAM in this buffer if you determine it’s more than your computer actually requires.
TIP 2: To check if you were successful increasing the RAM allocation: restart Screaming Frog, click on “Help”, and then click on “Debug”. In the line labeled “Memory” your new amount of RAM should be listed directly behind the word “Max”.
3. Increase Screaming Frog’s Crawl Speed
Crawling large websites does take time but there are ways to help expedite the process. To minimize the amount of time it takes you’ll need to increase the crawl speed in the configuration tab.
Once you click on “Speed” the spider speed configuration box will appear. The number listed next to “Max Threads” is what determines how quickly you can crawl a website.
Increasing this number will greatly improve the time it takes to crawl websites. Test using a few different max thread counts and see how crawl speed is affected (Ex. 10, 50, 100, 200, 500, 1000, etc.)
It’s worth noting that setting a high number of threads will increase the number of http requests made to the server which can impact the website’s response times. I’ve never run into this issue with any of our clients but it doesn’t hurt to reach out the site’s webmaster to approve a crawl rate just in case. Another option is to monitor response times and adjust the speed if you notice any issues.
4. Segment Website into Crawlable Chunks
No matter how much RAM you put into your computer there will always be some websites with too many pages to crawl in one session. In order to crawl websites of this size you’ll need to segment them into crawlable chunks.
The first thing you’ll need to do is run a test crawl to find out the maximum amount of URLs your computer can handle. The maximum amount of URLs you can crawl will help determine your segmentation strategy. As a reference, in my experience a computer with 15 GB of RAM allocated to Screaming Frog can crawl 600,000-900,000 URLs per session.
After determining your maximum amount of URLs you’ll need to map out the website by breaking it into subdomains or directories that fall below your maximum URL count. Use a site search command in Google to determine the amount of indexed pages in each section (ex: “Site:http://yourdomain.com/targeted-directory/”).
As you can see in the screenshot above, this particular directory only includes 304,000 indexed pages even though the website as a whole included over 3 million.
It’s important to note the site search command will only displays the amount of indexed pages in a directory. It does not display the total number of pages that exist in a directory. Pages could exist in the directory that aren’t currently indexed. For that reason, give yourself a buffer between the maximum amount of pages your computer can handle and the amount of indexed pages in the section you’re trying to crawl. For example, if your computer can handle around 700,000 pages per crawl then try to segment the website into sections that include 500,000 or fewer indexed pages. This will give you a 200,000 page buffer to account for any non-indexed pages.
Tip: You may run into situations where a subdomain or directory has more indexed pages than your computer can crawl. An explanation for how to handle that situation is included in section 6 of this tutorial.
5. Include Target Directories
The next step in the setup process is to let Screaming Frog know what specific content you want it to crawl. In order to do that you’ll need to click on “Include” in the configuration drop-down.
A box will open that allows you to use regular expressions (regex) to limit the software to only crawl certain content. If you’re not familiar with regular expressions Microsoft has a pretty good list of what each character can be used for.
The example below shows the regular expression used to limit a crawl to only the /foodscores/ directory.
It’s important to make sure you begin the crawl from a page that has at least one internal link pointing to your target content. If not, your crawl will end after only one page.
6. Exclude Unneeded URLs
The exclude feature is similar to the include feature except, as you may have guessed, it removes sections you specifically don’t want in your results.
This is very helpful when breaking up websites that have nested directories or for breaking up the directories themselves when they have more pages than your computer can handle on one crawl.
This functionality is also extremely helpful when working with a CMS that auto generates dynamic pages. If you start a crawl and you begin noticing a large number of the URLs being pulled in have dynamic segments like “?search=” or “?tag=” you’ll most likely want to add those URL segments into the exclude section to clean up your results and reduce the amount of unneeded pages crawled.
Below are a few regular expression examples of the exclude functionality that Screaming Frog highlights on their website.
TIP: It can sometimes be tricky to break up individual directories that have more indexed pages than your computer can handle. You’ll need to leverage both the include and exclude functionalities to accomplish this task.
One possible solution is to break up the directory by URL keyword usage. If a large portion of the URLs in a directory have a specific keyword in them you can use it to your advantage. Run one crawl including URLs in the directory that contain the targeted keyword and then run a separate crawl where you include the directory and exclude the keyword.
The best solution to this problem usually depends on the specific website you’re trying to crawl. Spend some time getting familiar with regular expressions. If you do, you’ll discover many different options for breaking up large directories into crawlable chunks.
Here are a few additional tips that can be helpful for setting up Screaming Frog and crawling large websites.
- If you have a 64-bit machine, ensure you download and install the 64-bit version of Java. If you don’t you’ll run into the error displayed below.
- Save crawl backups when getting close to your maximum RAM usage. For example, if your computer can usually handle 700,000 URLs per crawl and you’re in the middle of a crawl that just passed 600,000 URLs it’s a good idea to pause and save in case the high RAM usage eventually causes your computer to freeze up. There’s a default setting in Screaming Frog that’s supposed to pause the software on high memory usage but I’ve experienced situations where it doesn’t always work.
- As noted earlier, the more RAM your computer has available for Screaming Frog the more URLs you can crawl. That means it can be helpful to reduce the amount of RAM your other programs require. If you use Chrome and you keep many tabs open at once there’s a free extension called “The Great Suspender” that temporarily freezes tabs you haven’t used in a while. It then allows you to quickly unfreeze the tabs whenever you need them again. The extension helps reduce Chrome’s RAM usage and I have personally found it extremely beneficial.
Crawling huge websites can certainly be a challenge but it’s a process that’s vitally important for SEO, website maintenance, and during the redesign process. Luckily, once you’ve finished implementing all the above configurations you’ll be all-set to start crawling websites the size of Mount Everest.
Do you have your own tips for crawling large websites? Or do you have additional questions? Feel free to post your questions and comments below or reach out to me directly at: Brian.Rogel@beaconfire-red.com or @BrianRogel
Good luck, and happy crawling!