Last updated: January 26th, 2016
Screaming Frog SEO Spider is one of the few tools no SEO can do without. It’s unquestionably a superb crawler, until you start feeling its limitations. Does that situation sound familiar? Like running out of RAM when using Screaming Frog SEO Spider? And not being able to put more RAM into the computer to satisfy Screaming Frog SEO Spider’s needs when crawling really big sites either? There is a solution: Run Screaming Frog SEO Spider in the cloud, and more specifically on Google Cloud infrastructure.
Screaming Frog SEO Spider
For most SEOs, Screaming Frog SEO Spider is THE crawler to crawl and analyse a website from an SEO perspective. Developed by an agency in the United Kingdom called Screaming Frog, Screaming Frog SEO Spider is also one of the few quality crawlers – if not the only one – that works on Debian-based systems like Ubuntu. Screaming Frog SEO Spider can be used for many different purposes, such as on-page analysis, and checking your backlink profile.
The biggest challenge with Screaming Frog SEO Spider is that the program needs a lot of RAM to crawl big sites or large lists of URLs. Although the development team are working on several fixes, you can utilise Google infrastructure with lots of RAM to run Screaming Frog SEO Spider. This is where Google Compute Engine comes in.
Google Compute Engine
In case you are not aware, Google allows you to run large-scale workloads on virtual machines powered by the Google infrastructure. This is called Google Compute Engine, a great opportunity to rent computer resources on a case-use basis. Google Compute Engine is very much still under development, but is already a serious competitor in the cloud-computing resources field compared to the Amazon Cloud and Windows Azure Cloud services. Pricing is often as competitive as, or even cheaper, with Google Compute Engine, making it a great alternative to the already-established Amazon Cloud service.
And to be honest as a SEO Consultant, former Google Support Engineer and Google Search Quality team member, I love the notion of using Google hardware and infrastructure to crawl websites for me. So let’s get started by setting up your new Google Compute Engine instance with the Screaming Frog SEO Spider.
Installing the Google Cloud SDK
To get started, you first need to install the Google Cloud SDK locally on your computer. This process is rather lengthy, but only needs to be done once. As it is much easier to run this from a Linux distribution, such as Ubuntu or Macs, this article will only dive into the steps for Linux machines. Follow these steps if you prefer to install the Google Cloud SDK on a local Windows machine.
First open a terminal and make sure that you have installed curl. You can test this by typing “curl” into the terminal and seeing if it suggests that you install it. If needed and assuming you run Ubuntu, install curl by executing the following command:
sudo apt-get install curl
After you have verified that curl is installed, run the following command in the terminal (from the home directory). This will download and install the Google Cloud SDK:
curl https://sdk.cloud.google.com | bash
The next step is to let the terminal know that Google Cloud SDK has been installed. You have two options, the first of which is the more straightforward: close the terminal and then reopen it. Alternatively, you can also execute the following command (which will avoid you having to restart the terminal):
exec -l $SHELL
Once you have done this, you can verify that the Google Cloud SDK is installed by the following command:
If this does not return an error, and gives you an output with a version number and a list of different tools installed, then the Google Cloud SDK is installed and works on your system.
The next step is to authenticate your computer with the Google Cloud services. This will allow you to send commands to the Google Cloud to manage different Google Cloud services. In the terminal, execute the following command:
gcloud auth login
Depending on whether you have a browser installed on your local computer, a new browser window will open and ask you to give permission for Google Cloud SDK to access your Google account (you may be asked to log into your Google account first). Alternatively, you may also be asked to copy-paste a link into a browser from the terminal, and complete the process that way. Once accepted, you should see a confirmation: “You are now authenticated with the Google Cloud SDK.”
In the meantime, in the terminal you will be asked to enter the Google Cloud project ID. Just press Enter for now, as we will get back to this in a minute. If everything went well, then you will have received a message in the terminal that you are logged in with your Google account.
Google Developer Console
The next step is to go to the Google Cloud Console for developers. This is the website for managing the Google Cloud services, such as Google Compute Engine.
To get started, a new project needs to be created, so click the red button at the top of the page that says “Create Project”. A pop-up appears where you can give the project a name and define a Project ID. The Project Name is not that important, as it is only used in the Google Cloud Console. Just fill in here “Screaming Frog”.
The Project ID is very important, as it uniquely identifies the project among all Google Cloud Services users. So if you don’t go with the default suggestions from Google (click the little arrow on the right side in the input field for more suggestions from Google) it may take some effort to find an available and unique Project ID. For this project I am going with the “screaming-frog-wb”.
Once you click the Create button, you will be redirected to the overview page of the project – in this case https://console.developers.google.com/home/dashboard?project=screaming-frog-wb (note that the Project ID is used in the URL here).
Now comes an important step: we need to enable billing. Google Compute Engine does not have any free quotas, so in order to use Google Compute Engine, we need to enable billing. Find out more information about the pricing of Google Compute Engine here.
To enable billing, click on Settings in the left sidebar of the project overview page. The first option you now should see is Billing, with a grey button entitled “Enable billing”. Have your credit card ready, then click this. Select the right country, and enter your address, tax information (if applicable), phone number and name, and continue to enter your credit card data. Once you have completed this step you are ready to start using Google Compute Engine.
Tip: Once you have enabled billing and started using the Google Cloud services, you will see a link on the overview page of the project (in the right upper corner) to more details of the estimated charges for the present month.
After you created the project and enabled billing, let’s check the settings to make sure you can access the different tools. Click on the APIs and auth menu option in the left sidebar of the project overview page. Now check for Google Compute Engine and Google Cloud Storage and enable the services by clicking on the “off” button (which now turns green and displays the text “on”). To be safe, also enable the Google Cloud Storage JSON API service.
All the steps until now have been about setting up your computer and the project. Most of these steps you don’t need to repeat unless you change computer or want to set up new projects.
Are you ready to start the first virtual machine on Google infrastructure?
Running Your First Google Compute Engine Instance
Open the terminal again, and execute the following command (replace <project-id> with your unique Project ID – for this article that would be screaming-frog-wb):
gcloud config set project <project-id>
Everything you do now with the Google Cloud SDK will be executed as part of project Screaming Frog, which includes the billing.
Next, we need to set a zone by executing the following command (replace <zone> with zone – for this article that would be “us-central”):
gcloud config set compute/zone <ZONE>
Be aware that European zones tend to be slightly more expensive than US zones and that setting a zone is optional and can also be accomplished by adding
to almost every gcloud command in this article. When not set using “gcloud config”, you can use this added command line parameter to run multiple Google Compute Engine instances utilizing many CPUs in multiple zones.
To confirm that you don’t have anything running, execute the following command in the terminal:
gcloud compute instances list
This should display an empty table.
All good so far. Now you can create a new instance by executing the following command:
gcloud compute instances create screaming-frog-test
The observant reader may have noticed that I omitted the “-wb” part in the previous command. This is because “screaming-frog-test” is another unique identifier for the instance within the project “screaming-frog-wb”.
In the terminal you will now be asked to select a machine type and an image. For this stage you can go with option f1-micro machine type (cheapest), and the debian-7-wheezy image. The instance is now being set up. Once completed you can run the following command to log in using SSH (command line):
gcloud compute ssh screaming-frog-test
Note: You may be asked to set up the SSH keys. Just follow the instructions and use a passphrase that you can remember.
Congratulations! You are now connected to the virtual machine on Google infrastructure. You can confirm this by going to the project in the Google Cloud Console, or executing the following command in the terminal on your local machine:
gcloud compute instances list
Just for this stage, let’s shut the instance down again. Assuming that you are still connected to the instance using SSH, execute the following command in the terminal:
This will log you out and close the connection between your computer and the instance. Now go to the Google Cloud Console, select the project, select Compute Engine, select VM Instances, click on the screaming-frog-test link, and go to the bottom of the page. Here you can click the Delete button to delete the instance. When you click this, don’t forget to also delete the boot disk, screaming-frog-test.
Alternatively, to shut down an instance again, execute the following command in the terminal:
gcloud compute instances delete screaming-frog-test --delete-disks boot
You will be asked to confirm that you want to delete the instance and the boot disk. After you confirm this, Google Compute Engine will try to delete the virtual machine instance and the boot disk. You can again confirm that the instance is shut down (and therefore not accruing any cost) by executing the following command on your local machine:
gcloud compute instances list
Note: Sometimes Google Cloud services may have some lag, and the commands may time out in the terminal. If this happens, you can review the deletion progress in the Google Cloud Console.
Setting up Your Screaming Frog Instance
Now that the Google Cloud project is set up, and you know the basic commands to work with Google Compute Engine instances, it is time to set up an instance with Screaming Frog SEO Spider.
First we create a new instance by executing the following command in the terminal (note I am using “screaming-frog” as the unique identifier for the instance):
gcloud compute instances create screaming-frog --scopes storage-rw
Next, choose a machine with enough RAM (I tend to go for n1-standard-8), and the debian-7-wheezy image (most important!).
The observant reader may also have noticed the added flag for the service_account_scopes in the previous command. This flag enables you to store your installation later, and will save time whenever you want to use the image with Screaming Frog SEO Spider on Google Compute Engine in the future.
After the instance is up and running, SSH into the instance with the following command:
gcloud compute ssh screaming-frog
Now that you are logged into the instance, you need to switch to root by executing the following command in the terminal:
Now that you are in root, you need to update the software packages. Execute the following command:
Then execute the following command to install the necessary programs:
apt-get install tightvncserver xfce4 xfce4-goodies xdg-utils openjdk-6-jre software-properties-common python-software-properties
This will take a few minutes, and will install a VNC server and a minimalistic Graphical User Interface that uses very little resources. When asked for the keyboard configuration, just choose the default (use Tab on your keyboard to navigate to the “OK” option and Enter to execute).
At this point it may also be handy to execute the following command to avoid future warnings about locales not set:
You will be asked to select a locale. The easiest (but also the most time-consuming) option is to select “All Locales” (Use Tab to navigate to the “OK” option.) And then select the default “None” as default locale for the system environment. This process may take a few minutes to complete and is completely optional.
Once that process is completed, you need to add another user, named “vnc”, to the system by executing the following command:
When prompted, enter a secure password of eight characters. You can skip all the other values by just pressing Enter for the default. Choose Y to confirm that the information is correct.
Now you need to set up a new password for the user. First switch to the new user by executing the following command:
and then execute the following command:
When prompted, say (N)o to the question if you would like to enter a view-only password. I recommend you use the same eigh character password as the one you choose when creating the user.
This process will create a new directory in the /home/ directory of the VNC user, and set a new password that will later be used to make a VNC connection to the instance. Keep in mind that this password can not be more than eight characters long.
Setting up Startup Scripts
Now that the VNC user has been set up, a few startup scripts need to be installed that will run the VNC server every time the instance gets started and/or rebooted. First change back to the root user by typing the following command:
Now download the first startup script by executing the following command:
wget http://filiwiese.com/files/vncserver -O /etc/init.d/vncserver
Then download the second startup script by executing the following command:
wget http://filiwiese.com/files/xstartup -O /home/vnc/.vnc/xstartup
Now that the startup scripts have been downloaded and installed, you can make the VNCserver work by executing the following commands:
chown -R vnc. /home/vnc/.vnc && chmod +x /home/vnc/.vnc/xstartup sed -i 's/allowed_users.*/allowed_users=anybody/g' /etc/X11/Xwrapper.config chmod +x /etc/init.d/vncserver
Now reboot the instance by executing the following command:
The SSH connection will be closed at this time. It may take a minute or two, but then access the instance through SSH again by executing the following command:
gcloud compute ssh screaming-frog
and switch again to the root user by executing the following command:
Now let’s start the VNC service by executing the following two commands:
update-rc.d vncserver defaults service vncserver start
Congratulations, you can now use any VNC-capable program to access the instance using a VNC connection.
Installing Screaming Frog SEO Spider
Before connecting through VNC, lets finish the installation process by installing Screaming Frog SEO Spider and the Oracle Java library by executing the following commands:
echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" | tee /etc/apt/sources.list.d/webupd8team-java.list echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" | tee -a /etc/apt/sources.list.d/webupd8team-java.list apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys EEA14886 apt-get update apt-get install oracle-java8-installer
When prompted, select OK and use the arrow keys to select YES. Next, set Oracle Java as the default Java library:
apt-get install oracle-java8-set-default
To confirm Oracle Java8 is successfully have installed and made the default Java library, execute the following command:
and the following is returned:
java version "1.8.0_25" Java(TM) SE Runtime Environment (build 1.8.0_25-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)
Now that the Oracle Java library is installed, we need to add “ttf-mscorefonts-installer” library before we can install the latest version of Screaming Frog SEO Spider.
add-apt-repository "deb http://http.debian.net/debian wheezy main contrib non-free" && apt-get update && apt-get install ttf-mscorefonts-installer
Now execute the following command to download Screaming Frog SEO Spider:
Then install Screaming Frog SEO Spider for all users by executing the following command:
dpkg -i screamingfrogseospider_5.1_all.deb
This may throw up an error due to a dependency called “zenity”. This can be solved by entering the following command:
apt-get -f install
After which the following is returned (in a long list):
Setting up screamingfrogseospider (5.10) ...
Screaming Frog SEO Spider is now installed!
Note: A newer version of Screaming Frog for Ubuntu may have been released by the time you read this article. If that is the case, go to the website of Screaming Frog SEO Spider to find the new URL to the latest Ubuntu version.
Connecting through VNC
Now let’s connect to the VNC server. To find out which IP address you need to access the instance upon, log out of the SSH connection and execute the following command on your local machine to get the external IP address listed in the table:
gcloud compute instances list
Before you continue in the terminal, the firewall rules of the instance need to be updated. To do this, go to the Google Cloud Console in your browser, select the project, select Networking in the left sidebar, and click on the “Firewall rules” link. Here you need to add a new rule. Click on the “Create a firewall rule” button, and use the following details to fill in the form:
Name: vnc Source IP ranges: <YOUR_IP_ADDRESS> Allowed protocols & ports: tcp:5800,5900-5909
Replace “YOUR_IP_ADDRESS” with your IP address, if needed you can find your IP address here. Use the defaults for the remaining fields and click the blue “Create” button. A new firewall rule will be created that will allow you to access the VNC server.
Now, to connect to the instance through VNC on Ubuntu, try the program Remmina, using the following details:
hostname: :5901 password:
If needed, you can install Remmina by executing the following command on your local machine:
sudo apt-get install remmina
Once the VNC connection has been established, a pop-up can be seen on the desktop. Select the “Use default config” option.
Next create a shortcut on the desktop for Screaming Frog SEO Spider. Right-click on the background of the desktop, select “Create Launcher” and use the following details to fill in the pop-up:
Name: Screaming Frog Command: screamingfrogseospider %f
and click the Create button. A new shortcut will be created, and can be found on the desktop. Just double-click this shortcut to start Screaming Frog SEO Spider. At this point, it may be helpful to enter the licence information before closing the program again.
Note: When typing this name you may get a suggestion to select “Create Launcher Screaming Frog SE…”. If this happens, select this option. Also by clicking on the ICON button, before clicking the “Create” button and after selecting the suggestion, the standard Screaming Frog SEO Spider icon can be selected for the launcher.
You are now almost done with the installation process. There is one more step, which is to adapt the allocated memory to the instance RAM we have available. This process will only work if you have started Screaming Frog SEO Spider at least once through the VNC. Go back to the terminal and connect through SSH again with the following command:
gcloud compute ssh screaming-frog
Now switch users with the following commands:
sudo -s su vnc
Now open the Screaming Frog SEO Spider configuration file for allocated memory with the following command:
Depending on the type of machine you have chosen for the instance, change the number 512 to a number close to the maximum RAM of the instance. For example, if using the n1-standard-8 then the available RAM is 30GB – in this case update the number 512 to 29000. Close the file and save the changes by pressing Ctrl-X, and answer Y(es) to the question whether or not to save the modified buffer. In the future, whenever the type of machine of the instance is smaller or bigger, be sure to update this number to a size below the available RAM.
This is also a good time to start the Screaming Frog SEO Spider and change the configuration, e.g. User Agent, Speed, Crawl settings, etc. After you are done, be sure to set your current configuration to default.
At this point VNC and Screaming Frog SEO Spider is set up. Now to make sure you don’t have to repeat all of these steps again each time you want to start another instance with Screaming Frog SEO Spider and VNC, let’s save everything in a custom image.
Backing up Your Instance
To back everything up, first log in to the instance from your computer by executing the following command:
gcloud compute ssh screaming-frog
Then execute the following command to start the back-up process:
sudo gcimagebundle -d /dev/sda -o /tmp/ --log_file=/tmp/abc.log
This command will create an image of all the settings and programs installed in the previous steps. The output of this command will show a long hex number that represents the name and location of the newly created image, such as:
Temporarily copy-paste this long hex number somewhere, because you will need it in the next few steps.
Now that a back-up image has been created, the image needs to be stored in Google Cloud Storage. First authenticate and configure your Google Cloud Storage access with the following command:
Follow the instructions and open a new browser window with the provided URL, accept the permission request and copy-paste back the authorisation code provided in the input field of the provided URL. Then enter the Project ID, in this case screaming-frog-wb, and hit Enter.
Next create a new bucket in Google Cloud Storage with a unique name by executing the following command:
gsutil mb gs://<bucket-name>
Note: Replace the <bucket-name> with a unique name that is unique across all Google Cloud Storage buckets. Be aware that you may need to be creative to find an available name.
The next step is to copy the image into the Google Cloud Storage bucket by executing the following command:
gsutil cp /tmp/<long-hex-name>.image.tar.gz gs://<bucket-name>
Note: Be sure to update the <long-hex-name> and the <bucket-name> from the previous command.
Once the process of copying the back-up image to the Google Cloud Storage bucket has completed, log out of the SSH connection with the instance by executing the following command:
and add the custom back-up image to the Images collection of the Google Cloud Project by executing the following command in the terminal on your local machine:
gcloud compute images --project screaming-frog-wb create screaming-frog-image --source-uri gs://<bucket-name>/<long-hex-name>.image.tar.gz
Once this process has been completed, the back-up image is safely stored in the Google Cloud Storage and will be accessible within the project the next time you create an instance. If needed, the image can again be deleted using the Google Cloud Console or by executing the following command:
gcloud compute images --project screaming-frog-wb delete screaming-frog-image
Alternatively, to verify that the creation of the back-up image was successful, go to the Google Cloud Console, select the project, select Compute Engine, select Images on the left sidebar and here you should see the image “screaming-frog-image” in the list of available images.
At this point everything is configured and saved, so the current instance can be de-activated by using the following command:
gcloud compute instances delete screaming-frog --delete-disks boot
This will de-activate the instance and delete the disk, avoiding any additional cost (except for storage of the custom image) until the next time you need to use Screaming Frog SEO Spider on Google Compute Engine. This can be confirmed by executing the following command:
gcloud compute instances list
which should return again an empty table.
Using Your Preconfigured Instance
When you are ready to use Screaming Frog SEO Spider on Google Compute Engine, open the terminal on your computer and execute the following command:
gcloud compute instances create screaming-frog
When prompted, choose a machine with enough RAM (preferably n1-standard-8 or higher) and the “screaming-frog-image” image. Once the instance is up and running, note the external IP address assigned to the instance.
Next, start up the VNC program, such as Remmina, and connect to the instance using the external IP address on port 5901, and the eight-character password you previously set.
Start Screaming Frog SEO Spider and start crawling…
Adding Extra Disk Space
When you start using the instance, you will soon notice that the maximum space on the default instance is only ten gigabyte, of which approximately one gigabyte is used for the installation of the operating system, Screaming Frog, VNC and its dependencies. If you need more disk space then you can add additional disk space using a second persistent disk.
The simplest way of adding additional disk space to your Google Compute Engine instance is by going to the Google Cloud Console in your browser, select your project, then Compute Engine, then Disks and click on the red “New Disk” button. In the form you get then, select the same Zone as your instance (this is extremely important as you will not be able to connect your disk from an instance running in different zone) and select a black disk as Source Type. For size, I recommend to stick to the default 500 gigabyte disk unless you already know in advance that you need more. You will be surprised how quickly a second disk fills up.
After you have created the disk, navigate in the Google Cloud Console to your instance and attach the disk to your instance.
When prompted choose the read/write option.
After this you will see the disk attached to your instance (if not try rebooting the instance) and it can be accessed through the command line. Next log into your instance through SSH and switch to the root user by executing the following commands:
gcloud compute ssh screaming-frog sudo -s
Now identify the disk designation by executing the following command:
You will most likely see a message stating that the second disk (e.g. /dev/sdb) does not have a valid partition table.
We solve that with adding a partition table to the new disk by executing the following command:
When prompted, type ‘n’, choose ‘e’ and just press Enter for the defaults after this. When finished, type ‘q’ to quit fdisk again.
The next step is to format the disk by executing the following command:
Now the disk can be mounted to the instance so it can be accessed through command line and/or the file explorer in VNC and used to store data. Execute the following commands to mount the disk:
mkdir /mnt/disk1 mount /dev/sdb /mnt/disk1
To make sure that all users can access the disk and the contents of the disk, the rights to the disk need to be updated by executing the following command:
chmod 777 /mnt/disk1
When adding additional files to the disk, you may need to keep updating the rights of your files so that other users (e.g. vnc) can also read and write to these files. This can be done by navigating to the relevant directory through the command line and execute the following command:
chmod 777 *
Note: It is generally considered a bad idea to use “chmod 777” applied to files on any a remote server, however since our instance is meant to be temporary and is only accessible through the VNC and SSH, I consider it to be a lower risk and it makes things easier. If the instance is meant to run for longer periods of time, I recommend you to explore other options such as using user groups instead.
Now the disk can be used to store data from Screaming Frog SEO Spider crawls and more.
Be aware that the disk is operating independent of the instance. When the instance is deleted the disk will not be automatically deleted. This makes it possible to preserve the disk (with all stored data) in the Google cloud and re-attach it to another instance in the future by mounting it again to a new instance. Keep in mind that there are cost involved with not deleting the additional disk.
Transferring Data Through SSH
Data between your local machine and the Google Compute Engine instance can also be transferred through SSH using the copy-files command of gcloud tool.
On your local machine you can upload data to the Google Compute Engine instance by executing the following command:
gcloud compute copy-files /home/user/local-file instance-name:/home/remote-user/remote-file
For example, when this command is translated to uploading a zip file to the second disk attached on the screaming-frog instance the following command works:
gcloud compute copy-files /home/fili/local.zip screaming-frog:/mnt/disk1/
Alternatively, data can also be downloaded from the instance through SSH by executing the following command:
gcloud compute copy-files instance-name:/mnt/disk1/remote-file /home/user/
Again, when this command is translated to downloading a zip file from the second disk attached on the screaming-frog instance to the local computer the following command works:
gcloud compute copy-files screaming-frog:/mnt/disk1/remote.zip /home/fili/
More information about the copy-files command of gcloud tool (and other commands) can be found here.
Keeping Command Line Processes Running
The instance can also be used for processing large data files through command line or for other command line processes. When disconnecting with an instance through SSH, for example through loss of a network connection, processes you are running in the command line are shut down and data may be lost. This can easily be solved by installing and using tmux.
First connect to the instance through SSH and install tmux by executing the following commands:
gcloud compute ssh screaming-frog sudo -s apt-get install tmux exit
Now tmux can be used by executing the following command and it will open a new command line shell within existing command line shell:
Now any command executed in the tmux shell will continue running when disconnected. When needed, exit the tmux window by typing ‘Ctrl-B’ and then type ‘d’ on your keyboard. This will detach the tmux shell from the shell window. You can get back into the tmux shell window, for example to see the status of your processes, by executing the following command:
Although getting started with Google Compute Engine may seem intimidating at first, I discovered a lot of benefits (speed and pure computing power) by using Google Compute Engine for processing large data files and crawling URLs. I highly recommend learning more and experimenting with Google Cloud services, especially as a SEO Consultant.
Do you have any additional tips about using Google Compute Engine for SEO, do share them in the comments!
A previous German version of this article was published in the 25th anniversary print edition of the German magazine Website Boosting on behalf of SearchBrothers.com. If you read German and are interested in SEO, I highly recommend you check out this magazine.Running Screaming Frog SEO Spider on the Google Cloud Servers by Fili Wiese