If you need actionable SEO advice...Let's Talk!

Running Screaming Frog SEO Spider on the Google Cloud Servers

Last updated: January 26th, 2016

Screaming Frog SEO Spider is one of the few tools no SEO can do without. It’s unquestionably a superb crawler, until you start feeling its limitations. Does that situation sound familiar? Like running out of RAM when using Screaming Frog SEO Spider? And not being able to put more RAM into the computer to satisfy Screaming Frog SEO Spider’s needs when crawling really big sites either? There is a solution: Run Screaming Frog SEO Spider in the cloud, and more specifically on Google Cloud infrastructure.

Screaming Frog SEO Spider

For most SEOs, Screaming Frog SEO Spider is THE crawler to crawl and analyse a website from an SEO perspective. Developed by an agency in the United Kingdom called Screaming Frog, Screaming Frog SEO Spider is also one of the few quality crawlers – if not the only one – that works on Debian-based systems like Ubuntu. Screaming Frog SEO Spider can be used for many different purposes, such as on-page analysis, and checking your backlink profile.

Screaming Frog Logo

The biggest challenge with Screaming Frog SEO Spider is that the program needs a lot of RAM to crawl big sites or large lists of URLs. Although the development team are working on several fixes, you can utilise Google infrastructure with lots of RAM to run Screaming Frog SEO Spider. This is where Google Compute Engine comes in.

Google Compute Engine

In case you are not aware, Google allows you to run large-scale workloads on virtual machines powered by the Google infrastructure. This is called Google Compute Engine, a great opportunity to rent computer resources on a case-use basis. Google Compute Engine is very much still under development, but is already a serious competitor in the cloud-computing resources field compared to the Amazon Cloud and Windows Azure Cloud services. Pricing is often as competitive as, or even cheaper, with Google Compute Engine, making it a great alternative to the already-established Amazon Cloud service.

And to be honest as a SEO Consultant, former Google Support Engineer and Google Search Quality team member, I love the notion of using Google hardware and infrastructure to crawl websites for me. So let’s get started by setting up your new Google Compute Engine instance with the Screaming Frog SEO Spider.

Installing the Google Cloud SDK

To get started, you first need to install the Google Cloud SDK locally on your computer. This process is rather lengthy, but only needs to be done once. As it is much easier to run this from a Linux distribution, such as Ubuntu or Macs, this article will only dive into the steps for Linux machines. Follow these steps if you prefer to install the Google Cloud SDK on a local Windows machine.

First open a terminal and make sure that you have installed curl. You can test this by typing “curl” into the terminal and seeing if it suggests that you install it. If needed and assuming you run Ubuntu, install curl by executing the following command:

sudo apt-get install curl

After you have verified that curl is installed, run the following command in the terminal (from the home directory). This will download and install the Google Cloud SDK:

curl https://sdk.cloud.google.com | bash

The next step is to let the terminal know that Google Cloud SDK has been installed. You have two options, the first of which is the more straightforward: close the terminal and then reopen it. Alternatively, you can also execute the following command (which will avoid you having to restart the terminal):

exec -l $SHELL

Once you have done this, you can verify that the Google Cloud SDK is installed by the following command:

gcloud version

If this does not return an error, and gives you an output with a version number and a list of different tools installed, then the Google Cloud SDK is installed and works on your system.

Verifying the installed version of Google Cloud SDK on local machine.

Verifying the installed version of Google Cloud SDK.

The next step is to authenticate your computer with the Google Cloud services. This will allow you to send commands to the Google Cloud to manage different Google Cloud services. In the terminal, execute the following command:

gcloud auth login

Depending on whether you have a browser installed on your local computer, a new browser window will open and ask you to give permission for Google Cloud SDK to access your Google account (you may be asked to log into your Google account first). Alternatively, you may also be asked to copy-paste a link into a browser from the terminal, and complete the process that way. Once accepted, you should see a confirmation: “You are now authenticated with the Google Cloud SDK.”

In the meantime, in the terminal you will be asked to enter the Google Cloud project ID. Just press Enter for now, as we will get back to this in a minute. If everything went well, then you will have received a message in the terminal that you are logged in with your Google account.

Google Developer Console

The next step is to go to the Google Cloud Console for developers. This is the website for managing the Google Cloud services, such as Google Compute Engine.

To get started, a new project needs to be created, so click the red button at the top of the page that says “Create Project”. A pop-up appears where you can give the project a name and define a Project ID. The Project Name is not that important, as it is only used in the Google Cloud Console. Just fill in here “Screaming Frog”.

The Project ID is very important, as it uniquely identifies the project among all Google Cloud Services users. So if you don’t go with the default suggestions from Google (click the little arrow on the right side in the input field for more suggestions from Google) it may take some effort to find an available and unique Project ID. For this project I am going with the “screaming-frog-wb”.

Creating a new Google Cloud project in the Google Cloud Console.

Creating a new Google Cloud project.

Once you click the Create button, you will be redirected to the overview page of the project – in this case https://console.developers.google.com/home/dashboard?project=screaming-frog-wb (note that the Project ID is used in the URL here).

Now comes an important step: we need to enable billing. Google Compute Engine does not have any free quotas, so in order to use Google Compute Engine, we need to enable billing. Find out more information about the pricing of Google Compute Engine here.

To enable billing, click on Settings in the left sidebar of the project overview page. The first option you now should see is Billing, with a grey button entitled “Enable billing”. Have your credit card ready, then click this. Select the right country, and enter your address, tax information (if applicable), phone number and name, and continue to enter your credit card data. Once you have completed this step you are ready to start using Google Compute Engine.

Tip: Once you have enabled billing and started using the Google Cloud services, you will see a link on the overview page of the project (in the right upper corner) to more details of the estimated charges for the present month.

After you created the project and enabled billing, let’s check the settings to make sure you can access the different tools. Click on the APIs and auth menu option in the left sidebar of the project overview page. Now check for Google Compute Engine and Google Cloud Storage and enable the services by clicking on the “off” button (which now turns green and displays the text “on”). To be safe, also enable the Google Cloud Storage JSON API service.

Enable APIs for Google Cloud Services

Enable APIs for Google Cloud Services

All the steps until now have been about setting up your computer and the project. Most of these steps you don’t need to repeat unless you change computer or want to set up new projects.

Are you ready to start the first virtual machine on Google infrastructure?

Running Your First Google Compute Engine Instance

Open the terminal again, and execute the following command (replace <project-id> with your unique Project ID – for this article that would be screaming-frog-wb):

gcloud config set project <project-id>

Everything you do now with the Google Cloud SDK will be executed as part of project Screaming Frog, which includes the billing.

Next, we need to set a zone by executing the following command (replace <zone> with zone – for this article that would be “us-central”):

gcloud config set compute/zone <ZONE>

Be aware that European zones tend to be slightly more expensive than US zones and that setting a zone is optional and can also be accomplished by adding

--zone ""

to almost every gcloud command in this article. When not set using “gcloud config”, you can use this added command line parameter to run multiple Google Compute Engine instances utilizing many CPUs in multiple zones.

To confirm that you don’t have anything running, execute the following command in the terminal:

gcloud compute instances list

This should display an empty table.

List all active Google Compute Engine instances through command line, using gcloud tool.

List all active Google Compute Engine instances.

All good so far. Now you can create a new instance by executing the following command:

gcloud compute instances create screaming-frog-test

The observant reader may have noticed that I omitted the “-wb” part in the previous command. This is because “screaming-frog-test” is another unique identifier for the instance within the project “screaming-frog-wb”.

In the terminal you will now be asked to select a machine type and an image. For this stage you can go with option f1-micro machine type (cheapest), and the debian-7-wheezy image. The instance is now being set up. Once completed you can run the following command to log in using SSH (command line):

gcloud compute ssh screaming-frog-test

Note: You may be asked to set up the SSH keys. Just follow the instructions and use a passphrase that you can remember.

Congratulations! You are now connected to the virtual machine on Google infrastructure. You can confirm this by going to the project in the Google Cloud Console, or executing the following command in the terminal on your local machine:

gcloud compute instances list

Just for this stage, let’s shut the instance down again. Assuming that you are still connected to the instance using SSH, execute the following command in the terminal:

exit

This will log you out and close the connection between your computer and the instance. Now go to the Google Cloud Console, select the project, select Compute Engine, select VM Instances, click on the screaming-frog-test link, and go to the bottom of the page. Here you can click the Delete button to delete the instance. When you click this, don’t forget to also delete the boot disk, screaming-frog-test.

Alternatively, to shut down an instance again, execute the following command in the terminal:

gcloud compute instances delete screaming-frog-test --delete-disks boot

You will be asked to confirm that you want to delete the instance and the boot disk. After you confirm this, Google Compute Engine will try to delete the virtual machine instance and the boot disk. You can again confirm that the instance is shut down (and therefore not accruing any cost) by executing the following command on your local machine:

gcloud compute instances list

Note: Sometimes Google Cloud services may have some lag, and the commands may time out in the terminal. If this happens, you can review the deletion progress in the Google Cloud Console.

Setting up Your Screaming Frog Instance

Now that the Google Cloud project is set up, and you know the basic commands to work with Google Compute Engine instances, it is time to set up an instance with Screaming Frog SEO Spider.

First we create a new instance by executing the following command in the terminal (note I am using “screaming-frog” as the unique identifier for the instance):

gcloud compute instances create screaming-frog --scopes storage-rw

Next, choose a machine with enough RAM (I tend to go for n1-standard-8), and the debian-7-wheezy image (most important!).

The observant reader may also have noticed the added flag for the service_account_scopes in the previous command. This flag enables you to store your installation later, and will save time whenever you want to use the image with Screaming Frog SEO Spider on Google Compute Engine in the future.

After the instance is up and running, SSH into the instance with the following command:

gcloud compute ssh screaming-frog

Now that you are logged into the instance, you need to switch to root by executing the following command in the terminal:

sudo -s

Now that you are in root, you need to update the software packages. Execute the following command:

apt-get update

Then execute the following command to install the necessary programs:

apt-get install tightvncserver xfce4 xfce4-goodies xdg-utils openjdk-6-jre software-properties-common python-software-properties

This will take a few minutes, and will install a VNC server and a minimalistic Graphical User Interface that uses very little resources. When asked for the keyboard configuration, just choose the default (use Tab on your keyboard to navigate to the “OK” option and Enter to execute).

At this point it may also be handy to execute the following command to avoid future warnings about locales not set:

dpkg-reconfigure locales

You will be asked to select a locale. The easiest (but also the most time-consuming) option is to select “All Locales” (Use Tab to navigate to the “OK” option.) And then select the default “None” as default locale for the system environment. This process may take a few minutes to complete and is completely optional.

Once that process is completed, you need to add another user, named “vnc”, to the system by executing the following command:

adduser vnc

When prompted, enter a secure password of eight characters. You can skip all the other values by just pressing Enter for the default. Choose Y to confirm that the information is correct.

Now you need to set up a new password for the user. First switch to the new user by executing the following command:

su vnc

and then execute the following command:

vncpasswd

When prompted, say (N)o to the question if you would like to enter a view-only password. I recommend you use the same eigh character password as the one you choose when creating the user.

This process will create a new directory in the /home/ directory of the VNC user, and set a new password that will later be used to make a VNC connection to the instance. Keep in mind that this password can not be more than eight characters long.

Setting up Startup Scripts

Now that the VNC user has been set up, a few startup scripts need to be installed that will run the VNC server every time the instance gets started and/or rebooted. First change back to the root user by typing the following command:

exit

Now download the first startup script by executing the following command:

wget http://filiwiese.com/files/vncserver -O /etc/init.d/vncserver

Then download the second startup script by executing the following command:

wget http://filiwiese.com/files/xstartup -O /home/vnc/.vnc/xstartup

Now that the startup scripts have been downloaded and installed, you can make the VNCserver work by executing the following commands:

chown -R vnc. /home/vnc/.vnc && chmod +x /home/vnc/.vnc/xstartup
sed -i 's/allowed_users.*/allowed_users=anybody/g' /etc/X11/Xwrapper.config
chmod +x /etc/init.d/vncserver

Now reboot the instance by executing the following command:

reboot

The SSH connection will be closed at this time. It may take a minute or two, but then access the instance through SSH again by executing the following command:

gcloud compute ssh screaming-frog

and switch again to the root user by executing the following command:

sudo -s

Now let’s start the VNC service by executing the following two commands:

update-rc.d vncserver defaults
service vncserver start

Congratulations, you can now use any VNC-capable program to access the instance using a VNC connection.

Installing Screaming Frog SEO Spider

Before connecting through VNC, lets finish the installation process by installing Screaming Frog SEO Spider and the Oracle Java library by executing the following commands:

echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" | tee /etc/apt/sources.list.d/webupd8team-java.list
echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" | tee -a /etc/apt/sources.list.d/webupd8team-java.list
apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys EEA14886
apt-get update
apt-get install oracle-java8-installer

When prompted, select OK and use the arrow keys to select YES. Next, set Oracle Java as the default Java library:

apt-get install oracle-java8-set-default

To confirm Oracle Java8 is successfully have installed and made the default Java library, execute the following command:

java -version

and the following is returned:

java version "1.8.0_25"
Java(TM) SE Runtime Environment (build 1.8.0_25-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.25-b02, mixed mode)

Now that the Oracle Java library is installed, we need to add “ttf-mscorefonts-installer” library before we can install the latest version of Screaming Frog SEO Spider.

add-apt-repository "deb http://http.debian.net/debian wheezy main contrib non-free" && apt-get update && apt-get install ttf-mscorefonts-installer

Now execute the following command to download Screaming Frog SEO Spider:

wget http://www.screamingfrog.co.uk/products/seo-spider/screamingfrogseospider_5.1_all.deb

Then install Screaming Frog SEO Spider for all users by executing the following command:

dpkg -i screamingfrogseospider_5.1_all.deb

This may throw up an error due to a dependency called “zenity”. This can be solved by entering the following command:

apt-get -f install

After which the following is returned (in a long list):

Setting up screamingfrogseospider (5.10) ...

Screaming Frog SEO Spider is now installed!

Note: A newer version of Screaming Frog for Ubuntu may have been released by the time you read this article. If that is the case, go to the website of Screaming Frog SEO Spider to find the new URL to the latest Ubuntu version.

Connecting through VNC

Now let’s connect to the VNC server. To find out which IP address you need to access the instance upon, log out of the SSH connection and execute the following command on your local machine to get the external IP address listed in the table:

gcloud compute instances list

Before you continue in the terminal, the firewall rules of the instance need to be updated. To do this, go to the Google Cloud Console in your browser, select the project, select Networking in the left sidebar, and click on the “Firewall rules” link. Here you need to add a new rule. Click on the “Create a firewall rule” button, and use the following details to fill in the form:

Name: vnc
Source IP ranges: <YOUR_IP_ADDRESS>
Allowed protocols & ports: tcp:5800,5900-5909

Replace “YOUR_IP_ADDRESS” with your IP address, if needed you can find your IP address here. Use the defaults for the remaining fields and click the blue “Create” button. A new firewall rule will be created that will allow you to access the VNC server.

Now, to connect to the instance through VNC on Ubuntu, try the program Remmina, using the following details:

hostname: :5901
password:
Example of configuring your VNC connection (Remote Desktop) to a Google Compute Engine instance.

Example of configuring your VNC connection.

If needed, you can install Remmina by executing the following command on your local machine:

sudo apt-get install remmina

Once the VNC connection has been established, a pop-up can be seen on the desktop. Select the “Use default config” option.

Next create a shortcut on the desktop for Screaming Frog SEO Spider. Right-click on the background of the desktop, select “Create Launcher” and use the following details to fill in the pop-up:

Name: Screaming Frog
Command: screamingfrogseospider %f

and click the Create button. A new shortcut will be created, and can be found on the desktop. Just double-click this shortcut to start Screaming Frog SEO Spider. At this point, it may be helpful to enter the licence information before closing the program again.

Note: When typing this name you may get a suggestion to select “Create Launcher Screaming Frog SE…”. If this happens, select this option. Also by clicking on the ICON button, before clicking the “Create” button and after selecting the suggestion, the standard Screaming Frog SEO Spider icon can be selected for the launcher.

Example of creating a new launcher and choosing the suggestion from Debian.

Example of creating a new launcher and choosing the suggestion from Debian.

You are now almost done with the installation process. There is one more step, which is to adapt the allocated memory to the instance RAM we have available. This process will only work if you have started Screaming Frog SEO Spider at least once through the VNC. Go back to the terminal and connect through SSH again with the following command:

gcloud compute ssh screaming-frog

Now switch users with the following commands:

sudo -s
su vnc

Now open the Screaming Frog SEO Spider configuration file for allocated memory with the following command:

pico /home/vnc/.screamingfrogseospider

Depending on the type of machine you have chosen for the instance, change the number 512 to a number close to the maximum RAM of the instance. For example, if using the n1-standard-8 then the available RAM is 30GB – in this case update the number 512 to 29000. Close the file and save the changes by pressing Ctrl-X, and answer Y(es) to the question whether or not to save the modified buffer. In the future, whenever the type of machine of the instance is smaller or bigger, be sure to update this number to a size below the available RAM.

This is also a good time to start the Screaming Frog SEO Spider and change the configuration, e.g. User Agent, Speed, Crawl settings, etc. After you are done, be sure to set your current configuration to default.

At this point VNC and Screaming Frog SEO Spider is set up. Now to make sure you don’t have to repeat all of these steps again each time you want to start another instance with Screaming Frog SEO Spider and VNC, let’s save everything in a custom image.

Backing up Your Instance

To back everything up, first log in to the instance from your computer by executing the following command:

gcloud compute ssh screaming-frog

Then execute the following command to start the back-up process:

sudo gcimagebundle -d /dev/sda -o /tmp/ --log_file=/tmp/abc.log

This command will create an image of all the settings and programs installed in the previous steps. The output of this command will show a long hex number that represents the name and location of the newly created image, such as:

/tmp/<long-hex-name>.image.tar.gz

Temporarily copy-paste this long hex number somewhere, because you will need it in the next few steps.

Now that a back-up image has been created, the image needs to be stored in Google Cloud Storage. First authenticate and configure your Google Cloud Storage access with the following command:

gsutil config

Follow the instructions and open a new browser window with the provided URL, accept the permission request and copy-paste back the authorisation code provided in the input field of the provided URL. Then enter the Project ID, in this case screaming-frog-wb, and hit Enter.

Next create a new bucket in Google Cloud Storage with a unique name by executing the following command:

gsutil mb gs://<bucket-name>

Note: Replace the <bucket-name> with a unique name that is unique across all Google Cloud Storage buckets. Be aware that you may need to be creative to find an available name.

The next step is to copy the image into the Google Cloud Storage bucket by executing the following command:

gsutil cp /tmp/<long-hex-name>.image.tar.gz gs://<bucket-name>

Note: Be sure to update the <long-hex-name> and the <bucket-name> from the previous command.

Once the process of copying the back-up image to the Google Cloud Storage bucket has completed, log out of the SSH connection with the instance by executing the following command:

exit

and add the custom back-up image to the Images collection of the Google Cloud Project by executing the following command in the terminal on your local machine:

gcloud compute images --project screaming-frog-wb create screaming-frog-image --source-uri gs://<bucket-name>/<long-hex-name>.image.tar.gz

Once this process has been completed, the back-up image is safely stored in the Google Cloud Storage and will be accessible within the project the next time you create an instance. If needed, the image can again be deleted using the Google Cloud Console or by executing the following command:

gcloud compute images --project screaming-frog-wb delete screaming-frog-image

Alternatively, to verify that the creation of the back-up image was successful, go to the Google Cloud Console, select the project, select Compute Engine, select Images on the left sidebar and here you should see the image “screaming-frog-image” in the list of available images.

At this point everything is configured and saved, so the current instance can be de-activated by using the following command:

gcloud compute instances delete screaming-frog --delete-disks boot

This will de-activate the instance and delete the disk, avoiding any additional cost (except for storage of the custom image) until the next time you need to use Screaming Frog SEO Spider on Google Compute Engine. This can be confirmed by executing the following command:

gcloud compute instances list

which should return again an empty table.

Using Your Preconfigured Instance

When you are ready to use Screaming Frog SEO Spider on Google Compute Engine, open the terminal on your computer and execute the following command:

gcloud compute instances create screaming-frog

When prompted, choose a machine with enough RAM (preferably n1-standard-8 or higher) and the “screaming-frog-image” image. Once the instance is up and running, note the external IP address assigned to the instance.

Next, start up the VNC program, such as Remmina, and connect to the instance using the external IP address on port 5901, and the eight-character password you previously set.

Start Screaming Frog SEO Spider and start crawling…

Adding Extra Disk Space

When you start using the instance, you will soon notice that the maximum space on the default instance is only ten gigabyte, of which approximately one gigabyte is used for the installation of the operating system, Screaming Frog, VNC and its dependencies. If you need more disk space then you can add additional disk space using a second persistent disk.

The simplest way of adding additional disk space to your Google Compute Engine instance is by going to the Google Cloud Console in your browser, select your project, then Compute Engine, then Disks and click on the red “New Disk” button. In the form you get then, select the same Zone as your instance (this is extremely important as you will not be able to connect your disk from an instance running in different zone) and select a black disk as Source Type. For size, I recommend to stick to the default 500 gigabyte disk unless you already know in advance that you need more. You will be surprised how quickly a second disk fills up.

Creating a new persistent blank disk in the Google Compute Engine Cloud.

Creating a new persistent blank disk.

After you have created the disk, navigate in the Google Cloud Console to your instance and attach the disk to your instance.

Attaching a new persistent disk to a Google Compute Engine instance through the Google Cloud Console.

Attaching a new persistent disk.

When prompted choose the read/write option.

Choose the mode of the persistent disk in Google Compute Engine.

Choose the mode of the persistent disk.

After this you will see the disk attached to your instance (if not try rebooting the instance) and it can be accessed through the command line. Next log into your instance through SSH and switch to the root user by executing the following commands:

gcloud compute ssh screaming-frog
sudo -s

Now identify the disk designation by executing the following command:

fdisk -l

You will most likely see a message stating that the second disk (e.g. /dev/sdb) does not have a valid partition table.

Check the status of attached disks using fdisk.

Check the status of attached disks using fdisk.

We solve that with adding a partition table to the new disk by executing the following command:

fdisk /dev/sdb

When prompted, type ‘n’, choose ‘e’ and just press Enter for the defaults after this. When finished, type ‘q’ to quit fdisk again.

Create a new partition table on a disk using fdisk.

Create a new partition table on a disk using fdisk.

The next step is to format the disk by executing the following command:

mkfs.ext3 /dev/sdb

Now the disk can be mounted to the instance so it can be accessed through command line and/or the file explorer in VNC and used to store data. Execute the following commands to mount the disk:

mkdir /mnt/disk1
mount /dev/sdb /mnt/disk1

To make sure that all users can access the disk and the contents of the disk, the rights to the disk need to be updated by executing the following command:

chmod 777 /mnt/disk1

When adding additional files to the disk, you may need to keep updating the rights of your files so that other users (e.g. vnc) can also read and write to these files. This can be done by navigating to the relevant directory through the command line and execute the following command:

chmod 777 *

Note: It is generally considered a bad idea to use “chmod 777” applied to files on any a remote server, however since our instance is meant to be temporary and is only accessible through the VNC and SSH, I consider it to be a lower risk and it makes things easier. If the instance is meant to run for longer periods of time, I recommend you to explore other options such as using user groups instead.

Now the disk can be used to store data from Screaming Frog SEO Spider crawls and more.

Be aware that the disk is operating independent of the instance. When the instance is deleted the disk will not be automatically deleted. This makes it possible to preserve the disk (with all stored data) in the Google cloud and re-attach it to another instance in the future by mounting it again to a new instance. Keep in mind that there are cost involved with not deleting the additional disk.

Transferring Data Through SSH

Data between your local machine and the Google Compute Engine instance can also be transferred through SSH using the copy-files command of gcloud tool.
On your local machine you can upload data to the Google Compute Engine instance by executing the following command:

gcloud compute copy-files /home/user/local-file instance-name:/home/remote-user/remote-file

For example, when this command is translated to uploading a zip file to the second disk attached on the screaming-frog instance the following command works:

gcloud compute copy-files /home/fili/local.zip screaming-frog:/mnt/disk1/

Alternatively, data can also be downloaded from the instance through SSH by executing the following command:

gcloud compute copy-files instance-name:/mnt/disk1/remote-file /home/user/

Again, when this command is translated to downloading a zip file from the second disk attached on the screaming-frog instance to the local computer the following command works:

gcloud compute copy-files screaming-frog:/mnt/disk1/remote.zip /home/fili/

More information about the copy-files command of gcloud tool (and other commands) can be found here.

Keeping Command Line Processes Running

The instance can also be used for processing large data files through command line or for other command line processes. When disconnecting with an instance through SSH, for example through loss of a network connection, processes you are running in the command line are shut down and data may be lost. This can easily be solved by installing and using tmux.

First connect to the instance through SSH and install tmux by executing the following commands:

gcloud compute ssh screaming-frog
sudo -s
apt-get install tmux
exit

Now tmux can be used by executing the following command and it will open a new command line shell within existing command line shell:

tmux

Now any command executed in the tmux shell will continue running when disconnected. When needed, exit the tmux window by typing ‘Ctrl-B’ and then type ‘d’ on your keyboard. This will detach the tmux shell from the shell window. You can get back into the tmux shell window, for example to see the status of your processes, by executing the following command:

tmux attach

More information about tmux can be found here and here and here.

Conclusion

Although getting started with Google Compute Engine may seem intimidating at first, I discovered a lot of benefits (speed and pure computing power) by using Google Compute Engine for processing large data files and crawling URLs. I highly recommend learning more and experimenting with Google Cloud services, especially as a SEO Consultant.

Do you have any additional tips about using Google Compute Engine for SEO, do share them in the comments!

A previous German version of this article was published in the 25th anniversary print edition of the German magazine Website Boosting on behalf of SearchBrothers.com. If you read German and are interested in SEO, I highly recommend you check out this magazine.

Running Screaming Frog SEO Spider on the Google Cloud Servers by
About the Author

is a former member of the Google Search Quality team and the Google Ad Traffic Quality team. At Google, Fili spent 7 years hunting web spam and click spam, defining spam policies, educating webmasters and improving the quality of Google search results worldwide. From Berlin in Germany, Fili nowadays works at SearchBrothers.com as a SEO Consultant helping clients to optimize their websites for search engines and recover from Google penalties and Google algorithmic updates.

Comments

  1. Interesting! Have you tried out what is the maximum amount of URLs you can crawl with this installation? Is there a limit?

  2. Brilliant! I have been exploring the power of Google Computing Engine and its awesome. I am not a programmer or developer and of course I am even more wowed that your average developer dude. Thanks for the share. The challenge is on!

  3. Hello David,

    Yes, there is still a limit but this limit is much higher than on a standard computer. For example, crawling one million URLs is not a problem. Having said that, Screaming Frog can still improve things and they are actively working on this. And when this goes live, maybe crawling one million URLs on Compute Engine can then become 10 million URLs or more…

  4. Hello Eider,

    Thanks, and let me know how it goes!

  5. After I setup billing, I type gcutil listinstances in the terminal but it states API rate limit exceeded. Any ideas?

  6. Interesting approach,

    Yes, big data will change everythings in the way we work, that’s sure.

    But it seems a bit complicated and as you told to Eider, there is still a limit.

    Another point, is that we you are working on big website, KPI are not the same than on small ones. You have to adapt them to get a macro point of view.

    We have developed a solution based on pure Big Data tools (hadoop, nosql),
    where there is really no limit. It changes everythings we you come to big websites and need to understand competitors.

    Regards

  7. @Jemery did you try this solution?

  8. I logged out and logged back in.
    I got stuck at two different spots after this.

    Once when I’m trying to up the memory using:
    pico /home/vnc/.screamingfrogseospider
    The document is blank.

    Also when I am trying to backup my image:
    gsutil mb gs://
    No matter what I try I get Creating gs://nomatterwhateverIenter/…
    GSResponseError: status=403, code=AccessDenied, reason=”Forbidden”, message=”Access denied.”

    Any ideas?

  9. @Jemery,

    To answer your first question, make sure that you start the Screaming Frog program at least once before trying to edit the memory settings. The most likely reason why your memory settings file is blank is because the file does not exist yet. This is because the program has never been started yet.

    To answer your second question, please make sure that you enable Google Cloud API through the Google Cloud Console > Your Project > APIs and auth > Google Cloud Storage and click on the off button to switch it on. Same for Google Compute Engine if that hasn’t been enabled yet.

    I will include this last part later today or tomorrow in the article.

    Hope this helps,

    Fili

  10. Hi Fili, great write up… thanks for putting so much detail into you guide. I do have a little problem though in that I am trying to use your guide and adapting for a windows install, that’s why I have to use mRemoteNG istead of Rammina. the last hurdle is getting the program to connect, and its not having it, would there be any reason why I would get “———-
    ErrorMsg
    5/5/2014 6:53:55 PM
    Opening connection failed!
    Unable to connect to the server. Error was: No such host is known
    ———-

    despite using the external IP in my Google developers account and using the 5901 port?

  11. @Paul I haven’t used mRemoteNG but have you tried RealVNC? Also please make sure the port is set correctly.

  12. Hi Fili,

    Thankfully got over that hurdle, and now connected.

    The only problem I am having now is changing the RAM allowance in the frog through PICO “pico /home/vnc/.screamingfrogseospider” there is no file that appears after using that command, I have tried appending that with ScreamingFrogSEPSopider.l4j (which is the file to edit), and for the life of me cannot load that to amend? any suggestions?

    Thanks again

  13. @Paul don’t forget to first launch Screaming Frog SEO Spider at least once through VNC. Otherwise the file .screamingfrogseospider will not exist for you to edit.

  14. @fili thanks for the suggestion…that has worked.

    once again thanks for the fantastic guide, which is very clear… this coming from someone who is non-technical (as you could probably tell)

  15. @Paul Happy to see you succeeded :)

  16. The solution is interesting and the idea of remote machine running screaming frog sounds good.

    Not sure shut you went for a so complicate Google clouds system though.
    Wouldn’t a vps with enough ram and disk so the same?
    How much does Google cloud cost? Is it an hourly fee?

  17. @Andrea,

    Thank for your response. The reason why I went for Google Cloud is because you indeed only pay per minute (after the minimum of the first 10 minutes). In addition, I can expand it as much or as little as I want and I can save custom OS images in the cloud. Last but not least, the processing power is awesome and I have a chance to run my stuff on Google servers. It may seem like a complicated system at first, but once you use it for one thing like Screaming Frog you will soon use it for other things as well, such as parsing through a CSV file with 200 million links :)

  18. Fili,

    Thanks for this write up. Managed to get it set up. Pretty straight forward. The pricing structure for Google Cloud is one of the most convoluted things I have ever read. They really need some type calculator where you can select your machine and enter the number of hours you are going to be using.

    One thing not mentioned here is this is a great solution for a team environment to give multiple team members the IP and pw to the vnc and let them run queries on Screaming Frog and other data tools without having to buy multiple versions for each user.

    It is very powerful and burns through sites.

  19. Hi Fili,

    Great tutorial! Can’t wait to get started in the cloud. I have tried installing the vnc server with your command but get this error,

    root@ubuntu:~# apt-get install tightvncserver xfce4 xfce4-goodies xdg-utils openjdk-6-jre
    Reading package lists… Done
    Building dependency tree
    Reading state information… Done
    Package openjdk-6-jre is not available, but is referred to by another package.
    This may mean that the package is missing, has been obsoleted, or
    is only available from another source
    However the following packages replace it:
    icedtea-netx-common icedtea-netx

    E: Unable to locate package tightvncserver
    E: Unable to locate package xfce4
    E: Unable to locate package xfce4-goodies
    E: Package ‘openjdk-6-jre’ has no installation candidate

    Any idea on how to fix this?

  20. @Sung

    thanks for your question. You seem to have skipped a step. Be sure to first run “apt-get update” before this command.

    I hope this helps. Please let me know if you have any other problems.

  21. Hey there.

    I’m installing this through windows using CygWin. Installing the SDK using curl worked fine, but when I try “$ gcloud auth login” or even “$ gcloud version” I just get a command not found error. I think I need to path to the command (this makes sense since it’s all installed under CygWin) but I have no idea how to do that…

    Path to the SDK seems to be:

    C:\cygwin\home\[user]\google-cloud-sdk

  22. @Studiumcirclus

    Sorry I will not be able to help you much as I don’t use Windows at all. Alternatively to CygWin you can also try to install Ubuntu within your Windows installation (just like you would install any other program within Windows). Check out https://wiki.ubuntu.com/WubiGuide for more information.

  23. Thanks for trying to help! I’ll look into some other alternatives. I was planning on giving Mint/Chakra another try anyway

  24. Thank you so much for such an excellent guide, i was wondering when selecting the high memory one that you suggested in your post

    how many URI/s per second are you getting as mine seems to be 20 which seems quite low

    any speed tips?

    Thanks
    James

  25. @James

    Thanks, I am happy you found it useful. To answer your question, the crawl rate depends totally on what you are crawling. If you are crawling a list of URLs from different servers then you can go as high up as 500 threads per second (as I have done easily). However if you are crawling the same website, the server may not be able to respond to such a high thread rate and may slow down significantly. So then if you get 20 URLs per second it is kinda nice. Be sure to play around with the speed configuration in Screaming Frog. Hope this helps.

  26. Hi Fili,
    To ensure we don’t get charged when not in use, do we just turn off Google Cloud on the dashboard?

  27. Hello,

    If you don’t have any instances running or storage space used, you will not be charged in Google cloud.

    Hope that helps,

  28. Hi Fili,
    Where do you check to see if there are any instances currently running? Sorry for my lack of understanding

  29. Hello,

    You can verify if you have any instances running by using the command line or at the online interface of https://console.developers.google.com/.

    See also https://developers.google.com/compute/docs/instances#listingvms for more information.

    Thanks for trying this out.

  30. Hi Fili,
    Thanks for the response. Would this be a viable solution for me?

    http://stackoverflow.com/questions/20153695/how-to-stop-compute-engine-instance-without-terminating-the-instance

    This is exactly why I would like to “turn off” the instance when not in use.

  31. Hello,

    No I would not use this option as it does not delete the boot disk so you may still accrue cost. Instead follow the instructions from the official documentation I shared in my previous reply.

    Hope this helps :)

  32. Thanks Fili,
    Looks like I’ll have to “gcutil deleteinstance” to delete and “sudo reboot” to restart the instance again.

  33. Great Share

    So… what are the average monthly costs to do this vs using a local box for free?

    Wouldn’t it be a lot cheaper to just build an economical dev box?

  34. Hello Edward,

    it can indeed be cheaper to buy your own local box, but don’t forget that the Google cloud solution can provide me with RAM up to 104GB (as of now, however this can still increase in the future) and CPU power from the Google infrastructure and network connection. A local box will not be able to grow as fast and cheaply as using the Google cloud. Also, I may just need my Google cloud instances for a few days per month, so I only pay those days and no more. Unlike a local box.

    I am bias at this point, but I had my own local servers (hardware) in the past but everytime I do the calculation again I just don’t see a cheap solution providing me a similar setup with 60GB or 104GB of RAM and an uber-fast network connection.

    I hope this answers your question.

  35. Scott Clark says:

    It seems most comments/discussions are at an academic level. Are others getting this running successfully for daily use?

  36. Hi Fili, took a bit of effort but we now this up and running and works very well. So well we almost knocked a site over due to how high we had put the crawl rate!

    Incredibly useful and thanks for taking the time to put together, a great great resource, and saved us a lot of money from having to invest in another tool to do the job

  37. @Scott I and others are using this on a daily basis.

  38. @Darren glad it was helpful :) Anything in particular you had difficulty with?

  39. First of all, great tutorial! Thank you for taking the time to post this. I’m sure it took some time.

    Secondly, I’m stuck at:

    “Backing up Your Instance
    To back everything up, first log in to the instance from your computer by executing the following command:
    gcutil ssh
    Then execute the following command to start the back-up process:
    sudo gcimagebundle -d /dev/sda -o /tmp/ –log_file=/tmp/abc.log”

    I’m getting this response:

    gcimagebundlelib.block_disk.InvalidRawDiskError: The operation may require up to 7477972992 bytes of disk space. However, the free disk space for /tmp/tmpYPv6g7 is 5225410560 bytes. Please consider freeing more disk space. Note that the disk space required may be overestimated because it does not exclude temporary files that will not be copied. You may use –skip_disk_space_check to disable this check.

    I tried to do this a couple times and because it froze while “copying contents”, now i can’t create the disk image. Thoughts?

  40. Fili,

    Hi, I have tried to connect through both RealVNC and TightVNC viewer but I get:

    The connection was refused by the host computer
    or
    No connection could be made because the target machine actively refused it.

    Respectively.

    I have the exceptions in the Firewall, but I still can’t connect.
    Thank you for any help you could give me!

    Luis
    I

  41. @Luis How long was your instance running? Did you try restart the instance? If your firewall was set up correctly then it is possible that you had a DOS attack on the IP address and port, which caused the VNC server to reject any connection.

  42. @Doug Did you install anything else but the programs mentioned here? If not, try rebooting your instance. This can often free up the necessary memory in /tmp/. Alternatively, you can add an additional disk (see below) of 100GB and change the command to sudo gcimagebundle -d /dev/sda -o /NEW-LOCATION/ –log_file=/tmp/abc.log” and replace /NEW-LOCATION/ with the path to the additional disk space.

  43. Hi Fili, looks like great article. But I am trapped in part where I have to start Google Compute Engine Instance where terminal write error “Error: The resource ‘projects/screaming-frog-pu’ was not found”. Details bellow:

    You are now logged in as [jsem@pavelungr.cz].
    Your current project is [None]. You can change this setting by running:
    $ gcloud config set project PROJECT
    pavelungr@Dagon:~$ gcloud config set project screaming-frog-pu
    pavelungr@Dagon:~$ gcutil listinstances
    Error: The resource ‘projects/screaming-frog-pu’ was not found

    All previous steps work well and without any problems. Can you help me, please? Thank you for any advise.

  44. Adam Whittles says:

    Hi Fili,

    I had read your article last year but I’ve only now finally managed to get around to trying this. Your instructions are really great and easy to follow, so thank you very much!

    The only issue I encountered was with the install of Dropbox. For some reason my keyboard wouldn’t function correctly within that program. My workaround was to install Copy instead. To be honest, I actually prefer it to Dropbox.

    I also had one question, you mentioned that you were able to achieve 500 threads per second, however Screaming Frog limits the number of threads to 200 as a maximum. I was just wondering how you were able to open up to 500 threads?

    Thanks.

  45. @Pavel be sure to check out the documentation of the gcloud SDK. I hope this helps!

  46. @Adam I had no problem boosting the number up in the settings of Screaming Frog during my initial test. It seems though that this does not work for me either anymore :(

  47. Hey Fili,

    Awesome article, I just had one question. What is the average cost of running a check on 10 millions URLs?

  48. @Matt Thanks, this depends on your speed and the time your are running your instance, and which size of instance and in which zone. Too many variables to give any estimate.

  49. Peter French says:

    Hey Fili
    Great tutorial – love it – easy to follow and get going and is extremely helpful!

    A few questions:
    1. Approx how many URI’s can one process with 50GB RAM ?
    2. What are you using to analyze large CSV’s in a compute instance? Any special tips or tricks?

    Thanks again!

  50. @Peter Great to hear you found the tutorial easy to follow. Regarding your questions:

    1) it depends a bit on what you exactly are crawling but I found I can easily crawl websites up to two million URLs with 60gb. I haven’t tried more but it may be possible.

    2) To process large CSV files, I tend to use Python pandas library. You can use a separate Google Compute instance (with Python pandas installed, and use the same tricks for additional disk space and backing up your instance as described in this article above here) to process with up to 104gb RAM. Alternatively, you can import your CSV files into Google BigQuery and run SQL-like queries on them.

    Hope this helps,

  51. Hi Fili!

    Great tutorial! Thanks for the share.

    I had some errors while setting locales because I was connecting via ssh from a Spanish computer, but I solved it modifying the VM files /etc/default/locale and /etc/environment:

    LANG=”es_ES.UTF-8″
    LC_ALL=”es_ES.UTF-8″
    LANGUAGE=”es_ES”

    I hope this will be useful for other people! :)

  52. Great, easy to follow tutorial Fili! I’m going to implement it now. Thanks for your time at Brighton SEO. I got some great insight from our chat about disavowing links and footer links etc.

  53. Hi Fili, and thanks to support us on Google Cloud and Screaming Frog installation. Following your tutorial, I can’t access to /home/vnc/.screamingfrogseospider file (it’s blank depsite I have already run the spider). But if I connect through VNC to the server and use the terminal emulator I can read and modify that file.

    Hi

Trackbacks

  1. […] minimizar todos estos inconvenientes, me encontré con un interesante artículo sobre cómo ejecutar Screaming Frog en Google Cloud Servers y que ha resultado ser un EPIC WIN. Aunque, al ser un artículo de un año de […]

  2. […] only resource that I found by googling about it, is this outdated instructions post. It was from last year and there have been a few big changes to Google Cloud’s platform that […]