At Airbyte’s content team, analyzing Google Analytics (GA) is essential for the visibility of our website’s performance. Currently, we export our Google analytics data to a Google Data Studio report to track our content KPIs. Still, it doesn’t provide us with the capabilities we would like.
If you’re familiar with it, you know that GA does not have an ideal format for in-depth analysis. Hence, moving the data to a data warehouse becomes necessary, especially when combining GA data with other sources like Google Search Console or other SEO data.
Since our analytics team is already making extensive use of BigQuery, we thought it’s a great idea to use it for GA, therefore centralizing all our data. The Google Analytics to BigQuery integration will serve three primary purposes: querying raw GA data, combining data from other sources, and exporting data for visualization.
I decided to use our newest product: Airbyte Cloud, to set up the integration! I have to say I was impressed with how easy it is to configure it. I had never created a data integration this fast before, and the best part is that I didn’t need to code.
In addition to providing hosting and management, Airbyte Cloud enables companies to have multiple workspaces and access management for their teams. It also supports OAuth authentication to help less technical users to connect their tools.
But why would you use Airbyte if GA and BigQuery integrate natively? Well, there are a couple of reasons. The BigQuery Data Transfer Service (DTS) cannot be used to integrate services outside of Google. Airbyte can be integrated with practically any service. Also, BigQuery DTS goes only in one direction: into BigQuery, which means you’ll need to resort to third-party solutions to move data further. Airbyte can replicate data into and from BigQuery.
Airbyte provides the advantage of being the only service you need for all your data integrations, making data management a lot easier.
Now that you have all the context let’s see how to export Google Analytics data to BigQuery!
Note that Airbyte’s Google Analytics source is currently in Beta, which means the engineering team is still working on new features and bug fixes. It’s also worth mentioning that Airbyte’s GA connector is based on the Google Analytics Reporting API v4, and it only works with Universal Analytics properties – GA4 properties will be supported soon. You can keep track of progress in this Github issue.
Setting up a GA source on Airbyte Cloud is very easy. Just go to the “Sources” tab and create a new source. As the source type, select Google Analytics. Then, a menu will appear, and you’ll need to fill in the following parameters.
Write a name to help you identify the source in your workspace. You may have more than one GA source if you are a marketing agency or marketing SaaS tool, so selecting a descriptive name is convenient.
Currently, there are two options to authenticate: via Google using OAuth or using the JSON key of your GCS’s service account. I used OAuth since it only requires a few clicks. OAuth is a standard behind the scenes to ensure the secure handling of permissions.
Select when you want to start replicating data.
Input the ID for the Google Analytics View you want to fetch data from. If you go to Google Analytics Account Explorer, you’ll see it under the “View” column.
The GA connector can replicate Custom Reports as well. We will explore Custom Reports in a dedicated section of this article. For now, you can leave it blank as it's an optional parameter.
Data request time increment in days
The Data request time increment in days parameter allows you to specify the number of days that data each chunk contains. It doesn’t change how the data at the destination is presented, but it can potentially affect fetch speed – in general, the bigger the value, the faster the fetch.
This parameter is also relevant to reducing sampling. If you get a warning in the sync logs about your data being sampled, you can try using a smaller window. We talk more about sampling in the Challenges section.
Now your source is set up. Check if the connection can be established, and that’s it! You’re ready to export Google Analytics data. For more information about the Google Analytics V4 connector, check Airbyte’s documentation.
Now it’s time to set up the destination of your data, BigQuery. To do that, go to “Destinations” and create a new one. As destination type select BigQuery, and the following parameters will appear.
You can write a name to help you identify the destination in your workspace.
Default Dataset ID
Indicate the dataset ID where your tables will be stored in BigQuery. If you don’t have a dataset created yet, check GCS’s guide. It’s important to note that the default dataset will be used if you don’t specify a namespace in your connection settings.
The ID of the project containing the BigQuery dataset defined above.
As a loading method for BigQuery, you can select between “Standard Inserts” and “GCS Staging” options. The “Standard Inserts” option is inefficient and should only be used for testing. If you want to sync data regularly, we recommend setting up GCS Staging.
You may also need a credentials JSON key. To create it, check out GCS's helpful guide.
The Dataset Location parameter is optional, but in my experience, you should set it to match the value of the “Data location” in your BigQuery dataset. Otherwise, you may run into errors when syncing the data.
Transformation Query Run Type
An advanced option indicates when the transformation queries – used for Airbyte’s basic normalization – are executed in BigQuery. If you don’t know which method is best for you, read more about interactive and batch query jobs.
Google Big Query Client Chunk Size
This is the BigQuery client's buffer size, from 1MB to 15MB. I used the default option, “15”.
You’re all set! Now set up the destination and wait for the connection to be tested.
Now it’s time to set up the connection from Google Analytics to BigQuery. For that, go to “Connections” and add a new connection. Then, select the source and destination that you just created.
Airbyte will try to fetch the schema of your source and, if everything goes well, you’ll see a screen with the following options to set up your connection.
As the name suggests, the replication frequency parameter indicates how often your data should be synced. I recommend setting it to “manual” to avoid triggering unnecessary syncs if you're just testing.
When using BigQuery as a destination, the namespace refers to the dataset where your tables will be stored.
At Airbyte, we load data from multiple sources to a single BigQuery project. Because I don’t want data to be held in the default dataset specified in the destination configuration, I used the “Custom format” option. I indicated the dataset ID I wanted to use for GA’s data. In this case, I named it “google_analytics.” If you want to know more about namespaces, check Airbyte’s documentation.
Destination stream prefix
If you would like your BigQuery tables to have a prefix, you can indicate it in this parameter. In my case, I added the “ga_” prefix to identify the tables coming from GA quickly.
Next, you need to set up the streams. In this case, Airbyte automatically showed me all the available streams that have been configured for the GA connector. Each stream represents a table that will be created in BigQuery.
You can expand each stream to see the fields they contain, their data type, how they will be named in BigQuery, etc. You can also select which streams you want to sync by using the “Sync” toggle. I chose all of them.
One of the essential aspects that you should set up is the Sync mode, as this defines how the data will be replicated from the source to the destination.
To understand the implications of choosing a Sync mode for GA’s data, we should clarify the concept of golden data: Data is golden when the same request will not produce any new results if asked later.
In other words, golden is data that will not change – like page views from 2 months ago. Not-golden data still needs to be updated – today’s page views, for example.
You have the following sync options:
Sync all data from GA and replace data in BigQuery by overwriting it.
Sync all data from GA and append data in BigQuery.
Sync new records from GA and append data in BigQuery. The incremental cursor is the column “ga_date.” The connector adds a lookback window of 2 days to ensure any previously synced non-golden data is re-synced.
Sync new records from GA and append data in BigQuery to keep a history of changes. The incremental cursor is the column “ga_date.” Same as the previous sync mode, the connector adds a lookback window of 2 days to ensure any previously synced non-golden data is re-synced.
The final table is produced by de-duplicating the intermediate ones using a primary key, which you can select.
Each of the methods above has advantages and disadvantages, and your selection should depend on your specific use case. If you want to know more, you can read about sync modes.
At Airbyte, we selected the Full refresh | Overwrite sync mode because it provides the most advantages, like no duplicate data, with acceptable disadvantages since we currently don’t have large amounts of data. We sync every 24 hours, which is not so often.
On top of replication, Airbyte provides the option to enable or disable an additional transformation step at the end of the sync, called basic normalization. In this case, the basic normalization maps the JSON object coming from the Google Analytics API to a format compatible with BigQuery. Hence, I suggest using the “Normalized tabular data” option.
Now you’re ready to start your first sync! To do that, click on “Set up connection” and the sync process should start running.
Meanwhile, you could monitor the logs to see what’s happening under the hood. If there were any errors, the logs would allow you to debug. If everything goes well, you should see a “Success” status for your sync.
In my case, replicating 102 MB took around 7 minutes.
After a successful sync, you should be able to see the data in BigQuery. If you go to your dataset, you’ll see that several sets of tables were created.
The tables named _airbyte_raw_* contain the raw JSON data that was fetched from the source. They only have 3 columns: _airbyte_ab_id, airbyte_emitted_at, _airbyte_data.
Tables whose name matches the stream name are your final, normalized tables. The raw JSON data should have been mapped into columns.
Now, you could easily run queries or create reports on top of the normalized tables. For example, I was curious to know which of Airbyte’s pages had the most unique page views during Q1 2022, and there you have it!
Airbyte’s GA connector allows you to sync data from Custom Reports, too. A Custom Report is a report that you create: you pick the dimensions and metrics and decide how they should be displayed. Read more about creating custom reports here.
In the Custom Reports parameter of the GA source, you can input a JSON array of objects, using the following parameters:
For example, I want to generate a report that shows how many users we have per day and where they are, and another report showing the users per city. I name these reports “new_users_per_day” and “users_per_city”.
Note that incremental syncs are supported only if you add “ga:date” dimension to your custom report.
If you had already configured your connection without using Custom Reports in the source, it’s important to refresh the source schema in your replication settings, so the new streams are added.
Then, go ahead and run a sync. If everything goes well, you should see your reports as new tables in BigQuery!
The following are some limitations that you may encounter when replicating data from Google Analytics (Beta version connector) to BigQuery using Airbyte Cloud.
Google Analytics Reporting API v4 may limit the returned data by using sampling, which can affect the results of your data and is the reason why sometimes the values returned from the API do not match the web interface. Sampling is an approximation of the actual data.
The data request time increment in days parameter is used to specify the number of days to look back and can be used to avoid sampling. The smaller the number of days, the less chance of sampling.
When sampling occurs, a warning is logged to the sync log.
Currently, the cursor field for incremental syncs is set to “ga_date” for all streams and it cannot be changed to another column.
Taking your Google Analytics data to the next level by replicating it to a data warehouse can give you great advantages, and this tutorial has shown you that achieving it is not difficult and requires no code.
In this tutorial you have learned:
Now that we have Google Analytics raw data in BigQuery, the next step for the content team will be to create some transformations using dbt and build a Metabase dashboard. Our plan is to include more sources, like Google Search Console, to enrich our data. We might cover our next steps in upcoming tutorials!
Join the conversation at Airbyte’s community Slack Channel to share your ideas with over 5000 data engineers and help make everyone’s project a success. With Airbyte, the integration possibilities are endless, and we can't wait to see what you're going to build!
Thalia Barrera is a data engineer and technical writer at Airbyte. She has over a decade of experience as an engineer in the IT industry. She enjoys crafting technical and training materials for fellow engineers. Drawing on her computer science expertise and client-oriented nature, she loves turning complex topics into easy-to-understand content.
Learn how to easily build a data ingestion pipeline to load business data from Mailchimp to Snowflake.
Learn how to load Facebook Ads data to BigQuery using Airbyte Cloud.