Google Cloud Data - Quick Bites

Nilesh Khandalkar
4 min readOct 13, 2020

Cloud SQL vs Cloud Spanner

Cloud SQL: Fully managed relational database service for MySQL, PostgreSQL, and SQL Server

Cloud Spanner: Fully managed relational database with unlimited scale, strong consistency and very high availability.

There is not a big difference between then in terms on what they do (storing data in tables). The difference is how they handle the data in a small and big scale. Cloud Spanner is used when you need to handle massive amounts of data with an elevated level of consistency and with a big amount of data handling (+100,000 reads/write per second). Spanner gives much better scalability.
On the other hand, Spanner is also much more expensive than Cloud SQL. If you just want to store some data of your customer in a cheap way but still don’t want to face server configuration Cloud SQL is the right choice.

Google BigQuery vs Google Cloud BigTable

Google BigQuery: Data Warehouse to analyze terabytes of data in seconds. Run super-fast, SQL-like queries against terabytes of data in seconds, using the processing power of Google’s infrastructure Load data with ease. Bulk load your data using Google Cloud Storage or stream it in. Easy access. Access BigQuery by using a browser tool, a command-line tool, or by making calls to the BigQuery REST API with client libraries such as Java, PHP or Python. BigQuery is a query Engine for datasets that don’t change much, or change by appending. It’s a great choice when your queries require a “table scan” or the need to look across the entire database. Think sums, averages, counts, groupings. BigQuery is what you use when you have collected a large amount of data, and need to ask questions about it.

Google Cloud BigTable: Google Cloud Bigtable offers you a fast, fully managed, massively scalable NoSQL database service that’s ideal for web, mobile, and Internet of Things applications requiring terabytes to petabytes of data. Unlike comparable market offerings, Cloud Bigtable doesn’t require you to sacrifice speed, scale, or cost efficiency when your applications grow.
Also please keep in mind that Bigtable is not a relational database and it does not support SQL queries or JOINs, nor does it support multi-row transactions. Also, it is not a good solution for small amounts of data. If you want an RDBMS OLTP, you might need to look at cloudSQL (mysql/ postgres) or spanner.

Google BigQuery belongs to “Big Data as a Service” category of the tech stack, while Google Cloud Bigtable can be primarily classified under “NoSQL Database as a Service”.

DataProc vs DataFlow

DataProc: is a fast, easy-to-use, fully managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way. It is designed to run on clusters. Which makes it compatible with Apache Hadoop, hive and spark. It is significantly faster at creating clusters and can auto scale clusters without interruption of running job. It supports both streaming and batch processing.

DataFlow: is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing. It is better if your data has no implementation with spark or Hadoop. It does not run on clusters, instead it is based on parallel data processing. As such data is split processed on multiple microprocessors to reduce processing time. It supports both streaming and batch processing.

Both Dataproc and Dataflow have workflow templates that are easier to use.

Legacy SQL vs Standard SQL

BigQuery supports two SQL dialects: standard SQL and legacy SQL , Previously, BigQuery executed queries using a non-standard SQL dialect known as BigQuery SQL. With the launch of BigQuery 2.0, BigQuery released support for standard SQL, and renamed BigQuery SQL to legacy SQL. Standard SQL is the preferred SQL dialect for querying data stored in BigQuery.

How can you tell at a glance whether a query is written in legacy SQL or standard SQL? Just look at the syntax used to specify tables or project names.

  • In legacy SQL: Use square brackets to start and end the table name, and use a colon (:) to delimit dataset and table names:
  • In standard SQL: Use the backtick character (`) to start and end the table name, and use the period character (.) to delimit dataset and table names:

Standard SQL complies with the SQL 2011 standard, and has extensions that support querying nested and repeated data.

Federated Data Source

An external data source (also known as a federated data source) is a data source that you can query directly even though the data is not stored in BigQuery. Instead of loading or streaming the data, you create a table that references the external data source.

Watch out this space for more.

--

--

Nilesh Khandalkar

Passionate about Data and Cloud, working as Data Engineering Manager at Capgemini UK. GCP Professional Data Engineering Certified Airflow Fundamentals Certified