PubSub BigQuery Subscription

Подписаться 3,2 тыс.

Просмотров 6 тыс.

50% 1

Google Cloud has recently released a very neat feature to stream data directly from PubSub to BigQuery without needing to manage any pipelines in the middle. I've had some time to explore it in detail, and today I would like to share what I've done, the areas where it can be very useful but also some limitations can restrict the use cases of this new feature.
Further reading
- Slides: slides.com/ric...
- BigQuery Subscription: cloud.google.c...
- Avro Logical Types: fastavro.readt...
- Code repo: github.com/roc...

Опубликовано:

9 сен 2024

Ссылка:

Скачать:

Готовим ссылку...

Добавить в:

Мой плейлист

Посмотреть позже

Комментарии : 19

@andredesouza7077 Год назад

Thank you, this was awesome! I was struggling with datatypes and particularly around the date issue! I could not for the life of me understand why anyone would use the single data column to receive data -your perspective on this opened a number of ideas.

@practicalgcp2780 Год назад

Thanks for the feedback Andre. This is the purpose of this channel to share things to give others ideas 💡 glad you found it useful

@DadoscomCaio 2 года назад

Pretty useful. Thanks for sharing your knowledge.

@srivathsaballa6958 Год назад

Thanks for video. Can you take an example of optional column in schema and without that column in schema publish message in the bigquery.

@happymanu5 Год назад

Thank you, this was useful. Can we also use "Write to Bigquery" delivery type to delete or update table data in a Bigquery Dataset?

@practicalgcp2780 Год назад

Hi Raunak, this feature is designed to “ingest” data into BigQuery, not transformations. Typically it’s not a good idea to deal with transforming data at ingestion time, instead, get data into BigQuery first, then handle transformation using another tool such as DBT. This is because any processes / logics added at ingestion time is likely going to introduce a “magic” layer, and if anything goes wrong in that step it’s very difficult to debug because source data may no longer exist to validate what happened. If you want to only keep data ingested for a period of time, add TTL to the BigQuery table at partition level so that data doesn’t get kept longer than it should, see cloud.google.com/bigquery/docs/managing-tables

@keisou9765 2 года назад

Very informative. Do you know if there's any way to split a single message into multiple entries in BQ? The json in question has an array component and I would like for there to be a separate line of entry for each object in the array. Is this something that can only be handled by more traditional methods of propagating messages from Pub/Sub to BQ? Thanks.

@practicalgcp2780 2 года назад

I am glad you found it useful, Cong. I think this is what you are looking for if I understood you correctly. cloud.google.com/bigquery/docs/reference/standard-sql/json-data#extract_arrays_from_json basically you can use this function to get an array of JSON types and then do UNNEST to unpack it to multiple roles.

@keisou9765 2 года назад

@@practicalgcp2780 Thanks for your apply. Sorry, no, I meant to ask, if it is possible to get multiple BQ entries, through BQ Subscriptions as described in the video, out of a single message containing an array of elements.

@practicalgcp2780 2 года назад

@@keisou9765 ok I see, it depends on your input, it supports the Record data type, so that is nested so you can use that like this stackoverflow.com/questions/11764287/how-to-nest-records-in-an-avro-schema. Which will be mapped to the BigQuery Record type. I don’t think you want them to be separate rows in BQ, store them as Record and do Unnesting is a better way to do it because BigQuery natively supports nested and repeated data in a single row.

@1itech Год назад

Bro plz explain how we can send CSV files in dataflow pubsbus to store GCS using python code

@AlexanderBelozerov Год назад

Hey thanks for the video. Are u aware about delay with using bigquery subscriptions please? Im noticing a few minutes delay for a few messages and cant find any details about it online.

@practicalgcp2780 Год назад

Not that I am aware of, there maybe a bit of a delay when you first set it up but I don’t expect it to be continuously like that unless there are issues with google services (behind the scenes it might be using cloud dataflow). If you notice unexpected delays I would say note the message timestamp down (you can see these if you export all columns into BigQuery) and then raise it to google cloud support with screenshots etc, they can debug it for you.

@AlexanderBelozerov Год назад

@@practicalgcp2780 thanks for the advicee

@arshad.mp4 2 года назад

Very informative video. Hey, we want the same thing but the problem is we have a timestamp field. You said that you need to convert that but how can we do that. That part is confusing to me, can you please tell me how to convert the timestamp field.

@practicalgcp2780 2 года назад

I am glad you found it useful ;P So you can convert it in two ways. 1) convert it to an integer or long as a unix timestamp (which is a integer since epoch) during your ingestion process, in other words, in the system sends the message to PubSub. 2) leave it as a string, and make sure your PubSub schema and the BigQuery schema both have STRING type to match the data type, then, during the data modelling process, convert it using the BigQuery function cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions#timestamp. Hope that makes sense?

@practicalgcp2780 2 года назад

And I would say I prefer the second way because that is the least amount of transformation applied which means less logic and if something goes wrong you know it’s not this conversion process is causing it. Also you don’t always have the option to convert it at source systems.

@ridhoheranof3413 Год назад

ive seen your video, but i have a problem to do the streaming data from the gcs to bigquery, im trying to use pubsub subscription, but i kinda confuse where should i run the python producer file?

@practicalgcp2780 Год назад

From GCS to BigQuery? Is this a question specific to this video? If you want to load files from GCS to BQ, a much easier way is to do BQ load, have a look at the docs here cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv. This video is about getting data from PubSub to BigQuery, where the data must already exist in PubSub. The producer Python file is just to send data to PubSub, GCS is not involved in this design.