Using Airbyte to get data from websites/datasets platforms like kaggle

783 views Asked by At

I am new to Airbyte, our team is looking to use airbyte for different sources - ranging from http api (web scraped website) to websites containing datasets like kaggle etc. we are looking to create custom connectors for these sources. I am looking for some guide on how to get started with this resources.

I have implemented the custom connector for a sample api using below guide. https://docs.airbyte.com/connector-development/tutorials/cdk-tutorial-python-http/creating-the-source

I need to look at other ways of

  1. getting data from a website (scrapped into my destination) using custom connector.
  2. getting data from kaggle or equivalent data source using custom connector. please let me know how to achieve above tasks.
2

There are 2 answers

0
Alexander Marquardt On

I've written an example Webflow (CMS) source connector, which we use internally at Airbyte to extract data about our website/blogs/tutorials. This is accompanied by an associated blog article which gives an extensive description of the connector's implementation, including details about how to use the Python CDK to extract data from the Webflow API.

Details that are covered include authentication, requesting data, and paginating through responses, as well as how to dynamically create streams and how to automatically extract schemas.

Much of the information presented in the connector and associated article should be generalizable to your specific requirements.

Disclaimer: I am an Airbyte employee and author of the linked article.

0
Homer6 On

In addition to Alexander's excellent answer, you can also use Apify to scrape/parse the website content into a Apify Dataset, then use Airbyte to sync that dataset.

https://docs.airbyte.com/integrations/sources/apify-dataset

https://apify.com/

Additionally, Apify datasets can be used in other applications, such as Langchain: https://python.langchain.com/docs/integrations/document_loaders/apify_dataset