How to Scrape Crunchbase With Python in 2024

Updated: October 15, 2024

Want to extract valuable business data from Crunchbase? Here's a quick guide to scraping it with Python:

  1. Set up Python and install key libraries (httpx, parsel, selenium, pandas)
  2. Analyze Crunchbase's structure and respect their robots.txt
  3. Build a scraper using BeautifulSoup or Selenium
  4. Extract company info, funding data, and team details
  5. Handle pagination and search results
  6. Implement anti-detection measures (rotate IPs, use delays)
  7. Clean and store the scraped data

Remember: Always scrape ethically and respect Crunchbase's terms of service.

Quick Comparison: Scraping vs. API

Factor Scraping API
Cost Can be free Paid license required
Data Quality May have errors Usually accurate
Updates Manual/scheduled Near real-time
Ease of Use Varies Requires API knowledge
Data Volume Limited by scraping speed Higher limits
Legal Compliance Potential issues Officially approved

Choose based on your needs, budget, and technical skills. Scraping works for tight budgets, while the API offers official, up-to-date data.

What You Need to Start

To scrape Crunchbase with Python in 2024, you'll need these essentials:

Python Setup

Python

  1. Install Python 3+ from python.org
  2. Create a project directory
  3. Open it in PyCharm, VSCode, or Jupyter Notebook

Key Python Libraries

Install these packages:

pip install httpx parsel selenium pandas
Package Use
httpx/requests Web requests
parsel/BeautifulSoup HTML parsing
selenium Browser automation
pandas Data handling

Web Scraping Basics

Get familiar with:

  • HTML and CSS selectors
  • HTTP requests/responses
  • Handling dynamic content
  • Rate limiting
  • robots.txt rules

Here's a simple Selenium scraping example:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.crunchbase.com/organization/brightdata")

Remember: Crunchbase uses tough anti-bot measures. You might need CAPTCHA solving, browser fingerprint spoofing, and IP rotation.

Always check Crunchbase's robots.txt before scraping!

Preparing Your Workspace

Let's set up your workspace for Crunchbase scraping:

Virtual Environment

First, create a virtual environment:

  1. Open your terminal
  2. Go to your project folder
  3. Run:
python -m venv crunchbase_scraper
  1. Activate it:
    • Windows: crunchbase_scraper\Scripts\activate
    • macOS/Linux: source crunchbase_scraper/bin/activate

Install Packages

Now, install the needed libraries:

pip install requests beautifulsoup4 selenium pandas

Check Your Setup

Let's make sure everything's working:

  1. Open Python in your terminal:
python
  1. Import the packages:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

No errors? You're good to go!

  1. Exit Python:
exit()

Here's what each package does:

Package Purpose
requests Sends HTTP requests
beautifulsoup4 Parses HTML
selenium Automates browsers
pandas Handles data

That's it! Your workspace is ready for Crunchbase scraping.

Looking at Crunchbase's Layout

Crunchbase organizes company info into different sections. Here's how it breaks down:

  • /organization/openai gives you the overview
  • /organization/openai/company_financials shows the money stuff
  • /organization/openai/people lists who's who

When you're scraping, you'll need to hop between these sections to get the full picture.

Building Your Scraper

Let's create a scraper for Crunchbase using Python and BeautifulSoup.

Getting Started

Set up your Python environment:

from crawlbase import CrawlingAPI
import httpx
from parsel import Selector
from bs4 import BeautifulSoup
import time
import random

def crawl(page_url, api_token):
    api = CrawlingAPI({'token': api_token})
    response = api.get(page_url)
    if response['status_code'] == 200:
        return response["body"]
    else:
        print(f"Error: {response}")
        return None

api_token = 'YOUR_CRAWLBASE_TOKEN'
page_url = 'https://www.crunchbase.com/organization/openai'
html_content = crawl(page_url, api_token)

This code fetches HTML content from Crunchbase using the Crawlbase API.

Handling Logins

For login-required data:

  1. Create a session with httpx
  2. Send login credentials via POST request
  3. Save the session cookie

Respecting Server Load

To be a good web citizen:

  • Add 5-10 second delays between requests
  • Use random intervals
  • Limit concurrent requests

Parsing HTML

Use BeautifulSoup to extract data:

soup = BeautifulSoup(html_content, 'html.parser')

company_name = soup.find('h1', class_='profile-name').text.strip()
funding_info = soup.find('span', class_='funding-total').text.strip()

print(f"Company: {company_name}")
print(f"Total Funding: {funding_info}")

This code grabs the company name and funding details from a Crunchbase page.

Getting Company Information

Let's extract specific company data from Crunchbase.

Basic Company Details

Here's how to grab the core company info:

def get_basic_info(soup):
    return {
        'name': soup.find('h1', class_='profile-name').text.strip(),
        'description': soup.find('span', class_='description').text.strip(),
        'website': soup.find('a', class_='website-link')['href'],
        'headquarters': soup.find('span', class_='location').text.strip()
    }

company_info = get_basic_info(soup)
print(f"Company: {company_info['name']}")
print(f"Description: {company_info['description']}")

This code pulls the company name, description, website, and HQ location.

Financial and Funding Data

Crunchbase is a goldmine for funding info. Here's how to dig it up:

def get_funding_info(soup):
    funding_info = {'total_funding': soup.find('span', class_='funding-total').text.strip()}
    funding_info['rounds'] = [{
        'date': round.find('span', class_='date').text.strip(),
        'amount': round.find('span', class_='amount').text.strip(),
        'series': round.find('span', class_='series').text.strip()
    } for round in soup.find_all('div', class_='funding-round')]
    return funding_info

funding_data = get_funding_info(soup)
print(f"Total Funding: {funding_data['total_funding']}")
print(f"Latest Round: {funding_data['rounds'][0]['series']} - {funding_data['rounds'][0]['amount']}")

This snippet grabs the total funding and details of each round.

Team and Leader Info

Want to know who's running the show? Here's how:

def get_team_info(soup):
    return [{
        'name': member.find('span', class_='name').text.strip(),
        'title': member.find('span', class_='title').text.strip(),
        'linkedin': member.find('a', class_='linkedin-link')['href'] if member.find('a', class_='linkedin-link') else None
    } for member in soup.find_all('div', class_='person-card')]

team_data = get_team_info(soup)
for member in team_data[:3]:
    print(f"{member['name']} - {member['title']}")

This function scoops up names, titles, and LinkedIn profiles of team members listed on the company page.

Dealing with Multiple Pages and Search Results

Scraping Crunchbase often means handling multiple pages and search results. Here's how to do it:

Moving Through Page Numbers

To scrape across multiple pages:

import requests
from bs4 import BeautifulSoup

base_url = 'https://www.crunchbase.com/search/organizations/field/organizations/location_identifiers/'
location = 'san-francisco-bay-area'
page = 1

while True:
    url = f"{base_url}{location}?page={page}"
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract data from current page
    companies = soup.find_all('div', class_='company-card')
    for company in companies:
        # Process company data
        pass

    # Check for next page
    if not soup.find('a', class_='next-page'):
        break

    page += 1

This script keeps going until there's no "next" button.

Getting Data from Search Pages

To extract search result data:

def extract_search_results(soup):
    results = []
    for card in soup.find_all('div', class_='company-card'):
        results.append({
            'name': card.find('span', class_='company-name').text.strip(),
            'description': card.find('div', class_='description').text.strip(),
            'funding': card.find('span', class_='funding-total').text.strip()
        })
    return results

# Use in main loop
search_results = extract_search_results(soup)

Handling Content That Loads Later

For dynamic content, use Selenium:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get('https://www.crunchbase.com/search/funding_rounds')

# Wait for funding rounds to load
funding_rounds = WebDriverWait(driver, 10).until(
    EC.presence_of_all_elements_located((By.CLASS_NAME, 'funding-round-card'))
)

for round in funding_rounds:
    # Extract funding round data
    pass

driver.quit()

This waits for funding rounds to load before extracting data.

sbb-itb-00912d9

Staying Under the Radar While Scraping

Want to scrape Crunchbase without getting caught? Here's how:

Mix Up Your Identity

Make your scraper blend in:

  • Swap out user agent strings. It's not perfect, but it helps.
  • Change IP addresses for each request. Don't make it easy for Crunchbase to spot you.

Here's a quick Python example:

import requests
from fake_useragent import UserAgent

ua = UserAgent()
headers = {'User-Agent': ua.random}
response = requests.get('https://www.crunchbase.com', headers=headers)

Act Human

Mimic real browsing:

  • Pause between requests (3-10 seconds).
  • Mix up your timing.

Like this:

import time
import random

time.sleep(random.uniform(3, 10))

Hide Behind Proxies

Mask your real IP:

  • Use rotating proxies for each request.
  • Try residential proxies - they're tougher to spot.
Proxy Provider IP Pool Size Starting Price
Bright Data 72M+ IPs $5.04/GB
Oxylabs 100M+ IPs $8/GB
Smartproxy Not specified $7/GB

Pro tip: Rotate proxies by subnet. It makes you even harder to catch.

Organizing and Saving Your Data

After scraping Crunchbase, you need to clean and store your data. Here's how:

Clean Up Your Data

Raw scraped data is messy. Use Pandas to tidy it up:

import pandas as pd

df = pd.DataFrame(scraped_data)
df.drop_duplicates(inplace=True)
df['column_name'].fillna('N/A', inplace=True)
df['funding_amount'] = df['funding_amount'].astype(float)

Save Your Data

You've got options for storing cleaned data:

Format Good For Not So Good For
CSV Simple use, wide compatibility Complex data, large files
JSON Keeping data structure Quick read/write
Excel Small datasets Automation
Database Big data, queries Quick setup

For most Crunchbase scrapes, CSV works well:

df.to_csv("crunchbase_data.csv", index=False, encoding='utf-8')

Check and Fix Your Data

Always double-check your data:

  1. Look for weird values
  2. Make sure formats match (dates, money, etc.)
  3. Check if important stuff is missing

Use Pandas to spot issues:

print(df.isnull().sum())
print(df.describe())
print(df['category'].unique())

Fix errors as you go. For example, to clean up company names:

df['company_name'] = df['company_name'].str.strip().str.title()

Remember: Good data is clean data. Take the time to get it right.

Making Scraping Happen Automatically

Want to keep your Crunchbase data fresh? Here's how to set up regular scraping:

Setting Up Regular Scraping Times

Use cron jobs to schedule your scraper. Here's an example that runs your script daily at 2 AM:

0 2 * * * /usr/bin/python3 /path/to/your_crunchbase_scraper.py >> /path/to/scraper_log.log 2>&1

Scraping Only New Information

To save time and resources, focus on new data:

  1. Store the last scrape date
  2. Use Crunchbase's API to fetch only updated records

Here's a Python snippet to get updates:

import requests
from datetime import datetime, timedelta

api_key = 'YOUR_API_KEY'
base_url = 'https://api.crunchbase.com/api/v4/'
last_update = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')

params = {
    'user_key': api_key,
    'updated_since': last_update
}

response = requests.get(f'{base_url}entities/organizations', params=params)

if response.status_code == 200:
    new_data = response.json()
    # Process new_data here
else:
    print(f'Error: {response.status_code}')

This code grabs data updated in the last 24 hours.

Updating Existing Data

Got new data? Here's how to update your existing records:

  1. Use a unique identifier (like company name or ID)
  2. Check if the record exists in your database
  3. If it does, update the existing record
  4. If not, add a new record

Other Ways to Get Crunchbase Data

Scraping isn't the only game in town. Let's look at some other options:

Crunchbase API

Crunchbase has an official API. Here's the scoop:

  • Launched v4.0 in April 2020
  • Lets you customize searches and filter data
  • Needs an Enterprise or Applications License
  • 200 calls per minute
  • Uses token-based auth

API Pros and Cons:

Pros Cons
Fresh data Costs money
Customize data pulls Limited free stuff
Lots of calls allowed Might be tricky to use
Crunchbase support Coding skills helpful

ScrapingLab: No-Code Option

ScrapingLab

Don't code? No problem. ScrapingLab's got you covered:

  • Easy to use, no coding needed
  • Grabs key company info (ID, size, location, etc.)
  • Multiple output formats
  • Follows data protection laws

Scraping vs. API: Which to Pick?

Let's break it down:

Factor Scraping API
Cost Can be cheap or free Paid license needed
Data Quality Might have hiccups Usually spot-on
Updates Depends on your schedule Almost real-time
Ease of Use Varies Need API know-how
Data Amount Limited by scraping speed Can get more data
Legal Stuff Might break rules Officially OK

Your choice depends on what you need, your budget, and your tech skills. Want official, current data and have cash? Go API. Tight budget or prefer no-code? Scraping might be your best bet.

Fixing Common Problems

Scraping Crunchbase? You'll hit some snags. Here's how to deal with the big ones:

When Websites Change Their Layout

Websites love to shake things up. Your scraper might break. Here's what to do:

  • Use flexible selectors
  • Set up automated tests
  • Keep error logs

Don't use exact CSS selectors. Try XPath expressions that target content based on text or attributes instead.

Handling Internet Problems

Bad connections can mess up your scraping. Try this:

  • Use retry mechanisms with exponential backoff
  • Get a good proxy service
  • Save your progress often

When Data Isn't Collected Properly

Sometimes your scraper misses stuff. Fix it like this:

  • Validate scraped data
  • Use FuzzyWuzzy to fix typos
  • Check for missing or weird data

Here's a quick look at common scraping errors and fixes:

Error Cause Fix
404 Not Found Broken links Update URL list, validate URLs
403 Forbidden Bot detection Rotate IPs, set proper User-Agent
Parsing errors Changed HTML Use better selectors, update parsing
Rate limiting (429) Too many requests Add delays, use rotating proxies

Crunchbase might try to stop you. If you get a 403 error:

  1. Set a realistic User-Agent
  2. Add delays between requests
  3. Try a scraping API like ZenRows

Tips for Better Scraping

Speed Up Your Scraper

Want to make your Crunchbase scraper faster? Use parallel processing with async requests. Here's how:

from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed
import json
import time

def response_hook(resp, *args, **kwargs):
    with open(f'tmp/{time.thread_time_ns()}.json', 'wb') as fp:
        parsed = resp.json()
        fp.write(json.dumps(parsed).encode('utf-8'))

futures_session = FuturesSession()
futures_session.hooks['response'] = response_hook

with futures_session as session:
    futures = [
        session.get(f'https://www.crunchbase.com/api/v4/entities/{i}', hooks={'response': response_hook}) for i in range(1000)
    ]
    for future in as_completed(futures):
        resp = future.result()

This trick can make your scraper up to 10 times faster than regular requests.

Keep Your Data Clean

Good data is crucial. Here's how to keep your Crunchbase data squeaky clean:

  1. Check for missing or weird data
  2. Use FuzzyWuzzy to fix small typos
  3. Compare your data with other trusted sources

Pro tip: Keep a log of data issues. It'll help you spot and fix problems fast.

Play Nice When Scraping

Follow these rules to stay on Crunchbase's good side:

Rule Why?
Check robots.txt Tells you where you can and can't scrape
Add delays between requests Doesn't overload their servers
Switch up user agents Makes your scraper look like different browsers
Use proxies Hides your real IP address

As James Densmore, a Data Scientist, says:

"With a little respect we can keep a good thing going."

Remember: Good scraping is about balance. Be fast, but be nice too.

Wrapping Up

Here's how to scrape Crunchbase with Python in 2024:

  1. Set up Python
  2. Install libraries
  3. Analyze Crunchbase's structure
  4. Build your scraper
  5. Extract data
  6. Handle pagination
  7. Add anti-detection measures
  8. Clean and store data

What can you do with this data? A lot:

Use Case Description
Market Research Track trends and competitors
Lead Generation Find potential clients
Investment Analysis Evaluate startups
Recruitment Spot top talent
Partnership Scouting Find collaboration opportunities

Keep your scraper running smoothly:

  • Check for website changes weekly
  • Update your code
  • Monitor performance
  • Stay on top of Crunchbase's terms

Remember: scrape ethically. As data scientist James Densmore says:

"With a little respect we can keep a good thing going."

Related posts