Want to extract valuable business data from Crunchbase? Here's a quick guide to scraping it with Python:
- Set up Python and install key libraries (httpx, parsel, selenium, pandas)
- Analyze Crunchbase's structure and respect their robots.txt
- Build a scraper using BeautifulSoup or Selenium
- Extract company info, funding data, and team details
- Handle pagination and search results
- Implement anti-detection measures (rotate IPs, use delays)
- Clean and store the scraped data
Remember: Always scrape ethically and respect Crunchbase's terms of service.
Quick Comparison: Scraping vs. API
Factor | Scraping | API |
---|---|---|
Cost | Can be free | Paid license required |
Data Quality | May have errors | Usually accurate |
Updates | Manual/scheduled | Near real-time |
Ease of Use | Varies | Requires API knowledge |
Data Volume | Limited by scraping speed | Higher limits |
Legal Compliance | Potential issues | Officially approved |
Choose based on your needs, budget, and technical skills. Scraping works for tight budgets, while the API offers official, up-to-date data.
Related video from YouTube
What You Need to Start
To scrape Crunchbase with Python in 2024, you'll need these essentials:
Python Setup
- Install Python 3+ from python.org
- Create a project directory
- Open it in PyCharm, VSCode, or Jupyter Notebook
Key Python Libraries
Install these packages:
pip install httpx parsel selenium pandas
Package | Use |
---|---|
httpx/requests | Web requests |
parsel/BeautifulSoup | HTML parsing |
selenium | Browser automation |
pandas | Data handling |
Web Scraping Basics
Get familiar with:
- HTML and CSS selectors
- HTTP requests/responses
- Handling dynamic content
- Rate limiting
- robots.txt rules
Here's a simple Selenium scraping example:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.crunchbase.com/organization/brightdata")
Remember: Crunchbase uses tough anti-bot measures. You might need CAPTCHA solving, browser fingerprint spoofing, and IP rotation.
Always check Crunchbase's robots.txt
before scraping!
Preparing Your Workspace
Let's set up your workspace for Crunchbase scraping:
Virtual Environment
First, create a virtual environment:
- Open your terminal
- Go to your project folder
- Run:
python -m venv crunchbase_scraper
- Activate it:
- Windows:
crunchbase_scraper\Scripts\activate
- macOS/Linux:
source crunchbase_scraper/bin/activate
- Windows:
Install Packages
Now, install the needed libraries:
pip install requests beautifulsoup4 selenium pandas
Check Your Setup
Let's make sure everything's working:
- Open Python in your terminal:
python
- Import the packages:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
No errors? You're good to go!
- Exit Python:
exit()
Here's what each package does:
Package | Purpose |
---|---|
requests | Sends HTTP requests |
beautifulsoup4 | Parses HTML |
selenium | Automates browsers |
pandas | Handles data |
That's it! Your workspace is ready for Crunchbase scraping.
Looking at Crunchbase's Layout
Crunchbase organizes company info into different sections. Here's how it breaks down:
/organization/openai
gives you the overview/organization/openai/company_financials
shows the money stuff/organization/openai/people
lists who's who
When you're scraping, you'll need to hop between these sections to get the full picture.
Building Your Scraper
Let's create a scraper for Crunchbase using Python and BeautifulSoup.
Getting Started
Set up your Python environment:
from crawlbase import CrawlingAPI
import httpx
from parsel import Selector
from bs4 import BeautifulSoup
import time
import random
def crawl(page_url, api_token):
api = CrawlingAPI({'token': api_token})
response = api.get(page_url)
if response['status_code'] == 200:
return response["body"]
else:
print(f"Error: {response}")
return None
api_token = 'YOUR_CRAWLBASE_TOKEN'
page_url = 'https://www.crunchbase.com/organization/openai'
html_content = crawl(page_url, api_token)
This code fetches HTML content from Crunchbase using the Crawlbase API.
Handling Logins
For login-required data:
- Create a session with
httpx
- Send login credentials via POST request
- Save the session cookie
Respecting Server Load
To be a good web citizen:
- Add 5-10 second delays between requests
- Use random intervals
- Limit concurrent requests
Parsing HTML
Use BeautifulSoup to extract data:
soup = BeautifulSoup(html_content, 'html.parser')
company_name = soup.find('h1', class_='profile-name').text.strip()
funding_info = soup.find('span', class_='funding-total').text.strip()
print(f"Company: {company_name}")
print(f"Total Funding: {funding_info}")
This code grabs the company name and funding details from a Crunchbase page.
Getting Company Information
Let's extract specific company data from Crunchbase.
Basic Company Details
Here's how to grab the core company info:
def get_basic_info(soup):
return {
'name': soup.find('h1', class_='profile-name').text.strip(),
'description': soup.find('span', class_='description').text.strip(),
'website': soup.find('a', class_='website-link')['href'],
'headquarters': soup.find('span', class_='location').text.strip()
}
company_info = get_basic_info(soup)
print(f"Company: {company_info['name']}")
print(f"Description: {company_info['description']}")
This code pulls the company name, description, website, and HQ location.
Financial and Funding Data
Crunchbase is a goldmine for funding info. Here's how to dig it up:
def get_funding_info(soup):
funding_info = {'total_funding': soup.find('span', class_='funding-total').text.strip()}
funding_info['rounds'] = [{
'date': round.find('span', class_='date').text.strip(),
'amount': round.find('span', class_='amount').text.strip(),
'series': round.find('span', class_='series').text.strip()
} for round in soup.find_all('div', class_='funding-round')]
return funding_info
funding_data = get_funding_info(soup)
print(f"Total Funding: {funding_data['total_funding']}")
print(f"Latest Round: {funding_data['rounds'][0]['series']} - {funding_data['rounds'][0]['amount']}")
This snippet grabs the total funding and details of each round.
Team and Leader Info
Want to know who's running the show? Here's how:
def get_team_info(soup):
return [{
'name': member.find('span', class_='name').text.strip(),
'title': member.find('span', class_='title').text.strip(),
'linkedin': member.find('a', class_='linkedin-link')['href'] if member.find('a', class_='linkedin-link') else None
} for member in soup.find_all('div', class_='person-card')]
team_data = get_team_info(soup)
for member in team_data[:3]:
print(f"{member['name']} - {member['title']}")
This function scoops up names, titles, and LinkedIn profiles of team members listed on the company page.
Dealing with Multiple Pages and Search Results
Scraping Crunchbase often means handling multiple pages and search results. Here's how to do it:
Moving Through Page Numbers
To scrape across multiple pages:
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.crunchbase.com/search/organizations/field/organizations/location_identifiers/'
location = 'san-francisco-bay-area'
page = 1
while True:
url = f"{base_url}{location}?page={page}"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Extract data from current page
companies = soup.find_all('div', class_='company-card')
for company in companies:
# Process company data
pass
# Check for next page
if not soup.find('a', class_='next-page'):
break
page += 1
This script keeps going until there's no "next" button.
Getting Data from Search Pages
To extract search result data:
def extract_search_results(soup):
results = []
for card in soup.find_all('div', class_='company-card'):
results.append({
'name': card.find('span', class_='company-name').text.strip(),
'description': card.find('div', class_='description').text.strip(),
'funding': card.find('span', class_='funding-total').text.strip()
})
return results
# Use in main loop
search_results = extract_search_results(soup)
Handling Content That Loads Later
For dynamic content, use Selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://www.crunchbase.com/search/funding_rounds')
# Wait for funding rounds to load
funding_rounds = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.CLASS_NAME, 'funding-round-card'))
)
for round in funding_rounds:
# Extract funding round data
pass
driver.quit()
This waits for funding rounds to load before extracting data.
sbb-itb-00912d9
Staying Under the Radar While Scraping
Want to scrape Crunchbase without getting caught? Here's how:
Mix Up Your Identity
Make your scraper blend in:
- Swap out user agent strings. It's not perfect, but it helps.
- Change IP addresses for each request. Don't make it easy for Crunchbase to spot you.
Here's a quick Python example:
import requests
from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.random}
response = requests.get('https://www.crunchbase.com', headers=headers)
Act Human
Mimic real browsing:
- Pause between requests (3-10 seconds).
- Mix up your timing.
Like this:
import time
import random
time.sleep(random.uniform(3, 10))
Hide Behind Proxies
Mask your real IP:
- Use rotating proxies for each request.
- Try residential proxies - they're tougher to spot.
Proxy Provider | IP Pool Size | Starting Price |
---|---|---|
Bright Data | 72M+ IPs | $5.04/GB |
Oxylabs | 100M+ IPs | $8/GB |
Smartproxy | Not specified | $7/GB |
Pro tip: Rotate proxies by subnet. It makes you even harder to catch.
Organizing and Saving Your Data
After scraping Crunchbase, you need to clean and store your data. Here's how:
Clean Up Your Data
Raw scraped data is messy. Use Pandas to tidy it up:
import pandas as pd
df = pd.DataFrame(scraped_data)
df.drop_duplicates(inplace=True)
df['column_name'].fillna('N/A', inplace=True)
df['funding_amount'] = df['funding_amount'].astype(float)
Save Your Data
You've got options for storing cleaned data:
Format | Good For | Not So Good For |
---|---|---|
CSV | Simple use, wide compatibility | Complex data, large files |
JSON | Keeping data structure | Quick read/write |
Excel | Small datasets | Automation |
Database | Big data, queries | Quick setup |
For most Crunchbase scrapes, CSV works well:
df.to_csv("crunchbase_data.csv", index=False, encoding='utf-8')
Check and Fix Your Data
Always double-check your data:
- Look for weird values
- Make sure formats match (dates, money, etc.)
- Check if important stuff is missing
Use Pandas to spot issues:
print(df.isnull().sum())
print(df.describe())
print(df['category'].unique())
Fix errors as you go. For example, to clean up company names:
df['company_name'] = df['company_name'].str.strip().str.title()
Remember: Good data is clean data. Take the time to get it right.
Making Scraping Happen Automatically
Want to keep your Crunchbase data fresh? Here's how to set up regular scraping:
Setting Up Regular Scraping Times
Use cron jobs to schedule your scraper. Here's an example that runs your script daily at 2 AM:
0 2 * * * /usr/bin/python3 /path/to/your_crunchbase_scraper.py >> /path/to/scraper_log.log 2>&1
Scraping Only New Information
To save time and resources, focus on new data:
- Store the last scrape date
- Use Crunchbase's API to fetch only updated records
Here's a Python snippet to get updates:
import requests
from datetime import datetime, timedelta
api_key = 'YOUR_API_KEY'
base_url = 'https://api.crunchbase.com/api/v4/'
last_update = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')
params = {
'user_key': api_key,
'updated_since': last_update
}
response = requests.get(f'{base_url}entities/organizations', params=params)
if response.status_code == 200:
new_data = response.json()
# Process new_data here
else:
print(f'Error: {response.status_code}')
This code grabs data updated in the last 24 hours.
Updating Existing Data
Got new data? Here's how to update your existing records:
- Use a unique identifier (like company name or ID)
- Check if the record exists in your database
- If it does, update the existing record
- If not, add a new record
Other Ways to Get Crunchbase Data
Scraping isn't the only game in town. Let's look at some other options:
Crunchbase API
Crunchbase has an official API. Here's the scoop:
- Launched v4.0 in April 2020
- Lets you customize searches and filter data
- Needs an Enterprise or Applications License
- 200 calls per minute
- Uses token-based auth
API Pros and Cons:
Pros | Cons |
---|---|
Fresh data | Costs money |
Customize data pulls | Limited free stuff |
Lots of calls allowed | Might be tricky to use |
Crunchbase support | Coding skills helpful |
ScrapingLab: No-Code Option
Don't code? No problem. ScrapingLab's got you covered:
- Easy to use, no coding needed
- Grabs key company info (ID, size, location, etc.)
- Multiple output formats
- Follows data protection laws
Scraping vs. API: Which to Pick?
Let's break it down:
Factor | Scraping | API |
---|---|---|
Cost | Can be cheap or free | Paid license needed |
Data Quality | Might have hiccups | Usually spot-on |
Updates | Depends on your schedule | Almost real-time |
Ease of Use | Varies | Need API know-how |
Data Amount | Limited by scraping speed | Can get more data |
Legal Stuff | Might break rules | Officially OK |
Your choice depends on what you need, your budget, and your tech skills. Want official, current data and have cash? Go API. Tight budget or prefer no-code? Scraping might be your best bet.
Fixing Common Problems
Scraping Crunchbase? You'll hit some snags. Here's how to deal with the big ones:
When Websites Change Their Layout
Websites love to shake things up. Your scraper might break. Here's what to do:
- Use flexible selectors
- Set up automated tests
- Keep error logs
Don't use exact CSS selectors. Try XPath expressions that target content based on text or attributes instead.
Handling Internet Problems
Bad connections can mess up your scraping. Try this:
- Use retry mechanisms with exponential backoff
- Get a good proxy service
- Save your progress often
When Data Isn't Collected Properly
Sometimes your scraper misses stuff. Fix it like this:
- Validate scraped data
- Use FuzzyWuzzy to fix typos
- Check for missing or weird data
Here's a quick look at common scraping errors and fixes:
Error | Cause | Fix |
---|---|---|
404 Not Found | Broken links | Update URL list, validate URLs |
403 Forbidden | Bot detection | Rotate IPs, set proper User-Agent |
Parsing errors | Changed HTML | Use better selectors, update parsing |
Rate limiting (429) | Too many requests | Add delays, use rotating proxies |
Crunchbase might try to stop you. If you get a 403 error:
- Set a realistic User-Agent
- Add delays between requests
- Try a scraping API like ZenRows
Tips for Better Scraping
Speed Up Your Scraper
Want to make your Crunchbase scraper faster? Use parallel processing with async requests. Here's how:
from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed
import json
import time
def response_hook(resp, *args, **kwargs):
with open(f'tmp/{time.thread_time_ns()}.json', 'wb') as fp:
parsed = resp.json()
fp.write(json.dumps(parsed).encode('utf-8'))
futures_session = FuturesSession()
futures_session.hooks['response'] = response_hook
with futures_session as session:
futures = [
session.get(f'https://www.crunchbase.com/api/v4/entities/{i}', hooks={'response': response_hook}) for i in range(1000)
]
for future in as_completed(futures):
resp = future.result()
This trick can make your scraper up to 10 times faster than regular requests.
Keep Your Data Clean
Good data is crucial. Here's how to keep your Crunchbase data squeaky clean:
- Check for missing or weird data
- Use FuzzyWuzzy to fix small typos
- Compare your data with other trusted sources
Pro tip: Keep a log of data issues. It'll help you spot and fix problems fast.
Play Nice When Scraping
Follow these rules to stay on Crunchbase's good side:
Rule | Why? |
---|---|
Check robots.txt | Tells you where you can and can't scrape |
Add delays between requests | Doesn't overload their servers |
Switch up user agents | Makes your scraper look like different browsers |
Use proxies | Hides your real IP address |
As James Densmore, a Data Scientist, says:
"With a little respect we can keep a good thing going."
Remember: Good scraping is about balance. Be fast, but be nice too.
Wrapping Up
Here's how to scrape Crunchbase with Python in 2024:
- Set up Python
- Install libraries
- Analyze Crunchbase's structure
- Build your scraper
- Extract data
- Handle pagination
- Add anti-detection measures
- Clean and store data
What can you do with this data? A lot:
Use Case | Description |
---|---|
Market Research | Track trends and competitors |
Lead Generation | Find potential clients |
Investment Analysis | Evaluate startups |
Recruitment | Spot top talent |
Partnership Scouting | Find collaboration opportunities |
Keep your scraper running smoothly:
- Check for website changes weekly
- Update your code
- Monitor performance
- Stay on top of Crunchbase's terms
Remember: scrape ethically. As data scientist James Densmore says:
"With a little respect we can keep a good thing going."