logo

Linkedln URL Scraper

2024年11月22日 编辑

Scrap the LinkedIn URL as per the existing website linkage. It saves time to scroll through the original website one by one.
Linkedln URL Scraper.webp

Working Mechanism of the LinkedIn URL Scraper

The script you're working with is designed to extract LinkedIn URLs from a list of URLs provided in an Excel file. It handles both static and dynamic pages, with the latter rendered by JavaScript. Here's an overview of how the script works step by step:


1. Graphical User Interface (GUI)

The script uses Tkinter to create a simple graphical user interface (GUI). The main components of the GUI are:

  • Load Excel File Button: Allows the user to select an Excel file (.xlsx or .xls).
  • Start Scraping Button: Starts the process of scraping LinkedIn URLs from the Excel file.
  • Progress Bar: Displays the progress of the scraping process.
  • Log Area: Shows logs of the scraping process, including which URLs were processed and whether LinkedIn links were found or not.

The core of the GUI is handled by Tkinter's Tk, Button, Label, Text, and Progressbar components.

2. Loading Excel File

Once the Load Excel File button is clicked, the program:

  • Prompts the user to select an Excel file containing a column named URL (this column should contain the list of URLs to scrape).
  • After loading the file, it reads the Excel file into a pandas DataFrame, which is a structured representation of the Excel data.
  • The program checks if the URL column exists and loads the URLs into a list, displaying them in the GUI.

3. Scraping Process (Scraping LinkedIn URLs)

Once the user clicks Start Scraping, the program begins the actual scraping process:

  • Static Page Scraping (Using Regular Expressions):

    • Initially, it tries to scrape LinkedIn URLs using a static method.
    • For each URL in the URL column, it uses requests to fetch the page's HTML content.
    • The program then uses a regular expression (re.findall) to search for all LinkedIn URLs in the page content. The regex looks for URLs that match the pattern https://linkedin.com or https://www.linkedin.com.
  • Dynamic Page Scraping (Using Selenium):

    • If static scraping doesn’t yield results, the program uses Selenium (a web automation tool) to handle dynamic pages (pages where the content is rendered by JavaScript).
    • Selenium launches a headless Chrome browser (i.e., a browser that operates without a graphical user interface).
    • The script loads the webpage, waits for elements to render (using WebDriverWait), and retrieves the fully rendered page source.
    • It then uses the same regular expression to find LinkedIn URLs from the dynamically generated HTML content.

4. Removing Duplicate LinkedIn URLs

  • Set Data Structure:

    • If multiple LinkedIn URLs are found for the same source URL, the program converts the list of LinkedIn URLs into a set. A set automatically removes duplicate items, ensuring that only unique LinkedIn URLs are retained.
    • The unique URLs are then saved back into the output DataFrame.

5. Updating the GUI

While the script is scraping the URLs, it updates the following:

  • Log Text Area: Displays logs about the current URL being processed and the LinkedIn URLs found. This is useful for tracking the scraping process.
  • Progress Bar: Shows the progress of the scraping operation. It updates as URLs are processed, giving the user a visual indication of how much of the task is completed.

6. Saving Results to a New Excel File

Once all URLs are processed:

  • The program adds a new column to the Excel DataFrame called 领英 (LinkedIn).
  • This new column contains the LinkedIn URLs found for each source URL.
  • The results are saved to a new Excel file with _with_linkedin.xlsx appended to the original filename.

7. Error Handling

Throughout the script, error handling is implemented to ensure smooth operation:

  • File errors: If the Excel file is not in the correct format or is missing the URL column, an error message is shown to the user.
  • Scraping errors: If an error occurs while scraping (either during the static or dynamic scraping), the program catches the error and logs it in the GUI.

Workflow Summary

  1. Load Excel File: User selects an Excel file.
  2. Start Scraping: Program reads the URLs from the file and starts scraping.

    • It attempts to scrape LinkedIn URLs via static methods (using regular expressions).
    • If the page is rendered dynamically, it uses Selenium to scrape the rendered HTML.
  3. De-duplicate LinkedIn URLs: Any duplicate LinkedIn URLs are removed (using a set).
  4. Log Updates and Progress: Logs are shown in the GUI, and the progress bar is updated.
  5. Save Results: The results are saved to a new Excel file.

Dependencies and Tools Used

  • Requests: For fetching static page content.
  • Regular Expressions (re): For matching LinkedIn URLs in the page content.
  • Selenium: For handling pages that are dynamically rendered by JavaScript.
  • BeautifulSoup (optional in case of further parsing needs): Although it’s not used in this version, it could be used for parsing HTML more effectively.
  • Tkinter: For creating the graphical user interface.
  • Pandas: For handling the Excel files and data manipulation.

Performance Considerations

  • Headless Browser (Selenium): Since Selenium launches a real browser instance, it can be slow compared to just fetching static HTML using requests. However, it's essential for scraping dynamic content that can't be accessed through a simple HTTP request.
  • Excel File Size: For large Excel files, the program might take some time to process all URLs and can require significant memory depending on the number of URLs.

Possible Enhancements:

  • Error Recovery: Retry failed URLs a certain number of times before reporting them as failed.
  • Logging: The script can be enhanced to log errors to an external file for easier debugging.
  • Concurrency: Using threads or asynchronous programming could speed up scraping if processing many URLs.
import requests
import re
import pandas as pd
import tkinter as tk
from tkinter import filedialog, messagebox, ttk
import threading
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


# 使用 Selenium 获取动态渲染页面的 HTML 内容
def find_linkedin_urls_from_dynamic(url: str):
    try:
        # 配置 Selenium WebDriver(以 Chrome 为例)
        chrome_options = Options()
        chrome_options.add_argument("--headless")  # 不显示浏览器窗口
        service = Service(r"C:\Users\Eddie.Hu\Desktop\chromedriver-win64\chromedriver.exe")  # 使用您提供的 chromedriver 路径
        driver = webdriver.Chrome(service=service, options=chrome_options)
        
        # 打开网页
        driver.get(url)
        
        # 等待页面加载完毕
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "a")))  # 等待页面加载完毕

        # 获取渲染后的页面内容
        page_content = driver.page_source
        
        # 关闭浏览器
        driver.quit()

        # 使用正则表达式提取所有 LinkedIn 链接
        linkedin_urls = re.findall(r'https?://(?:www\.)?linkedin\.com/[^\s"]+', page_content)

        return linkedin_urls

    except Exception as e:
        print(f"Error processing {url}: {e}")
        return []

# 处理 Excel 文件的函数
def process_excel(input_file: str, output_file: str):
    try:
        df = pd.read_excel(input_file)
        if 'URL' not in df.columns:
            messagebox.showerror("Error", "Excel file must have a 'URL' column.")
            return

        linkedin_urls = []
        for url in df['URL']:
            # 使用 Selenium 提取 LinkedIn 链接
            linkedin_links = find_linkedin_urls_from_dynamic(url)

            if linkedin_links:
                # 去重并保存链接,使用 set 保证唯一性
                unique_links = list(set(linkedin_links))
                linkedin_urls.append(', '.join(unique_links))  # 将多个链接合并为一个字符串
            else:
                linkedin_urls.append('No LinkedIn found')

        df['领英'] = linkedin_urls
        df.to_excel(output_file, index=False)
        messagebox.showinfo("Success", f"Results saved to {output_file}")

    except Exception as e:
        messagebox.showerror("Error", f"Error processing the Excel file: {e}")

# Tkinter 界面
def create_gui():
    def load_file():
        global excel_file_path
        file_path = filedialog.askopenfilename(title="Select an Excel file", filetypes=[("Excel files", "*.xlsx;*.xls")])
        if file_path:
            try:
                urls = pd.read_excel(file_path)
                if 'URL' not in urls.columns:
                    messagebox.showerror("Error", "Excel file must have a 'URL' column.")
                    return
                url_list.delete(0, tk.END)
                for url in urls['URL']:
                    if pd.notna(url):
                        url_list.insert(tk.END, url)
                global excel_file_path
                excel_file_path = file_path
            except Exception as e:
                messagebox.showerror("Error", f"Failed to read Excel file: {e}")
    
    def start_scraping():
        if not excel_file_path:
            messagebox.showerror("Error", "Please load an Excel file first.")
            return

        start_button.config(state=tk.DISABLED)
        load_button.config(state=tk.DISABLED)
        progress_bar['value'] = 0
        progress_bar['maximum'] = 100
        progress_label.config(text="Starting scraping...")
        
        threading.Thread(target=run_scraping).start()

    def run_scraping():
        try:
            df = pd.read_excel(excel_file_path)
            df = df.dropna(subset=['URL'])
            linkedin_urls = []
            log_text.delete(1.0, tk.END)

            for i, url in enumerate(df['URL']):
                linkedin_links = find_linkedin_urls_from_dynamic(url)

                if linkedin_links:
                    # 去重并显示唯一链接
                    unique_links = list(set(linkedin_links))
                    log_text.insert(tk.END, f"Found LinkedIn links for {url}: {', '.join(unique_links)}\n")
                    linkedin_urls.append(', '.join(unique_links))
                else:
                    log_text.insert(tk.END, f"No LinkedIn found for {url}\n")
                    linkedin_urls.append('No LinkedIn found')

                progress_bar['value'] = int((i + 1) / len(df) * 100)
                root.update_idletasks()

            df['领英'] = linkedin_urls
            output_file = excel_file_path.replace('.xlsx', '_with_linkedin.xlsx')
            df.to_excel(output_file, index=False)

            messagebox.showinfo("Success", f"Scraping completed. Results saved to {output_file}")
        except Exception as e:
            messagebox.showerror("Error", f"Scraping failed: {e}")
        finally:
            start_button.config(state=tk.NORMAL)
            load_button.config(state=tk.NORMAL)

    # 创建 Tkinter 界面
    global root, url_list, start_button, load_button, progress_bar, progress_label, log_text, excel_file_path

    root = tk.Tk()
    root.title("LinkedIn URL Scraper")

    load_button = tk.Button(root, text="Load Excel File", command=load_file)
    load_button.pack(pady=10)

    start_button = tk.Button(root, text="Start Scraping", command=start_scraping)
    start_button.pack(pady=10)

    url_list = tk.Listbox(root, width=50, height=10)
    url_list.pack(pady=10)

    progress_label = tk.Label(root, text="Progress:")
    progress_label.pack(pady=5)
    progress_bar = ttk.Progressbar(root, length=300, mode='determinate')
    progress_bar.pack(pady=10)

    log_label = tk.Label(root, text="Scraping Log:")
    log_label.pack(pady=5)
    log_text = tk.Text(root, width=80, height=15)
    log_text.pack(pady=10)
    log_text.config(state=tk.DISABLED)

    root.mainloop()

# 启动 GUI
create_gui()