首页
归档
搜索
管理
Moments
关于
QQ空间
Linkedln URL Scraper
日常

Scrap the LinkedIn URL as per the existing website linkage. It saves time to scroll through the original website one by one.
linkedinindexthumb.webp
Linkedln URL Scraper.webp

Working Mechanism of the LinkedIn URL Scraper

The script you're working with is designed to extract LinkedIn URLs from a list of URLs provided in an Excel file. It handles both static and dynamic pages, with the latter rendered by JavaScript. Here's an overview of how the script works step by step:


1. Graphical User Interface (GUI)

The script uses Tkinter to create a simple graphical user interface (GUI). The main components of the GUI are:

The core of the GUI is handled by Tkinter's Tk, Button, Label, Text, and Progressbar components.

2. Loading Excel File

Once the Load Excel File button is clicked, the program:

3. Scraping Process (Scraping LinkedIn URLs)

Once the user clicks Start Scraping, the program begins the actual scraping process:

4. Removing Duplicate LinkedIn URLs

5. Updating the GUI

While the script is scraping the URLs, it updates the following:

6. Saving Results to a New Excel File

Once all URLs are processed:

7. Error Handling

Throughout the script, error handling is implemented to ensure smooth operation:

Workflow Summary

  1. Load Excel File: User selects an Excel file.
  2. Start Scraping: Program reads the URLs from the file and starts scraping.

    • It attempts to scrape LinkedIn URLs via static methods (using regular expressions).
    • If the page is rendered dynamically, it uses Selenium to scrape the rendered HTML.
  3. De-duplicate LinkedIn URLs: Any duplicate LinkedIn URLs are removed (using a set).
  4. Log Updates and Progress: Logs are shown in the GUI, and the progress bar is updated.
  5. Save Results: The results are saved to a new Excel file.

Dependencies and Tools Used


Performance Considerations


Possible Enhancements:

import requests
import re
import pandas as pd
import tkinter as tk
from tkinter import filedialog, messagebox, ttk
import threading
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


# 使用 Selenium 获取动态渲染页面的 HTML 内容
def find_linkedin_urls_from_dynamic(url: str):
    try:
        # 配置 Selenium WebDriver(以 Chrome 为例)
        chrome_options = Options()
        chrome_options.add_argument("--headless")  # 不显示浏览器窗口
        service = Service(r"C:\Users\Eddie.Hu\Desktop\chromedriver-win64\chromedriver.exe")  # 使用您提供的 chromedriver 路径
        driver = webdriver.Chrome(service=service, options=chrome_options)
        
        # 打开网页
        driver.get(url)
        
        # 等待页面加载完毕
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "a")))  # 等待页面加载完毕

        # 获取渲染后的页面内容
        page_content = driver.page_source
        
        # 关闭浏览器
        driver.quit()

        # 使用正则表达式提取所有 LinkedIn 链接
        linkedin_urls = re.findall(r'https?://(?:www\.)?linkedin\.com/[^\s"]+', page_content)

        return linkedin_urls

    except Exception as e:
        print(f"Error processing {url}: {e}")
        return []

# 处理 Excel 文件的函数
def process_excel(input_file: str, output_file: str):
    try:
        df = pd.read_excel(input_file)
        if 'URL' not in df.columns:
            messagebox.showerror("Error", "Excel file must have a 'URL' column.")
            return

        linkedin_urls = []
        for url in df['URL']:
            # 使用 Selenium 提取 LinkedIn 链接
            linkedin_links = find_linkedin_urls_from_dynamic(url)

            if linkedin_links:
                # 去重并保存链接,使用 set 保证唯一性
                unique_links = list(set(linkedin_links))
                linkedin_urls.append(', '.join(unique_links))  # 将多个链接合并为一个字符串
            else:
                linkedin_urls.append('No LinkedIn found')

        df['领英'] = linkedin_urls
        df.to_excel(output_file, index=False)
        messagebox.showinfo("Success", f"Results saved to {output_file}")

    except Exception as e:
        messagebox.showerror("Error", f"Error processing the Excel file: {e}")

# Tkinter 界面
def create_gui():
    def load_file():
        global excel_file_path
        file_path = filedialog.askopenfilename(title="Select an Excel file", filetypes=[("Excel files", "*.xlsx;*.xls")])
        if file_path:
            try:
                urls = pd.read_excel(file_path)
                if 'URL' not in urls.columns:
                    messagebox.showerror("Error", "Excel file must have a 'URL' column.")
                    return
                url_list.delete(0, tk.END)
                for url in urls['URL']:
                    if pd.notna(url):
                        url_list.insert(tk.END, url)
                global excel_file_path
                excel_file_path = file_path
            except Exception as e:
                messagebox.showerror("Error", f"Failed to read Excel file: {e}")
    
    def start_scraping():
        if not excel_file_path:
            messagebox.showerror("Error", "Please load an Excel file first.")
            return

        start_button.config(state=tk.DISABLED)
        load_button.config(state=tk.DISABLED)
        progress_bar['value'] = 0
        progress_bar['maximum'] = 100
        progress_label.config(text="Starting scraping...")
        
        threading.Thread(target=run_scraping).start()

    def run_scraping():
        try:
            df = pd.read_excel(excel_file_path)
            df = df.dropna(subset=['URL'])
            linkedin_urls = []
            log_text.delete(1.0, tk.END)

            for i, url in enumerate(df['URL']):
                linkedin_links = find_linkedin_urls_from_dynamic(url)

                if linkedin_links:
                    # 去重并显示唯一链接
                    unique_links = list(set(linkedin_links))
                    log_text.insert(tk.END, f"Found LinkedIn links for {url}: {', '.join(unique_links)}\n")
                    linkedin_urls.append(', '.join(unique_links))
                else:
                    log_text.insert(tk.END, f"No LinkedIn found for {url}\n")
                    linkedin_urls.append('No LinkedIn found')

                progress_bar['value'] = int((i + 1) / len(df) * 100)
                root.update_idletasks()

            df['领英'] = linkedin_urls
            output_file = excel_file_path.replace('.xlsx', '_with_linkedin.xlsx')
            df.to_excel(output_file, index=False)

            messagebox.showinfo("Success", f"Scraping completed. Results saved to {output_file}")
        except Exception as e:
            messagebox.showerror("Error", f"Scraping failed: {e}")
        finally:
            start_button.config(state=tk.NORMAL)
            load_button.config(state=tk.NORMAL)

    # 创建 Tkinter 界面
    global root, url_list, start_button, load_button, progress_bar, progress_label, log_text, excel_file_path

    root = tk.Tk()
    root.title("LinkedIn URL Scraper")

    load_button = tk.Button(root, text="Load Excel File", command=load_file)
    load_button.pack(pady=10)

    start_button = tk.Button(root, text="Start Scraping", command=start_scraping)
    start_button.pack(pady=10)

    url_list = tk.Listbox(root, width=50, height=10)
    url_list.pack(pady=10)

    progress_label = tk.Label(root, text="Progress:")
    progress_label.pack(pady=5)
    progress_bar = ttk.Progressbar(root, length=300, mode='determinate')
    progress_bar.pack(pady=10)

    log_label = tk.Label(root, text="Scraping Log:")
    log_label.pack(pady=5)
    log_text = tk.Text(root, width=80, height=15)
    log_text.pack(pady=10)
    log_text.config(state=tk.DISABLED)

    root.mainloop()

# 启动 GUI
create_gui()
上一篇
下一篇