This post details a practical approach to scraping data from a platform with a hostile API, specifically focusing on extracting user posts related to AI technologies from Xiaohongshu. The author faced several challenges and developed an efficient solution using Playwright for browser automation and response listening. Here are the key takeaways:
Challenges Faced
- Cookie vs Login State: Initially, it was unclear whether a cookie represented login state or not.
- Endpoint Discovery: Multiple endpoint guesses led to 404 errors, indicating that guessing wasn't effective.
- Signed Requests: The API required signed requests which were difficult to replicate manually.
Solution Approach
-
Login Verification:
- Used an identity API call to verify login state instead of relying on cookies.
-
Browser Sniffing:
- Instead of guessing endpoints, the author used browser automation (Playwright) to observe actual network requests made by the browser when interacting with the platform.
-
Response Listener:
- Listened for specific API responses (
collect/page) triggered by user actions like scrolling. - Captured paginated JSON data without needing to replicate complex signing schemes.
- Listened for specific API responses (
-
Idle Detection:
Read the full article at DEV Community
Want to create content about this topic? Use Nemati AI tools to generate articles, social posts, and more.

![[AINews] The Unreasonable Effectiveness of Closing the Loop](/_next/image?url=https%3A%2F%2Fmedia.nemati.ai%2Fmedia%2Fblog%2Fimages%2Farticles%2F600e22851bc7453b.webp&w=3840&q=75)



