Exotic (representing Exotic Star) is a professional version of PulsarR, which contains a upgraded PulsarR server, a set of top e-commerce site scraping examples, and a applet for auto extraction supported by advanced AI.
Never write another web scraper. Exotic learns from the website, automatically generates all the extract rules, queries the Web as a database, and delivers web data completely and accurately at scale:
STEP1: automatically extract every field in webpages using advanced AI and generate extract SQLs
STEP2: test the SQLs and improve them to match frontend business requirements if necessary
STEP3: create crawl rules in the web console to run extract SQLs continuously and download all the web data to drive your business forward
There are already dozens of scraping cases for the most popular websites, we are constantly adding more cases.
Web spider: browser rendering, ajax data crawling
High performance: highly optimized, rendering hundreds of pages in parallel on a single machine without be blocked
Low cost: scraping 100,000 browser rendered e-comm webpages, or n * 10,000,000 data point each day, only 8 core CPU/32G memory are required
Web UI: a very simple yet powerful web UI to manage spiders and download data
Machine learning: automatically extract every field in webpages using unsupervised machine learning and generate extract rules and SQLs
Data quantity assurance: smart retry, accurate scheduling, web data lifecycle management
Large scale: fully distributed, designed for large scale crawling
Simple API: single line of code to scrape, or single SQL to turn a website into a table
X-SQL: extended SQL to manage web data: Web crawling, scraping, Web content mining, Web BI
Bot stealth: IP rotation, web driver stealth, never get banned
RPA: simulating human behaviors, SPA crawling, or do something else awesome
Big data: various backend storage support: MongoDB/HBase/Gora
Logs & metrics: monitored closely and every event is recorded
Memory 4G+
The latest version of the Java 11 JDK
Java and jar on the PATH
Google Chrome 90+
Download the latest executable jar:
wget http://static.platonic.fun/repo/ai/platon/exotic/exotic-standalone.jar
Add the following lines to your .m2/settings.xml.
<mirrors>
<mirror>
<id>maven-default-http-blocker</id>
<mirrorOf>dummy</mirrorOf>
<name>Dummy mirror to override default blocking mirror that blocks http</name>
<url>http://0.0.0.0/</url>
</mirror>
</mirrors>
git clone https://github.com/platonai/exotic.git
cd exotic
mvn clean && mvn
cd exotic-standalone/target/
# Linux:
java -jar exotic-standalone*.jar serve
# Windows:
java -jar exotic-standalone[-the-actual-version].jar serve
Note: if you are using CMD or PowerShell on Windows, you may need to remove the wildcard *
and use the full name of the jar.
If Exotic is running in GUI mode, the web console should open within a few seconds, or you can open it manually:
We can use the harvest
command to leans from a set of item pages using unsupervised machine learning.
java -jar exotic-standalone*.jar harvest https://shopee.sg/Computers-Peripherals-cat.11013247 -diagnose -refresh
The URL in the command above should be an portal URL, such as the URL of the product listing page.
Exotic visits the portal URL, finds out the best link set for item pages, fetches item pages and then learn from them.
Here is a snapshot of the result of auto extract using unsupervised machine learning for an e-comm site.
The best CSS selectors for each field are generated automatically, you can use these rules for web scraping in the old-fashioned way:
And also the generated SQL:
Note that the website in this demo uses CSS obfuscation techniques, so the CSS selectors are hard to read and changes frequently. There is no other effective technology to solve this problem other than machine learning based solutions.
The complete code can be found here.
The harvest
command extracts fields automatically using unsupervised machine learning, and also generates the best css selectors for all possible fields and the extract SQLs. We can execute the SQLs using sql
command.
# Note: remove the wildcard `*` and use the full name of the jar on Windows
java -jar exotic-standalone*.jar sql "
select
dom_first_text(dom, 'div.-Esc+w.card.product-briefing div.HLQqkk div.flex-column.imEX5V span') as T1C2,
dom_first_text(dom, 'div.HLQqkk div.flex-column.imEX5V div.W2tD8- div.MrYJVA.Ga-lTj') as T1C3,
dom_first_text(dom, 'div.HLQqkk div.flex-column.imEX5V div.W2tD8- div.MrYJVA') as T1C4,
dom_first_text(dom, 'div.HLQqkk div.flex-column.imEX5V div.W2tD8- div.Wz7RdC') as T1C5,
dom_first_text(dom, 'div.HLQqkk div.flex-column.imEX5V div.W2tD8- div._45NQT5') as T1C6,
dom_first_text(dom, 'div.HLQqkk div.flex-column.imEX5V div.W2tD8- div.Cv8D6q') as T1C7,
dom_first_text(dom, 'div.-Esc+w.card.product-briefing div.HLQqkk div.imEX5V div.pmmxKx') as T1C8,
dom_first_text(dom, 'div.-Esc+w.card.product-briefing div.HLQqkk div.imEX5V div.mini-vouchers__label') as T1C9,
dom_first_text(dom, 'div.imEX5V div.PMuAq5 div.flex-no-overflow span.voucher-promo-value.voucher-promo-value--absolute-value') as T1C10,
dom_first_text(dom, 'div.HLQqkk div.imEX5V div.PMuAq5 label._0b8hHE') as T1C11,
dom_first_text(dom, 'div.PMuAq5 div.MGNOw3.hInOdW div.dHS5e4.xIMb1R div.LgUWja') as T1C12,
dom_first_text(dom, 'div.PMuAq5 div.MGNOw3.hInOdW div.dHS5e4.xIMb1R div.Nd79Ux') as T1C13,
dom_first_text(dom, 'div.MGNOw3.hInOdW div.dHS5e4.xIMb1R div.flex-row div.NPdOlf') as T1C14,
dom_first_text(dom, 'div.imEX5V div.PMuAq5 div.-+gikn.hInOdW label._0b8hHE') as T1C15,
dom_first_text(dom, 'div.PMuAq5 div.-+gikn.hInOdW div.items-center button.product-variation') as T1C16,
dom_first_text(dom, 'div.PMuAq5 div.-+gikn.hInOdW div.items-center button.product-variation') as T1C17,
dom_first_text(dom, 'div.imEX5V div.PMuAq5 div.-+gikn.hInOdW div._0b8hHE') as T1C18,
dom_first_text(dom, 'div.PMuAq5 div.-+gikn.hInOdW div.G2C2rT.items-center div') as T1C19,
dom_first_text(dom, 'div.flex-column.imEX5V div.vdf0Mi div.OozJX2 span') as T1C20,
dom_first_text(dom, 'div.HLQqkk div.flex-column.imEX5V div.vdf0Mi button.btn.btn-solid-primary.btn--l.GfiOwy') as T1C21,
dom_first_text(dom, 'div.-Esc+w.card.product-briefing div.HLQqkk div.flex-column.imEX5V span.zevbuo') as T1C22,
dom_first_text(dom, 'div.-Esc+w.card.product-briefing div.HLQqkk div.flex-column.imEX5V span') as T1C23
from load_and_select('https://shopee.sg/(Local-Stock)-(GEBIZ-ACRA-REG)-PLA-3D-Printer-Filament-Standard-Colours-Series-1.75mm-1kg-i.182524985.8326053759?sp_atk=3afa9679-22cb-4c30-a1db-9d271e15b7a2&xptdk=3afa9679-22cb-4c30-a1db-9d271e15b7a2', 'div.page-product');
"
Run the executable jar directly for help to explore more power provided:
# Note: remove the wildcard `*` and use the full name of the jar on Windows
java -jar exotic-standalone*.jar
This command will print the help message and most useful examples.
Q: How to use proxies?
A: Follow this guide for proxy rotation.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。