Recently we’ve performed the Yelp business directory scrape for acquiring high quality B2B leads (company + CEO info). This forced us to apply many techniques like proxying, external company site scrape, email verification and more.
Leads requirements
- Revenues above $xxxxx/year
- Location: Halifax Area, Nova Scotia, Canada
- Industries: diversity of industries including Health, Investment, Finance, Supply Chain, Education, Food industry, Marketing, Real estate, Retail.
The REQUIRED lead information to be collected:
– Name
– Gender
– Approximate Age
– Phone
– E-mail (verified)
– Address
– Website
– Social Media Accounts (if applicable)
– Company Name
– Company Size / Approximate Revenue
CHALLENGES
1. CEO Name /Surname are rarely given at Yelp directory listings or being not full or missing:
2. Links to Social Media Accounts of businesses are not in the public business directory.
3. The Gender mark is neither in the business directory nor at company websites.
4. Company Size / Approximate Revenue / CEO’s Email do not apply to information located on Yelp.
So, most information should be extracted from company’s own website separately.
Scrape Protection & Tech Obstacles
1. Selectors (HTML elements on web pages) are disguised by added set of random characters. Eg.<div class=" margin-b6__09f24__wgl48 border-color--default__09f24__NPAKY">
2. Some companies websites are limited to specific location. Therefore they are not achieved without the usage of Proxy/VPN.
Solution
1. The scrape has been done by the Scrape Machine scraping suit, it having been developed in JAVA language and Spring framework.
2. In order to get the CEO’s email, the script visits the company own website found on Yelp and extracts that with regex.
3. Parsing emails for retrieving CEO name. If the first/last name is not mentioned in the selector on Yelp then we check CEO e-mail if containing personal info. Since there is a huge variety of names in the world, the supposed names in emails are checked for authenticity. Example, lets imagine we’ve got one of the following 3 email addresses:JESSI@ORIGINSNUTRITION.CA
orJESSI-СOEN@ORIGINSNUTRITION.CA
orJESSI.СOEN@ORIGINSNUTRITION.CA
With a high probability, in each of the 3 options “Jessi” can mean a name. We do check if such a name exists against preloaded Englsih-given-names list. If yes, we write down the supposed name in the column called “Supposed Name” (see figure below). This significantly augmented the info taken from Yelp directory.
4. Using an API service to define Gender based on CEO’s first name.
https://genderize.io/
https://gender-api.com/
https://genderapi.io/
5. With help of an external service we have done the Email verification.
https://www.verifyemailaddress.org/email-validation
https://verify-email.org/
https://gsuite.tools/verify-email
6. In order to get other business details like Company Size/Approximate revenue I have done utility requests to other open directories (eg. LinkedIn).
The results
The resulting performance of the Yelp B2B leads extraction of Halifax, NS, totaled in 461 records.
Names found: 83, 18%
Phone numbers found: 100%
Emails found: 126, 27%
Social profiles found: 186, 40%
Genders recognized: 73, 16%