Proactively Sourcing Candidates in Talent Acquisition

Ying Li
9 min readMay 3, 2024

What is proactive sourcing? How to accomplish it with quality? and What’s the value?

In today’s world, hiring the right people is always among the top priority of any organization’s business plan. People are by far the largest investment any company will make into its future. While traditional talent acquisition focuses on slating active candidates, proactive sourcing, which refers to proactively identifying prospects from passive candidates, has become popular due to the proliferation of professional development-oriented social websites such as LinkedIn and ResearchGate.

There has been quite some work in the Talent Acquisition space which applies text analytics or concept matching to identify qualifying candidates for job openings. For instance, Zaroor et al. [1] proposed an approach to first extract concepts from both candidate resumes and job posts covering education, experience and skills information, then these concepts were used to construct a semantic network, based on which the closeness between the resume and job post was measured. A similar approach was also proposed by Kmail et al. [2] which first employed multiple semantic resources to highlight the semantic contents in resumes and job posts, then utilized statistical concept-relatedness measures to further enrich the highlighted contents with relevant concepts that were not initially recognized by the used semantic resources. An ontology-based approach was proposed by Senthil and Sankar [3] where the authors first built an ontology based on the candidates’ resumes and job posts, respectively, then they mapped the job post-based ontology onto that of the candidates and retrieved the eligible candidates. Occupational category-based ideas have also been found in prior work. For instance, Lamba et al. [4] proposed a Smart Resume Selector (SRS) which first extracted a set of information from a candidate’s resume, and converted it into tokens. The resume was then compared against a predefined array of information including required skill sets based on the job post. Finally, an overall matching score was calculated for each of 5 occupational categories to indicate the candidate’s expertise level.

Karen.ai [5] took a very different approach by applying Watson Personality Insights [6] to determine the fitness of a candidate to a job description based on the derived personality traits of the candidate. Also claiming to use the power of Artificial Intelligence to assist hiring, HiredScore [7] developed Spotlight to score a candidate’s fitness to the company and job using both application and publicly available data about the candidate.

While some of the prior work has shown promising results, majority of them are focusing on active candidates who applied for the jobs thus their resumes and application data are already available to the hiring company. Nevertheless, proactive sourcing, which looks into identifying promising candidates from people who never applied for the posted jobs, has its own merit as it allows tapping into a much wider and potentially more diversified pool. Moreover, proactive sourcing can bring speed and specialization on finding the right candidates, screening them instantly, and providing recommendation to recruiters for slating just-in-time. This article introduces an approach, termed as Expertise-Based Candidate Ranking, which can be used to achieve such goal. In particular, it uses publicly available social profile data, and applies natural language processing (NLP) and machine learning technique to identify promising passive candidates.

1. Expertise-based candidate ranking

At very high level, this approach takes two inputs, a job requisition with job description and skill requirement, and a list of passive candidates. The passive candidate information is in form of a social profile, which could be aggregated from one or more social websites such as LinkedIn, Github, Twitter, Stackoverflow, Facebook, ResearchGate, blogger, etc. The profile would usually contain professional information such as education, employment, expertise, summary of experience, publication, certification, etc. The output of the approach is a ranked list of candidates, which is achieved through scoring the level of “fitness” between each candidate’s profile and the job requisition based on their expertise similarity. Here, expertise refers to words describing technical skills listed in both profiles and requisitions, which can be keywords as well as high level concepts (termed as dimensions hereafter). For instance, “Java”, “C++” and “Ruby” are keywords, and “programming language” is a dimension that covers all three skills.

To accomplish this goal, candidate profiles and job requisitions are usually first transformed into weighted numerical representations of skill set descriptions, then similarity metrics like Cosine similarity are applied to calculate a matching score.

A. Feature Vector Generation

This is the first step in any machine learning solution. In case of text documents, this involves obtaining a meaningful summary of the document and in turn converting them into feature vectors. For this approach, a feature vector consists of expertise keywords and words that have similar semantic meanings, as well as job role based keywords and high-level concepts and dimensions.

A.1) Keyword Extraction

This is the process of extracting expertise keywords from a candidate’s social profile or a job requisition using a standard NLP preprocessing pipeline which includes steps of Stop word removal, Tokenization, Lemmatization, and TF-IDF Vectorization. Stop words are commonly used words like articles in a language which are not useful for creating a document summary. Tokenization is the process of splitting sentences into words. Lemmatization is used to convert multiple forms of the word into its base form so that all forms of the word are considered the same. TF-IDF Vectorization [9] is a standard method to assign a weight to each of the tokens such that less common and special expertise keywords are weighed more than popular skill words.

These preprocessing techniques are applied to both job requisitions and candidate profiles to generate document summaries. Once keywords from the job requisition are extracted, a pre-trained Word2Vec model is applied to recommend similar keywords to the extracted keywords. For example, skills like “Java” and “J2EE” are 80% similar meaning that in the absence of exact match of “Java”, a similar match like “J2EE” can be considered with a confidence of 0.8. Moreover, for the job role in the post, if there is an expanded definition or description of that role, then keywords from such description should also be extracted and added to the feature vector.

A.2) Dimension Extraction

Apart from keyword extraction, high level concepts or dimensions form another way of representing a document. These high-level concepts can be considered as broad categories into which specific keywords fall. For example, “Java” and “Python” are expertise keywords, while “Programming” denotes a category or concept that these skills belong to. Using dimensions to match a candidate profile with job requisition can help avoid cases where the candidate gets penalized because of his or her failing to use the exact skill names as appeared in the job description.

To extract dimensions, one would first need a skills taxonomy, which represents a set of skills in a hierarchical way. Figure 1 below shows an excerpt of such taxonomy where skills at leaves refer to keywords and non-leaf nodes represent concepts.

Figure 1: An excerpt of a skill taxonomy

A dimension model can then be built using the job requisition and the taxonomy which can be subsequently used to identify the right dimension for a given keyword. For instance, assuming that the job requisition contains a keyword of “Java”, then with the taxonomy tree in Figure 1, four different hierarchical paths which all lead to “Java” can be extracted, as shown in Table 1. Next, since “Programming” is the parent which has the highest occurrence in these 4 paths, it is thus chosen as the dimension covering “Java”. This same method is also applied to avoid choosing hierarchies which are not relevant to the context. For instance, for keyword “Chef”, a more relevant hierarchy for a job role of “System Developer” would be “DevOps → Build_Systems → Build_and_Release_Version_Control_tools → Chef”, instead of “Hospitality → Hotels_Restaurants → Chef”.

Table 1: Four hierarchies leading to keyword “Java”

B. Candidate Ranking

Once the feature vectors representing candidate profiles and a job requisition are generated respectively, a matching algorithm such as TFIDF-based scoring [8], Perplexity-based scoring [9], or the Word2Vec approach [10], can be applied to compute a similarity score between each profile and the job requisition, which can be subsequently used to rank candidates. Moreover, given that the feature vector matching can be performed through matching the exact keywords, similar keywords and dimensions, different weights can be assigned to them accordingly. For instance, a higher weight can be assigned to exact keyword matching, while using a lower weight for dimension matching since these are very high-level concepts. Finally, the similarity scores can be normalized to rank candidates in a decreasing order of relevance to the job requisition.

Figure 2 illustrates a matching example, where the text on the left is a job requisition of “hardware developer”, and the one on the right shows an excerpt of a top candidate’s profile. The yellow words are expertise keywords that are exactly matched, and blue ones are those matched by dimension.

Figure 2: An example of matching keywords and dimensions between a candidate profile and a job requisition

2. Exploitation of Generative AI technology

Generative AI has taken the world since November 2022, and it has been applied to or at least explored by people for many different application areas including marketing, sales, knowledge management, HR, IT, R&D, supply chain, legal, education, media & entertainment, etc. In particular, within HR, people have been exploring its capabilities such as translation, summarization, text generation and enhancement, and information retrieval in various areas including Talent Acquisition, Total Rewards, Talent Management, Performance Management, Learning & Development etc.

Ranking candidates by matching their social profiles with job requisitions would be a good application to leverage Generative AI technology. Nevertheless, given that it mostly performs as a black box and the output lacks explainability, nor does it support AI fairness assessment, its ranking outcome needs to be taken with cautious. One suggestion is to combine the Generative AI-based output with the expertise-based output, leveraging the best of both worlds.

3. Discussion

There are many benefits in proactively sourcing potential candidates from promising or desired sources and channels, nevertheless, one challenge frequently faced by recruiters in this case is, the response rate from those prospects is rather low. This to some extent, defeats the purpose of using proactive sourcing to help recruiters improve productivity and increase hiring success. One idea to address such challenge is to predict how likely a passive candidate is open to explore external job opportunities or even to switch jobs in the near term. Once a such job-changing propensity score is available, recruiters can prioritize the list of prospects who not only have the skills and experience for the job openings, but also have the propensity to respond to recruiters’ inquiries, as well as to explore presented opportunities.

A potential approach to predict a such propensity-to-explore score will be discussed in a separate blog.

References

[1] A. Zaroor, M. Maree, and M. Sabha, “A Hybrid Approach to Conceptual Classification and Ranking of Resumes and Their Corresponding Job Posts”, International Conference on Intelligent Decision Technologies, pp. 107–119, Springer, 2017

[2] A. Kmail, M. Maree, M. Belkhatir, and S. Alhashmi, “An automatic Online Recruitment System Based on Exploiting Multiple Semantic Resources and Concept-Relatedness Measures,IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 620–627, 2015

[3] V. Senthil Kumaran, and A. Sankar, “Towards an Automated System for Intelligent Screening of Candidates for Recruitment Using Ontology Mapping (EXPERT)”, International Journal of Metadata, Semantics and Ontologies, 8(1), pp. 56–64, 2013

[4] D. Lamba, S. Goyal, V. Chitresh, and N. Gupta, “An Integrated System for Occupational Category Classification based on Resume and Job Matching”, International Conference on Innovative Computing & Communications, 2020

[5] Karen.ai, https://karen.ai

[6] IBM, “Watson Personality Insights”, https://www.ibm.com/cloud/watson-personality-insights

[7] HiredScore, https://hiredscore.com/

[8] tf-idf, https://en.wikipedia.org/wiki/Tf–idf

[9] M. Fernandez-Pichel et al., “An Unsupervised Perplexity-Based Method for Boilerplate Removal”, Natural Language Engineering, 2024

[10] D. Jatnika et al., “Word2vec Model Analysis for Semantic Similarities in English Words”, Procedia Computer Science, 2019

--

--

Ying Li

Ying Li is the Global Head of People Analytics at PepsiCo. She leads team developing advanced analytics solutions to support leaders in key decision-making