In [2]:
IPython.display.HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
The raw code for this IPython notebook is by default hidden for easier reading.
To toggle on/off the raw code, click <a href="javascript:code_toggle()">here</a>.''')
Out[2]:
The raw code for this IPython notebook is by default hidden for easier reading. To toggle on/off the raw code, click here.

Making sense of the City of Cape Town using NLP

Gordon Inggs, Data Scientist, City of Cape Town

Outline

  1. Context
  2. Transforming data into a form for analysis
  3. Understanding the data

Context

Why were we doing this?

  • City of Cape Town has a Data Strategy.
  • City-wide initiative to improve how the City works with data.
  • One part of the strategy (Data Capabilities) concerns City employees
  • Need to understand "data-intensivity" of work

Caveats

  1. Use of Formal HR data
  2. Use of pre-trained models

* Qualification: For the purposes of brevity, administrative points have been removed

Transforming Data

In [3]:
IPython.display.HTML('./gordon_source_df.html')
Out[3]:
Directorate Department PositionName CriteriaGroup Row AppraisalScoreWeight
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Discipline Specific Skills L3 25
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Impact and Influence L3 30
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Organisational Awareness L3 15
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Planning and Organising L3 30
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's ANALYTIC DRIVEN CULTURE 20
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's DATA AUTOMATION 20
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's DATA INSIGHT 30
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's DATA REQUIREMENTS 30

Cleaning

First, loading the model:

nlp = spacy.load('en_core_web_lg')
stop_words = {
    "service", "delivery",
    "function", "functions",
    "orientation", "orientations",
    "problem", "solving",
    "cfadm", "cfpro", "cfuni", "cfsup", "cfart", "cfman", "cfart", "cftec",
    "kpaa", "kpan",
    "l1",  "l2", "l3", "l4", "l5"
}
nlp.Defaults.stop_words |= stop_words

Embedding

hr_df.RowVector = hr_df.Row.apply(
    lambda row: nlp(row).vector
)
  • Takes $\approx$ 2 mins using 16 cores.
  • 4 chunks per core, at least 10k entries per core
In [4]:
IPython.display.HTML('./gordon_source_wv_df.html')
Out[4]:
Directorate Department PositionName CriteriaGroup Row RowVector AppraisalScoreWeight
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Discipline Specific Skills L3 [-0.115235664, 0.094851844, -0.032811504, -0.1... 25
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Impact and Influence L3 [-0.185014, 0.27602965, -0.020265013, -0.01785... 30
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Organisational Awareness L3 [-0.0405184, 0.15518801, 0.110339, 0.008534556... 15
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies CFPRO: Planning and Organising L3 [-0.03400433, 0.02099867, 0.016796663, -0.0964... 30
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's ANALYTIC DRIVEN CULTURE [-0.01125025, 0.105551496, 0.2284475, -0.08509... 20
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's DATA AUTOMATION [-0.24759, 0.0056599975, 0.28850502, 0.09628, ... 20
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's DATA INSIGHT [-0.107594505, 0.18723, -0.019495003, 0.2254, ... 30
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's DATA REQUIREMENTS [0.006064996, -0.272295, -0.1181675, 0.003385,... 30

Reducing critera -> positions

Using centre of mass formula:

$$C = \frac{\sum_i^N{W_i X_i}}{\sum_i^N{W_i}}$$
  • $C$ - new position
  • $N$ - Number of entries in row $i$
  • $W_i$ - row $i$'s weight
  • $X_i$ - row $i$'s vector

a few weighted averages later...

In [5]:
IPython.display.HTML('./gordon_cg_df.html')
Out[5]:
Directorate Department PositionName CriteriaGroup CriteriaGroupVector
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci Competencies [-0.10059217, 0.13609965, 0.00730747, -0.06769...
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci KPA's [-0.0822269, -0.0032772017, 0.062091753, 0.070...

and a few more...

In [6]:
IPython.display.HTML('./gordon_position_df.html')
Out[6]:
Directorate Department PositionName PositionVector
CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci [-0.08773648, 0.038535856, 0.045656465, 0.0293...

But what does this actually look like?!?

In [7]:
IPython.display.HTML('./hr_translation_I.html')
Out[7]:
Bokeh Plot
In [8]:
IPython.display.HTML('./hr_translation_II_na.html')
Out[8]:
Bokeh Plot

Data-relevance Scoring

Relationship to data-intensive work

data_words = [
    "data",
    "gathering",
    "processing",
    "analysis",
    "dissemination"
]
data_word_vectors = {
    word: nlp(word.lower()).vector
    for word in data_words
}
for word, word_vector in data_word_vectors.items():
    score_df[f"{word.title()}Score"] = sklearn.metrics.pairwise.cosine_similarity(
        numpy.vstack(score_df.PositionVector.values),
        numpy.array([word_vector])
    )
  • Faily fast - few seconds at most
In [9]:
IPython.display.HTML('./data_score_df.html')
Out[9]:
Directorate Department PositionName DataScore GatheringScore ProcessingScore AnalysisScore DisseminationScore
4869 CORPORATE SERVICES Organisational Performance Management Principal Professional Officer: Data Sci 0.861595 0.411872 0.603572 0.698752 0.495789

But what does this actually look like?!?

In [10]:
IPython.display.HTML('./na_data_scoring.html')
Out[10]:
Bokeh Plot

On validation...

  • Those affliated to the Data Strategy are probably doing data-related work...
In [13]:
IPython.display.HTML('./data_scoring_comparison.html')
Out[13]:
Bokeh Plot

Conclusion

Key Findings

  • City job description data appears amenable to NLP analysis
  • City positions seem to have three groupings in relation to data key words:
    • Intensive workers (the green band)
    • Majority in the middle (the grey band)
    • Low intensity/bad data (the red band)
  • 'Processing' and 'Analysis' terminology is more prevelant than 'Gathering' and 'Dissemination'.

Recommendations

  1. Analysis is validated, qualitatively
  2. Use the 'green band' as beta testers for Data Strategy initiatives
  3. Data Strategy leaderships needs to reflect on absence of 'gathering' and 'processing' intensive positions.
  4. ?

Bonus Slides

Investigating Dynamics

  • Principal Component Analysis - tries to explain variance (difference in the dataset).
  • Remaps data into new, reduced dimension form.
  • Sometimes, these dimensions have meanings.
In [12]:
IPython.display.HTML('./data_scoring_pca.html')
Out[12]:
Bokeh Plot