Have you ever wondered which technologies and skills are the most wanted among IT employees? I do, so I’ve easily visualized it. On the picture above you can see which skills and technologies are the most frequent in over 3800 IT job offers in Poland from the period Jan‑Feb 2018. The bigger font, the higher frequency of given word. I’ve analyzed only unique technologies for each offer.

 

Important notes:

  • This graph should contain most common IT technologies however, I can’t guarantee it (I will explain shortly why). Please, treat this graphic as an interesting summary rather than a scientific source.
  • There is bias related to multiple‑technologies. Ex. SQL phrase will be found both in SQL and ms SQL phrases. As result, technologies with multiple variants may have higher, overestimated weight.
  • This graphic is an introduction to a bigger project, so it’s summary of POC and it will be improved.

 

I owe you short explanation how the word‑cloud was created:

  1. I’ve scraped all open IT‑related job offers form pracuj.pl, one of the biggest job portal in Poland. I’ve used the word IT‑related because it contains also jobs from data analysis, project management and the like.
  2. I’ve created a list of IT skills and technologies from Bulldogsjob.com (they have nice tags for technologies). I’ve scraped technologies listed by employers and copied some manually from approx. 100 offers. Due to that, some technologies could be omitted. As the list was created almost automatically, some phrases may be too general (ex. Windows or Microsoft, which were probably used in relation to technologies like Microsoft Excel). Finally, all the skills were merged into one array.
  3. I’ve collected all phrases from IT skills list which have occurred in the offers. I didn’t look at the context, so it may create some noise. Ex. in the phrase:

    Our company is the leader in web and mobile solutions

    web and mobile will be captured even though they may not be listed in required skills for this position. However, I’ve checked that on several samples and it seems to be imperceptible.

  4. Visualize it! I’ve used WordCloud ‑ python library which allows to easily create word clouds. You can find my cloud code and scraped data in my Github repository.

 

At the end I’d like to share with challenges regarding text preprocessing:

  • A noise of IT skills list. I’ve tested more detailed lists, but they contain very rare technologies, which appeared in non‑technical contexts, like pizzamaxmake etc. So it’s very important to create a reliable list.
  • Duplicates. In my IT skills list, many values were not detected as duplicates due to capital letters: Ruby On Rails and Ruby on rails was recognized as different technologies. The easiest way to deal with that was converting offers and IT skills list to lowercase. And then appeared…
  • R. R is programming language used mostly by data scientists, analysts and statisticians. Seems that converted to lowercase would not appear as a single word. But what if I told you that many polish job offers contain reminder about disclaimers:

Wyrażam zgodę na przetwarzanie moich danych osobowych zawartych w mojej aplikacji dla potrzeb niezbędnych do realizacji procesów rekrutacji (zgodnie z Ustawą z dnia 29 sierpnia 1997 r. o ochronie danych osobowych tj. Dz. U. z 2002 r., Nr 101, poz. 926, ze zm.), prowadzonych przez XYZ Sp. z o.o. z siedzibą w XYZ

Do you see this r. which stands by year? It was captured as IT skill and appeared on the word cloud as one of the most common technology. It’s very important to print how the data was preprocessed, before using it.

 

Finally, I’ve captured technologies from job offers in two steps:

  1. Detect only single-letter technologies (R, C) keep original letter case. Capital letters deal with disclaimers.
  2. Convert the rest of technologies and offers to lowercase and then capture.