Before going to this conference, I didn’t know what Information Retrieval was. Now I know what it comprises and how it relates to software development at Yoast.
In our recently released Insights feature in Yoast SEO, we are using parts of Information Retrieval research. We show a list of five words that are most prominent in your post or page. The way we generate the list of prominent words is by stripping all irrelevant words. For example, we don’t want to include the word “The” in the prominent words. It is not relevant for you to know that you used “The” a lot of times.
One of the talks at the conference was very relevant to this feature. The talk was about so-called “Poison Pills”. These are documents that are relevant, but you still don’t want in your results. In the same vein, an article might have a word that occurs frequently, but you don’t want in the results. For example, in an article about Twitter Cards, Twitter is a word that shows up a lot of times. However, you don’t want to see this in the Insights. The text isn’t about Twitter, but about Twitter cards.
Another highly related talk was by Suzan Verberne: “Evaluation and analysis of term scoring methods for term extraction”. This is basically about the same problem we are trying to solve. The biggest difference is that we are applying our solution to one post or page on a website. In the research, they are using datasets that are much much bigger than that.
In her research, she compared five different methods of extracting terms from a given text. Her conclusion lead to a best method for datasets under 5000 words and a best method for datasets over 10.000 words. Because our datasets are almost always under 5000 words, we are definitely going to investigate the algorithm for under 5000 words.
Blows our mind
Several presentations weren’t directly related to any current projects at Yoast but were too cool not to mention in this article.
The presentation that stood out the most for me was a presentation by Carlos Castillo on the topic of Detecting Algorithmic Discrimination. I am very aware of the diversity challenges we face in the IT community, so having someone on stage discussing these issues within the niche of creating algorithms was a nice way of looking at the problem. He challenged the assumption that because an algorithm only works with objective data, it cannot be biased, which according to Carlos is a very dangerous way to reason. Instead, he suggests that both the data and the algorithm can have biases without them explicitly being put in by the creator. He concludes that a neutral algorithm has a bias because our society has a bias. So the only way to create a fair algorithm is to take this bias into account.
Exploring Deep Space
Another presentation that inspired us was a talk titled “Exploring Deep Space: Learning Personalized Ranking in a Semantic Space” by Jeroen Vuurens. The concrete example he used was predicting which movies a person would like based on their ratings or reviews. He does this by looking at several factors that play a role in liking a movie. Then they remove the factors that the person is indifferent to. So maybe you don’t care about the genre of the movie or the amount of swearing so that factor is not taken into account. He showed a way that was a lot better than previous attempts at the same thing.
New to the academic world I was surprised by one particular part of the schedule: the poster session. In the poster session, several researchers hang a poster about their paper on a wall or board. After that, they are available to present their poster or answer questions about the paper. It is a personal way to get introduced to a certain paper. When a researcher is on stage telling you about a paper, it can sometimes be intimidating to ask a question or go up to the speaker afterward. With the poster session, this barrier is removed, and it is much easier to ask the researcher a niche question.
The Dutch-Belgian Information Retrieval Workshop was a very nice conference to expand our knowledge of the information retrieval research field and how it applies to development at Yoast. We have also connected with researchers who could lead to a great cooperation in the future.