ATIP Tools Tech Corner - Introduction to the ATIP Online Request Service and the use of artificial intelligence

Follow:

  • RSS
  • Cite

Introduction and background

Welcome to the ATIP Tools Tech Corner, where information and updates about the new ATIP Online Request Service will be shared. We will build this corner as we go, adding more information to explain our journey and fill you in on the status of the implementation of the service.

The ATIP Online Request Service is a simple, centralized website that enables users to complete access to information and personal information requests and submit them to any of the institutions that are subject to the Government of Canada’s Access to Information Act and Privacy Act.

This service will be implemented in an incremental, agile way. This means that a first version (we call this a "beta" version) has been released, with only a few institutions participating. This allows us to test the service, get user feedback, and fix problems as we go.

Onboarding institutions

Over the course of the next few years, all institutions subject to the Access to Information Act and the Privacy Act will onboard to the ATIP Online Request Service. "Onboarding" in this context means that the institutions will be set up to leverage all the features of the application, and users will be able to send initial access to information and privacy information requests to them using only the application.

The onboarding plan has been broken down into phases, based on a strategy that takes into account the requirements of the different types of institutions that are involved in the project. The onboarding phases also map to specific target time frames and the planned iterations of the ATIP Online Request Service.

Following is an update on the state of onboarding. This allows you to see the next institutions to be onboarded.

Onboarding status as of
Partially Onboarded 133
Onboarded 51
Figure 1: Onboarding Progress – Number of institutions by percentage onboarded as of
Onboarding Progress – Number of institutions by percentage onboarded as of March 5, 2019
Figure 1 - Text version
Percentage onboarded Number of institutions
100%table 1 note * 51
76-99% 5
51-75% 84
26-50% 21
1-25% 9

Table 1 Notes

Table 1 Note 1

Institutions onboarded at 100% are listed below.

Return to table 1 note * referrer

The following institutions have been onboarded:

  • Administrative Tribunals Support Services of Canada
  • Atlantic Canada Opportunities Agency
  • Canada Agricultural Review Tribunal
  • Canada Industrial Relations Board
  • Canada School of Public Service
  • Canada-Nova Scotia Offshore Petroleum Board
  • Canadian Environmental Assessment Agency
  • Canadian Grain Commission
  • Canadian Heritage
  • Canadian Human Rights Commission
  • Canadian Institutes of Health Research
  • Canadian Northern Economic Development Agency
  • Canadian Radio-television and Telecommunications Commission
  • Canadian Transportation Agency
  • Civilian Review and Complaints Commission for the Royal Canadian Mounted Police
  • Communications Security Establishment Canada
  • Copyright Board of Canada
  • Crown-Indigenous Relations and Northern Affairs Canada
  • Farm Products Council of Canada
  • First Nations Tax Commission
  • Indigenous Services Canada
  • Infrastructure Canada
  • Military Grievances External Review Committee
  • Military Police Complaints Commission
  • National Battlefields Commission
  • National Energy Board
  • National Film Board of Canada
  • National Research Council Canada
  • Natural Resources Canada
  • Natural Sciences and Engineering Research Council of Canada
  • Northern Pipeline Agency Canada
  • Office of the Administrator of the Fund for Railway Accidents Involving Designated Goods
  • Office of the Administrator of the Ship-source Oil Pollution Fund
  • Office of the Commissioner of Lobbying of Canada
  • Office of the Correctional Investigator of Canada
  • Office of the Superintendent of Financial Institutions Canada
  • Parole Board of Canada
  • Patented Medicine Prices Review Board
  • Port Alberni Port Authority
  • Privy Council Office
  • Public Prosecution Service of Canada
  • Public Service Commission of Canada
  • Sept-Îles Port Authority
  • Social Sciences and Humanities Research Council of Canada
  • Status of Women Canada
  • Transportation Appeal Tribunal of Canada
  • Transportation Safety Board of Canada
  • Treasury Board of Canada Secretariat
  • Vancouver Fraser Port Authority
  • Veterans Review and Appeal Board Canada
  • Western Economic Diversification Canada

Artificial intelligence

In this update, we will explain how the ATIP Online Request Service is leveraging artificial intelligence (AI).

This update is also one of our first efforts to explain the use of AI in governments, so please give us feedback so that we can understand how to make this as clear as possible. Contact us at open.ouvert@tbs-sct.gc.ca.

What is the impact of our use of artificial intelligence?

To assess the impact of our use of AI, we have used the Algorithmic Impact Assessment Tool.

Basically, what this tells us is that our use of AI has little socio-economic impact on citizens and little impact on government operations.

Using artificial intelligence

The search functionality provided on uses AI to improve user experience.

The first instance where AI is used is when searching for information that may have already been released in response to another request. The search results are based on information readily available on the Open Government website.

The second instance where AI is used is when helping to identify which institution may have the information pertaining to the request. The search will recommend institutions that are most suitable for the type of request. The data used to make this recommendation comes from the following locations:

  • Open Government summaries
  • departmental reports
  • "scraping" on government websites
  • institutions’ ATIP web pages
  • Government of Canada taxonomies
  • unified master data organization schema
  • Part III of Departmental Results Reports for the 2016 to 2017 fiscal year

How are we using AI

Ensuring that a web search finds all the correct documents can be a difficult task. The search system leverages machine learning to identify contextual and latent relationships that are more fundamental than keywords. To do this, the search looks at concepts and the relationship between past searches to improve result quality.

The searching system that was developed uses advanced natural language processing and machine learning techniques to enhance searches across multiple sources. This search solution will include websites, forums or anything that is publicly accessible. By going beyond simple word similarity and instead "understanding" the meaning of search terms, this solution can compare a user’s search needs to the corpus of documents in near real time, returning all relevant documents, or components of documents, that relate to a given search query or comparable document.

Synonyms, abbreviations and typos often mean that key documents go overlooked. By using advanced machine learning and  natural language processing, the algorithm built is able to read an entire corpus of documents (such as an enterprise website, a course curriculum and the textbooks and related documents). After reading the documents, the AI search system is able to semantically "understand" the phrases and ideas, more so than in keyword matching.

AI algorithm

The following will give more technical information about the algorithm that was used:

  • category of algorithm used: natural language processing
  • models used: tf-idf (term frequency-inverse document frequency) and cosine similarity models

Tf-idf Improvements

Tf-idf is a method to score how related two pieces of text are. The input text from the user will be matched against all documents previously released under ATIP. Public documents that have the highest scores will be suggested to the user.

The tf-idf algorithm begins by counting the number of words in the request that are also present in each public document. This count is then divided by how common each of the matched words are. This division reduces noise and accounts for the fact that common words such as "Canada" are likely to match many documents, regardless of the ATIP request and, therefore, the fact that match isn’t as important. It is more valuable to know that a less common word is found in both the user submission and the publicly available document.

At its core, tf-idf is a word-matching algorithm. Similar words (not exact) in both the query and the documents will not register as a match with tf-idf alone. In fact, tf-idf is what powers many off-the-shelf search platforms, including Apache Solr, which ATIP has suggested does not return relevant results. Therefore, a number of improvements to the tf-idf algorithm had to be done.

Stemming

The most common way of improving tf-idf is to use a technique called stemming. Stemming is the process of simplifying a word to its "stem" or root. For example, the root word of "stemming" is "stem." If we reduce all words to their base and then look for matches, we will count two words such as "fishing" and "fisher" as matching. This technique works similarly in English and French.

Stop words

As we move through the content to reduce words to their stem, we can also remove stop words. A stop word is a common word that does not contribute meaning to the phase. For example, if we removed "the" and "a" from any sentence, we can still infer the general meaning. Removing stop words improves the speed of our algorithm and reduces false matches.

Word embeddings

Stemming is useful when two words share the same root. But often there are words that are practically the same but do not share a common root. For example, "Access to Information and Privacy" and "ATIP" have the exact same meaning but share no words in common. In order for tf-idf to register matches for similar words, we need a way of measuring the similarity, or distance, between any two words. For example, "kids" and "children" should be close together, and "sheep" and "lion" should be far apart. In order to measure the distance between words, we can use a method called word embedding.

Word embedding is a tool that converts a word to a vector. Typically, this vector has hundreds of dimensions. We tend to use word embeddings that have 100 to 300 dimensions. Despite having a high number of dimensions, we can calculate the distance between any two words the same way we calculate distance in a smaller number of dimensions.

To combine tf-idf and embeddings, we convert every word to a vector by way of an embedding. We then measure the distance between every word in the activity (source) and every word in the snippet of content (target). Words that are very close together are given a score close to 1 (or exactly 1 if they’re the same word), and words that are very far apart are given a score of 0. In this way, a word will be considered a match if the meaning of the word is similar.

Dimensionality reduction

A word embedding converts a single word to a vector of hundreds of numbers. This is done for all words in all publicly available government documents. Ultimately this generates a tremendous amount of data, and this data must be searched and analyzed with every ATIP request. We can reduce the amount of computation (and therefore increase search performance) by using an algorithm called singular value decomposition (SVD). In short, SVD can be used to compress the information in each document (and total data generated) while still retaining the information and search accuracy.