About this project
The research scholarship, productivity, and impact report (RSPIR) project generates and visualizes research impact data for both individuals and research groups. This project integrates research impact theories into practical data services, catering to the needs of medical school leadership, clinical and basic science department heads, research institute directors, and research team leads. The project / data service is designed to be reliable, sustainable, and flexible, adapting to the specific requirements of stakeholders and audiences. Research impact reports generated by this project can be utilized in various scenarios, including government reporting, grant applications, recruitment efforts, and the identification of development opportunities.
The research impact project currently consists of five major components: traditional metrics, DORA-aligned metrics, collaboration, original research, and research funds. Three additional components - innovation, transdisciplinary research, and media impact - are currently in development.
The goal of the project is to visualize and analyze all the necessary research impact indicators to provide one-stop shopping experience for stakeholders and audience.
Project statistics
Code: 7,421 lines
Coverage: 4,750 tenure + clinical + adjunct faculty members
Research impact indicators: 52
Key indicators in RSPIR
Traditional metrics: publications, citations, topics, FWCI, individual performance, journals, collaboration type.
Original research: CSM original publications, cross-analysis (publication x grants).
Collaboration: transdisciplinary collaboration, collaboration map, institution collaborators.
DORA-aligned metrics: DORA, policy documents, policy impact, SDG, innovation, patent impact.
Research revenue: total grants, sponsor analysis, project analysis, type analysis, return on investment.
Challenges and current solutions
Scale
A significant challenge to scaling research impact assessment is report generation. Typically, schools consult with librarians to obtain publication lists using complex Boolean queries encompassing all potential researchers. This approach, while adaptable to researcher inclusion or exclusion, presents limitations. Lengthy queries hinder efficient publication-researcher matching, particularly for co-authored works. Additionally, identical researcher names and initials compromise publication list accuracy. These inaccuracies, coupled with the challenge of matching researchers to publications, impede scalable individual research impact reporting. Manual resolution, while possible, is impractical. For instance, our department of family medicine has more than 1,800 physicians, researchers, and educators; the majority do not publish or only publish a few works in a five-year interval compared to other productive physicians. It would take approximately one month to validate the publication history of all physicians within the family medicine department, which is not feasible.
My solution addresses the challenges from two distinct perspectives. First, instead of searching for researchers by names and abbreviations, researchers’ IDs are manually searched and retained as unique identifiers. The unique identifier can be reused and is more accurate in publication searches. Experimentation indicates that searching with unique identifiers yields 8–13% more publications than searching by names and abbreviations. Furthermore, with the unique identifiers, programs can be created to automatically extract the data from the source more frequently. Second, because the entire data extraction process is automated and can be adjusted with programs, it is relatively easy to link the researcher to the publication list by unique identifier rather than names (especially based on the combination of abbreviations).
Data governance
The research impact report indicated that data governance has presented a more significant challenge than scaling. This issue arises from the complexity of the entire data and reporting system. An automated data processing pipeline is essential to scale up reports to various stakeholders and maintain the sustainability of the service. This pipeline would build on the good practice of data governance.
In the context of CSM, data governance primarily comprises data access, lineage, quality, sharing, and definition. Its importance is underscored due to the suboptimal state of data governance at the university. A straightforward illustration of this issue is the version control of a unit roster. A unit may submit multiple copies of a unit roster by different employees, and the manager has no way of knowing which copy is the most up-to-date and accurate. Another example would be the sponsor name: one sponsor name may be spelled in 12 different ways in the financial database. In addition to version control and data quality issues, the university does not publish any guidelines to define crucial data variables, such as grant, causing discrepancies in research revenue between the school and university.
Improving data governance requires a systematic solution and it is an ongoing task (when you are reading, it is still one of my prioritized tasks). I have led the faculty analytics team to tackle the data governance challenges with easy-to-implement solutions (I will start a new thread). Our service philosophy is to implement meaningful data governance practices while reducing the workload to encourage adaption.
Metrics/Indicators
The selection of report metrics and indicators presented considerable challenges. Firstly, given the requirement to produce over 30 annual reports, all metrics and indicators must be automatically extracted, processed, and visualized programmatically using PowerBI. This precludes the use of many qualitative metrics and indicators due to service sustainability constraints. Secondly, while numerous quantitative metrics and indicators are available for extraction, their reliability and validity are often uncertain. Data lacking verifiable reliability and validity will not be utilized, regardless of extraction ease. Thirdly, the chosen metrics and indicators must be comprehensive and readily comprehensible to stakeholders and audiences, predominantly medical professionals accustomed to traditional, established metrics and indicators. Despite advances in scientometrics, many contemporary metrics and indicators remain unfamiliar to this demographic. The inclusion of novel, unfamiliar metrics and indicators would not be meaningful to these stakeholders and audiences. Fourthly, all report metrics and indicators must be pertinent to the field of medical research, a highly competitive domain attracting leading researchers and substantial funding. Researchers require impact reports to bolster grant and award applications. If the metrics and indicators are not contextually relevant to medical research (e.g., employing social science research impact indicators), the report's utility will be confined to governmental or institutional reporting, diminishing its value for broader application. The aim is to produce a versatile report applicable to multiple purposes rather than specialized reports for singular objectives.
A primary factor in selecting metrics and indicators is to underscore our institution's preeminence in scientometrics among Canadian medical schools. While my expertise does not lie within medical research, I have been instrumental in maintaining the Cumming School of Medicine's leadership position within the university and the U15 medical schools. Consequently, I have sought to integrate both traditional and DORA-aligned metrics and indicators within our reports. Though I do not fully endorse DORA, certain metrics and indicators therein can enhance our reporting. Furthermore, incorporating DORA-aligned metrics and indicators serves as a model for other faculties and medical schools to become acquainted with these methodologies.
It is important to note that metrics and indicators are merely a preliminary step and do not offer a complete picture. My focus has progressively shifted from metrics and indicators themselves to the impact they reflect. For instance, rather than simply reporting the number of publications cited by policy documents, I emphasize the impact of these cited policies, including their geographic reach and influence, to fully inform stakeholders and audiences. This aspect is currently under development, and I am utilizing advanced large language models to support this endeavor.
Data Extraction
The extraction of 52 indicators from the research intelligence service, constrained by API service quotas and limitations, has proven to be an arduous yet essential component of every report. This data extraction is a critical and mandatory step for both initial report generation and the subsequent updating of existing reports. Our internal assessments indicate that the data extraction process typically consumes 20-30% of the total project completion time. This duration can be further extended by factors such as API failures, intermittent university internet connectivity issues, or unidentifiable random errors (which are occasionally resolved by a full program restart after a workstation reboot).
My current approach to addressing the challenges of data extraction involves the development of a database solution. This solution necessitates several prerequisites, including an up-to-date member list (detailed in the scale section) and predefined indicators and rules (outlined in the data governance section). Data extraction is scheduled for the beginning of each month, with all data processed prior to injection into the database. To optimize efficiency, certain indicators, such as original research and collaboration, will not be re-extracted or re-processed after their initial acquisition, as their values remain static across updates. Conversely, indicators such as citations, number of publications, and FWCI will be updated during each extraction cycle. Preliminary testing demonstrates that the database solution significantly reduces data extraction time from 20-30% of project completion time to an estimated 5 minutes per update.
Reflection
Communication
Creating reports like this from scratch is a significant undertaking. This is the first school-level impact report for CSM, and it's also the first comprehensive report that encompasses both metrics and impacts across all Canadian U15 universities. This presents a unique challenge due to the lack of established templates or previous examples to follow. Building this report requires extensive research, data collection, and analysis to ensure it accurately reflects the school's impact and compares effectively with other U15 institutions. During the process, communication should be the first priority, and I could achieve nothing without clear and consistent communication with all stakeholders. This means actively seeking input, understanding expectations, and providing regular updates on progress. I've heard that the previous data person in my position made the mistake of isolating himself and refusing to communicate with department heads and leadership. This lack of communication meant they were unable to fully understand the needs and goals of the project, leading to a final report that was inadequate and missed the mark.
In some scenarios, department heads and leadership might not have a fully crystallized vision of their requirements. They may only be able to provide a general direction, leaving me to interpret and translate that into concrete deliverables. To bridge this gap, I often need to develop a prototype or proof-of-concept that demonstrates how data can be structured and analyzed based on their initial requests. This tangible representation serves as a starting point for further discussion and refinement. However, it's crucial to recognize that these initial requests might not perfectly encapsulate their true needs. There could be underlying assumptions, unarticulated goals, or evolving priorities that are not immediately apparent. Therefore, maintaining open and continuous communication throughout the development process is paramount. By actively engaging stakeholders, soliciting feedback, and iterating on the prototype, I can progressively guide them towards a solution that aligns with their actual requirements and delivers genuine business value.