Standardization to allow for consistent content and context management
The first challenge was that two different source systems were used and both were structured in different ways. Even within a single source, the structures have diverted over time.
A second major challenge was the lack of a proper naming convention of source documents; either the document names were not discriminative or the document names were not coherent in describing the content.
The third major challenge was the historically grown lack of standardized terminology for the metadata, that has been used to describe characteristics like substances, product names, applicants, etc.
Full exports (CSV files) of records from both repositories were made, thereby collecting as much properties as possible, including all obvious document properties, but also folder path and even file size.
As an initial filter, those that were absolutely not to be migrated (e.g. minor versions that have been superseded by major version) were omitted from the export.
For a few of these companies, eCTDs were explorered using Qdossier’s Dossplorer. Within Dossplorer the sequences could be structured in accordance with the regulatory activities and harmonized activity names could be given accros the various countries. Also the submission date, approval date and document status and all other envelope information could be reflected on individual documents. The subsequent data base could be exported to MS-Excel.
The CSV files from the repositories were enriched with data from the MS-Excel with eCTD information.
In general, metadata consisted of attributes/properties belonging to the documents (mostly describing the content), but also included the entire folder path and document name describing content and/or context). In addition, a connection was made between source documents in the repositories and eCTDs where these documents have been included. As a result additional contextual data could be added to the document.
The data in the enriched exports were to be cleansed. Business rules were defined to cleans as many as possible records. These business rules were captured in formulas in MS-Excel. We used a workstation with a huge single core processor, since MS-Excel cannot use multiple cores. In the beginning thousands and hundreds of records could be cleansed per business rule, whereas gradually the numbers cleansed by a single formula went down to a few per business rule. Somewhere we had to stop the effort on rules and define which set of records had to be either done manually. Those documents to be done manually were tagged as ‘legacy’. Similarly, if documents could be identified, though not all corresponding metadata, the missing property was filled the value ‘legacy’. Subsequently, post migration all ‘legacy documents’ and/or legacy values could be cleansed on an ad hoc basis. When it turns out that nobody misses a document, those legacy documents or documents with legacy terms can remain in there forevever.
Once the formulas were validated, using
the exports obtained, an actual migration date could be selected (best practice over the weekend). New exports were to be created and the same set of formulas
were to be run on the new exports. Subsequently all documents could be migrated
into the new invironment using the harmonized naming convention, resulting in a single source of truth.