mex.extractors.biospecimen package

Subpackages

Submodules

mex.extractors.biospecimen.extract module

mex.extractors.biospecimen.extract.extract_biospecimen_contacts_by_email(biospecimen_resource: Iterable[BiospecimenResource]) Generator[LDAPPerson, None, None]

Extract LDAP persons for Biospecimen contacts.

Parameters:

biospecimen_resource – Biospecimen resources

Returns:

Generator for LDAP persons

mex.extractors.biospecimen.extract.extract_biospecimen_organizations(biospecimen_resources: list[BiospecimenResource]) dict[str, MergedOrganizationIdentifier]

Search and extract organization from wikidata.

Parameters:

biospecimen_resources – Iterable of biospecimen resources

Returns:

dict with WikidataOrganization ID by externe partner

mex.extractors.biospecimen.extract.extract_biospecimen_resource(resource: DataFrame, sheet_name: str, file_name: str) BiospecimenResource | None

Extract one Biospecimen resource from an xlsx file.

Parameters:
  • resource – DataFrame containing resource information

  • sheet_name – Name of the Excel sheet the data came from

  • file_name – Name of the Excel file

Settings:

key_col: column in the file with keys val_col: column in the file with values

Returns:

Biospecimen resource

mex.extractors.biospecimen.extract.extract_biospecimen_resources() Generator[BiospecimenResource, None, None]

Extract Biospecimen resources by loading data from MS-Excel file.

Settings:
dir_path: Path to the biospecimen directory,

absolute or relative to assets_dir

Returns:

Generator for Biospecimen resources

mex.extractors.biospecimen.extract.get_clean_file_name(file_name: str) str

Clean file name string.

Parameters:

file_name – file_name string

Returns:

cleaned file name string

mex.extractors.biospecimen.extract.get_clean_string(series: Series[Any]) str

Clean string DataFrame and concatenate to one string.

Parameters:

series – series of related field

Returns:

string of extracted field

mex.extractors.biospecimen.extract.get_values(resource: DataFrame | None, key_col: str, val_col: str, field_name: str) str | None

Extract values of resource corresponding to Feldname.

Parameters:
  • resource – Biospecimen resource

  • key_col – column in the file with keys

  • val_col – column in the file with values

  • field_name – column name of extracted field

Returns:

string of extracted field

mex.extractors.biospecimen.extract.get_year_from_zeitlicher_bezug(resource: DataFrame | None, key_col: str, val_col: str, field_name: str) str | None

Extract the first four connected digits of the string as year.

Parameters:
  • resource – Biospecimen resource

  • key_col – column in the file with keys

  • val_col – column in the file with values

  • field_name – column name of extracted field

Returns:

string with first four digits treated as zeitlicher_bezug year

mex.extractors.biospecimen.main module

mex.extractors.biospecimen.settings module

class mex.extractors.biospecimen.settings.BiospecimenSettings(*, raw_data_path: AssetsPath = AssetsPath('raw-data/biospecimen'), key_col: str = 'Feldname', val_col: str = 'zu extrahierender Wert (maschinenlesbar)', mapping_path: AssetsPath = AssetsPath('mappings/__final__/biospecimen'))

Bases: BaseModel

Settings submodel for the Biospecimen extractor.

key_col: str
mapping_path: AssetsPath
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'key_col': FieldInfo(annotation=str, required=False, default='Feldname', description='column name of the biospecimen metadata keys'), 'mapping_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("mappings/__final__/biospecimen"), description='Path to the directory with the biospecimen mapping files containing the default values, absolute path or relative to `assets_dir`.'), 'raw_data_path': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("raw-data/biospecimen"), description='Path to the directory with the biospecimen excel files, absolute path or relative to `assets_dir`.'), 'val_col': FieldInfo(annotation=str, required=False, default='zu extrahierender Wert (maschinenlesbar)', description='column name of the biospecimen metadata values')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

raw_data_path: AssetsPath
val_col: str

mex.extractors.biospecimen.transform module

mex.extractors.biospecimen.transform.get_or_create_externe_partner(externe_partner: str, extracted_organizations: dict[str, MergedOrganizationIdentifier], extracted_primary_source_biospecimen: ExtractedPrimarySource) MergedOrganizationIdentifier

Get extracted organization for label or create new organization.

Parameters:
  • externe_partner – externe partner label

  • extracted_organizations – merged organization identifier extracted from wikidata

  • extracted_primary_source_biospecimen – extracted primary source

Returns:

matched or created merged organization identifier

mex.extractors.biospecimen.transform.transform_biospecimen_resource_to_mex_resource(biospecimen_resources: Iterable[BiospecimenResource], extracted_primary_source_biospecimen: ExtractedPrimarySource, unit_stable_target_ids_by_synonym: dict[str, Identifier], mex_persons: Iterable[ExtractedPerson], extracted_organization_rki: ExtractedOrganization, extracted_synopse_activities: Iterable[ExtractedActivity], resource_mapping: Any, extracted_organizations: dict[str, MergedOrganizationIdentifier]) Generator[ExtractedResource, None, None]

Transform Biospecimen resources to extracted resources.

Parameters:
  • biospecimen_resources – Biospecimen resources

  • extracted_primary_source_biospecimen – Extracted platform for Biospecimen

  • unit_stable_target_ids_by_synonym – Unit stable target ids by synonym

  • mex_persons – Generator for ExtractedPerson

  • extracted_synopse_activities – extracted synopse activities

  • extracted_organization_rki – extracted organization

  • resource_mapping – resource mapping model with default values

  • extracted_organizations – extracted organizations by label

Returns:

Generator for ExtractedResource instances

Module contents