mex.extractors.artificial package

Submodules

mex.extractors.artificial.identity module

mex.extractors.artificial.identity._create_numeric_ids(faker: Faker) dict[str, list[int]]

Create a mapping from entity type to a list of numeric ids.

These numeric ids can be used as seeds for the identity of artificial items. The seeds will be passed to Identifier.generate(seed=…) to get deterministic identifiers throughout consecutive runs of the artificial extractor.

Parameters:

faker – Instance of faker

Returns:

Dict with entity types and lists of numeric ids

mex.extractors.artificial.identity._get_offset_int(cls: type) int

Calculate an integer based on the crc32 checksum of the name of a class.

mex.extractors.artificial.identity.create_identities(faker: Faker) dict[str, list[Identity]]

Create the identities of the to-be-faked models.

We do this before actually creating the models, because we need to be able to set existing stableTargetIds on reference fields.

Parameters:

faker – Instance of faker

Returns:

Dict with entity types and lists of Identities

mex.extractors.artificial.identity.restore_identities(identity_map: dict[str, list[Identity]]) None

Restore the state of the memory identity provider.

Because identity creation and model instantiation happen in different subprocesses, the identity provider does not have access to previously stored identities.

Parameters:

identity_map – Identity map that needs to be loaded back into the provider

mex.extractors.artificial.main module

mex.extractors.artificial.provider module

class mex.extractors.artificial.provider.BuilderProvider(generator: Any)

Bases: Provider

Faker provider that deals with interpreting pydantic model fields.

extracted_items(model: type[ExtractedAccessPlatform | ExtractedActivity | ExtractedBibliographicResource | ExtractedConsent | ExtractedContactPoint | ExtractedDistribution | ExtractedOrganization | ExtractedOrganizationalUnit | ExtractedPerson | ExtractedPrimarySource | ExtractedResource | ExtractedVariable | ExtractedVariableGroup]) list[ExtractedAccessPlatform | ExtractedActivity | ExtractedBibliographicResource | ExtractedConsent | ExtractedContactPoint | ExtractedDistribution | ExtractedOrganization | ExtractedOrganizationalUnit | ExtractedPerson | ExtractedPrimarySource | ExtractedResource | ExtractedVariable | ExtractedVariableGroup]

Get a list of extracted items for the given model class.

field_value(field: FieldInfo, identity: Identity) list[Any]

Get a single artificial value for the given field and identity.

inner_type_and_pattern(field: FieldInfo) tuple[Any, str | None]

Return the inner type and pattern of a field.

If the Field arguments, randomly pick an argument.

min_max_for_field(field: FieldInfo) tuple[int, int]

Return a min and max item count for a field.

class mex.extractors.artificial.provider.IdentityProvider(factory: Any, identities: dict[str, list[Identity]])

Bases: BaseProvider

Faker provider that creates identities and helps with referencing them.

__init__(factory: Any, identities: dict[str, list[Identity]]) None

Create and persist identities for all entity types.

identities(model: type[ExtractedAccessPlatform | ExtractedActivity | ExtractedBibliographicResource | ExtractedConsent | ExtractedContactPoint | ExtractedDistribution | ExtractedOrganization | ExtractedOrganizationalUnit | ExtractedPerson | ExtractedPrimarySource | ExtractedResource | ExtractedVariable | ExtractedVariableGroup]) list[Identity]

Return a list of identities for the given model class.

reference(inner_type: type[Identifier], exclude: Identity) Identifier | None

Return ID for random identity of given type (that is not excluded).

class mex.extractors.artificial.provider.LinkProvider(generator: Any)

Bases: Provider, Provider

Faker provider that can return links with optional title and language.

Return a link with optional title and language.

class mex.extractors.artificial.provider.PatternProvider(factory: Any)

Bases: BaseProvider

Faker provider to create strings matching given patterns.

MESH_TO_TEMPLATE = {'^http://id\\.nlm\\.nih\\.gov/mesh/[A-Z0-9]{2,64}$': 'http://id.nlm.nih.gov/mesh/{}'}
REGEX_TO_NUMERIFY = {'^(((http)|(https))://(dx.)?doi.org/)(10.\\d{4,9}/[-._;()/:A-Z0-9]+)$': 'https://dx.doi.org/10.####/#######', '^https://d\\-nb\\.info/gnd/[-X0-9]{3,10}$': 'https://d-nb.info/gnd/3########', '^https://gepris\\.dfg\\.de/gepris/institution/[0-9]{1,64}$': 'https://gepris.dfg.de/gepris/institution/#######', '^https://isni\\.org/isni/[X0-9]{16}$': 'https://isni.org/isni/################', '^https://loinc.org/([a-zA-z]*)|(([0-9]*(-[0-9])*))$': 'https://loinc.org/#####-#', '^https://orcid\\.org/[-X0-9]{9,21}$': 'https://orcid.org/0000-####-####-###X', '^https://ror\\.org/[a-z0-9]{9}$': 'https://ror.org/#########', '^https://viaf\\.org/viaf/[0-9]{2,22}$': 'https://viaf.org/viaf/#########', '^https://www\\.wikidata\\.org/entity/[PQ0-9]{2,64}$': 'https://www.wikidata.org/entity/P######'}
__init__(factory: Any) None

Initialize the provider by loading the contents of the mesh_file.

pattern(regex: str) str

Return a randomized string matching the given pattern.

class mex.extractors.artificial.provider.TemporalEntityProvider(generator: Any)

Bases: Provider

Faker provider that can return a custom TemporalEntity with random precision.

temporal_entity(allowed_precision_levels: list[TemporalEntityPrecision]) TemporalEntity

Return a custom temporal entity with random date, time and precision.

class mex.extractors.artificial.provider.TextProvider(generator: Any)

Bases: Provider

Faker provider that handles custom text related requirements.

text_object() Text

Return a random text paragraph with an auto-detected language.

text_string() str

Return a randomized sequence of words as a string.

mex.extractors.artificial.settings module

class mex.extractors.artificial.settings.ArtificialSettings(*, count: Annotated[int, Ge(ge=26), Lt(lt=10000000.0)] = 100, chattiness: Annotated[int, Gt(gt=1), Le(le=100)] = 10, seed: int = 0, locale: list[str] = ['de_DE', 'en_US'], mesh_file: AssetsPath = AssetsPath('raw-data/artificial/asciimesh.bin'))

Bases: BaseModel

Artificial settings submodel definition for the artificial data creator.

chattiness: int
count: int
locale: list[str]
mesh_file: AssetsPath
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'chattiness': FieldInfo(annotation=int, required=False, default=10, description='Maximum amount of words to produce for textual fields.', metadata=[Gt(gt=1), Le(le=100)]), 'count': FieldInfo(annotation=int, required=False, default=100, description='Amount of artificial entities to create. At least 2 per entity type are required, to ensure valid linking between the entities.', metadata=[Ge(ge=26), Lt(lt=10000000.0)]), 'locale': FieldInfo(annotation=list[str], required=False, default=['de_DE', 'en_US'], description='The locale to use for faker.'), 'mesh_file': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("raw-data/artificial/asciimesh.bin"), description='Binary MeSH file, absolute path or relative to `assets_dir`. MeSH (Medical Subject Headings) are used by the US National Library of Medicine as a controlled vocabulary thesaurus for indexing articles in PubMed. See: https://www.ncbi.nlm.nih.gov/mesh'), 'seed': FieldInfo(annotation=int, required=False, default=0, description='The seed value for faker randomness.')}

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

seed: int

Module contents