mex.artificial package¶
Submodules¶
mex.artificial.identity module¶
- mex.artificial.identity._create_numeric_ids(faker: Faker, weights: dict[type[ExtractedData], int]) dict[str, Sequence[int]] ¶
Create a mapping from entity type to a list of numeric ids.
These numeric ids can be used as seeds for the identity of artificial items. The seeds will be passed to Identifier.generate(seed=…) to get deterministic identifiers throughout consecutive runs of the artificial extractor.
- Parameters:
faker – Instance of faker
weights – Mapping from extracted data classes to an integer weight. The weights control how many items per class are created, but the weights are normalized to keep the total below Settings.artificial.count.
- Returns:
Dict with entity types and lists of numeric ids
- mex.artificial.identity._get_offset_int(cls: type) int ¶
Calculate an integer based on the crc32 checksum of the name of a class.
- mex.artificial.identity.create_identities(faker: Faker, weights: dict[type[ExtractedData], int]) dict[str, list[Identity]] ¶
Create the identities of the to-be-faked models.
We do this before actually creating the models, because we need to be able to set existing stableTargetIds on reference fields.
- Parameters:
faker – Instance of faker
weights – Mapping from extracted data classes to an integer weight. The weights control how many items per class are created, but the weights are normalized to keep the total below Settings.artificial.count.
- Returns:
Dict with entity types and lists of Identities
- mex.artificial.identity.restore_identities(identity_map: dict[str, list[Identity]]) None ¶
Restore the state of the memory identity provider.
Because identity creation and model instantiation happen in different subprocesses, the identity provider does not have access to previously stored identities.
- Parameters:
identity_map – Identity map that needs to be loaded back into the provider
mex.artificial.main module¶
mex.artificial.provider module¶
- class mex.artificial.provider.BuilderProvider(generator: Any)¶
Bases:
Provider
Faker provider that deals with interpreting pydantic model fields.
- extracted_data(model: type[ExtractedData]) list[ExtractedData] ¶
Get a list of extracted data instances for the given model class.
- field_value(field: FieldInfo, identity: Identity) list[Any] ¶
Get a single artificial value for the given field and identity.
- inner_type_and_pattern(field: FieldInfo) tuple[Any, str | None] ¶
Return the inner type and pattern of a field.
If the Field arguments, randomly pick an argument.
- min_max_for_field(field: FieldInfo) tuple[int, int] ¶
Return a min and max item count for a field.
- class mex.artificial.provider.IdentityProvider(factory: Any, identities: dict[str, list[Identity]])¶
Bases:
BaseProvider
Faker provider that creates identities and helps with referencing them.
- __init__(factory: Any, identities: dict[str, list[Identity]]) None ¶
Create and persist identities for all entity types.
- identities(model: type[ExtractedData]) list[Identity] ¶
Return a list of identities for the given model class.
- reference(inner_type: type[Identifier], exclude: Identity) Identifier | None ¶
Return ID for random identity of given type (that is not excluded).
- class mex.artificial.provider.LinkProvider(generator: Any)¶
Bases:
Provider
,Provider
Faker provider that can return links with optional title and language.
- link() Link ¶
Return a link with optional title and language.
- class mex.artificial.provider.PatternProvider(factory: Any)¶
Bases:
BaseProvider
Faker provider to create strings matching given patterns.
- MESH_TO_TEMPLATE = {'^http://id\\.nlm\\.nih\\.gov/mesh/[A-Z0-9]{2,64}$': 'http://id.nlm.nih.gov/mesh/{}'}¶
- REGEX_TO_NUMERIFY = {'^https://d\\-nb\\.info/gnd/[-X0-9]{3,10}$': 'https://d-nb.info/gnd/3########', '^https://gepris\\.dfg\\.de/gepris/institution/[0-9]{1,64}$': 'https://gepris.dfg.de/gepris/institution/#######', '^https://isni\\.org/isni/[X0-9]{16}$': 'https://isni.org/isni/################', '^https://orcid\\.org/[-X0-9]{9,21}$': 'https://orcid.org/0000-####-####-###X', '^https://ror\\.org/[a-z0-9]{9}$': 'https://ror.org/#########', '^https://viaf\\.org/viaf/[0-9]{2,22}$': 'https://viaf.org/viaf/#########', '^https://www\\.wikidata\\.org/entity/[PQ0-9]{2,64}$': 'https://www.wikidata.org/entity/P######'}¶
- __init__(factory: Any) None ¶
Initialize the provider by loading the contents of the mesh_file.
- pattern(regex: str) str ¶
Return a randomized string matching the given pattern.
- class mex.artificial.provider.TemporalEntityProvider(generator: Any)¶
Bases:
Provider
Faker provider that can return a custom TemporalEntity with random precision.
- temporal_entity(allowed_precision_levels: list[TemporalEntityPrecision]) TemporalEntity ¶
Return a custom temporal entity with random date, time and precision.
mex.artificial.settings module¶
- class mex.artificial.settings.ArtificialSettings(*, c: Annotated[int, Gt(gt=10), Lt(lt=10000000.0)] = 100, matched: Annotated[int, Ge(ge=0), Le(le=100)] = 25, chattiness: Annotated[int, Gt(gt=1), Le(le=100)] = 10, seed: int = 0, locale: list[str] = ['de_DE', 'en_US'], mesh_file: AssetsPath = AssetsPath('raw-data/artificial/asciimesh.bin'))¶
Bases:
BaseModel
Artificial settings submodel definition for the artificial data creator.
- chattiness: int¶
- count: int¶
- locale: list[str]¶
- matched: int¶
- mesh_file: AssetsPath¶
- model_computed_fields: ClassVar[dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'extra': 'ignore', 'populate_by_name': True, 'str_max_length': 100000, 'str_min_length': 1, 'str_strip_whitespace': True, 'use_enum_values': True, 'validate_assignment': True, 'validate_default': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[dict[str, FieldInfo]] = {'chattiness': FieldInfo(annotation=int, required=False, default=10, description='Maximum amount of words to produce for textual fields.', metadata=[Gt(gt=1), Le(le=100)]), 'count': FieldInfo(annotation=int, required=False, default=100, alias='c', alias_priority=2, description='Amount of artificial entities to create. At least 2 per entity type will be created regardless of setting or hardcoded weights.', metadata=[Gt(gt=10), Lt(lt=10000000.0)]), 'locale': FieldInfo(annotation=list[str], required=False, default=['de_DE', 'en_US'], description='The locale to use for faker.'), 'matched': FieldInfo(annotation=int, required=False, default=25, description='Integer percentage of matched items with same target ID to produce.', metadata=[Ge(ge=0), Le(le=100)]), 'mesh_file': FieldInfo(annotation=AssetsPath, required=False, default=AssetsPath("raw-data/artificial/asciimesh.bin"), description='Binary MeSH file, absolute path or relative to `assets_dir`. MeSH (Medical Subject Headings) are used by the US National Library of Medicine as a controlled vocabulary thesaurus for indexing articles in PubMed. See: https://www.ncbi.nlm.nih.gov/mesh'), 'seed': FieldInfo(annotation=int, required=False, default=0, description='The seed value for faker randomness.')}¶
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo].
This replaces Model.__fields__ from Pydantic V1.
- seed: int¶