This link contains all 'proper names' in wiktionary in all languages. This means personal names like Tatiana, Zadie or Richard. However it also includes names of countries, towns, rivers, and so on.
I want to extract all records which are personal names.
The records I want to extract either have the string "given name" in them or the string "surname" (some have both).
For example the name Fabian:
{"pos": "name", "wikipedia": ["Fabian (name)"], "head_templates": [{"name": "en-proper noun", "args": {}, "expansion": "Fabian"}], "etymology_text": "From Latin Fabiānus (“belonging to Fabius”), derived from Fabius + -ānus.", "etymology_templates": [{"name": "der", "args": {"1": "en", "2": "la", "3": "Fabiānus", "4": "", "5": "belonging to Fabius"}, "expansion": "Latin Fabiānus (“belonging to Fabius”)"}, {"name": "m", "args": {"1": "la", "2": "Fabius"}, "expansion": "Fabius"}, {"name": "m", "args": {"1": "la", "2": "-ānus"}, "expansion": "-ānus"}], "sounds": [{"ipa": "/ˈfeɪbi.ən/"}, {"audio": "LL-Q1860 (eng)-Vealhurl-Fabian.wav", "text": "Audio (Southern England)", "tags": ["Southern-England"], "ogg_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/2/2b/LL-Q1860_%28eng%29-Vealhurl-Fabian.wav/LL-Q1860_%28eng%29-Vealhurl-Fabian.wav.ogg", "mp3_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/2/2b/LL-Q1860_%28eng%29-Vealhurl-Fabian.wav/LL-Q1860_%28eng%29-Vealhurl-Fabian.wav.mp3"}], "word": "Fabian", "lang": "English", "lang_code": "en", "senses": [{"links": [["given name", "given name"]], "raw_glosses": ["(rare) A male given name from Latin."], "glosses": ["A male given name from Latin."], "tags": ["rare"], "id": "Fabian-en-name-XC4~mcw6", "categories": [{"name": "English given names", "kind": "topical", "parents": ["Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}, {"name": "English male given names", "kind": "topical", "parents": ["Male given names", "Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}], "translations": [{"lang": "Aragonese", "code": "an", "sense": "male given name", "tags": ["masculine"], "word": "Fabián", "_dis1": "96 4"}, {"lang": "Catalan", "code": "ca", "sense": "male given name", "word": "Fabià", "_dis1": "96 4"}, {"lang": "Faroese", "code": "fo", "sense": "male given name", "tags": ["masculine"], "word": "Fabian", "_dis1": "96 4"}, {"lang": "French", "code": "fr", "sense": "male given name", "word": "Fabien", "_dis1": "96 4"}, {"lang": "Galician", "code": "gl", "sense": "male given name", "word": "Fabián", "_dis1": "96 4"}, {"lang": "Galician", "code": "gl", "sense": "male given name", "word": "Fabiano", "_dis1": "96 4"}, {"lang": "Galician", "code": "gl", "sense": "male given name", "word": "Fabio", "_dis1": "96 4"}, {"lang": "German", "code": "de", "sense": "male given name", "tags": ["masculine"], "word": "Fabian", "_dis1": "96 4"}, {"lang": "Hungarian", "code": "hu", "sense": "male given name", "word": "Fábián", "_dis1": "96 4"}, {"lang": "Italian", "code": "it", "sense": "male given name", "word": "Fabiano", "_dis1": "96 4"}, {"lang": "Polish", "code": "pl", "sense": "male given name", "tags": ["masculine", "person"], "word": "Fabian", "_dis1": "96 4"}, {"lang": "Portuguese", "code": "pt", "sense": "male given name", "word": "Fabiano", "_dis1": "96 4"}, {"lang": "Spanish", "code": "es", "sense": "male given name", "word": "Fabián", "_dis1": "96 4"}, {"lang": "Swedish", "code": "sv", "sense": "male given name", "word": "Fabian", "_dis1": "96 4"}]}, {"links": [["surname", "surname"]], "glosses": ["A surname."], "id": "Fabian-en-name-EMUC1F3L", "categories": [{"name": "English surnames", "kind": "other", "parents": [], "source": "w"}]}]}
{"pos": "name", "head_templates": [{"name": "head", "args": {"1": "fo", "2": "proper noun", "g": "m"}, "expansion": "Fabian m"}], "inflection_templates": [{"name": "fo-decl-proper-noun-s-indef", "args": {"1": "Fabian", "2": "Fabian", "3": "Fabiani", "4": "Fabians"}}], "forms": [{"form": "", "source": "declension", "tags": ["table-tags"]}, {"form": "fo-decl-proper-noun-s-indef", "source": "declension", "tags": ["inflection-template"]}, {"form": "Fabian", "tags": ["indefinite", "nominative"], "source": "declension"}, {"form": "Fabian", "tags": ["accusative", "indefinite"], "source": "declension"}, {"form": "Fabiani", "tags": ["dative", "indefinite"], "source": "declension"}, {"form": "Fabians", "tags": ["genitive", "indefinite"], "source": "declension"}], "word": "Fabian", "lang": "Faroese", "lang_code": "fo", "senses": [{"links": [["given name", "given name"]], "glosses": ["a male given name"], "tags": ["masculine"], "id": "Fabian-fo-name-h8YdwBAs", "categories": [{"name": "Faroese given names", "kind": "topical", "parents": ["Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}, {"name": "Faroese male given names", "kind": "topical", "parents": ["Male given names", "Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}]}]}
{"pos": "name", "head_templates": [{"name": "head", "args": {"1": "de", "2": "proper noun", "g": "m"}, "expansion": "Fabian m"}], "etymology_text": "Borrowed from Latin Fabiānus (“belonging to Fabius”).", "etymology_templates": [{"name": "glossary", "args": {"1": "loanword", "2": "Borrowed"}, "expansion": "Borrowed"}, {"name": "bor", "args": {"1": "de", "2": "la", "3": "Fabiānus", "4": "", "5": "belonging to Fabius", "lit": "", "pos": "", "tr": "", "ts": "", "id": "", "sc": "", "g": "", "g2": "", "g3": "", "nocat": "", "sort": ""}, "expansion": "Latin Fabiānus (“belonging to Fabius”)"}, {"name": "bor+", "args": {"1": "de", "2": "la", "3": "Fabiānus", "4": "", "5": "belonging to Fabius"}, "expansion": "Borrowed from Latin Fabiānus (“belonging to Fabius”)"}], "sounds": [{"ipa": "/ˈfaːbian/"}, {"audio": "De-Fabian.ogg", "text": "Audio", "ogg_url": "https://upload.wikimedia.org/wikipedia/commons/c/c9/De-Fabian.ogg", "mp3_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/c/c9/De-Fabian.ogg/De-Fabian.ogg.mp3"}], "word": "Fabian", "lang": "German", "lang_code": "de", "senses": [{"links": [["given name", "given name"]], "glosses": ["a male given name"], "tags": ["masculine"], "id": "Fabian-de-name-h8YdwBAs", "categories": [{"name": "German given names", "kind": "topical", "parents": ["Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}, {"name": "German male given names", "kind": "topical", "parents": ["Male given names", "Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}]}]}
{"pos": "name", "head_templates": [{"name": "head", "args": {"1": "oc", "2": "proper noun", "head": "", "g": "m", "g2": ""}, "expansion": "Fabian m"}, {"name": "oc-proper noun", "args": {"1": "m"}, "expansion": "Fabian m"}], "word": "Fabian", "lang": "Occitan", "lang_code": "oc", "senses": [{"links": [["given name", "given name"], ["Fabian", "Fabian#English"]], "raw_glosses": ["(Gascony) a male given name, equivalent to English Fabian"], "glosses": ["a male given name, equivalent to English Fabian"], "tags": ["Gascony", "masculine"], "id": "Fabian-oc-name-VtvZQ6Yw", "categories": [{"name": "Gascon", "kind": "other", "parents": [], "source": "w"}, {"name": "Occitan given names", "kind": "topical", "parents": ["Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}, {"name": "Occitan male given names", "kind": "topical", "parents": ["Male given names", "Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}]}]}
{"pos": "name", "head_templates": [{"name": "pl-proper noun", "args": {"1": "m-pr"}, "expansion": "Fabian m pers"}], "inflection_templates": [{"name": "pl-decl-noun-m-pr", "args": {"nomp": "Fabianowie"}}], "forms": [{"form": "", "source": "declension", "tags": ["table-tags"]}, {"form": "pl-decl-noun-m-pr", "source": "declension", "tags": ["inflection-template"]}, {"form": "Fabian", "tags": ["nominative", "singular"], "source": "declension"}, {"form": "Fabianowie", "tags": ["nominative", "plural"], "source": "declension"}, {"form": "Fabiana", "tags": ["genitive", "singular"], "source": "declension"}, {"form": "Fabianów", "tags": ["genitive", "plural"], "source": "declension"}, {"form": "Fabianowi", "tags": ["dative", "singular"], "source": "declension"}, {"form": "Fabianom", "tags": ["dative", "plural"], "source": "declension"}, {"form": "Fabiana", "tags": ["accusative", "singular"], "source": "declension"}, {"form": "Fabianów", "tags": ["accusative", "plural"], "source": "declension"}, {"form": "Fabianem", "tags": ["instrumental", "singular"], "source": "declension"}, {"form": "Fabianami", "tags": ["instrumental", "plural"], "source": "declension"}, {"form": "Fabianie", "tags": ["locative", "singular"], "source": "declension"}, {"form": "Fabianach", "tags": ["locative", "plural"], "source": "declension"}, {"form": "Fabianie", "tags": ["singular", "vocative"], "source": "declension"}, {"form": "Fabianowie", "tags": ["plural", "vocative"], "source": "declension"}], "etymology_text": "Borrowed from Latin Fabianus.", "etymology_templates": [{"name": "glossary", "args": {"1": "loanword", "2": "Borrowed"}, "expansion": "Borrowed"}, {"name": "bor", "args": {"1": "pl", "2": "la", "3": "Fabianus", "4": "", "5": "", "lit": "", "pos": "", "tr": "", "ts": "", "id": "", "sc": "", "g": "", "g2": "", "g3": "", "nocat": "", "sort": ""}, "expansion": "Latin Fabianus"}, {"name": "bor+", "args": {"1": "pl", "2": "la", "3": "Fabianus"}, "expansion": "Borrowed from Latin Fabianus"}], "sounds": [{"ipa": "/ˈfa.bjan/"}, {"rhymes": "-abjan"}], "hyphenation": ["Fa‧bian"], "word": "Fabian", "lang": "Polish", "lang_code": "pl", "senses": [{"links": [["given name", "given name"], ["Fabian", "Fabian#English"]], "glosses": ["a male given name, equivalent to English Fabian"], "tags": ["masculine", "person"], "id": "Fabian-pl-name-VtvZQ6Yw", "categories": [{"name": "Polish given names", "kind": "topical", "parents": ["Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}, {"name": "Polish male given names", "kind": "topical", "parents": ["Male given names", "Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}]}]}
{"pos": "name", "head_templates": [{"name": "head", "args": {"1": "sv", "2": "proper noun", "head": "", "g": "c", "3": "genitive", "4": "Fabians"}, "expansion": "Fabian c (genitive Fabians)"}, {"name": "sv-proper noun", "args": {"1": "c"}, "expansion": "Fabian c (genitive Fabians)"}], "forms": [{"form": "Fabians", "tags": ["genitive"]}], "word": "Fabian", "lang": "Swedish", "lang_code": "sv", "senses": [{"links": [["given name", "given name"]], "glosses": ["a male given name"], "tags": ["common-gender"], "id": "Fabian-sv-name-h8YdwBAs", "categories": [{"name": "Swedish given names", "kind": "topical", "parents": ["Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}, {"name": "Swedish male given names", "kind": "topical", "parents": ["Male given names", "Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}]}]}
As a human I can see that Fabian, the first record in the file linked, goes from lines 2 to 7. Line 8 is a new record. But I can't work out a regex pattern that will allow me to extract the whole of records like Fabian, which are personal names.
Can you help?
Given that the input data is in JSON format, it's best to parse it as such, using
ConvertFrom-Json
, which allows you to filter by the properties of the JSON objects usingWhere-Object
:$personalNameObjects
now contains[pscustomobject]
instances representing those input JSON objects where the.senses.links
property values contain eithergiven name
orsurname
(as substrings, as there are variations, such as with a plurals
or a suffix such as#English
) - further filtering, such as by entry type, may be needed.To get just the unique names themselves - assuming they're stored in the
.word
property - use:Note:
Given the size of the input file (almost 1 GB),
[System.IO.File]::ReadLines()
is used to improve reading performance;Get-Content
-LiteralPath names.json
works too, but would be noticeably slower.Convert-Path
is used to pass the input file's full path.If needed, you can later convert the filtered parsed-from-JSON objects back to JSON using
ConvertTo-Json
; be sure to use a sufficiently large-Depth
argument to prevent inadvertent truncation (see this post for background).