DRAFT: Basic search index documentation

DRAFT: Basic search index documentation

work in progress. not yet verified.

Note: Cells with a yellow background or text in red indicate area is under investigation.

Contents

Overview/notes

*documentation on basic search options

mod-search field types:

Full text capable fields (aka. multi-lang fields), searching within a field - analyzed and preprocessed fields;

Term fields (keywords, bool, date fields, etc.), exact matches - non-analyzed fields.

Relevancy

Relevance sorting uses the Okapi BM25 algorithm, which takes into account the following factors:

  • Term Frequency (TF): First, it looks at how often your search words appear in each instance. If an instance has your search words many times, it gets a higher score because it's more likely to be a good match

  • Inverse Document Frequency (IDF): Then, it checks how common or rare your search words are across all the instances in the library. If your words are rare, they get a higher score. If they are common, they get a lower score. This helps give importance to unique words.

  • Document Length (DL): BM25 also considers how long each instance is. If an instance is very long, it might dilute the importance of your search words, so it gets a lower score.

  • Parameter Tuning: BM25 has a few parameters that you can adjust to fine-tune your search. These parameters help you control how much importance you want to give to term frequency, inverse document frequency, and document length.

  • Calculation: Finally, BM25 combines all these factors using a mathematical formula to calculate a score for each instance. The instance with the highest score is considered the best match for your search.

Word-stemming & exact phrase

When the language of the record is English, word-stemming is applied, even in exact phrase searches. For example, a search for “buy” will retrieve matches on records that contain “buying”. See elasticsearch documentation: https://www.elastic.co/docs/manage-data/data-store/text-analysis/stemming#algorithmic-stemmers .

Related documentation on potential changes:

Supported operators

Search

Description

Supported operator

Example

Query search syntax

Notes

Search

Description

Supported operator

Example

Query search syntax

Notes

Exact phrase

Searches for text that contain all search terms in the order in which they are searched

Enclosing terms in wildcards (asterisks)

*Global africa*

Finds:

  • Global africa

  • Globalized africa

Does not find:

  • africa global

==

Note: word-stemming still applies when the language of the record is English

Contains all

Searches for text that contain all search terms regardless of order

 

N/A

Global africa

Finds:

  • Global africa

  • Africa global

  • Africa globalization

  • globalization africa

full-text: all or =

Term: all or ==

  • Default operator for most full-text types

    • Exception: Subject; requires wildcards, otherwise performs an exact phrase search

  • Supported in query and advanced search

Note: word stemming applies for records with a language of English, but see the following examples: “Africa” will not find “African”, and “America” will not find “American”, but “America” will find “Americas”.

Contains any

Searches for text that contain any of the search terms regardless of order

 

N/A

Global africa

Finds:

  • Global africa

  • Africa global

  • africa

  • global

  • globalization

any

Supported in query and advanced search.

Note: word stemming applies for records with a language of English, but see the following examples: “Africa” will not find “African”, and “America” will not find “American”, but “America” will find “Americas”.

Starts with

Searches for text that start with the characters before the wildcard

*

Scenario 1: Title> buddhi*

Finds:

  • Buddhism

  • Buddhism and the arts of Japan.

  • Buddhist logic

Does not find:

  • All is change : the two-thousand-year journey of Buddhism to the West

Scenario 2: Title> *buddhi*

Finds

  • Buddhism

  • Buddhism and the arts of Japan.

  • Buddhist logic

  • All is change : the two-thousand-year journey of Buddhism to the West

See note

*

In order to perform a query that looks for instances where the entire field starts with certain characters, add an asterisks to the end of the query

In order to perform a query that looks for instances where the field contains a value that starts with certain characters, wrap the text in asterisks.

For example, the search for “buddh*” will look for instances where the field value begins with “buddh”. This means that a search for “buddh*” will return fewer results than a search for “buddhism”.

Masking - zero or more, leading

 

A zero or more character wildcard search, beginning with a wildcard

*

*chemistry finds:

  • biochemistry

  • thermochemistry

  • immunochemistry

  • geochemistry

*

 

Masking - zero or more, trailing

A zero or more character wildcard search, ending with a wildcard

*

surg* finds:

  • surgeon

  • surgeries

  • surgical

  • surgery

*

Performing a “Starts with” search

Masking - zero or more, internal

 

A zero or more character wildcard search, containing a wildcard within the term

 

*

wom*n finds:

  • woman

  • women

  • womyn

*

 

Inventory

Instance search options

Search option - UI

Search option - BE

Type

Fields included

Default operator

Notes

Updates

Search option - UI

Search option - BE

Type

Fields included

Default operator

Notes

Updates

1

Keyword (title, contributor, identifier, HRID, UUID)

keyword

Full-text and term?

  • Title

    • Resource title

    • Alternative title (of all types)

    • Index title

    • Series statement

  • Contributors

  • Identifier (all)

  • Instance HRID

  • Instance UUID

Performs a “contains all” on full-text fields and an “exact phrase” on identifier fields

Does not search all full-text terms, only those identified in title of basic search option

 

2

Contributor

contributors

Full-text

Contributors (regardless of contributor type of name type)

Contains all

 

 

3

Title (all)

title

Full-text

  • Resource title

  • Alternative title (of all types)

  • Index title

  • Series statement

Contains all

 

 

4

Identifier (all)

identifiers.value

Term

All identifiers, regardless of type

Exact phrase

 

 

5

Classification, normalized

classifications.classificationNumber

Term

Classification number

Exact phrase

 

 

6

ISBN

isbn

Term

ISBN, Invalid ISBN

Contains all

 

 

7

ISSN

issn

Term

ISSN, Invalid ISSN, Linking ISSN

Exact phrase

 

 

8

LCCN, normalized

lccn

Term

LCCN, Canceled LCCN

Exact phrase

 

 

9

OCLC number, normalized

identifiers.typeId + identifiers.value

Term

OCLC, Cancelled OCLC

Contains all

This does not appear to be working? In Ramsons and Sunflower. Current RRT thread 5/9

 

10

Instance notes (all)

note.note

Full-text

Notes of all note types and administrative notes

Contains all

 

 

11

Instance administrative notes

administrativeNotes

Full-text

Administrative notes

Contains all

 

 

12

Place of publication

publication.place

Full-text

Place of publication

Contains all

 

 

13

Subject

subjects

Full-text

Subjects

Exact phrase

 

 

14

Instance HRID

hrid

Term

Instance HRID

Exact phrase

 

 

15

Instance UUID

id

Term

Instance UUID

Exact phrase

 

 

16

Authority UUID

authorityId

Term

Authority ID

Exact phrase

Need to understand if there is a separated field for contributors vs subjects (mod-search readme seems to imply that there is)

 

17

All

all

full-text or term?

 

Contains all

To search in query: cql.all

Not sure if this truly includes “all” fields; but this particularly query includes fields from instances, holdings, and items. To search just instance fields, query = cql.allInstances

 

18

Query search

N/A

 

 

N/A

CQL queries constructed from any indexed fields

 

19

Advanced search

N/A

 

All basic search options

Contains all (operators/modifiers can be changed in modal)

Populated with advanced search query containing human readable operator text

 

Holdings search options

Search option - UI

Search option - BE

Type

Fields included

Default operator

Notes

Updates

Search option - UI

Search option - BE

Type

Fields included

Default operator

Notes

Updates

1

Keyword (title, contributor, identifier, HRID, UUID)

keyword

full-text and term?

  • Instance level

    • Title

      • Resource title

      • Alternative title (of all types)

      • Index title

      • Series statement

    • Contributors

    • Identifier (all)

  • Holdings HRID

  • Holdings UUID

Contains all

Performs a “contains all” on full-text fields and an “exact phrase” on identifier fields

 

2

ISBN

isbn

Term

ISBN, Invalid ISBN

Contains all

 

 

3

ISSN

issn

Term

ISSN, Invalid ISSN, Linking ISSN

Exact phrase

 

 

4

Call number, not normalized

holdingsFullCallNumbers

Term

Does this only look for matches on Prefix + Call number + Suffix?

Exact phrase

Case sensitive?

Leading, internal, and trailing spaces NOT removed

 

5

Call number, normalized

holdingsNormalizedCallNumbers

Term

Prefix + Call number + Suffix

Exact phrase

Leading, internal, and trailing spaces removed.

 

 

6

Holdings notes (all)

holdings.notes.note

Full-text

Holdings of all note types and holdings Administrative notes

Contains all

 

 

7

Holdings administrative notes

holdings.administrativeNotes

Full-text

Holdings Administrative notes

Contains all

 

 

8

Holdings HRID

holdings.hrid

Term

Holdings HRID

Exact phrase

 

 

9

Holdings UUID

holdings.id

Term

Holdings UUID

Exact phrase

 

 

10

All

all

full-text or term?

 

Contains all

To search in query: cql.all

Not sure if this truly includes “all” fields; but this particularly query includes fields from instances, holdings, and items. To search just holdings fields, query = cql.allHoldings

 

11

Query search

N/A

 

 

 

CQL queries constructed from any indexed fields

Can combine record types?

 

12

Advanced search

N/A

 

All basic search options

Contains all (operators/modifiers can be changed in modal)

Populated with advanced search query containing human readable operator text

 

 

Item search options

Search option - UI

Search option - BE

Type

Fields included

Default operator

Notes

Updates

Search option - UI

Search option - BE

Type

Fields included

Default operator

Notes

Updates

1

Keyword (title, contributor, identifier, HRID, UUID, Barcode)

 

Full-text and term?

  • Instance level

    • Title

      • Resource title

      • Alternative title (of all types)

      • Index title

      • Series statement

    • Contributors

    • Identifier (all)

  • Item HRID

  • Item UUID

  • Barcode

Contains all

Performs a “contains all” on full-text fields and an “exact phrase” on identifier fields

Sunflower: Includes “Barcode”

2

Barcode

item.barcode

Term

Barcode

Exact phrase

 

 

3

ISBN

isbn

Term

ISBN, Invalid ISBN

Contains all

 

 

4

ISSN

issn

Term

ISSN, Invalid ISSN, Linking ISSN

Exact phrase

 

 

5

Effective call number (item), not normalized

itemFullCallNumbers

Term

Prefix + Call number + Suffix?

Exact phrase

 

 

6

Effective call number (item), normalized

itemNormalizedCallNumbers

Term

Prefix + Call number + Suffix

Exact phrase

Currently does NOT contain all of the elements that are marked as “Effective call number” on the Item detail view.

 

7

Item notes (all)

item.notes.note

Full-text

Notes of all note types and administrative notes

Contains all

 

 

8

Item administrative notes

item.administrativeNotes

Full-text

Administrative notes

Contains all

 

 

9

Circulation notes

item.circulationNotes.note

full-text

Circulation notes

Contains all

 

 

10

Item HRID

item.hrid

Term

Item HRID

Exact phrase

 

 

11

Item UUID

item.id

Term

Item UUID

Exact phrase

 

 

12

All

all

full-text or term?

 

Contains all

To search in query: cql.all

Not sure if this truly includes “all” fields; but this particularly query includes fields from instances, holdings, and items. To search just items fields, query = cql.allItems

 

13

Query search

N/A

 

 

 

CQL queries constructed from any indexed fields

Can combine record types?

 

14

Advanced search

N/A

 

All basic search options

Contains all (operators/modifiers can be changed in modal)

Populated with advanced search query containing human readable operator text

 

 

MARC authority

Mapping rules: https://github.com/folio-org/data-import-processing-core/blob/master/src/test/resources/org/folio/processing/mapping/authority/authorityRules.json

MARC authority search options

Search option - UI

Type

Fields included

Default operator

Notes

Updates

Search option - UI

Type

Fields included

Default operator

Notes

Updates

1

keyword

Full-text

  • identifiers

  • personalName

  • sftPersonalName

  • saftPersonalName

  • personalNameTitle

  • sftPersonalNameTitle

  • saftPersonalNameTitle

  • corporateName

  • sftCorporateName

  • saftCorporateName

  • corporateNameTitle

  • sftCorporateNameTitle

  • meetingName

  • sftMeetingName

  • saftMeetingName

  • uniformTitle

  • sftUniformTitle

  • saftUniformTitle

  • topicalTerm

  • sftTopicalTerm

  • saftTopicalTerm

  • geographicName

  • sftGeographicName

  • saftGeographicName

  • genreTerm

  • sftGenreTerm

  • saftGenreTerm

Contains all

 

 

2

Identifier (all)

Term

All identifiers regardless of type, LCCN, natural ID

Exact phrase

 

 

3

LCCN

Term

LCCN

Contains all

 

 

4

Personal name

Full-text

  • personalName

  • sftPersonalName (see from)

  • saftPersonalName (see also)

Contains all

 

 

5

Corporate/Conference name

Full-text

  • corporateName

  • sftCorporateName (see from)

  • saftCorporateName (see also)

Contains all

 

 

6

Geographic name

Full-text

  • geographicName

  • sftGeographicName (see from)

  • saftGeographicName (see also)

Contains all

 

 

7

Name-title

Full-text

  • personalNameTitle

  • sftPersonalNameTitle (see from)

  • saftPersonalNameTitle (see also)

Contains all

 

 

8

Uniform title

Full-text

  • uniformTitle

  • sftUniformTitle (see from)

  • saftUniformTitle (see also)

Contains all

 

 

9

Subject

Full-text

  • topicalTerm

  • sftTopicalTerm (see from)

  • saftTopicalTerm (see also)

Contains all

 

 

10

Children’s subject heading

Full-text

For LCCNs that start with the prefix “sj”

  • topicalTerm

  • sftTopicalTerm (see from)

  • saftTopicalTerm (see also)

Contains all

 

 

11

Genre

Full-text

  • genreTerm

  • sftGenreTerm (see from)

  • saftGenreTerm (see also)

Contains all

 

 

12

Advanced search

 

All basic search options

Contains all (operators/modifiers can be changed in modal)

Populated with advanced search query containing human readable operator text