Retrieve bibliographical records with Python via OAI-PMH
In this post we check how to retrieve bibliographical records with Python from a library catalogue via the OAI-PMH protocoll. It is also an experiment from my side: am I able to use the platform’s formatting options (that were not intended to support code explanation more than a minimalist way) creatively enough that readers can follow the explanation? We’ll see, if you have any suggestion how to improve, leave a comment or contact me elsewhere.
What is OAI-PMH? Its definition reads as:
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. Data Providers are repositories that expose structured metadata via OAI-PMH. Service Providers then make OAI-PMH service requests to harvest that metadata. OAI-PMH is a set of six verbs or services that are invoked within HTTP.
Shortly, OAI-PMH is a standard API to retrieve records. The API is very popular in the cultural heritage world, you can find lots of libraries, archives and museums that provide data access in this way.
The protocol provides 6 ‘verbs’ or endpoints to retrieve information:
GetRecord: to retrieve an individual metadata record
Identify: information about a repository
ListIdentifiers: to harvest record identifiers
ListMetadataFormats: to retrieve the metadata formats available from a repository. These formats are alternative XML representations of the same content, but it also happens that a library provides different content via different formats. In the `ListIdentifiers` and `ListRecords` verbs it is used as the `metadataPrefix` parameter. At least one format, the Dublin Core is mandatory.
ListRecords: to harvest records
ListSets: to retrieve the set structure of a repository. Sets are optional collections within a repository, e.g. there might be individual sets according to the physical collections of a library.
The response of these verbs is always XML. For the List* verbs the maintainer of the service sets a pager size, so you do not retrieve all records at once, only limited number (e.g. 100). The pagination is implemented by a distinct XML element called resumptionToken, which the client should use in the next call. Its value can change from call to call, so the client always have to extract is. In the above links you can find more detailed information and examples.
There are different implementations of OAI-PMH clients in Python, we use Mathias Loesch’s Sickle library (version 0.7.0) to retrieve the Estonian National Bibliography from DataLab, the data sharing platform of the Estonian National Library. At the time of writing it returns almost 410 000 MARCXML records. We save them into local files each containing 100 000 records.
As OAI-PMH is based on XML technology we use the lxml library (version 6.0.0), a lightweight XML and HTML processing toolbox.
import io
import sys
import os
from lxml import etree
from sickle import Sickle
if len(sys.argv) == 1: # <1>
print(’Please give a directory name’)
exit()
dir = sys.argv[1] # <2>
if not os.path.exists(dir):
os.mkdir(dir)
print(’saving to %s’ % (dir))
namespaces = { # <3>
‘oai’: ‘http://www.openarchives.org/OAI/2.0/’,
‘marc21’: ‘http://www.loc.gov/MARC21/slim’
}
header = ‘<?xml version=”1.0” encoding=”utf8”?>’ + “\n” \ # <4>
+ ‘<collection>’ + “\n”
footer = ‘</collection>’When we run Python, the
sys.argvlist contains the arguments we passed to the interpreter. Thelen()function returns the number of elements in a collectionso it gives here the number of arguments. The first argument is the name of the current script, if the number here is zero it means that the script itself does not have any agument. We should pass at least one argument: the name of the directory where the script will store the records – if we don’t provide it, the script will send us a message and to running (with theexit()function).If we provide an argument it will be stored as the name of the output directory. Then it creates the directory if it does not already exist, and notify us.
namespacesvariable registers the XML namespaces the process needs - otherwise the XPath expressions would not work.headerandfooterneeds at the beginning and end of the output XML files. The XML files are hierarchical: their main content is a list of MARC records, and their parent element will be the<collection>.
To save the records we have to do several things, like naming an output file, insert header and footer. We should save records several times, so if we have such repeatable task, it is always suggested to create a method.
def write_output(xmlrecords, file_counter, dir_name): # <1>
“”“
Writes MARC records to file
Parameters
----------
xmlrecords : list
A list of string contains the individual MARCXML records
file_counter : int
A file name counter
dir_name : int
The name of the output directory
“”“
file_name = f’{file_counter:06}.xml’) # <2>
file_path = os.path.join(dir_name, file_name)
print(file_path) # <3>
with io.open(file_path, ‘w’, encoding=’utf8’) as f: # <4>
f.write(header) # <5>
f.write(”\n”.join(xmlrecords) + “\n”) # <6>
f.write(footer) # <7>A
write_output()function creates the output files in the specified directory. It prints the XML header, join together and prints the XML records, and finally prints the XML footer.To create an output file name might looks a bit complicated at first sight. The os.path.join function is a safe way to create a path in your operating system – you might know that the directory separator character is different in Windows (`\`) than in Mac and Linux (`/`). With this function Python returns the path syntax that fits to your OS. In this case it has two parameters: a directory name, and a file name. We generate the later with the f-string syntax.
{file_counter:06}returns a string that takes the value of thefile_countervariable (a number) and pad it with zeros on the left side up to six characters so 1 becomes 000001, 1000 becomes 001000 etc. The last part of the f-string append this with the file extension.Just for logging the progress it prints out the generated path.
opens the file in writing mode with UTF-8 character encoding
first, writes out the above defined XML header
then, joins the individual XML records to a single string and writes it (alternatively we could iterate over the records and write them one by one).
finally, writes out the XML footer (the closing collection tag).
Then we create some variables to store the harvested records, and keeping status information, then initialize and start the harvester.
xmlrecords = [] # <1>
file_counter = 0
record_counter = 0
sickle = Sickle(’https://data.digar.ee/repox/OAIHandler’, max_retries=4) # <2>
it = sickle.ListRecords(metadataPrefix=’marc21xml’, set=’erb’) # <3>Initialisation of the necessary variables: the collector of the individual records, the output file counter and the record counter.
The Sickle harvester is initialized with the endpoint of the OAI-PMH service and an extra argument (
max_retries) to specify how many times it can retry to fetch a single URL . It is needed because sometimes there are communication problems between the server and the client (due to problems on either side or in the network).Start the harvest using the
ListRecordsverb. It returns an iterator, so we should not take care of pagination with individual HTTP calls and resumptionToken mentioned above, it is handled automatically by the Sickle module. We use two arguments: we set the metadata schema as MARCXML and the set aserbwhich stands for the Estonian National Bibliography. These values can be extracted from a preliminary investigation of the supported values returned by theListMetadataFormatsandListSetsverbs.
In the following the script iterates over the individual XML responses of the calls.
for content in it: # <1>
tree = etree.ElementTree(content) # <2>
records = tree.xpath(
‘/oai:record[*]/oai:metadata/marc21:record’, # <3>
namespaces=namespaces)
for record in records: # <4>
xmlrecord = etree\ # <5>
.tostring(record, encoding=’utf8’,
method=’xml’)\ # <6>
.decode(”utf-8”)\ # <7>
.replace(
”<?xml version=’1.0’ encoding=’utf8’?>\n”,
‘’) # <8>
xmlrecords.append(xmlrecord) # <9>
record_counter += 1 # <10>
if len(xmlrecords) >= 100000: # <11>
write_output(xmlrecords, file_counter, dir) # <12>
xmlrecords = [] # <13>
file_counter += 1Iterates over the responses. We do not need to call individual HTTP requests, the underlying library Sickle does it for us automatically.
The raw XML content is transformed to an XML element tree that is an internal data structure with an API.
selection of the individual records with an XPath expression. For the expression we should provide the XML namespace. In OAI-PMH the individual records (
<oai:record>) have two parts: a header, that contains record identifier and other metadata (such as the set it belongs to, the date of last modification) and the actual record (<oai:metadata>). This later is the container of the MARCXML record (<marc21:record>). If you would like to apply this script to another OAI-PMH server, please check the namespaces it uses. The abbreviations are not important, you can name them as you like (however to keep them as in the source, i.e. the XML response of the service helps in debugging the problems, if there are), but you should always use the URLs used by the service.iterates over the individual records that
xpath()returned.transformation of the above mentioned tree structure of the record to XML via a sequence of steps:
creates the XML string...
converts to UTF-8 encoding...
removes the XML declaration – we do not need it for each record.
adds the record into the record collection
increases the record counter by one
if the record collection contains more than 100 000 records...
the records should be written to a file...
then it empties the collection, and increases file counter by one
Since during the iteration we collect records, but do not save the list to file only when the list size reaches a limit, there is a chance that when the iteration ends we still have unsaved record. So we should save them and print a report.
if len(xmlrecords) > 0: # <1>
write_output(xmlrecords, file_counter, dir)
print(’saved %d records to %d files’ % (record_counter, file_counter)) # <2>after the iteration if the record collection is not empty write it to a file
finally notify the user about the number of records ingested and files written
If you reached this point without errors, congratulation, you downloaded the Estonian national bibliography’s MARCXML records. If Python throws errors during the process, or the explanation is not clear enough, please let me know.
A final note: while you can use OAI-PMH to download records from many libraries, the protocol has some downsides:
you can not select or search for records, you are depending on what the organisation have selected for you via the sets of the protocol
the iteration of the record is serial, and you can not skip some records or jump to the n-th record. The resumption token gives you the key for the next iteration. Some implementation however contains a counter that you might “hack” by manipulating it. Sometimes it works, however it is not supported by the protocol itself. In this post we did not see the resumption token, you should do some investigation yourself to figure out how to implement such hack when using Sickle or other Python library.
sometimes happens that server stops responding, or responds an error or even the resumption token is invalidated. In that case you can retry issuing the same URL request again (we saw max_retries parameter when Sickle was initialized). If that would not work you have two options: either hack the resumption token to jump ahead (but as mentioned before, it is not available if there is no counter in it) or you should start over.

