Getting Started
with
LR Data Services
What Was Data Services?
- Make it easier to get relevant data.
- Get clean data.
- In depth scrutiny of the LR data.
- I.E. Schemas & Signatures validated, envelopes filtered.
- Mostly external, complementary to a LR Node.
What Is LR Data Services Now?
-
A Way to Get the Data You Want
Only interested in one specific kind of data? Data Services aims to provide a way to extract the data that's relevant to you through simple customization.
-
It's A Design Pattern, With Some "Batteries Included".
- Follow simple conventions to identify discriminators within the LR documents
- Reuse or modify community-sourced libraries to aid in extracting data for your use case
- Install your data service code into a LR Node's CouchDB
- Access your custom solution via the
extract
HTTP service API installed on the node.
-
A Path Towards Making LR use Fewer Resources
We know LR can potentially hold mountains of data, and has a few data extraction services that utilize loads of storage space. This is a way to extract data in a very focused manner, which will result in substantial storage savings.
What Is a Discriminator?
Discriminators in Data Services are the characteristics of the data that you've identified that you want to capture.
- This allows you to exclude data you do not know how to process or recognize as quality information.
- For example, if you were only interested in ASN urls contained within the <conformsTo> element of XML metadata that uses a Dublin Core Terms schema, the identified ASN url becomes your discriminator.
What Can Data Services Do?
Following the conventions outlined, you will be able to extract data using:
- Resource Locator by Discriminator
- Resource Locator by start of Discriminator
- Resource Locator by Timestamp
- Discriminator by Resource Locator
- Discriminator by start of Resource Locator
- Discriminator by Timestamp
- Discriminated Resource Locator by Timestamp
- All Discriminated Resources
The Alignment to Standards Prototype Example
The prototype implementation can provide a Data Service that will allow use of the extract service to request resource data and aggregations where ASN's are Discriminators, `resource_locator`'s are Resource resource_locator, and `node_timestamp`'s are Timestamps, which will enable us to get:
- `resource_locator` by ASN
- `resource_locator` by start of ASN
- `resource_locator` by `node_timestamp`
- ASN by `resource_locator`
- ASN by start of `resource_locator`
- ASN by `node_timestamp`
- ASN `resource_locator`s by `node_timestamp`
- All ASN `resource_locator`s
Conventions
To follow the K.I.S.S principle, we're adopting a set of conventions to make adding new data services simple. These conventions are part of a CouchDB design documnent.
- A field named
dataservice
so the extract service can locate the implementation. The field will contain a map that indicates the name and description of the data service. More details for this TBD! - Views must follow a specific naming convention.
- View functions must emit specific keys according to the view they implement.
- Timestamps must be represented as seconds from epoch.
- List should be named respective of their output format. Included in the prototype is a reference sample for to-json.
{
"_id": "_design/standards-alignment",
"dataservice": {
"name": "Standards Alignment Data Service",
"description": "This is where I would document how this data service works."
},
"views": {
"discriminator-by-resource": {
"map": "function (doc) { emit([doc.resource_locator, getDescriminator(doc), getEpochTimestamp(doc)], null); }",
},
"discriminator-by-resource-ts": {
"map": "function (doc) { emit([doc.resource_locator, getEpochTimestamp(doc), getDescriminator(doc)], null); }",
},
"resource-by-discriminator": {
"map": "function (doc) { emit([getDescriminator(doc), doc.resource_locator, getEpochTimestamp(doc)], null); }",
},
"resource-by-discriminator-ts": {
"map": "function (doc) { emit([getDescriminator(doc), getEpochTimestamp(doc), doc.resource_locator], null); }",
},
"resource-by-ts": {
"map": "function (doc) { emit([getEpochTimestamp(doc), doc.resource_locator], null); }",
},
"discriminator-by-ts": {
"map": "function (doc) { emit([getEpochTimestamp(doc), getDescriminator(doc)], null); }",
}
},
"lists": { "to-json": "function(head, req) { ... }" }
}
How Map Functions Work
In LR, each record is JavaScript object, much like the one displayed below. The map function will process each individual object once for each view.
A Resource Data Document
How Map Functions Work: Try It
A map function takes takes one argument, the Resource Data Document. The function should perform the following tasks against the document:
- Evaluate the document to determine if the document should be included in the index. Does it meet some set of criteria specific to your use case?
- If the document should be included, define the structure of the keys and values in the index. Remember to follow the key conventions defined earlier.
Below is a sample implementation of a data services map function. Try editing and click run to learn what get's created in the index.
show sample resource dataThe Extract Service Output Specification
The basic JSON format for the extract service will is as specified in JSON Schema Internet Draft.
The Extract Service Example Output
This is an example of the data service output with doc_ID's being returned.
List Functions in Detail
List functions are responsible for formatting the response for the service output.
As the previous slides defined and displayed there are two basic parts to the response, the response wrapper, and the results.
Data Service Response Wrapper
The list function will need to group each result in the "documents"
property of the response wrapper.
Review of Prototype List Implementation
List functions group results and embed results into the Extract service's format for records in the documents list. Because the amount of data requiring processing could be large, we must try to always design list functions to buffer little and mostly stream.
Customize or Make your Own List Function
The prototype implementation should suffice as an adequate skeleton for making enhancing the prototype or building your own from scratch
- Start out by copying one of the samples.
- Don't forget any required files withing the lib folder. These helper functions have been abstracted to accomodate varied discriminator formats. In most cases unless you need supplemental_data, you can use the default to-json as is without modification.
- Remember to stream as much as you can, this will help keep your system resources under control.
The Extract Service
Provides a common HTTP interface to access data services with simplified parameters.
- What it does is create a bridge between the raw CouchDB view and list functions and query parameters, so you don't have to fully understand much of the complexity of CouchDB's query solution. The Extract service transforms your parameters into the equivalent query for CouchDB
- If running your own node, and you understand CouchDB's API, nothing prevents you from using the views directly for doing queries not supported by the extract service.
The Extract Service HTTP Request Format
The Basics
GET /extract/<data service name>/<view name>[?<DS Query Params>]
Custom List functions (AKA Roll Your Own Output Format)
GET /extract/<data service name>/<view name>/format/<your function name>[?<DS Query Params>]
The DS Query Params
Parameter | Description |
---|---|
from | ISO 8601 formatted timestamp for start range. |
until | ISO 8601 formatted timestamp for end range. |
resource | The resource locator you wish to harvest data. |
discriminator | The discriminator you wish to harvest data. |
resource-starts-with | A partial resource locator you wish to harvest data that uses the specified value as a prefix. (i.e. resource-starts-with=http://shodor.org will return all resources from http://shodor.org. |
discriminator-starts-with | The partial discriminator you wish to harvest data to be used for find the range that uses the specified value as a prefix. |
ids_only | Presence of the value will cause the resource_data values to be a list of doc_ID's instead of full resource_data documents (default behavior) |
Extract Service Parameter Matrix
Data services aims to allow you to narrow in on your data. So not all parameters work as expected with all data service views. Here is a breakdown of what parameters work together with specific view. Remember not to be concerned about the characteristics of the data returned, because the map functions have already taken care of that for you. If the data-service map functions only emit keys for documents that contain the word "exciting", that means anything this service returns will be at the very least, "exciting". The parameters just control how much to return.
View | Parameter Set | Description |
---|---|---|
discriminator-by-resource | resource | Get a list of discriminators for a specific resource locator. |
discriminator-by-resource | resource-starts-with | Get a list of discriminators that where the resource locator starts with a specified prefix. |
discriminator-by-resource-ts | resource, from, until | Get a list of discriminators for a specific resource locator between for a specified period of time. |
discriminator-by-resource-ts | resource-starts-with | Get a list of discriminators that where the resource locator starts with a specified prefix, include the timestamp in the result |
discriminator-by-ts | from, until | Get a list of discriminators for a specified time period. |
resource-by-discriminator | discriminator | Get a list of resource locators for a specified discriminator. |
resource-by-discriminator | discriminator-starts-with | Get a list of resource locators that start with the specified discriminator as a prefix. |
resource-by-discriminator-ts | discriminator, from, until | Get a list of resource locators that for a specified discriminator for a specified period of time. |
resource-by-discriminator-ts | discriminator-starts-with | Get a list of resource locators that start with the specified discriminator as a prefix. Timestamps are included in the output. |
resource-by-ts | from, until | Get a list of resource locators for a specified period of time. |
Extract Service API - Try It
Here are some example requests that you can try against a real data service install. Click on the line to populate the example into input box or edit by hand. Click Run to execute the request. This can take a bit of time, and cause your browser to complain. It's okay to click "Wait" or "Continue" if prompted by your browser until it completes.
GET /extract/standards-alignment-lr-paradata/resource-by-discriminator?ids_only&discriminator=["matched"]
GET /extract/standards-alignment-lr-paradata/resource-by-discriminator?ids_only&discriminator-starts-with=["matched","http://purl.org/ASN/resources/S"]
GET /extract/standards-alignment-dc-conformsTo/discriminator-by-resource?resource-starts-with=http://www.shodor.org/interactivate/activities/Advanced
GET /extract/standards-alignment-lr-paradata/resource-by-discriminator-ts?ids_only=true&discriminator=["matched","http://purl.org/ASN/resources/S1000132"]&from=2012-02-28T16:59:31Z&until=2012-02-28T16:59:31Z
Next Steps
- Check out the code on Github:
- Ask Questions on the LR Dev Google Group:
/
#