Meeting summary prepared by Kris Maloney
I attended at two day NISO meeting regarding the development of standards for metasearching. In my opinion, the meeting was very successful resulting in a number of key recommendations. It is clear that additional standards will be required in order to complete projects like Scholars Portal. This report is a description of the meeting.
The morning of the first day began with presentations from a variety of people representing the key stakeholders in this area. The presentations are available on the NISO page. The following are some highlights.
Brenda Bailey-Hainer talked about the state/public library perspective. The slides from her presentation are very close to the points that she made. She talked about the broad customer base that she needed to serve and the variety of services that needed to be integrated into a Portal. There were no big revelations here - just confirmation of what we are learning in the Scholars Portal project.
George Machovec talked about the academic library perspective. He is from the Colorado Alliance of Research Libraries. He talked almost completely from his slides so I won't repeat the content here.
Jenny Walker from ExLibris talked about her perspective as a library systems vendor (more than as a metasearch provider). She raised some interesting questions regarding trends and standard approaches. What is the role of the traditional library catalog? What is the balance between Union Catalog and federated searching? She sees many libraries moving forward with RFP for metasearching but with metasearching in combination with other services like ILL, Electronic Reserves and local content management. She mentioned that the University of Amsterdam is using Metalib as the primary access point to the library. She also mentioned that AARLIN (our Australian counterpart - that moved from FD to ExLibris) is focusing on researchers. (The director from Iowa said that the Scholars Portal project is focusing on Undergrads - I commented that each library is selecting its own focus).
Peter Noerr from Muse Global talked about issues from the Metasearch system provider. Again, he talked from his slides and seemed to reserve any insight that he has gained - it seemed like an advertisement more than a presentation.
Ed Moura from Gale talked about the content aggregators' perspective. His presentation followed his slides. He feels that metasearch should be part of the ILS and wonders why another level of system architecture would be required. Somebody asked if the publishers planned for access via these metasearchers. He said that they do plan for standard access but they are currently having problems because of the increased use of their resources from portals. He said that they have a page that provides information about standard access. A discussion ensued about the increased access to databases because of metasearch tools...one single user that doesn't know what they are looking for begins acting like 50 users. For example somebody looking for flea collars in psych info, eric, etc. blocks access to the system for others - like a neurosurgeon trying to put in a grant application. Screen scraping seems to be more 'expensive' for vendors. Searches have gone up 10 fold but retrieval hasn't gone up in a commensurate rate. He mentioned that you can get a rough estimate of the impact on indexes when you consider how many resources are in each profile. So, for example, if you have 10 resources in a profile, you can estimate that the index provider will see a 10 fold increase in searching. A single search actually produces 10 separate searches.
Marc Krellenstein talked from the publishers' perspective. Admitted that they were, at best, ambivalent about this technology. They have branding issues and, in addition, consider themselves as a primary search provider. They do have a standard interface that is Xquery-based - rather than based on library standards. They would be happy to make their 'standards' available to this group for adoption. They are moving to an XML repository approach to all of their data. There are details related to this on his slides. They don't want to expose their information in a way that is not optimal. They provide primary search as well as content and they see the entire package as being important. They want to be sure that there is appropriate branding, resolution of duplicates (will the appropriate content be delivered - or more essential, perhaps, will their content be delivered). Questions from the audience: ExLibris asked if customers are asking for this...and also if they were limiting this type of access in their licenses. They have had only limited interest raised to them for this functionality from the library community and they do limit access based on their contract. The slides reflect that Elsevier feels that it is important that they be able to deliver the information that is most relevant. Somebody mentioned that it was odd that the publisher would be able to determine relevance - that seemed like a function for the customer. Questioning where relevancy should be determined - in the search or the results.
The individual groups met for 5 _ hours over the course of two days. My initial group was MetaSearch ID that completed within 2 hours. I then joined the Result Set group.
Groups reported out intermediate results. See Appendix A. Overall it seemed like groups spent a great deal of time identifying scope and issues. Only the Metasearch ID group reached a recommendation at this point.
Final Recommendations:
Statistics:
- have additional working groups at ALA
- reinforce NISO Z39.7
- inform NISO Z39.7 that metasearching could skew results
- work with library schools to increase understanding of the impact of metasearching
- NISO would examine existing initiatives to determine overlap. Determine what expertise could bring to the table. There are too many standards for vendors to support.
- What one thing would you say would drive the others:
o short term we need a forum with library and librarians
o publication to raise awareness of the problems (particularly that metasearching changes the view of the session)
Access Management:
Recommendations:
- Develop formal use case model to define the problem space in the metasearch environment. Scenario development may be a good first step.
- Examine existing services (e.g., Shibboleth, digital certificates) as they apply to solving the above problem
- Initiate standards development if the above does not result in acceptable solutions
- Be independent of the authentication process: note that if our recommendation for authentication requires a specific scheme or invents a new scheme, it will fail
The core of the problem is entity (library, university) has authenticated the person, making sure the rest of the world knows that this person is certified to use appropriate resources. We also need to know that the entity (library, university) can be trusted to authenticate. The metasearch does not have to be involved in authentication or authorization but just needs to be able to pass through the certification.
CNI did a white paper a couple of years ago describing the current state of authentication, authorization and certification.
In simple words, the recommendation is to understand the problem and see what existing standards and practices will solve the problem.
Metasearch Identification:
Recommendations:
Two methods of identifying the metasearch engine to the target:
- Special address, URL, port, etc. (e.g., :210 for Z39.50)
- Protocol element, parameter (e.g., &metasearch="code", default="yes/no", other codes depend on target, for http)
Will create a guideline and put it on the NISO website for approval.
Collection Description:
Goal:
- Helping users find the best resources. Supporting point, search and point, and retrieve.
Open exchange of:
- Collection description (what)
- Service access description (how)
Requires:
- Metadata definitions for both entities
- Schemas for exchange
- Methods to operate on the information
Actions needed:
- Define entities and develop semantic descriptions for these
- Develop schemas to support exchange
- Develop service descriptions
Other work:
- JISC
- ISO Library Directory Standards
- Dublin Core (Collections)
- Digital Reference Standard
- NCIP Implementers
- EAD
- Digital Library Community
- ILL
Recommendations:
- Joint working party with other players active in this arena
- NISO and JISC host meeting to define short term action plan (by this fall)
Searching:
Short-term recommendations:
Information exchange between Metasearch vendor and content vendor
- Info vendor publishes guidelines
- Metasearcher user guidelines
Information exchange between metasearch vendor and content vendor
- License info
- Best practices
- Character set
- Impact on vendor
Best practices
- XML preferred, next best Z39.50
- Offer dumb HTML if no standard available (there is a possibility that we should define some meta tags so that scrapers will have an easier job).
- Make sure native UI is available as a link (so that publishers will accept)
- Metasearch should use session info if available
- Limit frivolous searching (this is the idea that systems in the pipeline would be able to refuse a search if it were deemed frivolous).
- Utilize resource description if available (another proposed standard)
- Start standards work
- Develop shared vocabulary
- Standards for session/authentication management
- Work with SRW/SRU or other web service models to see if metasearch issues can be addressed
Long-term Recommendations:
Standards work:
- Develop a request attribute set to increase server performance
- Move towards less tasteful models and standards for session mgmt; e.g. separate authentication
- Dynamic feedback on server status
- Sort out Relevancy Ranking/Result set order issues
- Develop standard XML gateway option (perhaps SRW/SRU or not?) to include:
o Diagnostic/error code standards
o Query languages
o Service description (e.g., WSDL)
o Support non-biblio sources
o Partial vs. complete result options
o Single request/multiple s1earches
We don't want to deprecate Z39.50.
Result sets:
Context:
- Focus on issues related of inconsistency of descriptive metadata
- Mainly discussed data elements rather than transmission (protocols). Can be separate from search mechanism/protocol
- Validated the needs for administrative metadata
General recommendations:
- Need a next-generation search protocol based on current technologies
- As part of search protocol development consideration should be given to: providing result set metadata, standard single record metadata, both descriptive and administrative
May need more informal short-term process to define metadata need not necessarily tied to a specific protocol
Short-term:
- Further consider defining metadata that could added to an HTML document with information about the result set
- Low barrier approach to help metasearch providers get more information for result sets Participants: interest content and metasearch providers
Result Set Long-term:
- Define possible data element set that could be used as part of search protocol (best practices)
- Ensure that data elements are consider as part of a standard search protocol
Single Record Short-term:
- Develop a best practice for a basic level of metadata to be available
- Review work of other groups (e.g., Dublin core, electronic thesis and dissertations) for overlap
- Particularly focus on standardizing journal citation data needed for article linking
Single-Record Long-term
- Develop element set for administrative/control information about a single record
- Review existing record schemas to determine if an existing schema can accommodate this data
- Work within an existing structure or develop a standard schema for transmitting administrative data
- Work must be done in conjunction with any protocol development or reference implementations
Open Issues
- How to allow extension to core metadata so that information providers can transmit extra value information
- If cross database searching allowed in single search, how will variations in the results set metadata be handled on database level
- Can result set metadata be overridden at single record level
- Tension between the need for a simple to implement protocol and the need for rich metadata to provide advanced features
Discussion:
Seems like metasearch id, search and result set are very closely intertwined. Wondering if it is possible to get together and write short-term best practices.
Four major working areas that result from the recommendation (according to Pat Stevens):
NISO is a member driven organization. Activities happen because people are willing to participate in them. Organizations that have large staffs have higher membership fees. They asked if people would be willing to work in the following areas:
- Access Management (several vendors raised hands-14 people)
- Statistics. At the beginning we are talking about raising awareness. The focal point will be the Z39.7. Only two people volunteered.
- Short-term best practices work on search and retrieve (lots of interest - 17 people)
- Long-term development of improved search and retrieve.
- Resource description. (Two people volunteered).
The planning committee will be meeting be meeting to turn this into a formal recommendation.
Appendix 1 - Interim Reports:
Access Management:
Like an onion. Everytime we feel like we are beginning to understand the issues we come up with something that complicates the issues.
Problem: Understand the roles and responsibilities of the actors in the meta-search delivery continuum. Many differing functions, there are no clear path to a solution, beginning to think that not every function may be conducive to a standard.
Definitions:
o Authentication - validation of the user credentials.
o Certification - communicates results of previous authentication (authentication, authenticator, organization, attributes). May be accomplished with digital certificates.
o Validation - confirmation that the entity has the right to use the service.
Assumptions.
Originating organization is responsible for authentication. An authenticated user is certified by the library to access remote services including meta-search systems. The certification is passed through intermediaries (meta-search systems) to the target. The intermediary may make use of the certification attributes. One sign-on provides for certification to many data service providers to execute federated search.
- Desired outcome.
o Single certification could be sent to multiple targets.
o Data service providers must be able to...
o Not another standard.
Statistics:
Used for...
o Purchasing decision
o Measuring effectiveness and efficiency (quality of service, relevance)
o Gauge value of library to funding agencies
Effects on stats of adding metasearch to information environment:
Does this make us non-compliant? One user can do multiple searches with multiple sessions. Or metasearch engine can create a session and hold it allowing many users to use it. How the metasearch engine is tweaked will create difference in search activity and statistics.
What do libraries really want us to measure?
Demographics, comparative data, outcomes Support analyzing use of product. Database selected, searched, used, hits, etc. How many targets selected (automatic vs. manual) Support setting of relevance of targets. Search time? Track quality of services? Content providers may want some metrics provided to them.
Search options (this group seemed to have lots of very technical people so they moved quickly to solutions)
- Spent a lot of time figuring out what search options were. Lots of overlap with result sets.
- Search targets: LMS targets, maps, museum data.
- Could we define a metasearch standard that was independent of the transmission carrier.
- Search problems: server overload, search continuation, relevancy ranking, consistency with native interface
- Proposing three standards. Most people preferring XML (SRW/SRU), Z39.50 and then native.
Metasearch identification:
Problem: Resource providers doing a lot of work to build screens that are immediately discarded by the metasearch providers. Increased demand based on having multiple search requests from metasearch engines.
Can we set up a mechanism that will allow a metasearch engine to identify itself?
Set up a special address for metasearch.
Use special parameters.
- Collection description:
o Collection - one or more digital or physical items
o Scope - helping people search smarter
- physical and digital materials
- libraries, museums, etc.
- both descriptive and service descriptions
- levels of standard needed (data elements, services for input/retrieval/update, policies)
Also developed a diagram. User is using a metasearch engine as an agent. Metasearch agent uses a collection description service that has a bunch of records about collections and how they are accessed. That gives the metasearch agent some information about what collections to present to the user. It might also be used to suggest additional searches when the user has sub-standard results. They have a pilot underway in the UK to build a repository of these kinds of collection descriptions.
Result set:
- Added additional scope to include information about information that is only relevant to the result set.
- Kept going back to the idea that we need to have a well defined record-level description especially when we incorporate information other than bibliographic.
- Tension between long-term and short-term solutions. Really trying to look at a quick way to get out of the screen scraping trap.
- Are having trouble separating these issues from issues related to search.