Framing the Risk Management Framework: Actionable Instructions by NIST in The “Measure” Section of the AI RMF

April 13, 2023 | Ben Winters, Senior. Counsel and Grant Fergusson, Equal Justice Works Fellow

Note: This piece is part of a series examining NIST’s A.I. Risk Management Framework. If you missed our previous parts, click here for our introduction to the “Govern” function, click here for our introduction to the “Manage” function, and click here for our introduction to the “Map” function.

Released on January 26, 2023 by the National Institute of Standards and Technology (NIST), the A.I. Risk Management Framework is a four-part, voluntary framework intended to guide the responsible development and use of A.I. systems. At the core of the framework are recommendations divided into four overarching functions: (1) Govern, which covers top-level policy decisions and organizations culture around A.I. development; (2) Map, which covers efforts to contextualize A.I. risks and potential benefits; (3) Measure, which covers efforts to assess and quantify A.I. risks; and (4) Manage, which covers the active steps an organization should take to mitigate risks and prioritize elements of trustworthy A.I. systems. In addition to the core Framework, NIST also hosts supplemental resources like a community Playbook to help organizations navigate the Framework. Over the next few weeks, EPIC will combine the work we’ve done to distill the A.I. RMF’s instructions into a deeper framework for analyzing, contextualizing, and implementing the A.I. RMF’s key recommendations.

The Measure Function of the A.I. Risk Management Framework urges companies to build and deploy carefully, centering human experience and a myriad of impact points including environmental and impact on civil liberties and rights. Particularly, it calls for regular testing on validity, reliability, transparency, accountability, safety, security, and fairness.

EPIC believes the main suggested actions under the Measure Function can be encompassed by four recommendations:

Identify and document testing procedures and metrics to measure both A.I. trustworthiness and significant A.I. risks.

Regularly assess and update testing procedures and metrics to ensure their effectiveness and responsiveness to emergent A.I. risks.

When developing and evaluating A.I. systems, prioritize efforts to improve system validity, reliability, transparency, accountability, safety, security, and fairness.

Incorporate human research requirements, feedback from diverse stakeholders, perspectives from impacted communities, and the risk of environmental harms into the metrics and testing procedures used to evaluate A.I. systems.

A breakdown of where and how each suggested action within the Measure Function maps onto these four recommendations is provided below.

Identify and document testing procedures and metrics to measure both A.I. trustworthiness and significant A.I. risks under different conditions.

[Measure 1.1] Establish approaches for detecting, tracking and measuring known risks, errors, incidents or negative impacts.
[Measure 1.1] Identify testing procedures and metrics to demonstrate whether or not the system is fit for purpose and functioning as claimed.
[Measure 1.1] Identify testing procedures and metrics to demonstrate AI system trustworthiness.
[Measure 1.1] Document metric selection criteria and include considered but unused metrics.
[Measure 1.1] Monitor AI system external inputs including training data, models developed for other contexts, system components reused from other contexts, and third-party tools and resources.
[Measure 1.1] Report metrics to inform assessments of system generalizability and reliability.
[Measure 1.1] Document risks or trustworthiness characteristics identified in the Map function that will not be measured, including justification for non-measurement.
[Measure 1.2] Develop and utilize metrics to monitor, characterize and track external inputs, including any third-party tools.
[Measure 1.2] Collect and report software quality metrics such as rates of bug occurrence and severity, time to response, and time to repair (See Manage 4.3).
[Measure 1.3] Track processes and measure and document change in AI system performance.
[Measure 2.2] Utilize disaggregated evaluation methods (e.g. by race, age, gender, ethnicity, ability, region) to improve AI system performance when deployed in real world settings.
[Measure 2.2] Establish thresholds and alert procedures for dataset representativeness within the context of use.
[Measure 2.3] Regularly test and evaluate systems in non-optimized conditions, and in collaboration with AI actors in user interaction and user experience (UI/UX) roles.
[Measure 2.3] Measure AI systems prior to deployment in conditions similar to expected scenarios.
[Measure 2.3] Document differences between measurement setting and the deployment environment(s).
[Measure 2.6] Thoroughly measure system performance in development and deployment contexts, and under stress conditions.
[Measure 2.4] Utilize hypothesis testing or human domain expertise to measure monitored distribution differences in new input or output data relative to test environments.
[Measure 2.5] Define and document processes to establish the system’s operational conditions and limits.
[Measure 2.5] Establish practices to specify and document the assumptions underlying measurement models to ensure proxies accurately reflect the concept being measured.
[Measure 2.5] Monitor operating conditions for system performance outside of defined limits.
[Measure 2.5] Identify TEVV approaches for exploring AI system limitations, including testing scenarios that differ from the operational environment. Consult experts with knowledge of specific context of use.
[Measure 2.5] Define post-alert actions. Possible actions may include: (1) alerting other relevant AI actors before action, (2) requesting subsequent human review of action, (3) alerting downstream users and stakeholder that the system is operating outside it’s defined validity limits, (4) tracking and mitigating possible error propagation, and (5) action logging.
[Measure 2.5] Log input data and relevant system configuration information whenever there is an attempt to use the system beyond its well-defined range of system validity.
[Measure 2.6] Thoroughly measure system performance in development and deployment contexts, and under stress conditions.
[Measure 2.6] Employ test data assessments and simulations before proceeding to production testing.
[Measure 2.6] Track multiple performance quality and error metrics.
[Measure 2.6] Stress-test system performance under likely scenarios (e.g., concept drift, high load) and beyond known limitations, in consultation with domain experts.
[Measure 2.6] Test the system under conditions similar to those related to past known incidents and measure system performance and safety characteristics.
[Measure 2.6] Align measurement to the goal of continuous improvement. Seek to increase the range of conditions under which the system is able to fail safely through system modifications in response to in-production testing and events.
[Measure 2.6] Document, practice and measure incident response plans for AI system incidents, including measuring response and down times.
[Measure 2.7] Establish and track AI system security tests and metrics (e.g., red-teaming activities, frequency and rate of anomalous events, system down-time, incident response times, time-to-bypass, etc.).
[Measure 2.8] Measure and document human oversight of AI systems. Specifically, (1) document the degree of oversight that is provided by specified AI actors regarding AI system output; (2) maintain statistics about downstream actions by end users and operators such as system overrides; (3) maintain statistics about and document reported errors or complaints, time to respond, and response types; and (4) maintain and report statistics about adjudication activities.
[Measure 2.9] Verify systems are developed to produce explainable models, post-hoc explanations and audit logs.
[Measure 2.9] Document AI model details including model type (e.g., convolutional neural network, reinforcement learning, decision tree, random forest, etc.) data features, training algorithms, proposed uses, decision thresholds, training data, evaluation data, and ethical considerations.
[Measure 2.9] Establish, document, and report performance and error metrics across demographic groups and other segments relevant to the deployment context.
[Measure 2.10] Quantify privacy-level data aspects such as the ability to identify individuals or groups (e.g., k-anonymity metrics, l-diversity, t-closeness).
[Measure 2.11] Define the actions to be taken if disparity levels rise above acceptable levels.
[Measure 2.11] Apply pre-processing data transformations to address factors related to demographic balance and data representativeness.
[Measure 3.1] Measure error response times and track response quality.
[Measure 3.2] Determine and document the rate of occurrence and severity level for complex or difficult-to-measure risks when: (1) prioritizing new measurement approaches for deployment tasks; (2) allocating AI system risk management resources; (3) evaluating AI system improvements; (4) making go/no-go decisions for subsequent system iterations.
[Measure 4.3] Delimit and characterize baseline operation values and states.
[Measure 4.3] Utilize qualitative approaches to augment and complement quantitative baseline measures, in close coordination with impact assessment, human factors and socio-technical AI actors.

Regularly assess and update testing procedures and metrics to ensure their effectiveness and responsiveness to emergent A.I. risks.

[Measure 1.1] Define acceptable limits for system performance (e.g., distribution of errors), and include course correction suggestions if/when the system performs beyond acceptable limits.
[Measure 1.1] Define metrics for, and regularly assess, AI actor competency for effective system operation.
[Measure 1.1] Assess and document pre- vs post-deployment system performance.
[Measure 1.1] Include existing and emergent risks in assessments of pre- and post-deployment system performance.
[Measure 1.2] Assess external validity of all measurements (e.g., the degree to which measurements taken in one context can generalize to other contexts).
[Measure 1.2] Assess effectiveness of existing metrics and controls on a regular basis throughout the AI system lifecycle.
[Measure 1.2] Document reports of errors, incidents and negative impacts and assess sufficiency and efficacy of existing metrics for repairs, and upgrades.
[Measure 1.2] Develop new metrics when existing metrics are insufficient or ineffective for implementing repairs and upgrades.
[Measure 1.3] Evaluate TEVV processes regarding incentives to identify risks and impacts.
[Measure 1.3] Utilize separate testing teams established in the Govern function (2.1 and 4.1) to enable independent decisions and course-correction for AI systems.
[Measure 1.3] Document test outcomes and course correct.
[Measure 1.3] Assess independence and stature of TEVV and oversight AI actors, to ensure they have the required levels of independence and resources to perform assurance, compliance, and feedback tasks effectively.
[Measure 2.1] Regularly assess the effectiveness of tools used to document measurement approaches, test sets, metrics, processes, and materials used.
[Measure 2.1] Update the tools used to document measurement approaches, test sets, metrics, processes, and materials used as needed.
[Measure 2.4] Monitor and document how metrics and performance indicators observed in production differ from the same metrics collected during pre-deployment testing. When differences are observed, consider error propagation and feedback loop risks.
[Measure 2.4] Monitor for anomalies using approaches such as control limits, confidence intervals, integrity constraints and ML algorithms. When anomalies are observed, consider error propagation and feedback loop risks.
[Measure 2.4] Verify alerts are in place for when distributions in new input data or generated predictions observed in production differ from pre-deployment test outcomes, or when anomalies are detected.
[Measure 2.4] Assess the accuracy and quality of generated outputs against new collected ground-truth information as it becomes available.
[Measure 2.4] Utilize human review to track processing of unexpected data and reliability of generated outputs; warn system users when outputs may be unreliable.
[Measure 2.4] Verify that human overseers responsible for these processes have clearly defined responsibilities and training for specified tasks.
[Measure 2.5] Utilize standard statistical methods to test bias, inferential associations, correlation, and covariance in adopted measurement models.
[Measure 2.5] Utilize standard statistical methods to test variance and reliability of system outcomes.
[Measure 2.5] Modify the system over time to extend its range of system validity to new operating conditions.
[Measure 2.6] Apply chaos engineering approaches to test systems in extreme conditions and gauge unexpected responses.
[Measure 2.6] Document the range of conditions under which the system has been tested and demonstrated to fail safely.
[Measure 2.6] Compare documented safety testing and monitoring information with established risk tolerances on an on-going basis.
[Measure 2.7] Use red-team exercises to actively test the system under adversarial or stress conditions, measure system response, assess failure modes or determine if system can return to normal function after an unexpected adverse event.
[Measure 2.7] Document red-team exercise results as part of continuous improvement efforts, including the range of security test conditions and results.
[Measure 2.7] Modify system security procedures and countermeasures to increase robustness and resilience to attacks in response to testing and events experienced in production.
[Measure 2.8] Track, document, and measure organizational accountability regarding AI systems via policy exceptions and escalations, and document “go” and “no/go” decisions made by accountable parties.
[Measure 2.8] Track and audit the effectiveness of organizational mechanisms related to AI risk management, including: (1) lines of communication between AI actors, executive leadership, users and impacted communities; (2) roles and responsibilities for AI actors and executive leadership; and (3) organizational accountability roles, e.g., chief model risk officers, AI oversight committees, responsible or ethical AI directors, etc.
[Measure 2.9] Test explanation methods and resulting explanations prior to deployment to gain feedback from relevant AI actors, end users, and potentially impacted individuals or groups about whether explanations are accurate, clear, and understandable.
[Measure 2.9] Test for changes in models over time, including for models that adjust in response to production data.
[Measure 2.11] Evaluate underlying data distributions and employ sensitivity analysis during the analysis of quantified harms.
[Measure 2.11] Evaluate quality metrics including false positive rates and false negative rates.
[Measure 2.11] Monitor system outputs for performance or bias issues that exceed established tolerance levels.
[Measure 2.11] Ensure periodic model updates; test and recalibrate with updated and more representative data to stay within acceptable levels of difference.
[Measure 2.11] Consider mediations to mitigate differences, especially those that can be traced to past patterns of unfair or biased human decision making.
[Measure 2.13] Review selected system metrics and associated TEVV processes to determine if they are able to sustain system improvements, including the identification and removal of errors.
[Measure 2.13] Regularly evaluate system metrics for utility, and consider descriptive approaches in place of overly complex methods.
[Measure 2.13] Review selected system metrics for acceptability within the end user and impacted community of interest.
[Measure 2.13] Assess effectiveness of metrics for identifying and measuring risks.
[Measure 3.1] Compare AI system risks with: (1) simpler or traditional models, (2) human baseline performance, and (3) other manual performance benchmarks.
[Measure 3.1] Assess effectiveness of metrics for identifying and measuring emergent risks.
[Measure 3.2] Establish processes for tracking emergent risks that may not be measurable with current approaches. Some processes may include: (1) recourse mechanisms for faulty AI system outputs; (2) bug bounties; (3) human-centered design approaches; (4) user-interaction and experience research; and (5) participatory stakeholder engagement with affected or potentially impacted individuals and communities.
[Measure 3.3] Evaluate measurement approaches to determine efficacy for enhancing organizational understanding of real world impacts.
[Measure 4.1] Analyze and document system-internal measurement processes in comparison to collected end user feedback.
[Measure 4.2] Measure frequency of AI systems’ override decisions, evaluate and document results, and feed insights back into continual improvement processes.
[Measure 4.3] Monitor and assess measurements as part of continual improvement to identify potential system adjustments or modifications.
[Measure 4.3] Perform and document sensitivity analysis to characterize actual and expected variance in performance after applying system or procedural updates.
[Measure 4.3] Document decisions related to the sensitivity analysis and record expected influence on system performance and identified risks.

When developing and evaluating A.I. systems, prioritize efforts to improve system validity, reliability, explainability, transparency, accountability, safety, security, and fairness.

[Measure 1.1] Identify transparency metrics to assess whether stakeholders have access to necessary information about system design, development, deployment, use, and evaluation.
[Measure 1.1] Utilize accountability metrics to determine whether AI designers, developers, and deployers maintain clear and transparent lines of responsibility and are open to inquiries.
[Measure 2.1] Leverage existing industry best practices for transparency and documentation of all possible aspects of measurements. Examples include: data sheet for data sets and model cards.
[Measure 2.2] Analyze differences between intended and actual population of users or data subjects, including likelihood for errors, incidents, or negative impacts.
[Measure 2.2] Evaluate data representativeness through (1) investigating known failure modes, (2) assessing data quality and diverse sourcing, (3) applying public benchmarks, (4) traditional bias testing, (5) chaos engineering, and (6) stakeholder feedback.
[Measure 2.3] Measure and document performance criteria such as accuracy (false positive rate, false negative rate, etc.) and efficiency (training times, prediction latency, etc.).
[Measure 2.3] Measure assurance criteria such as AI actor competency and experience.
[Measure 2.5] Establish or identify, and document approaches to measure forms of validity, including: (1) construct validity (the test is measuring the concept it claims to measure); (2) internal validity (relationship being tested is not influenced by other factors or variables); and (3) external validity (results are generalizable beyond the training condition) (standard approaches include the use of experimental design principles and statistical analyses and modeling).
[Measure 2.5] Assess and document system variance. Standard approaches include confidence intervals, standard deviation, standard error, bootstrapping, or cross-validation.
[Measure 2.5] Establish—or identify—and document robustness measures.
[Measure 2.5] Establish—or identify—and document reliability measures.
[Measure 2.6] Measure and monitor system performance in real-time to enable rapid response when AI system incidents are detected.
[Measure 2.7] Use countermeasures (e.g., authentication, throttling, differential privacy, robust ML approaches) to increase the range of security conditions under which the system is able to return to normal function.
[Measure 2.8] Instrument the system for measurement and tracking, e.g., by maintaining histories, audit logs and other information that can be used by AI actors to review and evaluate possible sources of error, bias, or vulnerability.
[Measure 2.9] When possible or available, utilize approaches that are inherently explainable, such as traditional and penalized generalized linear models, decision trees, nearest-neighbor and prototype-based approaches, rule-based models, generalized additive models, explainable boosting machines and neural additive models.
[Measure 2.9] Explain systems using a variety of methods, e.g., visualizations, model extraction, feature importance, and others.
[Measure 2.9] Since explanations may not accurately summarize complex systems, test explanations according to properties such as fidelity, consistency, robustness, and interpretability.
[Measure 2.9] Assess the characteristics of system explanations according to properties such as fidelity (local and global), ambiguity, interpretability, interactivity, consistency, and resilience to attack/manipulation.
[Measure 2.9] Secure model development processes to avoid vulnerability to external manipulation such as gaming explanation processes.
[Measure 2.9] Use transparency tools such as data statements and model cards to document explanatory and validation information.
[Measure 2.10] Document collection, use, management, and disclosure of personally sensitive information in datasets, in accordance with privacy and data governance policies.
[Measure 2.10] Establish and document protocols (authorization, duration, type) and access controls for training sets or production data containing personally sensitive information, in accordance with privacy and data governance policies.
[Measure 2.10] Monitor internal queries to production data for detecting patterns that isolate personal records.
[Measure 2.10] Monitor PSI disclosures and inference of sensitive or legally protected attributes.
[Measure 2.10] Assess the risk of manipulation from overly customized content.
[Measure 2.10] Evaluate information presented to representative users at various points along axes of difference between individuals (e.g., individuals of different ages, genders, races, political affiliation, etc.).
[Measure 2.10] Use privacy-enhancing techniques such as differential privacy, when publicly sharing dataset information.
[Measure 2.11] Conduct fairness assessments to manage computational and statistical forms of bias which include the following steps: (1) identify types of harms, including allocational, representational, quality of service, stereotyping, or erasure; (2) identify across, within, and intersecting groups that might be harmed; (3) quantify harms using both a general fairness metric, if appropriate (e.g., demographic parity, equalized odds, equal opportunity, statistical hypothesis tests), and custom, context-specific metrics developed in collaboration with affected communities; (4) analyze quantified harms for contextually significant differences across groups, within groups, and among intersecting groups; and (5) refine identification of within-group and intersectional group disparities.
[Measure 2.11] Consider biases affecting small groups, within-group or intersectional communities, or single individuals.
[Measure 2.11] Understand and consider sources of bias in training and TEVV data, including: (1) differences in distributions of outcomes across and within groups, including intersecting groups; (2) completeness, representativeness and balance of data sources; (3) identify input data features that may serve as proxies for demographic group membership (i.e., credit score, ZIP code) or otherwise give rise to emergent bias within AI systems; (4) forms of systemic bias in images, text (or word embeddings), audio or other complex or unstructured data.
[Measure 2.11] Leverage impact assessments to identify and classify system impacts and harms to end users, other individuals, and groups with input from potentially impacted communities.
[Measure 2.11] Identify the classes of individuals, groups, or environmental ecosystems which might be impacted through direct engagement with potentially impacted communities.
[Measure 2.11] Evaluate systems in regards to disability inclusion, including consideration of disability status in bias testing, and discriminatory screen out processes that may arise from non-inclusive design or deployment decisions.
[Measure 2.11] Develop objective functions in consideration of systemic biases, in-group/out-group dynamics.
[Measure 2.11] Use context-specific fairness metrics to examine how system performance varies across groups, within groups, and/or for intersecting groups. Metrics may include statistical parity, error-rate equality, statistical parity difference, equal opportunity difference, average absolute odds difference, standardized mean difference, percentage point differences.
[Measure 2.11] Customize fairness metrics to specific context of use to examine how system performance and potential harms vary within contextual norms.
[Measure 2.11] Apply in-processing to balance model performance quality with bias considerations.
[Measure 2.11] Apply model selection approaches with transparent and deliberate consideration of bias management and other trustworthy characteristics.
[Measure 2.11] Utilize human-centered design practices to generate deeper focus on societal impacts and counter human-cognitive biases within the AI lifecycle.
[Measure 2.11] Evaluate practices along the lifecycle to identify potential sources of human-cognitive bias such as availability, observational, and confirmation bias, and to make implicit decision-making processes more explicit and open to investigation.
[Measure 4.3] Develop baseline quantitative measures for trustworthy characteristics.

Incorporate human research requirements, legal frameworks, feedback from diverse stakeholders, perspectives from impacted communities, and the risk of environmental harms into the metrics and testing procedures used to evaluate A.I. systems.

[Measure 1.2] Determine frequency and scope for sharing metrics and related information with stakeholders and impacted communities.
[Measure 1.2] Utilize stakeholder feedback processes established in the Map function to capture, act upon and share feedback from end users and potentially impacted communities.
[Measure 1.3] Plan and evaluate AI system prototypes with end user populations early and continuously in the AI lifecycle.
[Measure 1.3] Evaluate interdisciplinary and demographically diverse internal teams established in Map 1.2.
[Measure 1.3] Evaluate effectiveness of external stakeholder feedback mechanisms, specifically related to processes for eliciting, evaluating and integrating input from diverse groups.
[Measure 1.3] Evaluate effectiveness of external stakeholder feedback mechanisms for enhancing AI actor visibility and decision making regarding AI system risks and trustworthy characteristics.
[Measure 2.2] Follow human subjects research requirements as established by organizational and disciplinary requirements, including informed consent and compensation, during dataset collection activities.
[Measure 2.2] Construct datasets in close collaboration with experts with knowledge of the context of use.
[Measure 2.2] Follow intellectual property and privacy rights related to datasets and their use, including for the subjects represented in the data.
[Measure 2.2] Use informed consent for individuals providing data used in system testing and evaluation.
[Measure 2.3] Conduct regular and sustained engagement with potentially impacted communities.
[Measure 2.3] Maintain a demographically diverse and multidisciplinary and collaborative internal team.
[Measure 2.3] Evaluate feedback from stakeholder engagement activities, in collaboration with human factors and socio-technical experts.
[Measure 2.3] Collaborate with socio-technical, human factors, and UI/UX experts to identify notable characteristics in context of use that can be translated into system testing scenarios.
[Measure 2.4] Collect uses cases from the operational environment for system testing and monitoring activities in accordance with organizational policies and regulatory or disciplinary requirements (e.g., informed consent, institutional review board approval, human research protections).
[Measure 2.5] Define the operating conditions and socio-technical context under which the AI system will be validated.
[Measure 2.6] Collect pertinent safety statistics (e.g., out-of-range performance, incident response times, system down time, injuries, etc.) in anticipation of potential information sharing with impacted communities or as required by AI system oversight personnel.
[Measure 2.7] Verify that information about errors and attack patterns is shared with incident databases, other organizations with similar systems, and system users and stakeholders (MANAGE-4.1).
[Measure 2.7] Develop and maintain information sharing practices with AI actors from other organizations to learn from common attacks.
[Measure 2.7] Verify that third party AI resources and personnel undergo security audits and screenings. Risk indicators may include failure of third parties to provide relevant security information.
[Measure 2.8] Calibrate controls for users in close collaboration with experts in user interaction and user experience (UI/UX), human computer interaction (HCI), and/or human-AI teaming.
[Measure 2.8] Test provided explanations for calibration with different audiences including operators, end users, decision makers and decision subjects (individuals for whom decisions are being made), and to enable recourse for consequential system decisions that affect end users or subjects.
[Measure 2.9] Test the quality of system explanations with end-users and other groups.
[Measure 2.10] Specify privacy-related values, frameworks, and attributes that are applicable in the context of use through direct engagement with end users and potentially impacted groups and communities.
[Measure 2.10] Collaborate with privacy experts, AI end users and operators, and other domain experts to determine optimal differential privacy metrics within contexts of use.
[Measure 2.11] Define acceptable levels of difference in performance in accordance with established organizational governance policies, business requirements, regulatory compliance, legal frameworks, and ethical standards within the context of use.
[Measure 2.11] Identify groups within the expected population that may require disaggregated analysis, in collaboration with impacted communities.
[Measure 2.11] Leverage experts with knowledge in the specific context of use to investigate substantial measurement differences and identify root causes for those differences.
[Measure 2.11] Apply post-processing mathematical/computational techniques to model results in close collaboration with impact assessors, socio-technical experts, and other AI actors with expertise in the context of use.
[Measure 2.11] Collect and share information about differences in outcomes for the identified groups.
[Measure 2.11] Collect and share information about differences in outcomes for the identified groups.
[Measure 2.11] Work with human factors experts to evaluate biases in the presentation of system output to end users, operators and practitioners.
[Measure 2.11] Utilize processes to enhance contextual awareness, such as diverse internal staff and stakeholder engagement.
[Measure 2.12] Include environmental impact indicators in AI system design and development plans, including reducing consumption and improving efficiencies.
[Measure 2.12] Identify and implement key indicators of AI system energy and water consumption and efficiency, and/or GHG emissions.
[Measure 2.12] Establish measurable baselines for sustainable AI system operation in accordance with organizational policies, regulatory compliance, legal frameworks, and environmental protection and sustainability norms.
[Measure 2.12] Assess tradeoffs between AI system performance and sustainable operations in accordance with organizational principles and policies, regulatory compliance, legal frameworks, and environmental protection and sustainability norms.
[Measure 2.12] Identify and establish acceptable resource consumption and efficiency, and GHG emissions levels, along with actions to be taken if indicators rise above acceptable levels.
[Measure 2.12] Estimate AI system emissions levels throughout the AI lifecycle via carbon calculators or similar process.
[Measure 3.1] Compare end user and community feedback about deployed AI systems to internal measures of system performance.
[Measure 3.1] Elicit and track feedback from AI actors in user support roles about the type of metrics, explanations and other system information required for fulsome resolution of system issues. Consider: (1) instances where explanations are insufficient for investigating possible error sources or identifying responses; and (2) system metrics, including system logs and explanations, for identifying and diagnosing sources of system error.
[Measure 31] Elicit and track feedback from AI actors in incident response and support roles about the adequacy of staffing and resources to perform their duties in an effective and timely manner.
[Measure 3.2] Identify AI actors responsible for tracking emergent risks and inventory methods.
[Measure 3.3] Measure efficacy of end user and operator error reporting processes.
[Measure 3.3] Categorize and analyze type and rate of end user appeal requests and results.
[Measure 3.3] Measure feedback activity participation rates and awareness of feedback activity availability.
[Measure 3.3] Utilize feedback to analyze measurement approaches and determine subsequent courses of action.
[Measure 3.3] Analyze end user and community feedback in close collaboration.
[Measure 4.1] Support mechanisms for capturing feedback from system end users (including domain experts, operators, and practitioners). Successful approaches are: (1) conducted in settings where end users are able to openly share their doubts and insights about AI system output, and in connection to their specific context of use (including setting and task-specific lines of inquiry); (2) developed and implemented by human-factors and socio-technical domain experts and researchers; and (3) designed to ensure control of interviewer and end user subjectivity and biases.
[Measure 4.1] Identify and document approaches (1) for evaluating and integrating elicited feedback from system end users, (2) in collaboration with human-factors and socio-technical domain experts, (3) to actively inform a process of continual improvement.
[Measure 4.1] Evaluate feedback from end users alongside evaluated feedback from impacted communities (MEASURE 3.3).
[Measure 4.1] Utilize end user feedback to investigate how selected metrics and measurement approaches interact with organizational and operational contexts.
[Measure 4.1] Identify and implement approaches to measure effectiveness and satisfaction with end user elicitation techniques, and document results.
[Measure 4.2] Integrate feedback from end users, operators, and affected individuals and communities from Map function as inputs to assess AI system trustworthiness characteristics. Ensure both positive and negative feedback is being assessed.
[Measure 4.2] Evaluate feedback in connection with AI system trustworthiness characteristics from Measure 2.5 to 2.11.
[Measure 4.2] Evaluate feedback regarding end user satisfaction with, and confidence in, AI system performance including whether output is considered valid and reliable, and explainable and interpretable.
[Measure 4.2] Identify mechanisms to confirm/support AI system output (e.g., recommendations), and end user perspectives about that output.
[Measure 4.2] Consult AI actors in impact assessment, human factors and socio-technical tasks to assist with analysis and interpretation of results

Support Our Work

EPIC's work is funded by the support of individuals like you, who allow us to continue to protect privacy, open government, and democratic values in the information age.

Donate