Module 1 Assignment: Text preprocessing and linguistic signals

Contents

Module 1 Assignment: Text preprocessing and linguistic signals#

Scenario#

You are advising a product team evaluating an NLP workflow before using it in customer-facing communication. The stakeholders are: product manager, support lead, privacy reviewer, and model evaluator.

Task#

Answer the module question: What is lost and gained when language becomes data?

Use the module lab and course readings to produce: NLP evaluation packet with task framing, retrieval/evaluation design, and deployment guardrails focused on text preprocessing and linguistic signals: Compare tokenization choices on a small corpus..

Required Evidence#

Define the decision or system boundary in one paragraph.
Identify the dataset, proxy data, or evidence source you used: synthetic support messages, retrieval snippets, intent labels, and factuality checks.
Compare at least two alternatives, baselines, policies, or designs.
Report one quantitative result or structured scoring table.
Explain two failure modes and one mitigation for each.
State what additional evidence would be required before real deployment.

Submission#

Submit the completed notebook plus a 900-1200 word memo. The memo must include clear headings for context, method, evidence, risks, recommendation, and open questions.

# Assignment workspace for Module 1: Text preprocessing and linguistic signals
module = 1
decision = "What is lost and gained when language becomes data?"
artifact = "NLP evaluation packet with task framing, retrieval/evaluation design, and deployment guardrails focused on text preprocessing and linguistic signals: Compare tokenization choices on a small corpus."

alternatives = [
    {"option": "baseline_or_manual_process", "strength": "", "risk": "", "evidence": ""},
    {"option": "ai_assisted_or_advanced_option", "strength": "", "risk": "", "evidence": ""},
]

recommendation = {
    "decision": decision,
    "recommended_option": "",
    "minimum_evidence_before_pilot": [],
    "monitoring_metric": "",
    "rollback_trigger": "",
}

{"module": module, "artifact": artifact, "alternatives": alternatives, "recommendation": recommendation}

{'module': 1,
 'artifact': 'NLP evaluation packet with task framing, retrieval/evaluation design, and deployment guardrails focused on text preprocessing and linguistic signals: Compare tokenization choices on a small corpus.',
 'alternatives': [{'option': 'baseline_or_manual_process',
   'strength': '',
   'risk': '',
   'evidence': ''},
  {'option': 'ai_assisted_or_advanced_option',
   'strength': '',
   'risk': '',
   'evidence': ''}],
 'recommendation': {'decision': 'What is lost and gained when language becomes data?',
  'recommended_option': '',
  'minimum_evidence_before_pilot': [],
  'monitoring_metric': '',
  'rollback_trigger': ''}}

Acceptance Criteria#

Your submission is complete only if another reviewer can reproduce your reasoning from the evidence you provide. You do not need production-grade data, but you must be explicit about proxy-data limits and what would change with real institutional data.