Mit Colab erstellt

ieg-dhr · Jan 4, 2025 · dbc520c · dbc520c
1 parent 61c1f28
commit dbc520c
Showing 1 changed file with 36 additions and 87 deletions.
diff --git a/Large_Language_Models_Article_Separation.ipynb b/Large_Language_Models_Article_Separation.ipynb
@@ -4,7 +4,7 @@
   "metadata": {
     "colab": {
       "provenance": [],
-      "authorship_tag": "ABX9TyP3HNW6Xlqgf54aKWyHGm2s",
+      "authorship_tag": "ABX9TyN8MxU868OagAbRX3uguOkR",
       "include_colab_link": true
     },
     "kernelspec": {
@@ -142,29 +142,6 @@
       "execution_count": null,
       "outputs": []
     },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "#Importing a Text File Containing an Example of how to Structure the Output"
-      ],
-      "metadata": {
-        "id": "MSzGg3kmERmQ"
-      }
-    },
-    {
-      "cell_type": "code",
-      "source": [
-        "\n",
-        "with open('/content/NLP-Course4Humanities_2024/datasets/structure_example_AS.txt', 'r') as file:\n",
-        "    examples = file.read()\n",
-        "examples"
-      ],
-      "metadata": {
-        "id": "2MwqzKZ3DvfP"
-      },
-      "execution_count": null,
-      "outputs": []
-    },
     {
       "cell_type": "code",
       "source": [
@@ -175,14 +152,14 @@
         "# Initialize OpenAI client with NVIDIA API settings\n",
         "client = OpenAI(\n",
         "    base_url=\"https://integrate.api.nvidia.com/v1\",\n",
-        "    api_key = userdata.get('NVIDIA_TOKEN')\n",
+        "    api_key=userdata.get('NVIDIA_TOKEN')\n",
         ")\n",
         "\n",
         "def analyze_dataframe(df: pd.DataFrame, text_column: str) -> pd.DataFrame:\n",
-        "    def analyze_text(text: str) -> List[Dict[str, str]]:\n",
-        "        system_prompt = f\"\"\"\n",
+        "    def analyze_text(text: str) -> str:\n",
+        "        system_prompt = \"\"\"\n",
         "# System Instructions\n",
-        "You are an expert text analyst and information retrieval specialist and hate summarization as well as enumerations. Use {examples} for structuring your answer.\n",
+        "You are an expert text analyst and information retrieval specialist and hate summarization as well as enumerations.\n",
         "Your task is to carefully analyze given texts and extract complete articles that contain specific themes. You never change original texts.\n",
         "\n",
         "Classify as relevant if the text contains:\n",
@@ -201,30 +178,26 @@
         "- historical references\n",
         "- comparisons\n",
         "\n",
-        "Your output should consist of the extracted articles and the verification\n",
+        "Your output should consist of tnothing else but the the xml structure >article></article><verification></verification><human_verification_needed></human_verification_needed>\n",
         "\n",
         "Maintain a neutral, objective stance throughout the analysis. Focus on accuracy and completeness in your extractions\n",
         "\"\"\"\n",
         "        user_prompt = f\"\"\"\n",
-        "# Task Instructions\n",
-        "Bitte führe die folgenden Schritte aus:\n",
-        "1. Lese jeden Text aufmerksam durch. Behandle jeden Text als eigene Einheit, ohne auf andere Texte zu referieren\n",
-        "2. Identifiziere alle Artikel zum Thema Erdbeben und Erstoß\n",
-        "3. Für jedes Vorkommen des Themas:\n",
-        "   a. Bestimme den Anfang des Artikels, in dem das Thema vorkommen.\n",
-        "   b. Kontrolliere Satz für Satz, ob diese zusammengehören, Ende den Artikel, wenn die Sätze nicht mehr zusammengehören.\n",
-        "   c. Markiere den vollständigen Artikel von Anfang bis Ende.\n",
-        "   d. Wenn der Artikel zu lang für eine Antwort ist, antworte mit Ja auf \"article too long, human addition needed\":\n",
-        "   e. Berücksichtige auch sehr kurze und sehr lange Artikel\n",
-        "4. Überprüfe jeden markierten Artikel:\n",
-        "   a. Stelle sicher, dass er eine Einheit bildet, auch wenn es nicht mehr um Erdbeben geht.\n",
-        "   b. Vergewissere dich, dass er eines der genannten Themen enthält.\n",
-        "   c. Prüfe, ob der extrahierte Text tatsächlich im Dokument ist\n",
-        "5. Extrahiere jeden überprüften Artikel als Originaltext, der nichts als den originalen Text enthält\n",
-        "6. Korrigiere OCR-Fehler\n",
-        "7. Wenn keine Artikel gefunden wurden, gib \"Keine Artikel mit dem angegebenen Thema gefunden.\" aus.\n",
-        "\n",
-        "Führe nun diese Schritte für den folgenden Text aus:\n",
+        "Bitte befolgen Sie diese Spezifikationen:\n",
+        "1. Definition eines Artikels: Ein Artikel ist eine semantische Einheit im Text, die sich deutlich von vorangehendem und nachfolgendem Inhalt abgrenzt (z.B. durch eine eigene Überschrift).\n",
+        "3. Antwortformat:\n",
+        "- Wenn ein oder mehrere relevante Artikel gefunden werden, strukturieren Sie Ihre Antwort mit XML-Tags wie im folgenden Beispiel, unter Verwendung der Tags article, verification und human_verification_needed (True oder False): <article>vollständiger extrahierter Artikelinhalt</article><verification>Ist die Einheit kohärent? Ist das Thema vorhanden? Ist der Artikel vollständig? Wurden alle Artikel gefunden?</verification><human_verification_needed>False</human_verification_needed>\n",
+        "- Gebe alle relevanten Artikel in ihrer Originalform zurück, ohne Ergänzungen, Auslassungen, Korrekturen oder Kommentare.\n",
+        "- Wenn keine relevanten Artikel gefunden werden, ist keine besondere Strukturierung erforderlich; gebe einfach \"Kein relevanter Artikel gefunden.\" ohne weitere Erklärungen zurück.\n",
+        "4. Hinweise zur Segmentierung:\n",
+        "- Stelle sicher, dass über mehrere Absätze verteilte Artikel als eine Einheit behandelt werden.\n",
+        "5. Menschliche Überprüfung notwendig:\n",
+        "- Kann die Werte \"True\" oder \"False\" haben\n",
+        "- False: Wenn Sie glauben, den Artikel korrekt segmentiert und seine Relevanz richtig eingeschätzt zu haben.\n",
+        "- True: Wenn du unsicher bist, ob du den vollständigen Inhalt des Artikels, wie er im Zeitungsdokument enthalten ist, erfasst hast oder ob er relevant ist.\n",
+        "\n",
+        "Hier ist das Zeitungsdokument:\n",
+        "\n",
         "{text}\n",
         "\"\"\"\n",
         "        try:\n",
@@ -245,52 +218,28 @@
         "                temperature=0.0,\n",
         "                max_tokens=20000\n",
         "            )\n",
-        "\n",
-        "            content = completion.choices[0].message.content\n",
-        "\n",
-        "            # Split the content into individual articles\n",
-        "            articles = []\n",
-        "            if \"Keine Artikel mit dem angegebenen Thema gefunden.\" in content:\n",
-        "                return []\n",
-        "\n",
-        "            # Split by \"**END OF ARTICLE**\" if present, otherwise treat as single article\n",
-        "            if \"**END OF ARTICLE**\" in content:\n",
-        "                parts = content.split(\"**END OF ARTICLE**\")\n",
-        "                articles = [{\"article\": part.strip()} for part in parts if part.strip()]\n",
-        "            else:\n",
-        "                articles = [{\"article\": content.strip()}]\n",
-        "\n",
-        "            return articles\n",
-        "\n",
+        "            return completion.choices[0].message.content\n",
         "        except Exception as e:\n",
-        "            print(f\"Error in AI processing: {str(e)}\")\n",
-        "            return []\n",
+        "            print(f\"Error in API call: {str(e)}\")\n",
+        "            return \"\"\n",
         "\n",
         "    # Apply the analysis to each row in the DataFrame\n",
-        "    all_articles = []\n",
-        "    for index, row in df.iterrows():\n",
-        "        articles = analyze_text(row[text_column])\n",
-        "        for i, article in enumerate(articles, 1):\n",
-        "            new_row = row.to_dict()\n",
-        "            new_row['extracted_article'] = article['article']\n",
-        "            new_row['article_part'] = i\n",
-        "            new_row['total_parts'] = len(articles)\n",
-        "            all_articles.append(new_row)\n",
-        "\n",
-        "    # Create a new DataFrame with individual rows for each article\n",
-        "    result_df = pd.DataFrame(all_articles)\n",
+        "    df['separated_articles'] = df[text_column].apply(lambda x: analyze_text(x) if pd.notna(x) else \"\")\n",
         "\n",
-        "    return result_df\n",
+        "    return df\n",
         "\n",
-        "# Usage example\n",
-        "text_column = 'plainpagefulltext'\n",
-        "result_df = analyze_dataframe(df, text_column)\n",
+        "# Usage example (assuming df is your input DataFrame)\n",
+        "if __name__ == \"__main__\":\n",
+        "    # Process the DataFrame\n",
+        "    text_column = 'plainpagefulltext'  # or your text column name\n",
+        "    result_df = analyze_dataframe(df, text_column)\n",
         "\n",
-        "# Save the results to an Excel file\n",
-        "result_df.to_excel('test_1.xlsx', index=False)\n",
+        "    # Save the results\n",
+        "    result_df.to_excel('analyzed_results.xlsx', index=False)\n",
         "\n",
-        "# Display the first few rows of the result\n",
-        "print(result_df.head())"
+        "    # Display sample results\n",
+        "    print(\"\\nSample of processed articles:\")\n",
+        "    print(result_df['separated_articles'].head())"
       ],
       "metadata": {
         "id": "G6GHbkcUb0hR"