Initial Document Import

Foreword and import statistics

Importing millions of documents from legacy systems into Livingdocs takes time. We observed these numbers:

  • 50k articles per hour
  • 100k - 300k images per hour

During these observations, memory usage was around ~4GB or RAM und 25Mbps of inbout and outbound bandwidth was used.

If no images are imported, a lot more documents could be imported.

Custom document IDs

During a migration of an existing system, it is best practice to migrate all entries of the old system into Livingdocs. To ease the migration, we want to support user-defined identifiers, so a custom import script can reuse existing identifiers.

To prevent issues with the id generation of Postgres, we will make the maximum allowed id configurable.

Example

SQL to execute to prevent conflicts when new documents are generated:

You should replace 100000 with the maximum id of the legacy system you’d like to import documents from.

ALTER SEQUENCE documents_id_seq  RESTART WITH 100000;

Livingdocs Server Configuration needed to support custom ids:

// server configuration
documents: {
  allowCustomIdsBelow: 100000,
}

Example curl request to import a document with a custom document id:

curl -k -X POST "https://server.livingdocs.io/api/v1/import/documents" \
  -H "Authorization: Bearer ey1234" \
  -H "Content-Type: application/json; charset=utf-8" \
  --data-binary @- << EOF
 {
  // Attention! It's important that the systemName is always the same
  // for all documents, otherwise the mapping does not work properly
  "systemName": "import",
  "documents": [{
    "documentId": 1,
    "id": "123abc",
    "title": "test import",
    "contentType": "article",
    "checksum": "xyz456",
    "livingdoc": {
      "content": [],
      "design": {
        "name": "living-times",
        "version": "1.0.1"
      }
    },
    "metadata": {
      "description": "foo"
    }
  }]
}
EOF

Custom publication dates

When importing articles from legacy systems, you should be setting the publicationDate. The publicationDate can be found in the Public API or Import API reference documentation.

The publicationDate controls when an article has been published, updated and is important for the search to function properly.

If an article has multiple publication dates and you want to keep a history of for example created and updated, we advise importing the same article twice.

First import the article with the publicationDate containing the value of the first time an article was published. Then re-import the article and you basically would ‘update’ that article with a new publicationDate

We save the firstPublicationDate of an article, so you could access both dates later on in your delivery and show when an article has been published initially and when it was updated.