So far, “integer” has always been sufficient for me to handle my data in terms of file sizes, character counts, etc.
Of course, Delphi is using “integer” extensively for all such operations internally too, i.e.: TFile’s ReadAllLines method, or TStringList’s LoadFromFile, etc.
But now, in the age of AI, I’m finding that my training datasets are going into multi-TB range and I’m sure we will all get there soon enough.
Basically, all Delphi RTL’s must be updated with some bigger data type. ASAP. Or things would begin to break rapidly.
Windows API’s have actually been using bigger numbers for some time now. And some Delphi RTL’s also use things like NativeUInt, or split it into 2 integers, but not often.
Has anybody heard of anyone at Emb/Idera raising this issue?
Defaults in C#/.NET are all integers, and I have very rarely needed to us longs. The cost of storage and processing to use TBs of data just aren’t worth it in my opinion. Not saying don’t, but data and processing limits will be more limiting than 32-bit integers in the main
That would likely be 16Gb of memory, minimum. a 64-bit pointer for each item - not including the actual list item data! A TStringList would be starting at 32Gb and would most likely be much larger than that!
Not sure if the memory allocator would allow that amount of memory to be allocated - they appear to use a single contiguous block of memory to allocate a complete list of pointers to the items!
Loading with which RTL classes, exactly? Can you provide a reproducible example? Obviously do not include huge files, but it might help to know what size they are.
They’re examples of methods, not examples of how to reproduce the problem. Those methods work fine for me. Please give a reproducible example of how they do not work for you. As described earlier, please also include what file sizes (i.e. not the actual files)
Just try them with a 5GB text file – in a typical fashion, no tricks, it’s just the size that matters.
And what’s 5 (or 50 for that matter) GB these days? Surely that would soon become 5-50TB, the way things are going. I have a 90TB drive now, but it’s filling rather fast…
It looks like the issue is that methods like ReadAllLines (which ultimately calls DoReadAllBytes) are attempting to read everything at once from the system, and at least Windows and macOS are saying “NO”. They could probably be changed so that the data is read in chunks, instead.
The ultimate error at the end was that it saw the buffer size as a negative value, after the overflow of integer. Or just crashed with an overflow error, unless the checks were disabled.
Out of curiosity, I “patched” DoReadAllBytes (called by ReadAllLines) so that it reads the data in chunks (of 1MB, or smaller when it reaches the end) - now the problem is that TEncoding.GetString fails with a range check error, since it only accepts Integer params, so for ReadAllLines, there’s at least 2 fixes needed.
For TStrings.LoadFromFile, it ultimately calls TStrings.LoadFromStream which right from the outset uses an Integer type for the size, but will suffer the same fate as ReadAllLines anyway, since it also uses TEncoding.GetString.
I wouldn’t expect the ‘solution’ to be quite that simple.
I’m curious as to when the code that must have been in the CP/M version of Turbo Pascal to handle these sorts of issues got removed, as everyone in all products seems to have done.
Once upon a time handling the ‘file bigger than memory issue’ was a given, yet these days I always have difficulty finding an editor that can handle the huge ‘log file’ issue.
Kinda weird that the authors didn’t even think they should break the log file ‘every day’, or some other reasonable period.
This is going to sound like a shameless plug* but it’s not.
UltraEdit https://www.ultraedit.com can handle MASSIVE files and it responds to searches in those files pretty much instantaneously. I use it for dealing with obscenely large SQL text dump files of 2gb or more.
Note that is obscenely large not obscene files - you’ll need a whole different editor for those…
*[UltraEdit is also owned by Idera and I sometimes do DevRel stuff for them too]
I expect it doesn’t use the technique of loading an entire file at once into an array of strings. Apparently there seems to be at least one case where it is necessary
I’m sure there are many more: it’s wide-spread, they are using Integers everywhere. Like 40 years ago they may have used Words for it, working with kB’s.
We have an app that parses 5GB+ CSV files from a third party data provider. It uses TBufferedFileStream for the file access, which uses Int64 for seeking/size. We are loading into our own data structure. I have considered changing our CSV parser to use mmap() but that’s an exercise for another day. I shudder to think about using TStringList for super large files!